parse_date_time: Parse character and numeric date-time vectors with user friendly order formats.

Description

parse_date_time parses an input vector into POSIXct date-time object. It differs from strptime in two respects. First, it allows specification of the order in which the formats occur without the need to include separators and "%" prefix. Such a formating argument is refered to as "order". Second, it allows the user to specify several format-orders to handle heterogeneous date-time character representations. parse_date_time2 is a fast C parser of numeric orders. fast_strptime is a fast C parser of numeric formats only that accepts explicit format arguments, just as strptime.

Usage

parse_date_time(x, orders, tz = "UTC", truncated = 0,
    quiet = FALSE, locale = Sys.getlocale("LC_TIME"),
    select_formats = .select_formats)
  parse_date_time2(x, orders, tz = "UTC")
  fast_strptime(x, format, tz = "UTC")

Arguments

a character or numeric vector of dates

orders

a character vector of date-time formats. Each order string is series of formatting characters as listed strptime but might not include the "%" prefix, for example "ymd" will match all the possi

a character string that specifies the time zone with which to parse the dates

truncated

integer, number of formats that can be missing. The most common type of irregularity in date-time data is the truncation due to rounding or unavailability of the time stamp. If truncated parameter is non-zero parse_date_time

quiet

logical. When TRUE progress messages are not printed, and "no formats found" error is surpresed and the function simply returns a vector of NAs. This mirrors the behavior of base R functions strptime and as.POSIXct. Defa

locale

locale to be used, see locales. On linux systems you can use system("locale -a") to list all the installed locales.

select_formats

A function to select actual formats for parsing from a set of formats which matched a training subset of x. it receives a named integer vector and returns a character vector of selected formats. Names of the input vector are formats (

format

a character string of formats. It should include all the separators and each format must be prefixed with strptime.

Value

a vector of POSIXct date-time objects

Details

When several format-orders are specified parse_date_time sorts the supplied format-orders based on a training set and then applies them recursively on the input vector.

parse_date_time, and hence all the derived functions, such as ymd_hms, ymd etc, will drop into fast_strptime instead of strptime whenever the trained from input data formats are all numeric.

Here are all the formats recognized by lubridate. For numeric formats leading 0s are optional. As compared to strptime, some of the formats have been extended for efficiency reasons. They are marked with "*". Formats accepted by parse_date_time2 and fast_strptime are marked with "!".

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Examples

Run this code

x <- c("09-01-01", "09-01-02", "09-01-03")
parse_date_time(x, "ymd")
parse_date_time(x, "%y%m%d")
parse_date_time(x, "%y %m %d")
#  "2009-01-01 UTC" "2009-01-02 UTC" "2009-01-03 UTC"

## ** heterogenuous formats **
x <- c("09-01-01", "090102", "09-01 03", "09-01-03 12:02")
parse_date_time(x, c("%y%m%d", "%y%m%d %H%M"))

## different ymd orders:
x <- c("2009-01-01", "02022010", "02-02-2010")
parse_date_time(x, c("%d%m%Y", "ymd"))
##  "2009-01-01 UTC" "2010-02-02 UTC" "2010-02-02 UTC"

## ** truncated time-dates **
x <- c("2011-12-31 12:59:59", "2010-01-01 12:11", "2010-01-01 12", "2010-01-01")
parse_date_time(x, "%Y%m%d %H%M%S", truncated = 3)
parse_date_time(x, "ymd_hms", truncated = 3)
## [1] "2011-12-31 12:59:59 UTC" "2010-01-01 12:11:00 UTC"
## [3] "2010-01-01 12:00:00 UTC" "2010-01-01 00:00:00 UTC"

## ** fast parsing **
options(digits.secs = 3)
  ## random times between 1400 and 3000
  tt <- as.character(.POSIXct(runif(1e6, -17987443200, 32503680000)))
  system.time(out <- as.POSIXct(tt, tz = "UTC"))
  system.time(out1 <- ymd_hms(tt)) ## format learning overhead
  system.time(out2 <- parse_date_time2(tt, "YmdHMOS"))
  system.time(out3 <- fast_strptime(tt, "%Y-%m-%d %H:%M:%OS"))
  all.equal(out, out1)
  all.equal(out, out2)
  all.equal(out, out3)

## ** how to use select_formats **
## By default \%Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC"   "2013-09-27 UTC"

## to give priority to \%y format, define your own select_format function:

my_select <-   function(trained){
   n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
   names(trained[ which.max(n_fmts) ])
}

parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"