parse_date_time: Parse character and numeric date-time vectors with user friendly order formats.

Description

parse_date_time parses an input vector into POSIXct date-time object. It differs from strptime in two respects. First, it allows specification of the order in which the formats occur without the need to include separators and "%" prefix. Such a formating argument is refered to as "order". Second, it allows the user to specify several format-orders to handle heterogeneous date-time character representations. parse_date_time2 is a fast C parser of numeric orders. fast_strptime is a fast C parser of numeric formats only that accepts explicit format arguments, just as strptime.

Usage

parse_date_time(x, orders, tz = "UTC", truncated = 0, quiet = FALSE, locale = Sys.getlocale("LC_TIME"), select_formats = .select_formats, exact = FALSE)
parse_date_time2(x, orders, tz = "UTC", exact = FALSE, lt = FALSE)
fast_strptime(x, format, tz = "UTC", lt = TRUE)

Arguments

a character or numeric vector of dates

orders

a character vector of date-time formats. Each order string is series of formatting characters as listed strptime but might not include the "%" prefix, for example "ymd" will match all the possible dates in year, month, day order. Formatting orders might include arbitrary separators. These are discarded. See details for implemented formats.

a character string that specifies the time zone with which to parse the dates

truncated

integer, number of formats that can be missing. The most common type of irregularity in date-time data is the truncation due to rounding or unavailability of the time stamp. If truncated parameter is non-zero parse_date_time also checks for truncated formats. For example, if the format order is "ymdhms" and

truncated
= 3

, parse_date_time will correctly parse incomplete dates like 2012-06-01 12:23, 2012-06-01 12 and 2012-06-01. NOTE: ymd family of functions are based on strptime which currently fails to parse %y-%m formats.

quiet

logical. When TRUE progress messages are not printed, and "no formats found" error is surpresed and the function simply returns a vector of NAs. This mirrors the behavior of base R functions strptime and as.POSIXct. Default is FALSE.

locale

locale to be used, see locales. On linux systems you can use system("locale -a") to list all the installed locales.

select_formats

A function to select actual formats for parsing from a set of formats which matched a training subset of x. it receives a named integer vector and returns a character vector of selected formats. Names of the input vector are formats (not orders) that matched the training set. Numeric values are the number of dates (in the training set) that matched the corresponding format. You should use this argument if the default selection method fails to select the formats in the right order. By default the formats with most formating tockens (%) are selected and %Y counts as 2.5 tockens (so that it has a priority over %y%m). Se examples.

exact

logical. If TRUE, orders parameter is interpreted as an exact strptime format and no trainign or guessing are performed.

logical. If TRUE returned object is of class POSIXlt, and POSIXct otherwise. For compatibility with base `strptime` function default is TRUE for `fast_strptime` and FALSE for `parse_date_time2`.

format

a character string of formats. It should include all the separators and each format must be prefixed with argument of strptime.

Value

a vector of POSIXct date-time objects

Details

When several format-orders are specified parse_date_time sorts the supplied format-orders based on a training set and then applies them recursively on the input vector.

parse_date_time, and all derived functions, such as ymd_hms, ymd etc, will drop into fast_strptime instead of strptime whenever the guesed from the input data formats are all numeric.

The list below contains formats recognized by lubridate. For numeric formats leading 0s are optional. In contrast to strptime, some of the formats have been extended for efficiency reasons. They are marked with "*". Fast perasers, parse_date_time2 and fast_strptime, currently accept only formats marked with "!".

a: Abbreviated weekday name in the current locale. (Also matches full name)

A

Full weekday name in the current locale. (Also matches abbreviated name).

You need not specify a and A formats explicitly. Wday is automatically handled if preproc_wday = TRUE

b

Abbreviated month name in the current locale. (Also matches full name.)

B

Full month name in the current locale. (Also matches abbreviated name.)

d!

Day of the month as decimal number (01--31 or 0--31)

H!

Hours as decimal number (00--24 or 0--24).

I

Hours as decimal number (01--12 or 1--12).

j

Day of year as decimal number (001--366 or 1--366).

m!*

Month as decimal number (01--12 or 1--12). For parse_date_time, also matches abbreviated and full months names as b and B formats.

M!

Minute as decimal number (00--59 or 0--59).

p

AM/PM indicator in the locale. Used in conjunction with I and not with H. An empty string in some locales.

S!

Second as decimal number (00--61 or 0--61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).

OS

Fractional second.

U

Week of the year as decimal number (00--53 or 0-53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.

w

Weekday as decimal number (0--6, Sunday is 0).

W

Week of the year as decimal number (00--53 or 0-53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.

y!*

Year without century (00--99 or 0--99). In parse_date_time also matches year with century (Y format).

Y!

Year with century.

z!*

ISO8601 signed offset in hours and minutes from UTC. For example -0800, -08:00 or -08, all represent 8 hours behind UTC. This format also matches the Z (Zulu) UTC indicator. Because strptime doesn't fully support ISO8601 this format is implemented as an union of 4 orders: Ou (Z), Oz (-0800), OO (-08:00) and Oo (-08). You can use these four orders as any other but it is rarely necessary. parse_date_time2 and fast_strptime support all of the timezone formats.

r*

Matches Ip and H orders.

R*

Matches HM andIMp orders.

T*

Matches IMSp, HMS, and HMOS orders.

Examples

Run this code


## ** orders are much easier to write **
x <- c("09-01-01", "09-01-02", "09-01-03")
parse_date_time(x, "ymd")
parse_date_time(x, "y m d")
parse_date_time(x, "%y%m%d")
#  "2009-01-01 UTC" "2009-01-02 UTC" "2009-01-03 UTC"

## ** heterogenuous date-times **
x <- c("09-01-01", "090102", "09-01 03", "09-01-03 12:02")
parse_date_time(x, c("ymd", "ymd HM"))

## ** different ymd orders **
x <- c("2009-01-01", "02022010", "02-02-2010")
parse_date_time(x, c("dmY", "ymd"))
##  "2009-01-01 UTC" "2010-02-02 UTC" "2010-02-02 UTC"

## ** truncated time-dates **
x <- c("2011-12-31 12:59:59", "2010-01-01 12:11", "2010-01-01 12", "2010-01-01")
parse_date_time(x, "Ymd HMS", truncated = 3)
parse_date_time(x, "ymd_hms", truncated = 3)
## [1] "2011-12-31 12:59:59 UTC" "2010-01-01 12:11:00 UTC"
## [3] "2010-01-01 12:00:00 UTC" "2010-01-01 00:00:00 UTC"

## ** specifying exact formats and avoiding training and guessing **
parse_date_time(x, c("%m-%d-%y", "%m%d%y", "%m-%d-%y %H:%M"), exact = TRUE)
## [1] "2001-09-01 00:00:00 UTC" "2002-09-01 00:00:00 UTC" NA "2003-09-01 12:02:00 UT
parse_date_time(c('12/17/1996 04:00:00','4/18/1950 0130'),
                c('%m/%d/%Y %I:%M:%S','%m/%d/%Y %H%M'), exact = TRUE)
## [1] "1996-12-17 04:00:00 UTC" "1950-04-18 01:30:00 UTC"

## ** fast parsing **
## Not run: 
#   options(digits.secs = 3)
#   ## random times between 1400 and 3000
#   tt <- as.character(.POSIXct(runif(1000, -17987443200, 32503680000)))
#   tt <- rep.int(tt, 1000)
# 
#   system.time(out <- as.POSIXct(tt, tz = "UTC"))
#   system.time(out1 <- ymd_hms(tt)) # constant overhead on long vectors
#   system.time(out2 <- parse_date_time2(tt, "YmdHMOS"))
#   system.time(out3 <- fast_strptime(tt, "%Y-%m-%d %H:%M:%OS"))
# 
#   all.equal(out, out1)
#   all.equal(out, out2)
#   all.equal(out, out3)
# ## End(Not run)

## ** how to use `select_formats` argument **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC"   "2013-09-27 UTC"

## to give priority to %y format, define your own select_format function:

my_select <-   function(trained){
   n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
   names(trained[ which.max(n_fmts) ])
}

parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## [1] "2013-09-27 UTC" "2013-09-27 UTC"

## ** invalid times with "fast" parcing **
parse_date_time("2010-03-14 02:05:06",  "YmdHMS", tz = "America/New_York")
## [1] NA
parse_date_time2("2010-03-14 02:05:06",  "YmdHMS", tz = "America/New_York")
## [1] "2010-03-14 03:05:06 EDT"
parse_date_time2("2010-03-14 02:05:06",  "YmdHMS", tz = "America/New_York", lt = TRUE)
## [1] "2010-03-14 02:05:06 America/New_York"