templates: Path Templates

Description

A friendly, focused alternative to using regular expressions for path parsing.

Arguments

Details

The purpose of the dirdf package is to let you, the user, write a path specification that we can apply to file paths, extracting out relevant chunks into data frame columns. The most obvious mechanism for doing so is a regular expression, and indeed, dirdf lets you provide a regex argument.

But for most reasonable directory/file naming conventions, regex is overkill; its power is wasted on something like YYYY-MM/DD/LocationId/SubjectId.csv, yet you still have to pay the price of regexes being difficult to write and to read, and easy to get subtly wrong.

Path templates are a friendlier alternative. A path template is a string that consists of variable names and delimiters. A variable name is any contiguous run of alphanumeric characters (optionally, with a trailing ? character); delimiters are everything else.

For example:

Year-Month/Day/FirstName_MiddleInitial?_LastName.ext

In this example, Year, Month, Day, FirstName, MiddleInitial, LastName, and ext are variable names. All of the dash, slash, underscore, and period characters between them are considered delimiters.

When parsed, this template will match each variable to any number of non-slash characters, up until the next delimiter. (Slash will never be considered part of a variable match, as we consider it the path separator.)

The trailing question mark makes MiddleInitial? optional; both its value and its preceding delimiter (_ in this case) can be omitted from target paths, in which case the resulting value for that variable will be NA (or in some edge cases, "").

Examples

Run this code

# NOT RUN {
template <- "Year-Month/Day/FirstName_MiddleInitial?_LastName.ext"
paths <- c(
  "1860-02/01/Abel_Magwitch.csv",
  "1847-10/13/Bertha_A_Mason.csv"
)
dirdf_parse(paths, template)

# }

Run the code above in your browser using DataLab