The purpose of the dirdf package is to let you, the user, write a path
specification that we can apply to file paths, extracting out relevant chunks
into data frame columns. The most obvious mechanism for doing so is a regular
expression, and indeed, dirdf lets you provide a regex argument.
But for most reasonable directory/file naming conventions, regex is overkill;
its power is wasted on something like YYYY-MM/DD/LocationId/SubjectId.csv,
yet you still have to pay the price of regexes being difficult to write and
to read, and easy to get subtly wrong.
Path templates are a friendlier alternative. A path template is a string that
consists of variable names and delimiters. A variable name is any contiguous
run of alphanumeric characters (optionally, with a trailing ?
character); delimiters are everything else.
For example:
Year-Month/Day/FirstName_MiddleInitial?_LastName.ext
In this example, Year
, Month
, Day
, FirstName
,
MiddleInitial
, LastName
, and ext
are variable names. All
of the dash, slash, underscore, and period characters between them are
considered delimiters.
When parsed, this template will match each variable to any number of
non-slash characters, up until the next delimiter. (Slash will never be
considered part of a variable match, as we consider it the path separator.)
The trailing question mark makes MiddleInitial?
optional; both its
value and its preceding delimiter (_
in this case) can be omitted from
target paths, in which case the resulting value for that variable will be
NA
(or in some edge cases, ""
).