Extract metadata from a group of packets. This is an experimental high-level function for interacting with the metadata in a way that we hope will be useful. We'll expand this a bit as time goes on, based on feedback we get so let us know what you think. See Details for how to use this.
orderly_metadata_extract(
expr = NULL,
name = NULL,
location = NULL,
allow_remote = NULL,
fetch_metadata = FALSE,
extract = NULL,
options = NULL,
root = NULL
)
A data.frame
, the columns of which vary based on the
names of extract
; see Details for more information.
The query expression. A NULL
expression matches everything.
Optionally, the name of the packet to scope the query on. This
will be intersected with scope
arg and is a shorthand way of running
scope = list(name = "name")
Optional vector of locations to pull from. We might in future expand this to allow wildcards or exceptions.
Logical, indicating if we should allow packets
to be found that are not currently unpacked (i.e., are known
only to a location that we have metadata from). If this is
TRUE
, then in conjunction with orderly_dependency()
you might pull a large quantity of data. The default is NULL
. This is
TRUE
if remote locations are listed explicitly as a character
vector in the location
argument, or if you have specified
fetch_metadata = TRUE
, otherwise FALSE
.
Logical, indicating if we should pull
metadata immediately before the search. If location
is given,
then we will pass this through to
orderly_location_fetch_metadata()
to filter locations
to update. If pulling many packets in sequence, you will want
to update this option to FALSE
after the first pull, otherwise
it will update the metadata between every packet, which will be
needlessly slow.
A character vector of columns to extract, possibly named. See Details for the format.
DEPRECATED. Please don't use this any more, and
instead use the arguments location
, allow_remote
and
fetch_metadata
directly.
The path to the root directory, or NULL
(the
default) to search for one from the current working
directory. This function does not require that the directory is
configured for orderly, and can be any outpack
root (see
orderly_init()
for details).
Within custom.orderly
, additional fields can be extracted. The
format of this is subject to change, both in the stored metadata
and schema (in the short term) and in the way we deserialise it.
It is probably best not to rely on this right now, and we will
expand this section when you can.
Extracting data from outpack metadata is challenging to do in a
way that works in data structures familiar to R users, because it
is naturally tree structured, and because not all metadata may be
present in all packets (e.g., a packet that does not depend on
another will not have a dependency section, and one that was run
in a context without git will not have git metadata). If you just
want the raw tree-structured data, you can always use
orderly_metadata()
to load the full metadata for any
packet (even one that is not currently available on your computer,
just known about it) and the structure of the data will remain
fairly constant across orderly versions.
However, sometimes we want to extract data in order to ask specific questions like:
what parameter combinations are available across a range of packets?
when were a particular set of packets used?
what files did these packets produce?
Later we'd like to ask even more complex questions like:
at what version did the file graph.png
change?
what inputs changed between these versions?
...but being able to answer these questions requires a similar approach to interrogating metadata across a range of packets.
The orderly_metadata_extract
function aims to simplify the
process of pulling out bits of metadata and arranging it into a
data.frame
(of sorts) for you. It has a little mini-language in
the extract
argument for doing some simple rewriting of results,
but you can always do this yourself.
In order to use function you need to know what metadata are available; we will expand the vignette with more worked examples here to make this easier to understand. The function works on top-level keys, of which there are:
id: the packet id (this is always returned)
name: the packet name
parameters: a key-value pair of values, with string keys and atomic values. There is no guarantee about presence of keys between packets, or their types.
time: a key-value pair of times, with string keys and time
values (see DateTimeClasses; these are stored as seconds since
1970 in the actual metadata). At present start
and end
are
always present.
files: files present in each packet. This is a data.frame
(per
packet), each with columns path
(relative), size
(in bytes)
and hash
.
depends: dependencies used each packet. This is a data.frame
(per packet), each with columns packet
(id), query
(string,
used to find packet
) and files
(another data.frame
with
columns there
and here
corresponding to filenames upstream
and in this packet, respectively)
git: either metadata about the state of git or null
. If given
then sha
and branch
are strings, while url
is an array of
strings/character vector (can have zero, one or more elements).
session: some information about the session that the packet was run in (this is unstandardised, and even the orderly version may change)
custom: additional metadata added by its respective engine. For
packets run by orderly
, there will be an orderly
field here,
which is itself a list:
artefacts: A data.frame with artefact information, containing
columns description
(a string) and paths
(a list column of paths).
shared: A data.frame of the copied shared resources with
their original name (there
) and name as copied into the packet
(here
).
role: A data.frame of identified roles of files, with columns path
and role
.
description: A list of information from
orderly_description()
with human-readable descriptions and
tags.
session: A list of information about the session as run,
with a list platform
containing information about the platform
(R version as version
, operating system as os
and system name
as system
) and packages
containing columns package
,
version
and attached
.
The nesting here makes providing a universally useful data format
difficult; if considering files we have a data.frame
with a
files
column, which is a list of data.frame
s; similar
nestedness applies to depends
and the orderly custom
data. However, you should be able to fairly easily process the
data into the format you need it in.
The simplest extraction uses names of top-level keys:
extract = c("name", "parameters", "files")
This creates a data.frame with columns corresponding to these
keys, one row per packet. Because name
is always a string, it
will be a character vector, but because parameters
and files
are more complex, these will be list columns.
You must not provide id
; it is always returned and always first
as a character vector column. If your extraction could possibly
return data from locations (i.e., you have allow_remote = TRUE
or have given a value for location
) then we add a logical column
local
which indicates if the packet is local to your archive,
meaning that you have all the files from it locally.
You can rename the columns by providing a name to entries within
extract
, for example:
extract = c("name", pars = "parameters", "files")
is the same as above, except that that the parameters
column has
been renamed pars
.
More interestingly, we can index into a structure like
parameters
; suppose we want the value of the parameter x
, we
could write:
extract = c(x = "parameters.x")
which is allowed because for each packet the parameters
element is a list.
However, we do not know what type x
is (and it might vary
between packets). We can add that information ourselves though and write:
extract = c(x = "parameters.x is number")
to create an numeric column. If any packet has a value of x
that
is non-integer, your call to orderly_metadata_extract
will fail
with an error, and if a packet lacks a value of x
, a missing
value of the appropriate type will be added.
Note that this does not do any coercion to number, it will error
if a non-NULL non-numeric value is found. Valid types for use
with is <type>
are boolean
, number
and string
(note that
these differ slightly from R's names because we want to emphasise
that these are scalar quantities; also note that there is no
integer
here as this may produce unexpected errors with
integer-like numeric values). You can also use list
but this is
the default. Things in the schema that are known to be scalar
atomics (such as name
) will be automatically simplified.
You can index into the array-valued elements (files
and
depends
) in the same way as for the object-valued elements:
extract = c(file_path = "files.path", file_hash = "files.hash")
would get you a list column of file names per packet and another
of hashes, but this is probably less useful than the data.frame
you'd get from extracting just files
because you no longer have
the hash information aligned.
You can index fairly deeply; it should be possible to get the orderly "display name" with:
extract = c(display = "custom.orderly.description.display is string")
If the path you need to extract has a dot in it (most likely a
package name for a plugin, such as custom.orderly.db
) you need
to escape the dot with a backslash (so, custom.orderly\.db
). You
will probably need two slashes or use a raw string (in recent
versions of R).
path <- orderly_example()
# Generate a bunch of packets:
suppressMessages({
orderly_run("data", echo = FALSE, root = path)
for (n in c(2, 4, 6, 8)) {
orderly_run("parameters", list(max_cyl = n), echo = FALSE, root = path)
}
})
# Without a query, we get a summary over all packets; this will
# often be too much:
orderly_metadata_extract(root = path)
# Pass in a query to limit things:
meta <- orderly_metadata_extract(quote(name == "parameters"), root = path)
meta
# The parameters are present as a list column:
meta$parameters
# You can also lift values from the parameters into columns of their own:
orderly_metadata_extract(
quote(name == "parameters"),
extract = c(max_cyl = "parameters.max_cyl is number"),
root = path)
Run the code above in your browser using DataLab