Automatically Retrieve Multidimensional Distributed Data Sets
Tool to automatically fetch, transform and arrange subsets of multidimensional data sets (collections of files) stored in local and/or remote file systems or servers, using multicore capabilities where possible. The tool provides an interface to perceive a collection of data sets as a single large multidimensional data array, and enables the user to request for automatic retrieval, processing and arrangement of subsets of the large array. Wrapper functions to add support for custom file formats can be plugged in/out, making the tool suitable for any research field where large multidimensional data sets are involved.
Automatically retrieve multidimensional distributed data sets in R
The first step in data analysis made easy
Data retrieval and alignment is the first step in data analysis in any field and is often highly complex and time-consuming, especially nowadays in the era of Big Data, where large multidimensional data sets from diverse sources need to be combined and processed. Taking subsets of these datasets (Divide) to then be processed efficiently (and Conquer) becomes an indispensable technique.
startR (Subset, Transform, Arrange and ReTrieve multidimensional subsets in R) is an R project started at BSC with the aim to develop a tool that allows the user to automatically retrieve, homogenize and align subsets of multidimensional distributed data sets. It is an open source project that is open to external collaboration and funding, and will continuously evolve to support as many data set formats as possible while maximizing its efficiency.
What it does
startR, through its main function
Start(), provides an interface that allows to perceive and access one or a collection of data sets as if they all were part of a large multidimensional array. Indices or bounds can be specified for each of the dimensions in order to crop the whole array into a smaller sub-array.
Start() will perform the required operations to fetch the corresponding regions of the corresponding files (potentially distributed over various remote servers) and arrange them into a local R multidimensional array. By default, as many cores as available locally are used in this procedure.
Usually, in retrieval processes previous to multidimensional data analysis, it is needed to apply a set of common transformations, pre-processes or reorderings to the data as it comes in.
Start() accepts user-defined transformation or reordering functions to be applied for such purposes.
Start() is not bound to a specific file format. Interface functions to custom file formats can be provided for
Start() to read them. As of April 2017
startR includes interface functions to the following file formats:
Metadata and auxilliary data is also preserved and arranged by
Start() in the measure that it is retrieved by the interface functions for a specific file format.
How to use it
Start() is the only function in the package intended for direct use. This function has a rather steep learning curve but it makes the retrieval process straightforward and highly customizable. The header looks as follows:
Start(..., return_vars = NULL, synonims = NULL, file_opener = NcOpener, file_var_reader = NcVarReader, file_dim_reader = NcDimReader, file_data_reader = NcDataReader, file_closer = NcCloser, transform = NULL, transform_params = NULL, transform_vars = NULL, transform_extra_cells = 0, apply_indices_after_transform = FALSE, pattern_dims = NULL, metadata_dims = NULL, selector_checker = SelectorChecker, num_procs = NULL, silent = FALSE)
Usually most of the required information will be provided through the
... parameter and only a few of the parameters in the function header will be used.
The parameters can be grouped as follows:
- The parameters provided via
..., with information on the structure of the datasets which to take data from, information on which dimensions they have, which indices to take from each of the dimensions, and how to reorder each of the dimensions if needed.
synonimswith information to identify dimensions and variablse across the multiple files/datasets (useful if they don't follow a homogeneous naming convention).
return_varswith information on auxilliary data to be retrieved.
file_*parameters, which allow to specify interface functions to the desired file format.
transform_*parameters, which allow to specify a transformation to be applied to the data as it comes in.
- Other general configuration parameters.
Examples / Tutorial
Next, an explanation on how to use
Start(), starting from a simple example and progressively adding complexity.
Retrieving a single entire file
Start() can be used in the simplest situation to take a subset of data from a single file.
Let's imagine we have a file named 'file.nc' which contains an array of data (a single variable) with the following dimensions:
c(a = 5, time = 12, b = 100, c = 100)
It is then possible to take the entire data file as follows:
data <- Start(dataset = '/path/to/file.nc', a = 'all', time = 'all', b = 'all', c = 'all') * Exploring files... This will take a variable amount of time depending * on the issued request and the performance of the file server... * Detected dimension sizes: * dataset: 1 * a: 5 * time: 3 * b: 100 * c: 100 * Total size of requested data: * 1 x 5 x 3 x 100 x 100 x 8 bytes = 1.2 Mb * If the size of the requested data is close to or above the free shared * RAM memory, R may crash. * If the size of the requested data is close to or above the half of the * free RAM memory, R may crash. * Will now proceed to read and process 1 data files: * /path/to/file.nc * Loading... This may take several minutes... starting worker pid=30733 on localhost:11276 at 15:30:39.997 starting worker pid=30749 on localhost:11276 at 15:30:40.180 starting worker pid=30765 on localhost:11276 at 15:30:40.382 starting worker pid=30781 on localhost:11276 at 15:30:40.557 * Successfully retrieved data.
None of the provided parameters to
Start() are in the set of known parameters (
synonims, ...). Each unknown parameter is interpreted as a specification of a dimension of the data set you want to retrieve data from, where the name of the parameter matches the name of the dimension and the value associated expresses the indices you want to retrieve from the dimension. In this example, we have defined that the data set has 5 dimensions, with the names 'dataset', 'a', 'time', 'b', and 'c', and we want to take all indices of each of these dimensions.
It is mandatory to make
Start() aware of all the existing dimensions in the file (unless they are of length 1).
Note that the file to read is considered to be an element that belongs to the 'dataset' dimension (could be any other name).
Start() automatically looks for at least one of the dimension specifications with an expression pointing to a set of files or URLs.
The returned result looks as follows:
str(data) List of 5 $ Data : num [1, 1:5, 1:3, 1:100, 1:100] 1 2 3 4 5 6 7 8 ... $ Variables :List of 2 ..$ common : NULL ..$ dataset1: NULL $ Files : chr [1(1d)] "/path/to/file.nc" $ NotFoundFiles: NULL $ FileSelectors:List of 1 ..$ dataset1:List of 1 .. ..$ dataset:List of 1 .. .. ..$ : chr "dataset1"
In this case, most of the returned information is empty.
These are the dimensions of the actual data array:
dim(data$Data) dataset a time b c 1 5 3 100 100
Reordering array dimensions
If the dimensions are specified and requested in a different order, the resulting array will be arranged following the same order:
data <- Start(dataset = '/path/to/file.nc', a = 'all', b = 'all', c = 'all', time = 'all') dim(data$Data) dataset a b c time 1 5 100 100 3
Retrieving multiple files
Assuming we have the files
/path/to/ |-> group_a/ | |-> file_1.nc | |-> file_2.nc | |-> file_3.nc |-> group_b/ |-> file_1.nc |-> file_2.nc |-> file_3.nc
We can load them as follows:
data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', group = 'all', number = 'all', a = 'all', b = 'all', c = 'all', time = 'all') dim(data$Data) dataset group number a b c time 1 2 3 5 100 100 3
Path pattern tags that depend on other tags
Assuming we have the files
/path/to/ |-> group_a/ | |-> file_1.nc | |-> file_2.nc | |-> file_3.nc |-> group_b/ |-> file_4.nc |-> file_5.nc |-> file_6.nc
We can load them as follows:
data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', group = 'all', number = 'all', number_depends = 'group', a = 'all', b = 'all', c = 'all', time = 'all') dim(data$Data) dataset group number a b c time 1 2 3 5 100 100 3
Dimensions inside the files that go across files
Assuming the 'time' dimension goes across all the 'number' files in a group. We would like to select time indices e.g. 3 to 7 without
Start() crashing because of indices out of bounds. We can do so as follows:
data <- Start(dataset = '/path/to/group_$group$/file_$number$.nc', group = 'all', number = 'all', number_depends = 'group', a = 'all', b = 'all', c = 'all', time = indices(list(2, 5)), time_across = 'number') dim(data$Data) dataset group number a b c time 1 2 3 5 100 100 5
In this case, the dimension 'number' is of length 3 because we have retrieved data from 3 different 'number's: the 'time' index 3 from 'number' 1, the 'time' indices 4 to 6 from 'number' 2 and the 'time' index 7 from 'number 3. The non-taken indices from a 'number' are filled in with NA in the returned array.
Taking specific indices of a dimension
data <- Start(dataset = '/path/to/file.nc', a = indices(c(1, 3)), b = 'all', c = indices(list(10, 20)), time = 'all') dim(data$Data) dataset a b c time 1 2 100 11 3
Taking specific indices of a dimension in function of associated values
data <- Start(dataset = '/path/to/file.nc', a = c('value1', 'value2', 'value5'), a_var = 'x', b = 'all', c = indices(list(10, 20)), time = 'all', return_vars = list(a_var = NULL)) dim(data$Data) dataset a b c time 1 3 100 11 3
Taking data from NetCDF files with multiple variables
Now let us imagine the data array in the file has an extra dimension, 'var', of length 2, and a variable 'var_names' with the names of the variables at each position along the dimension 'var'. The names of the 2 variables are 'x' and 'y'. We would like being able to tell
Start() to take only the variable 'y', regardless of its position along the 'var' dimension. This can be achieved by defining the 'var' dimension with more detail, using the '*_var' parameters:
data <- Start(dataset = '/path/to/file.nc', var = 'y', var_var = 'var_names', a = 'all', b = 'all', c = 'all', time = 'all', return_vars = list(var_names = NULL)) dim(data$Data) dataset var a b c time 1 1 5 100 100 3
Taking specific indices of a dimension in function of associated values, with tolerance
Dimension and variable name synonims
Reordering inner dims with associated values
Defining interface functions to a custom file format
Explanation of outputs
Other configuration parameters
Functions in startR
|Sort||Sort Dimension Reorder for 'startR'|
|Start||Automatically Retrieve Multidimensional Distributed Data Sets|
|NcCloser||NetCDF File Closer for 'startR'|
|NcDataReader||NetCDF File Data Reader for 'startR'|
|NcVarReader||NetCDF Variable Reader for 'startR'|
|SelectorChecker||Default Selector Checker for 'startR'|
|CDORemapper||CDO Remap Data Transformation for 'startR'|
|CircularSort||Circular Sort Dimension Reorder for 'startR'|
|NcDimReader||NetCDF Dimension Reader for 'startR'|
|NcOpener||NetCDF File Opener for 'startR'|
|Subset||Subset a Data Array|
|indices||Mark Dimension Selectors as Indices|
Last month downloads
|Packaged||2017-04-22 02:58:45 UTC; nmanubens|
|Date/Publication||2017-04-22 04:26:58 UTC|
Include our badge in your README