readinandsketch: Create a sketch from a file containing a (very large) data set

Description

This function calculates a sketch of a file. The sketch can be used to perform approximate frequentist or Bayesian linear regression. The sketch is a substitute data set of the same dimension but much smaller number of observations. The analysis based on the sketch is much faster and its results are provably close to the results on the original data set. The file is read in sequentially, making it possible to sketch data sets that are too large to be loaded into R completely.

Usage

readinandsketch(file, nrows = 50000, epsilon = NULL, obs_sketch = NULL,
                affine = TRUE, method= c("C", "S", "R"), header = FALSE,
                sep = "", col.names, skip = 0, warn = FALSE, ...)

Arguments

file

The name of a file that contains the (large) data set. The data set should consist of both the design matrix X and the vector Y, which contains the values of the dependent variable. The order is arbitrary.

nrows

A positive integer, which controls, how many rows are read into the memory per iteration. Differs from use in read.table as the other rows will be read in subsequent iterations. For that reason, nrows has to be larger than 0.

epsilon

Approximation error of the sketch (see Details). Only one of epsilon and obs_sketch can be used, if both are specified, currently epsilon is used and obs_sketch is ignored. Possible values for epsilon lie in the interval (0, 0.5].

obs_sketch

Desired number of observations of the sketch (see Details). Only one of epsilon and obs_sketch can be used, if both are specified, currently epsilon is used and obs_sketch is ignored.

affine

Boolean, choose TRUE if your model includes an intercept term and your data set does not contain a corresponding column. The corresponding column will be added as new left-most column of the sketch. If you do not want an added intercept term, choose FALSE.

method

The sketching method to be used. Possible values are "R", "S", and "C". See Details.

header

Boolean, if TRUE, the first line of the file is used as variable names, see read.table.

sep

The field separator character, see read.table.

col.names

An optional vector containing the variable names, see read.table.

skip

integer: the number of lines of the data file to skip before beginning to read data, see read.table.

warn

Boolean, if TRUE show a warning if the sketch will result in a matrix of larger dimension than the original matrix.

...

Additional arguments that will be passed on to read.table.

Value

Returns a data frame, which contains both the sketched data frame SX and the sketched vector SY. The order of the columns is the same as in the original data set. If affine is TRUE, the corresponding intercept column is added as the new left-most column of the sketch. Please omit the standard intercept term from any models based on sketches in that case.

Details

This function reads a data set iteratively and calculates/updates a sketch of the read in data set. This sketch can then be used for frequentist or Bayesian linear regression, especially on large data sets. The functionality used here is the same as in sketch, but readinandsketch can also handle data sets that are too large to be loaded into the working memory.

In principle, nrows can be any positive integer value. If using the methods "R" or "C", small integer values will only lead to an increased running time. If using method "S", however, nrows has to be at least as large as the number of observations \(k\) in the sketch, otherwise there will be an error.

If the number of observations in the data set is a multiple of nrows, there will be one additional empty run, where no data is read and a sketch of an empty data set is calculated. This does not influence the resulting sketch.

References

Geppert, L., Ickstadt, K., Munteanu, A., Quedenfeld, J., & Sohler, C. (2017). Random projections for Bayesian regression. Statistics and Computing, 27(1), 79-101. doi:10.1007/s11222-015-9608-z

Examples

Run this code

# NOT RUN {
  sketchC = readinandsketch(file.choose(), header = TRUE, sep = '\t',
  nrows = 10000, epsilon = 0.1, method = 'R')
# }

Run the code above in your browser using DataLab