This function calculates a sketch of a file. The sketch can be used to perform approximate frequentist or Bayesian linear regression. The sketch is a substitute data set of the same dimension but much smaller number of observations. The analysis based on the sketch is much faster and its results are provably close to the results on the original data set. The file is read in sequentially, making it possible to sketch data sets that are too large to be loaded into R completely.
readinandsketch(file, nrows = 50000, epsilon = NULL, obs_sketch = NULL,
affine = TRUE, method= c("C", "S", "R"), header = FALSE,
sep = "", col.names, skip = 0, warn = FALSE, ...)
The name of a file that contains the (large) data set. The data set should consist of both the design matrix X and the vector Y, which contains the values of the dependent variable. The order is arbitrary.
A positive integer, which controls, how many rows are read into the memory per iteration. Differs from use in read.table
as the other rows will be read in subsequent iterations. For that reason, nrows
has to be larger than 0.
Approximation error of the sketch (see Details). Only one of epsilon
and obs_sketch
can be used, if both are specified, currently epsilon
is used and obs_sketch
is ignored. Possible values for epsilon lie in the interval (0, 0.5].
Desired number of observations of the sketch (see Details). Only one of epsilon
and obs_sketch
can be used, if both are specified, currently epsilon
is used and obs_sketch
is ignored.
Boolean, choose TRUE if your model includes an intercept term and your data set does not contain a corresponding column. The corresponding column will be added as new left-most column of the sketch. If you do not want an added intercept term, choose FALSE.
The sketching method to be used. Possible values are "R", "S", and "C". See Details.
Boolean, if TRUE, the first line of the file is used as variable names, see read.table
.
The field separator character, see read.table
.
An optional vector containing the variable names, see read.table
.
integer: the number of lines of the data file to skip before beginning to read data, see read.table
.
Boolean, if TRUE show a warning if the sketch will result in a matrix of larger dimension than the original matrix.
Additional arguments that will be passed on to read.table
.
Returns a data frame, which contains both the sketched data frame SX and the sketched vector SY. The order of the columns is the same as in the original data set. If affine is TRUE, the corresponding intercept column is added as the new left-most column of the sketch. Please omit the standard intercept term from any models based on sketches in that case.
This function reads a data set iteratively and calculates/updates a sketch of the read in data set. This sketch can then be used for frequentist or Bayesian linear regression, especially on large data sets. The functionality used here is the same as in sketch
, but readinandsketch
can also handle data sets that are too large to be loaded into the working memory.
In principle, nrows
can be any positive integer value. If using the methods "R" or "C", small integer values will only lead to an increased running time. If using method "S", however, nrows
has to be at least as large as the number of observations \(k\) in the sketch, otherwise there will be an error.
If the number of observations in the data set is a multiple of nrows
, there will be one additional empty run, where no data is read and a sketch of an empty data set is calculated. This does not influence the resulting sketch.
Geppert, L., Ickstadt, K., Munteanu, A., Quedenfeld, J., & Sohler, C. (2017). Random projections for Bayesian regression. Statistics and Computing, 27(1), 79-101. doi:10.1007/s11222-015-9608-z
# NOT RUN {
sketchC = readinandsketch(file.choose(), header = TRUE, sep = '\t',
nrows = 10000, epsilon = 0.1, method = 'R')
# }
Run the code above in your browser using DataLab