ScalablePCA: Perform Principal Component Analysis on a large data set

Description

Run prcomp on subsamples of the data set and compile the results for the first dimension.

Usage

ScalablePCA(x, filename = NULL, db = NULL, subsample = 10000, n.subsamples = 1000, ignore.cols, use.cols, return.sds = FALSE, progress.bar = FALSE)

Arguments

data.frame, data over which to run PCA

filename

character, name of the file containing the data. This must be a tab-delimited file with a header row formatted per the default options for read.delim.

Object type, database connection to table containing the data (NOT IMPLEMENTED).

subsample

numeric or logical, If an integer, size of each subsample. If FALSE, runs PCA on entire data set.

n.subsamples

numeric, number of subsamples.

ignore.cols

numeric, indices of columns not to include.

use.cols

numeric, indices of columns to use.

return.sds

logical, if TRUE return the standard deviations of each network's edge weights.

progress.bar

logical, if TRUE then progress in running subsamples will be shown.

Value

If return.sds is FALSE, return named vector of component weights for first dimension of principal component analysis (see example for comparison to prcomp).If return.sds is TRUE, return a list.

coefficients

named vector of the component weights for first dimension of principal component analysis (see example for comparison to prcomp).

sds

named vector of the standard deviations of each network's edge weights.

Details

Scales the function prcomp to data sets with an arbitrarily large number of rows by running prcomp on repeated subsamples of the rows.

References

https://github.com/shaptonstahl/

Examples

Run this code

data(iris)        # provides example data
prcomp(iris[,1:4], center=FALSE, scale.=FALSE)$rotation[,1]
ScalablePCA(iris, subsample=10, use.cols=1:4)
ScalablePCA(iris, subsample=10, ignore.cols=5)

Run the code above in your browser using DataLab