big.PCA: PCA/Singular Value Decomposition for big.matrix

Description

At the time of submission there was no existing native method to conduct principle components analysis (PCA) on big.matrix objects. This function allows singular value decomposition (SVD) of very large matrices, very efficiently, versus the default method. The major speed advantages occur when the 'bigalgebra' package is installed, and when the argument for this function 'SVD'=TRUE. Regular PCA can be conducted using SVD=FALSE but it will be slower and the maximum matrix size able to produce a result, given memory limits, will be smaller. SVD is not exactly the same as PCA, but from my testing the components produced will correlate R>.9 with components of PCA on the same matrix. This function is not completely native to big.matrix objects so there is one step where the matrix submitted needs to be loaded into memory, so if your big.matrix object is larger than the allowed size of a standard R-matrix [[which is roughly 3GB; you can check using NCmisc::estimate.memory()], then this function will fail unless you set the option 'thin' to a percentage that, multiplied by the original matrix memory-load, is under 3GB. For large matrices in my applications, components produced with thinning are still highly correlated with components produced using the full dataset. For a breakdown of thinning methods, see the description for the function thin() for more information. Even with medium sized matrices, for instance 15,000 x 50,000 in size, this function is orders of magnitude faster than the standard R PCA functions, usually running in a matter of minutes, rather than hours or days in examples that I have tested, due to much better handling of memory for internal transpose and eigen operations by using the 'bigmemory' architecture.

Usage

big.PCA(bigMat, dir = getwd(), pcs.to.keep = 50, thin = FALSE,
  SVD = TRUE, LAP = FALSE, center = TRUE, save.pcs = FALSE,
  use.bigalgebra = TRUE, pcs.fn = "PCsEVsFromPCA.RData",
  return.loadings = FALSE, verbose = FALSE,
  delete.existing = getOption("deleteFileBacked"), ...)

Arguments

bigMat

a big.matrix object, or any argument accepted by get.big.matrix(), which includes paths to description files or even a standard matrix object.

dir

directory containing the filebacked.big.matrix, and also where the output file will be saved by default if the save.pcs option is set TRUE.

pcs.to.keep

integer, number of principle components to keep. Singular Value Decomposition methods are able to run faster if they don't need to calculate every single PC for a large matrix. Default is to calculate only the first 50; in practice even fewer than this are generally used directly. Apart from reducing processing time, this can also reduce storage/RAM burden for the resulting matrix. Set to NA, or a number >= min(dim(bigMat)) in order to keep all PCs.

thin

decimal, percentage of the original number of rows you want to thin the matrix to. see function thin() for details of how this can be done, pass arguments to thin() using ... Even though this PCA function uses mainly 'big.matrix' native methods, there is a step where the matrix must be stored fully in memory, so this limits the size of what matrix can be processed, depending on RAM limits. If you want to conduct PCA/SVD on a matrix larger than RAM you can thin the matrix to a percentage of the original size. Usually such a large matrix will contain correlated measures and so the exclusion of some data-rows (variables) will have only a small impact on the resulting principle components. In some applications tested using this function, using only 5 of 200,000 variables a PCA gave extremely similar results to using the full dataset.

SVD

logical, whether to use a Singular Value Decomposition method or a PCA method. The eigenvalues and eigenvectors of each alternative will be highly correlated so for most applications, such as PC-correction, this shouldn't make much difference to the result. However, using SVD=TRUE can provide significant advantages in speed, or even allow a solution on a matrix that would be to large to practically compute full PCA. Note that only in SVD mode, and with the bigalgebra package installed will the full speed advantage of this function be utilised.

LAP

logical, whether to use La.svd() instead of svd() when SVD=TRUE, see base:svd for more info.

center

whether to 'centre' the matrix rows by subtracting rowMeans() before conducting the PCA. This is usually advisable, although you may wish to skip this if the matrix is already centred to save extra processing. unlike prcomp there is no option to standardize or use the correlation matrix, if this is desired, please standardize the bigMat object before running this function. Alternatively, 'center' can be a vector of length nrow(bigMat) which gives an offset to correct each row by.

save.pcs

whether to save the principle component matrix and eigenvalues to a binary file with name pcs.fn

use.bigalgebra

logical, whether to use the bigalgebra package for algebraic operations. For large datasets bigalgebra should provide a substantial speedup, and also facilitates use of larger matrices. This relies on having bigalgebra installed and loaded, which requires some manual configuration as bigalgebra is not currently available on CRAN, but only SVN and RForge. See svn.bigalgebra.install() or big.algebra.install.help() Default is to use bigalgebra if available (TRUE), but setting this FALSE prevents the check for bigalgebra which would be cleaner if you know that you don't have it installed.

pcs.fn

name of the binary when save.pcs=TRUE

return.loadings

logical, whether to return the 'loadings' (or the other singular vector when SVD=T); could result in a speed decrease

verbose

whether to display detailed progress of the PCA

delete.existing

logical, whether to automatically delete filebacked matrices (if they exist) before rewriting. This is because of an update since 20th October 2015 where bigmemory won't allow overwrite of an existing filebacked matrix. If you wish to set this always TRUE or FALSE, use options(deleteFileBacked)

...

if thin is TRUE, then these should be any additional arguments for thin(), e.g, 'pref', 'keep', 'how', etc.i

Value

A list of principle components/singular vectors (may be incomplete depending on options selected), and of the eigenvalues/singular values.

Examples

Run this code

# NOT RUN {
# create an example matrix and its transpose
min.dim <- 200; nvar <- 500; subset.size <- 50
mat <- matrix(rnorm(min.dim*nvar),ncol=min.dim) 
prv.large(mat)
t.mat <- t(mat)
# create two alternative covariance matrices
MMs <- t.mat %*% mat
MsM <- mat %*% t.mat
# run singular value decomposition
pca <- svd(mat)   
D <- pca$d # singular values (=sqrt(eigenvalues))
V <- pca$v # right singular vector
U <- pca$u # left singular vector
sig <- mat-mat; diag(sig) <- D; 
MMs2 <- V %*% (t(sig) %*% sig) %*% t(V)
sig <- t.mat-t.mat; diag(sig) <- D; 
MsM2 <- U %*% (sig %*% t(sig)) %*% t(U)
# show that the covariance matrices are equal to the functions of 
# the left and right singular vectors
prv(MMs,MsM); prv(MMs2,MsM2)
pr <- princomp(mat) # PCA using eigendecomposition of cov matrix
L <- matrix(rep(0,40000),ncol=200); diag(L) <- pr[[1]]^2 # eigenvalues as diag
mat2 <- (pr[[2]]) %*% L %*%  solve(pr[[2]]) # = eigenvectors * eigenvalues * inv(eigenvectors)
prv.large(cov(mat)); prv.large(mat2) #  == COVmat (may be slight tolerance differences)
## Now demonstrate the correlation between SVD and PCA ##
# the right singular vector is highly correlated with the pca loadings:
median(abs(diag(cor(V,pr[["loadings"]]))))
# the left singular vector is highly correlated with the pca scores (eigenvectors):
median(abs(diag(cor(U,pr[["scores"]]))))
cor(pr$sdev,D) # the singular values are equivalent to the eigenvalues

## MAIN EXAMPLES ##
orig.dir <- getwd(); setwd(tempdir()); # move to temporary dir
if(file.exists("testMyBig.bck")) { unlink(c("testMyBig.bck","testMyBig.dsc")) }
bmat <- as.big.matrix(mat,backingfile="testMyBig.bck",
 descriptorfile="testMyBig.dsc",  backingpath = getwd())
result <- big.PCA(bmat) #,verbose=TRUE)
headl(result)
# plot the eigenvalues with a linear fit line and elbow placed at 13
Eigv <- pca.scree.plot(result$Evalues,M=bmat,elbow=6,printvar=FALSE)
rm(bmat) 
unlink(c("testMyBig.bck","testMyBig.dsc"))
##  generate some data with reasonable intercorrelations ##
mat2 <- sim.cor(500,200,genr=function(n){ (runif(n)/2+.5) })
bmat2 <- as.big.matrix(mat2,backingfile="testMyBig2.bck",
 descriptorfile="testMyBig2.dsc",  backingpath = getwd())
# calculate PCA on decreasing subset size 
result2 <- big.PCA(bmat2,thin=FALSE)
normal <- result2$PCs; rm(result2)
result3 <- big.PCA(bmat2,thin=TRUE,keep=.5, pref="t1")
thinned <- result3$PCs; rm(result3)
result4 <- big.PCA(bmat2,thin=TRUE,keep=.5, pref="t2", how="cor")
corred <- result4$PCs; rm(result4)
result5 <- big.PCA(bmat2,thin=TRUE,keep=.5, pref="t3", how="pca")
pced <- result5$PCs; rm(result5)
result6 <- big.PCA(bmat2,thin=TRUE,keep=.2, pref="t4")
thinner <- result6$PCs; rm(result6)
## correlate the resulting PCs with the un-thinned PCs
cors.thin.with.orig <- apply(cor(normal,thinned),1,max)
cors.corred.with.orig <- apply(cor(normal,corred),1,max)
cors.pced.with.orig <- apply(cor(normal,pced),1,max)
cors.thinner.with.orig <-apply(cor(normal,thinner),1,max)
plot(cors.thin.with.orig,type="l",col="red",ylim=c(0,1))
lines(cors.thinner.with.orig,col="orange")
lines(cors.corred.with.orig,col="lightblue")
lines(cors.pced.with.orig,col="lightgreen")
# can see that the first component is highly preserved,
# and next components, somewhat preserved; try using different thinning methods
rm(bmat2) 
unlink(c("testMyBig2.bck","testMyBig2.dsc"))
setwd(orig.dir)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples