Learn R Programming

bootSVD (version 0.1)

bootSVD: Calculates bootstrap distribution of PCA (i.e. SVD) results

Description

Applies fast bootstrap PCA, using the method from (Fisher et al., 2014). Dimension of the sample is denoted by $p$, and sample size is denoted by $n$.

Usage

bootSVD(Y = NULL, K, V = NULL, d = NULL, U = NULL, B = 50,
  output = "HD_moments", talk = TRUE, bInds = NULL,
  percentiles = c(0.025, 0.975), centerSamples = TRUE, mc.cores = 1)

Arguments

Y
initial data sample. Can be either tall ($p$ by $n$) or wide ($n$ by $p$). If Y is entered and V, d and U are not, bootSVD will also compute the SVD of Y.
K
number of PCs to calculate the bootstrap distribution for.
V
(optional) full set of $p$-dimensional PCs for the sample data matrix. If Y is wide, these are the right singular vectors of Y (i.e. $Y=UDV'$). If Y is tall, these are the left singular vectors of Y
U
(optional) full set of $n$-dimensional singular vectors of Y. If Y is wide, these are the left singular vectors of Y (i.e. $Y=UDV'$). If Y is tall, these are the right singular vectors of Y
d
(optional) vector of the singular values of Y. For example, if Y is tall, then we have $Y=VDU'$ with D=diag(d).
B
number of bootstrap samples to compute.
output
a vector telling which descriptions of the bootstrap distribution should be calculated. Can include any of the following: 'initial_SVD','HD_moments', 'full_HD_PC_dist', 'full_LD_PC_dist', 'd_dist', and 'U_dist'. If output is set to 'a
talk
If TRUE, the function will print progress during calculation procedure.
bInds
a ($B$ by $n$) matrix of bootstrap indeces, where B is the number of bootstrap samples, and n is the sample size. The purpose of setting a specific bootstrap sampling index is to allow the results to be more easily compar
percentiles
a vector containing percentiles to be used to calculate element-wise percentile confidence intervals for the PCs (both the $p$-dimensional components and the $n$-dimensional components). For example, percentiles=c(.025,.975) will resu
mc.cores
passed to mclapply. Used when transforming the $n$-dimensional PCs to the $p$-dimensional PCs.
centerSamples
whether each bootstrap sample should be centered before calculating the SVD.

Value

  • Output is a list which can include any of the following elements, depending on what is specified in the output argument: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

    In addition, the following results are always included in the output, regardless of what is specified in the output argument:

  • LD_momentsA list that is comparable to HD_moments, but instead describes the variability of the $n$-dimensional principal components of the resampled score matrices. LD_moments contains the bootstrap expected value (EPCs), bootstrap variance (varPCs), and bootstrap standard deviation (sdPCs) for each of the $n$-dimensional PCs. Each of these three elements of LD_moments is also a list, which contains $K$ vectors, one for each PC. LD_moments also contains momentCI, a list of $K$ ($n$ by 2) matrices containing element-wise moment based confidence intervals for the PCs.
  • LD_percentilesA list of $K$ matrices, each of dimension ($n$ by $2$). The $k^{th}$ matrix in LD_percentiles contains element-wise percentile intervals for the $k^{th}$ $n$-dimensional PC.
  • If the dimension of the sample is especially large (e.g. $p>10000$), operations involving the entire sample data matrix may require too much memory, creating computational bottlenecks. In these cases, the sample data can be partitioned into several smaller data files, and simple block matrix algebra can be used to compute the bootstrap distribution of the principal components. This matrix algebra is explained in more detail in section 1 of the supplemental materials of (Fisher et al., 2014)

References

Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922

Examples

Run this code
#use small n, small B for a quick illustration
set.seed(0)
Y<-simEEG(n=100, centered=TRUE, wide=TRUE)
b<-bootSVD(Y,B=200,K=2,output='all')

#explore results
matplot(b$initial_SVD$V[,1:4],type='l',main='Fitted PCs',lty=1)
legend('bottomright',paste0('PC',1:4),col=1:4,lty=1,lwd=2)

######################
# look specifically at 2nd PC
k<-2

######
#looking at HD variability

#plot several draws from bootstrap distribution
VsByK<-reindexPCsByK(b$full_HD_PC_dist)
matplot(t(VsByK[[k]][1:20,]),type='l',lty=1,
		main=paste0('20 Draws from bootstrap\ndistribution of HD PC ',k))

#plot pointwise CIs
matplot(b$HD_moments$momentCI[[k]],type='l',col='blue',lty=1,
		main=paste0('CIs For HD PC ',k))
matlines(b$HD_percentiles[[k]],type='l',col='darkgreen',lty=1)
lines(b$initial_SVD$V[,k])
legend('topright',c('Fitted PC','Moment CIs','Percentile CIs'),
		lty=1,col=c('black','blue','darkgreen'))
abline(h=0,lty=2,col='darkgrey')

######
# looking at LD variability

# plot several draws from bootstrap distribution
AsByK<-reindexPCsByK(b$full_LD_PC_dist)
matplot(t(AsByK[[k]][1:50,]),type='l',lty=1,
		main=paste0('50 Draws from bootstrap\ndistribution of LD PC ',k),
		xlim=c(1,10),xlab='PC index (truncated)')

# plot pointwise CIs
matplot(b$LD_moments$momentCI[[k]],type='o',col='blue',
		lty=1,main=paste0('CIs For LD PC ',k),xlim=c(1,10),
		xlab='PC index (truncated)',pch=1)
matlines(b$LD_percentiles[[k]],type='o',pch=1,col='darkgreen',lty=1)
abline(h=0,lty=2,col='darkgrey')
legend('topright',c('Moment CIs','Percentile CIs'),lty=1,
		pch=1,col=c('blue','darkgreen'))
#Note: variability is mostly due to rotations with the third and fourth PC.

# Bootstrap eigenvalue distribution
dsByK<-reindexDsByK(b$d_dist)
boxplot(dsByK[[k]]^2,main=paste0('Covariance Matrix Eigenvalue ',k),ylab='Bootstrap Distribution')
points(b$initial_SVD$d[2]^2,pch=18,col='red')
legend('bottomright','Sample Value',pch=18,col='red')

Run the code above in your browser using DataLab