gPCA.batchdetect: Guided Principal Components Analysis

Description

Tests for batch effects an $n \times p$ data set with batch vector given by batch using the $\delta$ statistic resulting from guided principal componenets analysis (gPCA).

Usage

gPCA.batchdetect(x, batch, filt = NULL, nperm = 1000, center = FALSE, scaleY=FALSE, 
seed = NULL)

Arguments

an $n x p$ matrix of data where $n$ denotes observations and $p$ denotes the number of features (e.g. probe, gene, SNP, etc.).

batch

a length $n$ vector that indicates batch (group or class) for each observation.

filt

(optional) the number of features to retain after applying a variance filter. If NULL, no filter is applied. Filtering can significantly reduce the processing time in the case of very large data sets.

nperm

the number of permutations to perform for the permutation test, default is 1000.

center

(logical) Is your data x centered? If not, then center=FALSE and gPCA.batchdetect will center it for you.

scaleY

(logical) Do you want to scale the Y matrix by the number of samples in each batch? If not, then center=FALSE (default), otherwise, center=TRUE.

seed

the seed number for set.seed(). Default is NULL.

Value

delta: test statistic $\delta$ from gPCA.
p.val: $p$-value associated with $\delta$ resulting from gPCA.
delta.p: nperm length vector of delta values resulting from the permuation test.
batch: returns your length $n$ batch vector.
filt: returns the number of features the variance filter retained.
n: the number of observations
p: the number of features
b: the number of batches
PCu: principal component matrix from unguided PCA.
PCg: principal component matrix from gPCA.
varPCu1: the proportion out of the total variance associated with the first principal component of unguided PCA.
varPCg1: the proportion out of the total variance associated with the first principal component of gPCA.
cumulative.var.u: length $n$ vector of the cumulative variance of the $i=1,\dots,n$ principal components from unguided PCA.
cumulative.var.g: length $b$ vector of the cumulative variance of the $k=1,\dots,b$ principal components from gPCA.

Details

Guided principal components analysis (gPCA) is an extension of principal components analysis (PCA) that guides the singular value decomposition (SVD) of PCA by applying SVD to $\mathbf{Y}'\mathbf{X}$ where $\mathbf{Y}$ is a $n \times b$ batch indicator matrix of ones when an observation $i (i=1,\dots,n)$ is in batch $b$ and zeros otherwise.

The test statistic $\delta$ along with a one-sided $p$-value results from a gPCA.batchdetect() call, along with the values of $\delta_p$ from the permutation test. The $\delta_p$ values can be used to visualize the permutation distribution of your test using the gDist function. For more information on gPCA, please see reese.

References

Reese, S. E., Archer, K. J., Therneau, T. M., Atkinson, E. J., Vachon, C. M., de Andrade, M., Kocher, J. A., and Eckel-Passow, J. E. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal components analysis. Bioinformatics, (in review).

Examples

Run this code

data(caseDat)
batch<-caseDat$batch
data<-caseDat$data
out<-gPCA.batchdetect(x=data,batch=batch,center=FALSE,nperm=250)
out$delta ; out$p.val

## Plots:
gDist(out)
CumulativeVarPlot(out,ug="unguided",col="blue")
PCplot(out,ug="unguided",type="1v2")
PCplot(out,ug="unguided",type="comp",npcs=4)

Run the code above in your browser using DataLab