sGPCA: Sparse Generalized Principal Component Analysis

Description

Computes the rank K sparse, sparse non-negative, two-way sparse, and two-way sparse non-negative GPCA solutions.

Usage

sgpca(X, Q, R, K = 1, lamu = 0, lamvs = 0, posu = FALSE, posv = FALSE, 
threshold = 1e-07, maxit = 1000, full.path = FALSE)

Arguments

The n x p data matrix. X must be of class matrix with all numeric values.

The row generalizing operator, an n x n matrix. Q can be of class matrix or class dcGMatrix, must be positive semi-definite, and have operator norm one.

The column generalizing operator, an p x p matrix. R can be of class matrix or class dcGMatrix, must be positive semi-definite, and have operator norm one.

The number of GPCA components to compute. The default value is one.

lamu

The regularization parameter that determines the sparsity level for the row factor, U. The default value is 0. If the data is oriented with rows as samples, non-zero lamu corresponds to two-way sparse methods.

lamvs

A scalar or vector of regularization parameters that determine the sparsity level for the column factor, V. The default is 0, with non-zero values corresponding to sparse or two-way sparse methods. If lamvs is a vector, then the BIC method is used to select the optimal sparsity level. Alternatively, if full.path is specified, then the solution at each value of lamvs is returned.

posu

Flag indicating whether the row factor, U should be constrained to be strictly positive. The default value is FALSE.

posv

Flag indicating whether the column factor, V should be constrained to be strictly positive. The default value is FALSE.

threshold

Sets the threshold for convergence. The default value is .0001.

maxit

Sets the maximum number of iterations. The default value is .0001.

full.path

Flag indicating whether the entire solution path, or the solution at each value of lamvs, should be returned. The default value is FALSE.

Value

U: The left sparse GPCA factors, an n x K matrix. If full.path is specified with r values of lamvs, then U is a n x K x r array.
V: The right sparse GPCA factors, a p x K matrix. If full.path is specified with r values of lamvs, then V is a p x K x r array.
D: A vector of the K sparse GPCA values. If full.path is specified with r values of lamvs, then D is a K x r matrix.
cumulative.prop.var: The cumulative proportion of variance explained by the components
bics: The BIC values computed for each value of lamvs and each of the K components.
optlams: Optimal regularization parameter as chosen by the BIC method for each of the K components.

Details

The sgpca function has the flexibility to fit combinations of sparsity and/or non-negativity for both the row and column generalized PCs. Regularization is used to encourage sparsity in the GPCA factors by placing an L1 penalty on the GPC loadings, V, and or the sample GPCs, U. Non-negativity constraints on V and/or U yield sparse non-negative and two-way non-negative GPCA. Generalizing operators as described for gpca can be used with this function and have the same properties.

When lamvs=0, lamu=0, posu=0, and posv=0, the GPCA solution also given by gpca is returned. The magnitude of the regularization parameters, lamvs and lamu, determine the level of sparsity of the factors U and V, with higher regularization parameter values yielding sparser factors. If more than one regularization value lamvs is given, then sgpca finds the optimal regularization parameter lamvs by minimizing the BIC derived from the generalized least squares update for each factor.

If full.path = TRUE, then the full path of solutions (U, D, and V) is returned for each value of lamvs given. This option is best used with 50 or 100 values of lamvs to well approximate the regularization paths. Numerically, the path begins with the GPCA solution, lamvs=0, and uses warm starts at each step as lamvs increases.

Proximal gradient descent is used to compute each rank-one solution. Multiple components are calculated in a greedy manner via deflation. Each rank-one solution is solved by iteratively fitting generalized least squares problems with penalties or non-negativity constraints. These regression problems are solved by the Iterative Soft-Thresholding Algorithm (ISTA) or projected gradient descent.

References

Genevera I. Allen, Logan Grosenick, and Jonathan Taylor, "A generalized least squares matrix decomposition", arXiv:1102.3074, 2011.

Genevera I. Allen and Mirjana Maletic-Savatic, "Sparse Non-negative Generalized PCA with Applications to Metabolomics", Bioinformatics, 27:21, 3029-3035, 2011.

Examples

Run this code

data(ozone2)
ind = which(apply(is.na(ozone2$y),2,sum)==0)
X = ozone2$y[,ind]
n = nrow(X)
p = ncol(X)
#Generalizing Operators - Spatio-Temporal Smoothers
R = Exp.cov(ozone2$lon.lat[ind,],theta=5)
er = eigen(R,only.values=TRUE);
R = R/max(er$values)
Q = Exp.cov(c(1:n),c(1:n),theta=3)
eq = eigen(Q,only.values=TRUE)
Q = Q/max(eq$values)

#Sparse GPCA
fit = sgpca(X,Q,R,K=1,lamu=0,lamvs=c(.5,1))
fit$prop.var #proportion of variance explained
fit$optlams #optimal regularization param chosen by BIC
fit$bics #BIC values for each lambda

#Sparse Non-negative GPCA
fit = sgpca(X,Q,R,K=1,lamu=0,lamvs=1,posv=TRUE)

#Two-way Sparse GPCA
fit = sgpca(X,Q,R,K=1,lamu=1,lamvs=1)

#Two-way Sparse Non-negative GPCA
fit = sgpca(X,Q,R,K=1,lamu=1,lamvs=1,posu=TRUE,posv=TRUE)

#Return full regularization paths for inputted lambda values
fit = sgpca(X,Q,R,K=1,lamu=0,lamvs=c(.1,.5,1),full.path=TRUE)

Run the code above in your browser using DataLab