ssvd: Sparse regularized low-rank matrix approximation.

Description

Estimate an ${\ell}1$-penalized singular value or principal components decomposition (SVD or PCA) that introduces sparsity in the right singular vectors based on the fast and memory-efficient sPCA-rSVD algorithm of Haipeng Shen and Jianhua Huang.

Usage

ssvd(x, k = 1, n = 2, maxit = 500, tol = 0.001, center = FALSE,
  scale. = FALSE, alpha = 0, tsvd = NULL, ...)

Arguments

A numeric real- or complex-valued matrix or real-valued sparse matrix.

Matrix rank of the computed decomposition (see the Details section below).

Number of nonzero components in the right singular vectors. If k > 1, then a single value of n specifies the number of nonzero components in each regularized right singular vector. Or, specify a vector of length k indicating the number of desired nonzero components in each returned vector. See the examples.

maxit

Maximum number of soft-thresholding iterations.

tol

Convergence is determined when $\|U_j - U_{j-1}\|_F < tol$, where $U_j$ is the matrix of estimated left regularized singular vectors at iteration $j$.

center

a logical value indicating whether the variables should be shifted to be zero centered. Alternately, a centering vector of length equal the number of columns of x can be supplied. Use center=TRUE to perform a regularized sparse PCA.

scale.

a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. Alternatively, a vector of length equal the number of columns of x can be supplied.

The value of scale determines how column scaling is performed (after centering). If scale is a numeric vector with length equal to the number of columns of x, then each column of x is divided by the corresponding value from scale. If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center=TRUE, and the root mean square otherwise. If scale is FALSE, no scaling is done. See scale for more details.

alpha

Optional scalar regularization parameter between zero and one (see Details below).

tsvd

Optional initial rank-k truncated SVD or PCA (skips computation if supplied).

...

Additional arguments passed to irlba.

Value

A list containing the following components:

u regularized left singular vectors with orthonormal columns
d regularized upper-triangluar projection matrix so that x %*% v == u %*% d
v regularized, sparse right singular vectors with columns of unit norm
center, scale the centering and scaling used, if any
lambda the per-column regularization parameter found to obtain the desired sparsity
iter number of soft thresholding iterations
n value of input parameter n
alpha value of input parameter alpha

Details

The ssvd function implements a version of an algorithm by Shen and Huang that computes a penalized SVD or PCA that introduces sparsity in the right singular vectors by solving a penalized least squares problem. The algorithm in the rank 1 case finds vectors $u, w$ that minimize $$\|x - u w^T\|_F^2 + \lambda \|w\|_1$$ such that $\|u\| = 1$, and then sets $v = w / \|w\|$ and $d = u^T x v$; see the referenced paper for details. The penalty $\lambda$ is implicitly determined from the specified desired number of nonzero values n. Higher rank output is determined similarly but using a sequence of $\lambda$ values determined to maintain the desired number of nonzero elements in each column of v specified by n. Unlike standard SVD or PCA, the columns of the returned v when k > 1 may not be orthogonal.

References

Shen, Haipeng, and Jianhua Z. Huang. "Sparse principal component analysis via regularized low rank matrix approximation." Journal of multivariate analysis 99.6 (2008): 1015-1034.
Witten, Tibshirani and Hastie (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. _Biostatistics_ 10(3): 515-534.

Examples

Run this code

# NOT RUN {
set.seed(1)
u <- matrix(rnorm(200), ncol=1)
v <- matrix(c(runif(50, min=0.1), rep(0,250)), ncol=1)
u <- u / drop(sqrt(crossprod(u)))
v <- v / drop(sqrt(crossprod(v)))
x <- u %*% t(v) + 0.001 * matrix(rnorm(200*300), ncol=300)
s <- ssvd(x, n=50)
table(actual=v[, 1] != 0, estimated=s$v[, 1] != 0)
oldpar <- par(mfrow=c(2, 1))
plot(u, cex=2, main="u (black circles), Estimated u (blue discs)")
points(s$u, pch=19, col=4)
plot(v, cex=2, main="v (black circles), Estimated v (blue discs)")
points(s$v, pch=19, col=4)

# Compare with SPC from the PMA package, regularizing only the v vector and choosing
# the regularization constraint `sum(abs(s$v))` computed above by ssvd
# (they find the about same solution in this "sparse SVD" case):
if (requireNamespace("PMA", quietly = TRUE)) {
  p <- PMA::SPC(x, sumabsv=sum(abs(s$v)), center=FALSE)
  table(actual=v[, 1] != 0, estimated=p$v[, 1] != 0)
  # compare optimized values
  print(c(ssvd=s$d, SPC=p$d))

  # Same example, but computing a "sparse PCA", again about the same results:
  sp <- ssvd(x, n=50, center=TRUE)
  pp <- PMA::SPC(x, sumabsv=sum(abs(sp$v)), center=TRUE)
  print(c(ssvd=sp$d, SPC=pp$d))
}


# Let's consider a trivial rank-2 example (k=2) with noise. Like the
# last example, we know the exact number of nonzero elements in each
# solution vector of the noise-free matrix. Note the application of
# different sparsity constraints on each column of the estimated v.
# Also, the decomposition is unique only up to sign, which we adjust
# for below.
set.seed(1)
u <- qr.Q(qr(matrix(rnorm(400), ncol=2)))
v <- matrix(0, ncol=2, nrow=300)
v[sample(300, 15), 1] <- runif(15, min=0.1)
v[sample(300, 50), 2] <- runif(50, min=0.1)
v <- qr.Q(qr(v))
x <- u %*% (c(2, 1) * t(v)) + .001 * matrix(rnorm(200 * 300), 200)
s <- ssvd(x, k=2, n=colSums(v != 0))

# Compare actual and estimated vectors (adjusting for sign):
s$u <- sign(u) * abs(s$u)
s$v <- sign(v) * abs(s$v)
table(actual=v[, 1] != 0, estimated=s$v[, 1] != 0)
table(actual=v[, 2] != 0, estimated=s$v[, 2] != 0)
plot(v[, 1], cex=2, main="True v1 (black circles), Estimated v1 (blue discs)")
points(s$v[, 1], pch=19, col=4)
plot(v[, 2], cex=2, main="True v2 (black circles), Estimated v2 (blue discs)")
points(s$v[, 2], pch=19, col=4)
par(oldpar)

# }

Run the code above in your browser using DataLab