huge: High-dimensional undirected graph estimation in one-step mode

Description

The main function for high-dimensional undirected graph estimation. It allows the user to load huge.npn(), huge.scr(),huge.subgraph() sequentially as a pipeline to analyze data.

Usage

huge(L, ind.group = NULL, lambda = NULL, n.lambda = NULL, lambda.min = NULL, 
alpha = 1, sym = "or", npn = TRUE, npn.func = "shrinkage", npn.thresh = NULL, 
approx = FALSE, scr = TRUE, scr.num = NULL, verbose = TRUE)

Arguments

There are two options for input L: (1) An n by d data matrix L representing n observations in d dimensions. (2) A list L containing L$data as an

ind.group

A length k vector indexing a subset of all d variables. Only applicable when estimating a subgraph of the whole graph. The default value is c(1:d).

lambda

A sequence of decresing positive numbers to control the regularization in Meinshausen & Buhlmann Graph Estimation via Lasso (GEL) when approx = FALSE or Graph Estimation via Correlation Approximation (GECA) when {approx = TRUE}. Typical usage

n.lambda

The number of regularization/thresholding paramters. The default value is 30 if approx = TRUE and 10 if approx = FALSE.

lambda.min

The smallest value for lambda, as a fraction of the uppperbound (MAX) of the regularization/thresholding parameter which makes all estimates equal to 0. The program can automatically generate lambda as a

alpha

The tuning parameter for the elastic-net regression. The default value is 1 (lasso). When some dense pattern exists in the graph or some variables are highly correlated, the elastic-net is encouraged for its grouping effect. Only applicable w

sym

Symmetrize the output graphs. If sym = "and", the edge between node i and node j is selected only when both node i and node j are selected as neighbors for each other. If sym = "or"

npn

If npn = TRUE, the nonparanormal transformation is applied to the input data L or L$data. The default value is TRUE.

npn.func

The transformation function used in the nonparanormal transformation. If npn.func = "truncation", the truncated ECDF is applied. If npn.func = "shrinkage", the shrunken ECDF is applied. The default value is "shrinkage"

npn.thresh

The truncation threshold used in nonparanormal transformation, only applicable when npn.func = "truncation". The default value is 1/(4*(n^0.25)*sqrt(pi*log(n))).

approx

If approx = FALSE, GEL is implemented. If approx = TRUE, GECA is implemented. The defaulty value is approx = FALSE.

scr

If scr = TRUE, the graph screening procedure is applied to preselect the neighborhood before GEL. The default value is TRUE. Only applicable when approx = FALSE.

scr.num

The neighborhood size after the graph screening (the number of remaining neighbors per node). Only applicable when scr = TRUE. The default value is n-1 when p>n and p-1 (equivalent to disabling graph scr

verbose

If verbose = FALSE, tracing information printing is disabled. The default value is TRUE.

Value

An object with S3 class "huge" is returned:
dataThe n by d data matrix from the input
thetaThe true graph structure from the input. Only applicable when the input list L contains L$theta as the true graph structure.
ind.groupThe ind.group from the input
ind.matThe scr.num by k matrix with each column correspondsing to a variable in ind.group and contains the indices of the remaining neighbors after the graph screening. Only applicable when scr = TRUE and approx = FALSE
lambdaThe sequence of regularization parameters used in GEL or thresholding parameters in GECA.
alphaThe alpha from the input. Only applicable when approx = FALSE.
symThe sym from the input. Only applicable when approx = FALSE.
npnThe npn from the input.
scrThe scr from the input. Only applicable when approx = FALSE.
graphreturn "subgraph path" when k and "fullgraph path" when k==d.
pathA list of k by k adjacency matrices of estimated graphs is returned as the solution path corresponding to lambda.
sparsityThe sparsity levels of the solution path.
approxThe correlation graph estimation indicator from the input
rssA k by n.lambda matrix. Each row is corresponding to a variable in ind.group and contains all RSS's (Residual Sum of Squares) along the lasso solution path. Only applicable when approx = FALSE.
dfA k by n.lambda matrix. Each row corresponds to a variable in ind.group and contains the number of nonzero coefficients along the lasso solution path. Only applicable when approx = FALSE.

Details

This function provides a general framework for high-dimensional undirected graph estimation. The package integrates data preprocessing (Gaussianization), graph screening, graph estimation, and model selection techniques into a pipeline. The nonparanormal transformation is applied to preprocess the data and helps relax the normality assumption. The graph screening subroutine preselects the graph neighborhood of each variable. In the graph estimation stage, the structure of either the whole graph or a pre-specified sub-graph is estimated by the Meinshausen & Buhlmann Graph Estimation via Lasso (GEL) strategy on the pre-screened data. In the case d >> n or d >>k, the computation is memory optimized and is targeted on larger-sclae problems (with d>3000). We also provide another efficient method, Graph Estimation via Correlation Approximation (GECA).

References

Tuo Zhao and Han Liu. HUGE: A Package for High-dimensional Undirected Graph Estimation. Technical Report, Carnegie Mellon University, 2010 Han Liu, John Lafferty and Larry Wasserman. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. Journal of Machine Learning Research (JMLR), Vol.10, Page 2295-2328, 2009 Jianqing Fan and Jinchi Lv. Sure independence screening for ultra-high dimensional feature space (with discussion). Journal of Royal Statistical Society B, Vol.70, Page 849-911, 2008. Jerome Friedman, Trevor Hastie and Rob Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol.33, No.1, 2008. Nicaolai Meinshausen and Peter Buhlmann. High-dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics, Vol.34, Page 1436-1462, 2006.

Examples

Run this code

#generate data
L = huge.generator(n = 200, d = 80, graph = "hub")

#subset indices
ind.group = c(1:50)

#subgraph solution path estimation with input as a list
out1 = huge(L, ind.group = ind.group)
summary(out1)
plot(out1)
plot(out1, align = TRUE)

#subgraph solution path estimation using the correlation graph estimation
out3 = huge(L$data, ind.group = ind.group, approx = TRUE)
summary(out3)
plot(out3)

#fullgraph solution path estimation using elastic net
out4 = huge(L, alpha = 0.7)
summary(out4)
plot(out4)

Run the code above in your browser using DataLab