regularization_net: Regularized Network Estimation

Description

Estimate cross-sectional network structures using regularization. The function first computes the correlations (if needed), constructs a grid of tuning parameters tailored to the chosen penalty, and then selects the final network by minimizing a user‑specified information criterion.

Usage

regularization_net(
  data = NULL,
  ns = NULL,
  mat = NULL,
  likelihood = "obs_based",
  n_calc = "average",
  count_diagonal = TRUE,
  ic_type = NULL,
  extended_gamma = 0.5,
  penalty = "atan",
  vary = "lambda",
  n_lambda = NULL,
  lambda_min_ratio = 0.01,
  n_gamma = 50,
  pen_diag = FALSE,
  lambda = NULL,
  gamma = NULL,
  ordered = FALSE,
  missing_handling = "two-step-em",
  nimp = 20,
  imp_method = "pmm",
  ...
)

Value

A list with the following elements:

pcor: Estimated partial correlation matrix corresponding to the selected (optimal) network.
n: Effective sample size used, either supplied directly via n or derived based on n_calc.
cor_method: Correlation estimation method used for each variable pair.
full_results: Full set of results returned by the model selection procedure, including all evaluated networks and their fit statistics.
args: A list of settings used in the estimation procedure.

Arguments

data

Optional raw data matrix or data frame containing the variables to be included in the network. May include missing values. If data is not provided (NULL), a covariance or correlation matrix must be supplied in mat.

ns

Optional numeric sample size specification. Can be a single value (one sample size for all variables) or a vector (e.g., variable-wise sample sizes). When data is provided and ns is NULL, sample sizes are derived automatically from data. When mat is supplied instead of raw data, ns must be provided and should reflect the sample size underlying mat.

mat

Optional covariance or correlation matrix for the variables to be included in the network. Used only when data is NULL. If both data and mat are supplied, mat is ignored. When mat is used, ns must also be provided.

likelihood

Character string specifying how the log-likelihood is computed. Possible values are:

"obs_based": Uses the observed-data log-likelihood.

"mat_based"

Uses log-likelihood based on the sample correlation matrix.

n_calc

Character string specifying how the effective sample size is determined. When data are provided, it controls how the observation counts across variables are aggregated. When ns is a vector, it controls how the entries of ns are combined. If both data and ns are supplied, the values in ns take precedence. This argument is ignored when ns is a single numeric value. Possible values are:

"average": Uses the average sample size across variables or across the entries of ns.

"max"

Uses the maximum sample size across variables or across the entries of ns.

"total"

Uses the total number of observations. Applicable only when ns is not provided by the user.

count_diagonal

Logical; should observations contributing to the diagonal elements be included when computing the sample size? Only relevant when data is provided and n_calc = "average".

ic_type

Character string specifying the type of information criterion used for model selection. Possible values are: "aic", "bic", and "ebic". If no input is provided, defaults to "ebic" when penalty = "glasso" and "bic" otherwise.

extended_gamma

Numeric gamma parameter used in the extended information criterion calculation. Only relevant when ic_type = "ebic".

penalty

Character string indicating the type of penalty used for regularization. Available options are described in the Details section.

vary

Character string specifying which penalty parameter(s) are varied during regularization to determine the optimal network. Possible values are "lambda", "gamma", or "both".

n_lambda

Number of lambda values to be evaluated. If not specified, the default is 100 when penalty = "glasso" and 50 if lambda is varied for. If vary == "gamma", n_lambda is set to 1.

lambda_min_ratio

Ratio of the smallest to the largest lambda value.

n_gamma

Number of gamma values to be evaluated. Is set to 1 if vary == "lambda".

pen_diag

Logical; should the diagonal elements be penalized in the regularization process?

lambda

Optional user-specified vector of lambda values.

gamma

Optional user-specified vector of gamma values.

ordered

Logical vector indicating which variables in data are treated as ordered (ordinal). Only used when data is provided. If a single logical value is supplied, it is recycled to the length of data.

missing_handling

Character string specifying how correlations are estimated from the data input in the presence of missing values. Possible values are:

"two-step-em": Uses a classical EM algorithm to estimate the correlation matrix from data.

"stacked-mi"

Uses stacked multiple imputation to estimate the correlation matrix from data.

"pairwise"

Uses pairwise deletion to compute correlations from data.

"listwise"

Uses listwise deletion to compute correlations from data.

nimp

Number of imputations (default: 20) to be used when missing_handling = "stacked-mi".

imp_method

Character string specifying the imputation method to be used when missing_handling = "stacked-mi" (default: "pmm" - predictive mean matching).

...

Further arguments passed to internal functions.

Details

Penalties

This function supports a range of convex and nonconvex penalties for regularized network estimation.

For convex penalties, the graphical lasso can be used via penalty = "glasso" friedman.2008mantar.

Another option is the adaptive lasso, specified with penalty = "adapt". By default, it employs = 0.5 zou.2008mantar. Smaller values of (i.e., 0) correspond to stronger penalization, whereas = 1 yields standard _1 regularization.

The available nonconvex penalties follow the work of williams.2020;textualmantar, who identified the atan penalty as particularly promising. It serves as the default in this implementation because it has desirable theoretical properties, including consistency in recovering the true model as \(n \to \infty\). Additional nonconvex penalties are included for completeness. These were originally implemented in the now–deprecated R package GGMncv williams.2021mantar, and the implementation in mantar is based on the corresponding methods from that package.

Several algorithms exist for nonconvex regularized network estimation. In mantar, we use the one-step estimator of zou.2008;textualmantar because of its computational efficiency and its good performance in settings where \(n > p\), which is typically the case in psychological research.

Atan: penalty = "atan" wang.2016mantar. This is currently the default.
Exponential: penalty = "exp" wang.2018mantar.
Log: penalty = "log" mazumder.2011mantar.
MCP: penalty = "mcp" zhang.2010mantar.
SCAD: penalty = "scad" fan.2001mantar.
Seamless _0: penalty = "selo" dicker.2013mantar.
SICA: penalty = "sica" lv.2009mantar.

Information Criteria

The argument ic_type specifies which information criterion is computed. All criteria are computed based on the log-likelihood of the estimated model.

"aic":: Akaike Information Criterion akaike.1974mantar; defined as AIC = -2 + 2k, where \(\ell\) is the log-likelihood of the model and \(k\) is the number of freely estimated edge parameters (non-zero edges).
"bic":: Bayesian Information Criterion schwarz.1978mantar; defined as BIC = -2 + k (n), where \(\ell\) is the log-likelihood of the model, \(k\) is the number of freely estimated edge parameters (non-zero edges), and \(n\) is the sample size.
"ebic":: Extended Bayesian Information Criterion chen.2008mantar; particularly useful in high-dimensional settings. Defined as EBIC = -2 + k (n) + 4 k (p), where \(\ell\) is the log-likelihood, \(k\) is the number of freely estimated edges (non-zero edges), \(n\) is the sample size, \(p\) is the number of variables, and \(\gamma\) is the extended-penalty parameter.

Conditional Defaults

By default, some tuning parameters depend on the chosen penalty. Specifically, when penalty = "glasso", the number of lambda values n_lambda defaults to 100 and ic_type defaults to "ebic". For all other penalties, the defaults are n_lambda = 50 and ic_type = "bic". These defaults can be overridden by specifying n_lambda and/or ic_type explicitly.

Missing Handling

To handle missing data, the function offers two approaches: a two-step expectation-maximization (EM) algorithm and stacked multiple imputation. According to simulations by nehler.2025;textualmantar, stacked multiple imputation performs reliably across a range of sample sizes. In contrast, the two-step EM algorithm provides accurate results primarily when the sample size is large relative to the amount of missingness and network complexity - but may still be preferred in such cases due to its much faster runtime. Currently, the function only supports variables that are directly included in the network analysis; auxiliary variables for missing handling are not yet supported. During imputation, all variables are imputed by default using predictive mean matching @see e.g., @vanbuuren.2018mantar, with all other variables in the data set serving as predictors.

References

Examples

Run this code

# Estimate regularized network from full data set
# Using observed-data loglikelihood and atan penalty
result <- regularization_net(mantar_dummy_full_cont,
                            likelihood = "obs_based",
                            penalty = "atan")

# View estimated partial correlation network
result$pcor

# Estimate regularized network from data set with missings
# Using correlation-matrix-based loglikelihood, glasso penalty,
# and stacked multiple imputation to handle missings
# set nimp to 10 for faster computation to in this example (not recommended
# in practice)
result <- regularization_net(mantar_dummy_mis_mix,
                           likelihood = "mat_based",
                           penalty = "glasso",
                           missing_handling = "stacked-mi",
                           nimp = 10,
                           ordered = c(FALSE,FALSE,TRUE,TRUE,
                                       FALSE,FALSE,TRUE,TRUE))

# View used correlation method and effective sample size
result$cor_method
result$n
# View estimated partial correlation network
result$pcor

Run the code above in your browser using DataLab