neighborhood_net: Network Estimation via Neighborhood Selection using Information Criteria

Description

Estimates a network structure through node-wise regression models, where each regression is selected via an information-criterion–based stepwise procedure. The selected regression coefficients are subsequently combined into partial correlations to form the final network.

Usage

neighborhood_net(
  data = NULL,
  ns = NULL,
  mat = NULL,
  n_calc = "individual",
  ic_type = "bic",
  ordered = FALSE,
  pcor_merge_rule = "and",
  missing_handling = "two-step-em",
  nimp = 20,
  imp_method = "pmm",
  ...
)

Value

A list with the following elements:

pcor: Partial correlation matrix estimated from the node-wise regressions.
betas: Matrix of regression coefficients from the final regression models.
ns: Sample sizes used for each variable in the node-wise regressions.
args: List of settings used in the network estimation.

Arguments

data

Optional raw data matrix or data frame containing the variables to be included in the network. May include missing values. If data is not provided (NULL), a covariance or correlation matrix must be supplied in mat.

ns

Optional numeric sample size specification. Can be a single value (same sample size is used for all regressions) or a vector (e.g., variable-wise sample sizes). When data is provided and ns is NULL, sample sizes are derived automatically from data. When mat is supplied instead of raw data, ns must be provided and should reflect the sample size underlying mat.

mat

Optional covariance or correlation matrix for the variables to be included in the network. Used only when data is NULL. If both data and mat are supplied, mat is ignored. When mat is used, ns must also be provided.

n_calc

Character string specifying how per-variable sample sizes for node-wise regression models are computed when ns is not supplied. If ns is provided, its values are used directly and n_calc is ignored. Possible values are:

"individual": For each variable, uses the number of non-missing observations for that variable.

"average"

Computes the average number of non-missing observations across all variables and uses this average as the sample size for every variable.

"max"

Computes the maximum number of non-missing observations across all variables and uses this maximum as the sample size for every variable.

"total"

Uses the total number of rows in data as the sample size for every variable.

ic_type

Type of information criterion to compute for model selection in the node-wise regression models. Options are bic (default), aic, aicc.

ordered

Logical vector indicating whether each variable in data should be treated as ordered categorical. Only used when data is provided. If a single logical value is supplied, it is recycled to all variables.

pcor_merge_rule

Character string specifying how regression weights from the node-wise models are merged into partial correlations. Possible values are:

"and": Estimates a partial correlation only if the regression weights in both directions (e.g., from node 1 to 2 and from node 2 to 1) are non-zero in the final models.

"or"

Uses the available regression weight from one direction as the partial correlation if the corresponding regression in the other direction is not included in the final model.

missing_handling

Character string specifying how correlations are estimated from the data input in the presence of missing values. Possible values are:

"two-step-em": Uses a classical EM algorithm to estimate the correlation matrix from data.

"stacked-mi"

Uses stacked multiple imputation to estimate the correlation matrix from data.

"pairwise"

Uses pairwise deletion to compute correlations from data.

"listwise"

Uses listwise deletion to compute correlations from data.

nimp

Number of imputations (default: 20) to be used when missing_handling = "stacked-mi".

imp_method

Character string specifying the imputation method to be used when missing_handling = "stacked-mi" (default: "pmm" - predictive mean matching).

...

Further arguments passed to internal functions.

Details

This function estimates a network structure using neighborhood selection guided by information criteria. Simulations by williams.2019;textualmantar indicated that using the "and" rule for merging regression weights tends to yield more accurate partial correlation estimates than the "or" rule.

The argument ic_type specifies which information criterion is computed. All criteria are computed based on the log-likelihood of the maximum likelihood estimated regression model, where the residual variance determines the likelihood. The following options are available:

"aic":: Akaike Information Criterion akaike.1974mantar; defined as AIC = -2 + 2k, where \(\ell\) is the log-likelihood of the model and \(k\) is the number of estimated parameters (including the intercept).
"bic":: Bayesian Information Criterion schwarz.1978mantar; defined as BIC = -2 + k (n), where \(\ell\) is the log-likelihood of the model, \(k\) is the number of estimated parameters (including the intercept) and \(n\) is the sample size.
"aicc":: Corrected Akaike Information Criterion hurvich.1989mantar; particularly useful in small samples where AIC tends to be biased. Defined as AIC_c = AIC + 2k(k+1)n - k - 1, where \(k\) is the number of estimated parameters (including the intercept) and \(n\) is the sample size.

Missing Handling

To handle missing data, the function offers two approaches: a two-step expectation-maximization (EM) algorithm and stacked multiple imputation. According to simulations by nehler.2024;textualmantar, stacked multiple imputation performs reliably across a range of sample sizes. In contrast, the two-step EM algorithm provides accurate results primarily when the sample size is large relative to the amount of missingness and network complexity - but may still be preferred in such cases due to its much faster runtime.

Currently, the function only supports variables that are directly included in the network analysis; auxiliary variables for missing handling are not yet supported. During imputation, all variables are imputed by default using predictive mean matching @see e.g., @vanbuuren.2018mantar, with all other variables in the data set serving as predictors.

References

Examples

Run this code

# Estimate network from full data set
# Using Akaike information criterion
result <- neighborhood_net(data = mantar_dummy_full_cont,
ic_type = "aic")

# View estimated partial correlations
result$pcor

# Estimate network for data set with missings
# Using Bayesian Information Criterion, individual sample sizes, and two-step EM
result_mis <- neighborhood_net(data = mantar_dummy_mis_cont,
n_calc = "individual",
missing_handling = "two-step-em",
ic_type = "bic")

# View estimated partial correlations
result_mis$pcor

Run the code above in your browser using DataLab