cggm: Clusterpath Estimator of the Gaussian Graphical Model

Description

Compute the clusterpath estimator of the Gaussian Graphical Model (CGGM) for fixed values of the tuning parameters to obtain a sparse estimate with variable clustering of the precision matrix or the covariance matrix.

Usage

cggm(
  S,
  W_cpath,
  lambda_cpath,
  W_lasso = NULL,
  lambda_lasso = 0,
  eps_lasso = 0.005,
  gss_tol = 0.005,
  conv_tol = 1e-07,
  fusion_threshold = NULL,
  tau = 0.001,
  max_iter = 5000,
  expand = FALSE,
  max_difference = 0.01,
  verbose = 0
)

Value

An object of class "CGGM" with the following components:

A,R

Lists of matrices. Each pair of matrices with the same index parametrize the estimated precision matrix for the corresponding value of the aggregation parameter lambda_cpath. It is not recommended to use these directly, instead use the accessor function get_Theta() to extract the estimated precision matrix for a given index of the aggregation parameter.

clusters

An integer matrix in which each row contains the cluster assignment of each variable for the corresponding value of the aggregation parameter lambda_cpath. Use the accessor function get_clusters() to extract the cluster assignment for a given index of the aggregation parameter.

lambdas

A vector with the values for the aggregation parameter lambda_cpath for which the CGGM loss function has been minimized.

Theta

List of matrices. Contains the solution to the minimization procedure for each value of the aggregation parameter lambda_cpath. It is not recommended to use these directly, instead use the accessor function get_Theta() to extract the estimated precision matrix for a given index of the aggregation parameter.

losses

A vector with the values of the minimized CGGM loss function for each value of the aggregation parameter lambda_cpath.

cluster_counts

An integer vector containing the number of clusters obtained for each value of the aggregation parameter lambda_cpath.

loss_progression

A list of vectors. Contains, for each value of the aggregation parameter lambda_cpath, the value of the loss function for each iteration of the minimization procedure. This is only part of the output if expand = FALSE.

fusion_threshold

The threshold value used to determine whether two clusters should be clustered.

cluster_solution_index

An integer vector containing the index of the value of the aggregation parameter lambda_cpath for which a certain number of clusters was attained. For example, cluster_solution_index[2] yields the index of the smallest value for lambda_cpath for which a solution with two clusters was found. Contains -1 if there is no value for lambda_cpath with that number of clusters.

n

The number of values of the aggregation parameter lambda_cpath for which the CGGM loss function was minimized.

inputs

A list of the inputs of the function, used internally and in cggm_refit(). It consists of eight components:

S (the sample covariance matrix)
W_cpath (the weight matrix for the clusterpath penalty)
gss_tol (the tolerance for the GSS algorithm)
conv_tol (the convergence tolerance)
max_iter (the maximum number of iterations)
lambda_lasso (the penalty parameter for the lasso penalty)
eps_lasso (parameter used for the quadratic approximation of the lasso penalty)
W_lasso (the weight matrix for the lasso penalty)

Arguments

S: The sample covariance matrix of the data.
W_cpath: The weight matrix used in the clusterpath penalty.
lambda_cpath: A numeric vector of tuning parameters for regularization. Should be a sequence of monotonically increasing values.
W_lasso: The weight matrix used in the lasso penalty. Defaults to NULL, which is interpreted as all weights being zero (no penalization).
lambda_lasso: The penalty parameter used for the lasso penalty. Defaults to 0 (no penalization).
eps_lasso: Parameter that governs the quadratic approximation of the lasso penalty. Within the interval c(-eps_lasso, eps_lasso) the absolute value function is approximated by a quadratic function. Defaults to 0.005.
gss_tol: The tolerance value used in the golden section search (GSS) algorithm. Defaults to 0.005.
conv_tol: The tolerance used to determine convergence. Defaults to 1e-7.
fusion_threshold: The threshold for fusing two clusters. If NULL, defaults to tau times the median distance between the rows of solve(S).
tau: The parameter used to determine the fusion threshold. Defaults to 0.001.
max_iter: The maximum number of iterations allowed for the optimization algorithm. Defaults to 5000.
expand: Determines whether the vector lambda should be expanded with additional values in order to find a sequence of solutions that (a) terminates in the minimum number of clusters and (b) has consecutive solutions for Theta that are not too different from each other. The degree of difference between consecutive solutions that is allowed is determined by max_difference. Defaults to FALSE.
max_difference: The maximum allowed difference between consecutive solutions of Theta if expand = TRUE. The difference is computed as norm(Theta[i-1]-Theta[i], "F") / norm(Theta[i-1], "F"). Defaults to 0.01.
verbose: Determines the amount of information printed during the optimization. Slows down the algorithm significantly. Defaults to 0.

Author

Daniel J.W. Touw

References

D.J.W. Touw, A. Alfons, P.J.F. Groenen and I. Wilms (2025) Clusterpath Gaussian Graphical Modeling. arXiv:2407.00644. doi:10.48550/arXiv.2407.00644.

Examples

Run this code

## CGGM can be used to estimate a clustered precision matrix

# Generate data
set.seed(3)
Theta <- matrix(
  c(2, 1, 0, 0,
    1, 2, 0, 0,
    0, 0, 4, 1,
    0, 0, 1, 4),
  nrow = 4
)
X <- mvtnorm::rmvnorm(n = 100, sigma = solve(Theta))

# Estimate the covariance matrix
S <- cov(X)

# Compute the weight matrix for the clusterpath (clustering) weights
W_cpath <- clusterpath_weights(S, phi = 1, k = 2)

# Compute the weight matrix for the lasso (sparsity) weights
W_lasso <- lasso_weights(S)

# Set values to be used for the aggregation parameter
lambdas <- seq(0, 0.2, by = 0.01)

# Estimate the precision matrix for each value of the aggregation
# parameter and a fixed value of the sparsity parameter
fit <- cggm(S, W_cpath = W_cpath, lambda_cpath = lambdas,
            W_lasso = W_lasso, lambda_lasso = 0.2)

# The index of the first value for lambda for which there are 2 clusters
keep <- fit$cluster_solution_index[2]

# Accessor function that retrieve the solution with 2 clusters
get_Theta(fit, index = keep)
get_clusters(fit, index = keep)


# Often, it is not clear which values of the aggregation parameter
# make up the right sequence. But it can be expanded automatically.
fit <- cggm(S, W_cpath = W_cpath, lambda_cpath = lambdas,
            W_lasso = W_lasso, lambda_lasso = 0.2,
            expand = TRUE)

# A solution with 2 clusters
keep <- fit$cluster_solution_index[2]
get_Theta(fit, index = keep)
get_clusters(fit, index = keep)


## CGGM can also be used to estimate a clustered covariance matrix

# Generate data
set.seed(3)
Sigma <- matrix(
  c(2, 1, 0, 0,
    1, 2, 0, 0,
    0, 0, 4, 1,
    0, 0, 1, 4),
  nrow = 4
)
X <- mvtnorm::rmvnorm(n = 100, sigma = Sigma)

# Estimate the covariance matrix and compute its inverse
S <- cov(X)
S_inv <- solve(S)

# Compute the weight matrix for the clusterpath (clustering) weights.
# The input is now the sample precision matrix.
W_cpath <- clusterpath_weights(S_inv, phi = 1, k = 2)

# Compute the weight matrix for the lasso (sparsity) weights.
# The input is again the sample precision matrix.
W_lasso <- lasso_weights(S_inv)

# Set values to be used for the aggregation parameter
lambdas <- seq(0, 0.2, by = 0.01)

# Use the sample precision matrix to estimate the covariance matrix
# for each value of the aggregation parameter and a fixed value of
# the sparsity parameter
fit <- cggm(S_inv, W_cpath = W_cpath, lambda_cpath = lambdas,
            W_lasso = W_lasso, lambda_lasso = 0.2, expand = TRUE)

# A solution with 2 clusters
keep <- fit$cluster_solution_index[2]
get_Theta(fit, index = keep)
get_clusters(fit, index = keep)

Run the code above in your browser using DataLab