SimulateGraphical: Data simulation for Gaussian Graphical Modelling

Description

Simulates data from a Gaussian Graphical Model (GGM).

Usage

SimulateGraphical(
  n = 100,
  pk = 10,
  theta = NULL,
  implementation = HugeAdjacency,
  topology = "random",
  nu_within = 0.1,
  nu_between = NULL,
  nu_mat = NULL,
  v_within = c(0.5, 1),
  v_between = c(0.1, 0.2),
  v_sign = c(-1, 1),
  continuous = TRUE,
  pd_strategy = "diagonally_dominant",
  ev_xx = NULL,
  scale_ev = TRUE,
  u_list = c(1e-10, 1),
  tol = .Machine$double.eps^0.25,
  scale = TRUE,
  output_matrices = FALSE,
  ...
)

Value

A list with:

data: simulated data with n observation and sum(pk) variables.
theta: adjacency matrix of the simulated graph.
omega: simulated (true) precision matrix. Only returned if output_matrices=TRUE.
phi: simulated (true) partial correlation matrix. Only returned if output_matrices=TRUE.
sigma: simulated (true) covariance matrix. Only returned if output_matrices=TRUE.
u: value of the constant u used for the simulation of omega. Only returned if output_matrices=TRUE.

Arguments

n: number of observations in the simulated dataset.
pk: vector of the number of variables per group in the simulated dataset. The number of nodes in the simulated graph is sum(pk). With multiple groups, the simulated (partial) correlation matrix has a block structure, where blocks arise from the integration of the length(pk) groups. This argument is only used if theta is not provided.
theta: optional binary and symmetric adjacency matrix encoding the conditional independence structure.
implementation: function for simulation of the graph. By default, algorithms implemented in huge.generator are used. Alternatively, a user-defined function can be used. It must take pk, topology and nu as arguments and return a (sum(pk)*(sum(pk))) binary and symmetric matrix for which diagonal entries are all equal to zero. This function is only applied if theta is not provided.
topology: topology of the simulated graph. If using implementation=HugeAdjacency, possible values are listed for the argument graph of huge.generator. These are: "random", "hub", "cluster", "band" and "scale-free".
nu_within: probability of having an edge between two nodes belonging to the same group, as defined in pk. If length(pk)=1, this is the expected density of the graph. If implementation=HugeAdjacency, this argument is only used for topology="random" or topology="cluster" (see argument prob in huge.generator). Only used if nu_mat is not provided.
nu_between: probability of having an edge between two nodes belonging to different groups, as defined in pk. By default, the same density is used for within and between blocks (nu_within=nu_between). Only used if length(pk)>1. Only used if nu_mat is not provided.
nu_mat: matrix of probabilities of having an edge between nodes belonging to a given pair of node groups defined in pk.
v_within: vector defining the (range of) nonzero entries in the diagonal blocks of the precision matrix. These values must be between -1 and 1 if pd_strategy="min_eigenvalue". If continuous=FALSE, v_within is the set of possible precision values. If continuous=TRUE, v_within is the range of possible precision values.
v_between: vector defining the (range of) nonzero entries in the off-diagonal blocks of the precision matrix. This argument is the same as v_within but for off-diagonal blocks. It is only used if length(pk)>1.
v_sign: vector of possible signs for precision matrix entries. Possible inputs are: -1 for positive partial correlations, 1 for negative partial correlations, or c(-1, 1) for both positive and negative partial correlations.
continuous: logical indicating whether to sample precision values from a uniform distribution between the minimum and maximum values in v_within (diagonal blocks) or v_between (off-diagonal blocks) (if continuous=TRUE) or from proposed values in v_within (diagonal blocks) or v_between (off-diagonal blocks) (if continuous=FALSE).
pd_strategy: method to ensure that the generated precision matrix is positive definite (and hence can be a covariance matrix). If pd_strategy="diagonally_dominant", the precision matrix is made diagonally dominant by setting the diagonal entries to the sum of absolute values on the corresponding row and a constant u. If pd_strategy="min_eigenvalue", diagonal entries are set to the sum of the absolute value of the smallest eigenvalue of the precision matrix with zeros on the diagonal and a constant u.
ev_xx: expected proportion of explained variance by the first Principal Component (PC1) of a Principal Component Analysis. This is the largest eigenvalue of the correlation (if scale_ev=TRUE) or covariance (if scale_ev=FALSE) matrix divided by the sum of eigenvalues. If ev_xx=NULL (the default), the constant u is chosen by maximising the contrast of the correlation matrix.
scale_ev: logical indicating if the proportion of explained variance by PC1 should be computed from the correlation (scale_ev=TRUE) or covariance (scale_ev=FALSE) matrix. If scale_ev=TRUE, the correlation matrix is used as parameter of the multivariate normal distribution.
u_list: vector with two numeric values defining the range of values to explore for constant u.
tol: accuracy for the search of parameter u as defined in optimise.
scale: logical indicating if the true mean is zero and true variance is one for all simulated variables. The observed mean and variance may be slightly off by chance.
output_matrices: logical indicating if the true precision and (partial) correlation matrices should be included in the output.
...: additional arguments passed to the graph simulation function provided in implementation.

Details

The simulation is done in two steps with (i) generation of a graph, and (ii) sampling from multivariate Normal distribution for which nonzero entries in the partial correlation matrix correspond to the edges of the simulated graph. This procedure ensures that the conditional independence structure between the variables corresponds to the simulated graph.

Step 1 is done using SimulateAdjacency.

In Step 2, the precision matrix (inverse of the covariance matrix) is simulated using SimulatePrecision so that (i) its nonzero entries correspond to edges in the graph simulated in Step 1, and (ii) it is positive definite (see MakePositiveDefinite). The inverse of the precision matrix is used as covariance matrix to simulate data from a multivariate Normal distribution.

The outputs of this function can be used to evaluate the ability of a graphical model to recover the conditional independence structure.

References

ourstabilityselectionfake

Examples

Run this code

oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))

# Simulation of random graph with 50 nodes
set.seed(1)
simul <- SimulateGraphical(n = 100, pk = 50, topology = "random", nu_within = 0.05)
print(simul)
plot(simul)

# Simulation of scale-free graph with 20 nodes
set.seed(1)
simul <- SimulateGraphical(n = 100, pk = 20, topology = "scale-free")
plot(simul)

# Extracting true precision/correlation matrices
set.seed(1)
simul <- SimulateGraphical(
  n = 100, pk = 20,
  topology = "scale-free", output_matrices = TRUE
)
str(simul)

# Simulation of multi-block data
set.seed(1)
pk <- c(20, 30)
simul <- SimulateGraphical(
  n = 100, pk = pk,
  pd_strategy = "min_eigenvalue"
)
mycor <- cor(simul$data)
Heatmap(mycor,
  col = c("darkblue", "white", "firebrick3"),
  legend_range = c(-1, 1), legend_length = 50,
  legend = FALSE, axes = FALSE
)
for (i in 1:2) {
  axis(side = i, at = c(0.5, pk[1] - 0.5), labels = NA)
  axis(
    side = i, at = mean(c(0.5, pk[1] - 0.5)),
    labels = ifelse(i == 1, yes = "Group 1", no = "Group 2"),
    tick = FALSE, cex.axis = 1.5
  )
  axis(side = i, at = c(pk[1] + 0.5, sum(pk) - 0.5), labels = NA)
  axis(
    side = i, at = mean(c(pk[1] + 0.5, sum(pk) - 0.5)),
    labels = ifelse(i == 1, yes = "Group 2", no = "Group 1"),
    tick = FALSE, cex.axis = 1.5
  )
}

# User-defined function for graph simulation
CentralNode <- function(pk, hub = 1) {
  theta <- matrix(0, nrow = sum(pk), ncol = sum(pk))
  theta[hub, ] <- 1
  theta[, hub] <- 1
  diag(theta) <- 0
  return(theta)
}
simul <- SimulateGraphical(n = 100, pk = 10, implementation = CentralNode)
plot(simul) # star
simul <- SimulateGraphical(n = 100, pk = 10, implementation = CentralNode, hub = 2)
plot(simul) # variable 2 is the central node

# User-defined adjacency matrix
mytheta <- matrix(c(
  0, 1, 1, 0,
  1, 0, 0, 0,
  1, 0, 0, 1,
  0, 0, 1, 0
), ncol = 4, byrow = TRUE)
simul <- SimulateGraphical(n = 100, theta = mytheta)
plot(simul)

# User-defined adjacency and block structure
simul <- SimulateGraphical(n = 100, theta = mytheta, pk = c(2, 2))
mycor <- cor(simul$data)
Heatmap(mycor,
  col = c("darkblue", "white", "firebrick3"),
  legend_range = c(-1, 1), legend_length = 50, legend = FALSE
)

par(oldpar)

Run the code above in your browser using DataLab

State of Data and AI Literacy Report 2025