To specify a degree-corrected stochastic blockmodel, you must specify
the degree-heterogeneity parameters (via n
or theta
),
the mixing matrix (via k
or B
), and the relative block
probabilities (optional, via pi
). We provide defaults for most of these
options to enable rapid exploration, or you can invest the effort
for more control over the model parameters. We strongly recommend
setting the expected_degree
or expected_density
argument
to avoid large memory allocations associated with
sampling large, dense graphs.
dcsbm(
n = NULL,
theta = NULL,
k = NULL,
B = NULL,
...,
pi = rep(1/k, k),
sort_nodes = TRUE,
force_identifiability = FALSE,
poisson_edges = TRUE,
allow_self_loops = TRUE
)
An undirected_dcsbm
S3 object, a subclass of the
undirected_factor_model()
with the following additional
fields:
theta
: A numeric vector of degree-heterogeneity parameters.
z
: The community memberships of each node, as a factor()
.
The factor will have k
levels, where k
is the number of
communities in the stochastic blockmodel. There will not
always necessarily be observed nodes in each community.
pi
: Sampling probabilities for each block.
sorted
: Logical indicating where nodes are arranged by
block (and additionally by degree heterogeneity parameter)
within each block.
(degree heterogeneity) The number of nodes in the blockmodel.
Use when you don't want to specify the degree-heterogeneity
parameters theta
by hand. When n
is specified, theta
is randomly generated from a LogNormal(2, 1)
distribution.
This is subject to change, and may not be reproducible.
n
defaults to NULL
. You must specify either n
or theta
, but not both.
(degree heterogeneity) A numeric vector
explicitly specifying the degree heterogeneity
parameters. This implicitly determines the number of nodes
in the resulting graph, i.e. it will have length(theta)
nodes.
Must be positive. Setting to a vector of ones recovers
a stochastic blockmodel without degree correction.
Defaults to NULL
. You must specify either n
or theta
,
but not both.
(mixing matrix) The number of blocks in the blockmodel.
Use when you don't want to specify the mixing-matrix by hand.
When k
is specified, the elements of B
are drawn
randomly from a Uniform(0, 1)
distribution.
This is subject to change, and may not be reproducible.
k
defaults to NULL
. You must specify either k
or B
, but not both.
(mixing matrix) A k
by k
matrix of block connection
probabilities. The probability that a node in block i
connects
to a node in community j
is Poisson(B[i, j])
. Must be
a square matrix. matrix
and Matrix
objects are both
acceptable. If B
is not symmetric, it will be
symmetrized via the update B := B + t(B)
. Defaults to NULL
.
You must specify either k
or B
, but not both.
Arguments passed on to undirected_factor_model
expected_degree
If specified, the desired expected degree
of the graph. Specifying expected_degree
simply rescales S
to achieve this. Defaults to NULL
. Do not specify both
expected_degree
and expected_density
at the same time.
expected_density
If specified, the desired expected density
of the graph. Specifying expected_density
simply rescales S
to achieve this. Defaults to NULL
. Do not specify both
expected_degree
and expected_density
at the same time.
(relative block probabilities) Relative block
probabilities. Must be positive, but do not need to sum
to one, as they will be normalized internally.
Must match the dimensions of B
or k
. Defaults to
rep(1 / k, k)
, or a balanced blocks.
Logical indicating whether or not to sort the nodes
so that they are grouped by block and by theta
. Useful for plotting.
Defaults to TRUE
. When TRUE
, nodes are first sorted by block
membership, and then by degree-correction parameters within each block.
Additionally, pi
is sorted in increasing order, and the columns
of the B
matrix are permuted to match the new order of pi
.
Logical indicating whether or not to
normalize theta
such that it sums to one within each block. Defaults
to FALSE
, since this behavior can be surprise when theta
is set
to a vector of all ones to recover the DC-SBM case.
Logical indicating whether or not
multiple edges are allowed to form between a pair of
nodes. Defaults to TRUE
. When FALSE
, sampling proceeds
as usual, and duplicate edges are removed afterwards. Further,
when FALSE
, we assume that S
specifies a desired between-factor
connection probability, and back-transform this S
to the
appropriate Poisson intensity parameter to approximate Bernoulli
factor connection probabilities. See Section 2.3 of Rohe et al. (2017)
for some additional details.
Logical indicating whether or not
nodes should be allowed to form edges with themselves.
Defaults to TRUE
. When FALSE
, sampling proceeds allowing
self-loops, and these are then removed after the fact.
There are two levels of randomness in a degree-corrected
stochastic blockmodel. First, we randomly chose a block
membership for each node in the blockmodel. This is
handled by dcsbm()
. Then, given these block memberships,
we randomly sample edges between nodes. This second
operation is handled by sample_edgelist()
,
sample_sparse()
, sample_igraph()
and
sample_tidygraph()
, depending depending on your desired
graph representation.
Let \(z_i\) represent the block membership of node \(i\). To generate \(z_i\) we sample from a categorical distribution (note that this is a special case of a multinomial) with parameter \(\pi\), such that \(\pi_i\) represents the probability of ending up in the ith block. Block memberships for each node are independent.
In addition to block membership, the DCSBM also allows nodes to have different propensities for edge formation. We represent this propensity for node \(i\) by a positive number \(\theta_i\). Typically the \(\theta_i\) are constrained to sum to one for identifiability purposes, but this doesn't really matter during sampling (i.e. without the sum constraint scaling \(B\) and \(\theta\) has the same effect on edge probabilities, but whether \(B\) or \(\theta\) is responsible for this change is uncertain).
Once we know the block memberships \(z\) and the degree
heterogeneity parameters \(theta\), we need one more
ingredient, which is the baseline intensity of connections
between nodes in block i
and block j
. Then each edge
\(A_{i,j}\) is Poisson distributed with parameter
$$ \lambda[i, j] = \theta_i \cdot B_{z_i, z_j} \cdot \theta_j. $$
Other stochastic block models:
directed_dcsbm()
,
mmsbm()
,
overlapping_sbm()
,
planted_partition()
,
sbm()
Other undirected graphs:
chung_lu()
,
erdos_renyi()
,
mmsbm()
,
overlapping_sbm()
,
planted_partition()
,
sbm()
set.seed(27)
lazy_dcsbm <- dcsbm(n = 1000, k = 5, expected_density = 0.01)
lazy_dcsbm
# sometimes you gotta let the world burn and
# sample a wildly dense graph
dense_lazy_dcsbm <- dcsbm(n = 500, k = 3, expected_density = 0.8)
dense_lazy_dcsbm
# explicitly setting the degree heterogeneity parameter,
# mixing matrix, and relative community sizes rather
# than using randomly generated defaults
k <- 5
n <- 1000
B <- matrix(stats::runif(k * k), nrow = k, ncol = k)
theta <- round(stats::rlnorm(n, 2))
pi <- c(1, 2, 4, 1, 1)
custom_dcsbm <- dcsbm(
theta = theta,
B = B,
pi = pi,
expected_degree = 50
)
custom_dcsbm
edgelist <- sample_edgelist(custom_dcsbm)
edgelist
# efficient eigendecompostion that leverages low-rank structure in
# E(A) so that you don't have to form E(A) to find eigenvectors,
# as E(A) is typically dense. computation is
# handled via RSpectra
population_eigs <- eigs_sym(custom_dcsbm)
Run the code above in your browser using DataLab