san: Use Simulated Annealing to attempt to match a network to a vector of mean statistics

Description

This function attempts to find a network or networks whose statistics match those passed in via the target.stats vector.

Usage

san(object, ...)
# S3 method for formula
san(
  object,
  response = NULL,
  reference = ~Bernoulli,
  constraints = ~.,
  target.stats = NULL,
  nsim = NULL,
  basis = NULL,
  output = c("network", "edgelist", "ergm_state"),
  only.last = TRUE,
  control = control.san(),
  verbose = FALSE,
  offset.coef = NULL,
  ...
)
# S3 method for ergm_model
san(
  object,
  reference = ~Bernoulli,
  constraints = ~.,
  target.stats = NULL,
  nsim = NULL,
  basis = NULL,
  output = c("network", "edgelist", "ergm_state"),
  only.last = TRUE,
  control = control.san(),
  verbose = FALSE,
  offset.coef = NULL,
  ...
)

Value

A network or list of networks that hopefully have network statistics close to the target.stats vector. Additionally, attr()-style attributes formula and stats are included.

Arguments

object

Either a formula or an ergm object. The formula should be of the form y ~ <model terms>, where y is a network object or a matrix that can be coerced to a network object. For the details on the possible <model terms>, see ergmTerm. To create a network object in , use the network() function, then add nodal attributes to it using the %v% operator if necessary.

...

Further arguments passed to other functions.

response

Either a character string, a formula, or NULL (the default), to specify the response attributes and whether the ERGM is binary or valued. Interpreted as follows:

NULL

Model simple presence or absence, via a binary ERGM.

character string

The name of the edge attribute whose value is to be modeled. Type of ERGM will be determined by whether the attribute is logical (TRUE/FALSE) for binary or numeric for valued.

a formula

must be of the form NAME~EXPR|TYPE (with | being literal). EXPR is evaluated in the formula's environment with the network's edge attributes accessible as variables. The optional NAME specifies the name of the edge attribute into which the results should be stored, with the default being a concise version of EXPR. Normally, the type of ERGM is determined by whether the result of evaluating EXPR is logical or numeric, but the optional TYPE can be used to override by specifying a scalar of the type involved (e.g., TRUE for binary and 1 for valued).

reference

A one-sided formula specifying the reference measure ($h(y)$) to be used. See help for ERGM reference measures implemented in the ergm package.

constraints

A one-sided formula specifying one or more constraints on the support of the distribution of the networks being simulated. See the documentation for a similar argument for ergm and see list of implemented constraints for more information. For simulate.formula, defaults to no constraints. For simulate.ergm, defaults to using the same constraints as those with which object was fitted.

target.stats

A vector of the same length as the number of non-offset statistics implied by the formula, which is either object itself in the case of san.formula or object$formula in the case of san.ergm.

nsim

Number of networks to generate. Deprecated: just use replicate().

basis

If not NULL, a network object used to start the Markov chain. If NULL, this is taken to be the network named in the formula.

output

Character, one of "network" (default), "edgelist", or "ergm_state": determines the output format. Partial matching is performed.

only.last

if TRUE, only return the last network generated; otherwise, return a network.list with nsim networks.

control

A list of control parameters for algorithm tuning, typically constructed with control.san(). Its documentation gives the the list of recognized control parameters and their meaning. The more generic utility snctrl() (StatNet ConTRoL) also provides argument completion for the available control functions and limited argument name checking.

verbose

A logical or an integer to control the amount of progress and diagnostic information to be printed. FALSE/0 produces minimal output, with higher values producing more detail. Note that very high values (5+) may significantly slow down processing.

offset.coef

A vector of offset coefficients; these must be passed in by the user. Note that these should be the same set of coefficients one would pass to ergm via its offset.coef argument.

formula

(By default, the formula is taken from the ergm object. If a different formula object is wanted, specify it here.

Methods (by class)

san(formula): Sufficient statistics are specified by a formula.
san(ergm_model): A lower-level function that expects a pre-initialized ergm_model.

Details

Acceptance probabilities for proposed toggles are computed as we now describe. There are two contributions: one from targeted statistics and one from offsets.

For the targeted statistics, a matrix of weights W is determined on each san iteration as follows. On the first iteration, the matrix W is the n by n identity matrix (n = number of target statistics), divided by n. On subsequent iterations: if control$SAN.invcov.diag = FALSE (the default), then the matrix W is the inverse of the covariance matrix of the targeted statistics, divided by the sum of its (the inverse's) diagonal; if control$SAN.invcov.diag = TRUE, then W is the inverse of the diagonal (regarded as a matrix) of the covariance matrix of the targeted statistics, divided by the sum of its (the inverse's) diagonal. In either of these two cases, the covariance matrix is computed based on proposals (not acceptances) made on the previous iteration, and the normalization for W is such that sum(diag(W)) = 1. The component of the acceptance probability coming from the targeted statistics is then computed for a given W as exp([y.Wy - x.Wx]/T) where T is the temperature, y the column vector of differences network statistics - target statistics computed before the current proposal is made, x the column vector of differences network statistics - target statistics computed assuming the current proposal is accepted, and . the dot product. If control$SAN.maxit > 1, then on the ith iteration, the temperature T takes the value control$SAN.tau * (1/i - 1/control$SAN.maxit)/(1 - 1/control$SAN.maxit); if control$SAN.maxit = 1, then the temperature T takes the value 0. Thus, T steps down from control$SAN.tau to 0 and is always 0 on the final iteration.

Offsets also contribute to the acceptance probability, as follows. If eta are the canonical offsets and Delta the corresponding change statistics for a given proposal, then the offset contribution to the acceptance probability is simply exp(eta.Delta) where . denotes the dot product. By default, finite offsets are ignored, but this behavior can be changed by setting control$SAN.ignore.finite.offsets = FALSE.

The overall acceptance probability is the product of the targeted statistics contribution and the offset contribution (with the product capped at one).

Examples

Run this code

# \donttest{
# initialize x to a random undirected network with 50 nodes and a density of 0.1
x <- network(50, density = 0.05, directed = FALSE)
 
# try to find a network on 50 nodes with 300 edges, 150 triangles,
# and 1250 4-cycles, starting from the network x
y <- san(x ~ edges + triangles + cycle(4), target.stats = c(300, 150, 1250))

# check results
summary(y ~ edges + triangles + cycle(4))

# initialize x to a random directed network with 50 nodes
x <- network(50)

# add vertex attributes
x %v% 'give' <- runif(50, 0, 1)
x %v% 'take' <- runif(50, 0, 1)

# try to find a set of 100 directed edges making the outward sum of
# 'give' and the inward sum of 'take' both equal to 62.5, so in
# edges (i,j) the node i tends to have above average 'give' and j
# tends to have above average 'take'
y <- san(x ~ edges + nodeocov('give') + nodeicov('take'), target.stats = c(100, 62.5, 62.5))

# check results
summary(y ~ edges + nodeocov('give') + nodeicov('take'))


# initialize x to a random undirected network with 50 nodes
x <- network(50, directed = FALSE)

# add a vertex attribute
x %v% 'popularity' <- runif(50, 0, 1)

# try to find a set of 100 edges making the total sum of
# popularity(i) and popularity(j) over all edges (i,j) equal to
# 125, so nodes with higher popularity are more likely to be
# connected to other nodes
y <- san(x ~ edges + nodecov('popularity'), target.stats = c(100, 125))
 
# check results
summary(y ~ edges + nodecov('popularity'))

# creates a network with denser "core" spreading out to sparser
# "periphery"
plot(y)
# }

Run the code above in your browser using DataLab