gof-methods: Conduct Goodness-of-Fit Diagnostics on ERGMs, TERGMs, SAOMs, and logit models

Description

Assess goodness of fit of btergm and other network models.

Usage

# S4 method for btergm
gof(object, target = NULL, formula = getformula(object), 
    nsim = 100, MCMC.interval = 1000, MCMC.burnin = 10000, 
    parallel = c("no", "multicore", "snow"), ncpus = 1, cl = NULL, 
    statistics = c(dsp, esp, deg, ideg, geodesic, rocpr, 
    walktrap.modularity), verbose = TRUE, ...)
# S4 method for mtergm
gof(object, target = NULL, formula = getformula(object), 
    nsim = 100, MCMC.interval = 1000, MCMC.burnin = 10000, 
    parallel = c("no", "multicore", "snow"), ncpus = 1, cl = NULL, 
    statistics = c(dsp, esp, deg, ideg, geodesic, rocpr, 
    walktrap.modularity), verbose = TRUE, ...)
# S4 method for ergm
gof(object, target = NULL, formula = getformula(object), 
    nsim = 100, MCMC.interval = 1000, MCMC.burnin = 10000, 
    parallel = c("no", "multicore", "snow"), ncpus = 1, cl = NULL, 
    statistics = c(dsp, esp, deg, ideg, geodesic, rocpr, 
    walktrap.modularity), verbose = TRUE, ...)
# S4 method for matrix
gof(object, covariates, coef, target = NULL, nsim = 100, 
    mcmc = FALSE, MCMC.interval = 1000, MCMC.burnin = 10000, 
    parallel = c("no", "multicore", "snow"), ncpus = 1, cl = NULL, 
    statistics = c(dsp, esp, deg, ideg, geodesic, rocpr, 
    walktrap.modularity), verbose = TRUE, ...)
# S4 method for network
gof(object, covariates, coef, target = NULL, 
    nsim = 100, mcmc = FALSE, MCMC.interval = 1000, 
    MCMC.burnin = 10000, parallel = c("no", "multicore", "snow"), 
    ncpus = 1, cl = NULL, statistics = c(dsp, esp, deg, ideg, 
    geodesic, rocpr, walktrap.modularity), verbose = TRUE, ...)
# S4 method for sienaFit
gof(object, period = NULL, parallel = c("no", 
    "multicore", "snow"), ncpus = 1, cl = NULL, structzero = 10, 
    statistics = c(esp, deg, ideg, geodesic, rocpr, walktrap.modularity), 
    groupName = object$f$groupNames[[1]], varName = NULL, 
    outofsample = FALSE, sienaData = NULL, sienaEffects = NULL, 
    nsim = NULL, verbose = TRUE, ...)
createGOF(simulations, target, statistics = c(dsp, esp, deg, 
    ideg, geodesic, rocpr, walktrap.modularity), parallel = "no", 
    ncpus = 1, cl = NULL, verbose = TRUE, ...)

Arguments

An optional parallel or snow cluster for use if parallel = "snow". If not supplied, a cluster on the local machine is created temporarily.

coef

A vector of coefficients.

covariates

A list of matrices or network objects that serve as covariates for the dependent network. The covariates in this list are automatically added to the formula as edgecov terms.

formula

A model formula from which networks are simulated for comparison. By default, the formula from the btergm object x is used. It is possible to hand over a formula with only a single response network and/or dyad or edge covariates or with lists of response networks and/or covariates. It is also possible to use indices like networks[[4]] or networks[3:5] inside the formula.

groupName

The group name used in the Siena model.

mcmc

Should statnet's MCMC methods be used for simulating new networks? If mcmc = FALSE, new networks are simulated based on predicted tie probabilities of the regression equation.

MCMC.burnin

Internally, this package uses the simulation facilities of the ergm package to create new networks against which to compare the original network(s) for goodness-of-fit assessment. This argument sets the MCMC burnin to be passed over to the simulation command. The default value is 10000. There is no general rule of thumb on the selection of this parameter, but if the results look suspicious (e.g., when the model fit is perfect), increasing this value may be helpful.

MCMC.interval

Internally, this package uses the simulation facilities of the ergm package to create new networks against which to compare the original network(s) for goodness-of-fit assessment. This argument sets the MCMC interval to be passed over to the simulation command. The default value is 1000, which means that every 1000th simulation outcome from the MCMC sequence is used. There is no general rule of thumb on the selection of this parameter, but if the results look suspicious (e.g., when the model fit is perfect), increasing this value may be helpful.

ncpus

The number of CPU cores used for parallel GOF assessment (only if parallel is activated). If the number of cores should be detected automatically on the machine where the code is executed, one can try the detectCores() function from the parallel package. On some HPC clusters, the number of available cores is saved as an environment variable; for example, if MOAB is used, the number of available cores can sometimes be accessed using Sys.getenv("MOAB_PROCCOUNT"), depending on the implementation. Note that the maximum number of connections in a single R session (i.e., to other cores or for opening files etc.) is 128, so fewer than 128 cores should be used at a time.

nsim

The number of networks to be simulated at each time step. Example: If there are six time steps in the formula and nsim = 100, a total of 600 new networks is simulated. The comparison between simulated and observed networks is only done within time steps. For example, the first 100 simulations are compared with the first observed network, simulations 101-200 with the second observed network etc.

object

A btergm, ergm, or sienaFit object (for the btergm, ergm, and sienaFit methods, respectively). Or a network object or matrix (for the network and matrix methods, respectively).

outofsample

Should out-of-sample prediction be attempted? If so, some additional arguments must be provided: sienaData, sienaEffects, and nsim. The sienaData object must contain a base and a target network for out-of-sample prediction. The sienaEffects must contain the effects to be used for the simulations. The estimates will be taken from the estimated object, and they will be injected into a new SAOM and fixed during the sampling procedure. nsim determines how many simulations are used for the out-of-sample comparison.

parallel

Use multiple cores in a computer or nodes in a cluster to speed up the simulations. The default value "no" means parallel computing is switched off. If "multicore" is used (only available for sienaAlgorithm and sienaModel objects), the mclapply function from the parallel package (formerly in the multicore package) is used for parallelization. This should run on any kind of system except MS Windows because it is based on forking. It is usually the fastest type of parallelization. If "snow" is used, the parLapply function from the parallel package (formerly in the snow package) is used for parallelization. This should run on any kind of system including cluster systems and including MS Windows. It is slightly slower than the former alternative if the same number of cores is used. However, "snow" provides support for MPI clusters with a large amount of cores, which multicore does not offer (see also the cl argument). Note that "multicore" will only work if all cores are on the same node. For example, if there are three nodes with eight cores each, a maximum of eight CPUs can be used. Parallel computing is described in more detail on the help page of btergm.

period

Which transition between time periods should be used for GOF assessment? By default, all transitions between all time periods are used. For example, if there are three consecutive networks, this will extract simulations from the transitions between 1 and 2 and between 2 and 3, respectively, and these simulations will be compared to the networks at time steps 2 and 3, respectively. The time period can be provided as a numeric, e.g., period = 4 for extracting the simulations between time steps 4 and 5 (= the fourth transition) and predicting the fifth network. Values lower than 1 or larger than the number of consecutive networks minus 1 are therefore not permitted. This argument is only used if out-of-sample prediction is switched off.

sienaData

An object of the class siena, which is usually created using the sienaDataCreate function in the RSiena package. This argument is only used for out-of-sample prediction. The object must be based on a sienaDependent object that contains two networks: the base network from which to simulate forward, and the target network which you want to predict out-of-sample. The object can contain further objects for storing covariates etc. that are necessary for estimating new networks. The best practice is to create an object that is identical to the siena object used for estimating the model, except that it contains the base and the target network instead of the dependent variable/networks.

sienaEffects

An object of the class sienaEffects, which is usually created using the getEffects() and the includeEffects() functions in the RSiena package. The best practice is to provide a sienaEffects object that is identical to the object used to create the original model (that is, it should contain the same effects), except that it should be based on the siena object provided through the sienaData argument. In other words, the sienaEffects object should be based on the base and target network used for out-of-sample prediction, and it should contain the same effects as those used for the original estimation. This argument is used only for out-of-sample prediction.

simulations

A list of network objects or sparse matrices (generated using the Matrix package) representing simulated networks.

statistics

A list of functions used for comparison of observed and simulated networks. Note that the list should contain the actual functions, not a character representation of them. See gof-statistics for details.

target

In the gof function: A network or list of networks to which the simulations are compared. If left empty, the original networks from the btergm object x are used as observed networks. In the createGOF function: a list of sparse matrices (generated using the Matrix package) or a list of network objects (generated using the network package). The simulations are compared against these target networks.

structzero

Which value was used for structural zeros (usually nodes that have dropped out of the network or have not yet joined the network) in the dependent variable/network? These nodes are removed from the observed network and the simulations before comparison. Usually, the value 10 is used for structural zeros in Siena.

varName

The variable name that denotes the dependent networks in the Siena model.

verbose

Print details?

...

Arbitrary further arguments to be passed on to the statistics. See also the help page for the gof-statistics.

Details

The generic gof function provides goodness-of-fit measures and degeneracy checks for btergm, mtergm, ergm, sienaFit, and custom dyadic-independent models. The user can provide a list of network statistics for comparing simulated networks based on the estimated model with the observed network(s). See gof-statistics. The objects created by these methods can be displayed using various plot and print methods (see gof-plot).

In-sample GOF assessment is the default, which means that the same time steps are used for creating simulations and for comparison with the observed network(s). It is possible to do out-of-sample prediction by specifying a (list of) target network(s) using the target argument. If a formula is provided, the simulations are based on the networks and covariates specified in the formula. This is helpful in situations where complex out-of-sample predictions have to be evaluated. A usage scenario could be to simulate from a network at time t (provided through the formula argument) and compare to an observed network at time t + 1 (the target argument). This can be done, for example, to assess predictive performance between time steps of the original networks, or to check whether the model performs well with regard to a newly measured network given the old data from the previous time step.

Predictive fit can also be assessed for stochastic actor-oriented models (SAOM) as implemented in the RSiena package. After compiling the usual objects (model, data, effects), one of the time steps can be predicted based on the previous time step and the SAOM using the sienaFit method of the gof function. By default, however, within-sample fit is used for SAOMs, just like for (T)ERGMs.

The gof methods for networks and matrices serve to assess the goodness of fit of a dyadic-independence model. To do this, the method requires a vector of coefficients (one coefficient for the intercept or edges term and one coefficient for each covariate), a list of covariates (in matrix or network shape), and a dependent network or matrix. This is useful for assessing the goodness of fit of QAP-adjusted logistic regression models (as implemented in the netlogit function in the sna package) or other dyadic-independence models, such as models fitted using glm. Note that this method only works with cross-sectional models and does not accept lists of networks as input data.

The createGOF function is used internally by the gof function in order to create a gof object from a list of simulated networks and a list of target networks to compare against. It can also be used directly by the end user if the user wants to supply lists of simulated and target networks from other sources.

References

Leifeld, Philip, Skyler J. Cranmer and Bruce A. Desmarais (2017): Temporal Exponential Random Graph Models with btergm: Estimation and Bootstrap Confidence Intervals. Journal of Statistical Software 83(6): 1-36. http://dx.doi.org/10.18637/jss.v083.i06.

Leifeld, Philip and Skyler J. Cranmer (2014): A Theoretical and Empirical Comparison of the Temporal Exponential Random Graph Model and the Stochastic Actor-Oriented Model. Paper presented at the 7th Political Networks Conference, McGill University, Montreal, Canada, May 30. https://arxiv.org/abs/1506.06696.

Examples

Run this code

# NOT RUN {
# First, create data and fit a TERGM...
networks <- list()
for(i in 1:10){            # create 10 random networks with 10 actors
  mat <- matrix(rbinom(100, 1, .25), nrow = 10, ncol = 10)
  diag(mat) <- 0           # loops are excluded
  nw <- network(mat)       # create network object
  networks[[i]] <- nw      # add network to the list
}

covariates <- list()
for (i in 1:10) {          # create 10 matrices as covariate
  mat <- matrix(rnorm(100), nrow = 10, ncol = 10)
  covariates[[i]] <- mat   # add matrix to the list
}

fit <- btergm(networks ~ edges + istar(2) +
    edgecov(covariates), R = 100)

# Then assess the goodness of fit:
g <- gof(fit, statistics = c(triad.directed, esp, fastgreedy.modularity, 
    rocpr), nsim = 50)
g
plot(g)  # see ?"gof-plot" for details

# createGOF can also be used with user-supplied simulations:
library("statnet")
data(florentine)
gest <- ergm(flomarriage ~ edges + absdiff("wealth"))
sim <- simulate(gest, nsim = 50)
g <- createGOF(sim, list(flomarriage), statistics = c(esp, ideg), roc = FALSE)
g
plot(g)

# The help page for the Knecht dataset (?knecht) contains another example.
# }

Run the code above in your browser using DataLab