ergm: Exponential Family Random Graph Models

Description

ergm is used to fit linear exponential random network models, in which the probability of a given network, $y$, on a set of nodes is $\exp(\theta{\cdot}g(y))/c(\theta)$, where $g(y)$ is a vector of network statistics, $\theta$ is a parameter vector of the same length and $c(\theta)$ is the normalizing constant for the distribution. ergm can return either a maximum pseudo-likelihood estimate or an approximate maximum likelihood estimator based on a Monte Carlo scheme.

Usage

ergm(formula, theta0="MPLE",
     MPLEonly=FALSE, MLestimate=!MPLEonly, seed=NULL,
     burnin=10000, MCMCsamplesize=10000, interval=100, maxit=3,
     constraints=~.,
     meanstats = NULL,
     control=control.ergm(),
     eval.loglik=FALSE,
     verbose=FALSE, ...)

Arguments

formula

formula; an Rformula object, of the form y ~ , where y is a network object or a matrix that can be

theta0

vector; the parameter value used to generate the MCMC sample and as a starting value for the estimation. By default the MPLE is used (theta0="MPLE").

MPLEonly

logical; TRUE if the maximum pseudo-likelihood estimate is to be computed and returned. Note that MPLEonly=TRUE will render moot most other parameters in this list. See ergmMPLE<

MLestimate

logical; TRUE if only the Monte Carlo maximum likelihood estimate is to be computed and returned.

burnin

count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number.

MCMCsamplesize

count; the number of network statistics, randomly drawn from a given distribution on the set of all networks, returned by the Metropolis-Hastings algorithm.

interval

count; the number of proposals between sampled statistics.

maxit

count; the number of times the parameter for the MCMC should be updated by maximizing the MCMC likelihood. At each step the parameter is changed to the values that maximizes the MCMC likelihood based on the current sample.

constraints

A one-sided formula specifying one or more constraints on the support of the distribution of the networks being modeled, using syntax similar to the formula argument. Multiple constraints may be given, separated by +

meanstats

vector; the mean-value parameter value for the model. If this is given, the algorithm finds the natural parameter value corresponding to this mean-value parameter. If it is missing, the mean-value parameters used are the observed statisti

control

A list of control parameters for algorithm tuning. Constructed using control.ergm.

seed

integer; random number integer seed. Defaults to NULL to use whatever the state of the random number generator is at the time of the call.

eval.loglik

For dyad-dependent models, whether to use bridge sampling to evaluate the log-likelihoood associated with the fit. Has no effect for dyad-independent models. Defaults to FALSE because bridge sampling takes additional time.

verbose

logical; if this is TRUE, the program will print out additional information, including goodness of fit statistics.

...

Additional arguments, to be passed to lower-level functions in the future.

Value

ergm returns an object of class ergm that is a list consisting of the following elements:
coefThe Monte Carlo maximum likelihood estimate of $\theta$, the vector of coefficients for the model parameters.
sampleThe $n\times p$ matrix of network statistics, where $n$ is the sample size and $p$ is the number of network statistics specified in the model, that is used in the maximum likelihood estimation routine.
iterationsThe number of Newton-Raphson iterations required before convergence.
MCMCthetaThe value of $\theta$ used to produce the Markov chain Monte Carlo sample. As long as the Markov chain mixes sufficiently well, sample is roughly a random sample from the distribution of network statistics specified by the model with the parameter equal to MCMCtheta. If MPLEonly=TRUE then MCMCtheta equals the MPLE.
loglikelihoodThe approximate change in log-likelihood in the last iteration. The value is only approximate because it is estimated based on the MCMC random sample.
gradientThe value of the gradient vector of the approximated loglikelihood function, evaluated at the maximizer. This vector should be very close to zero.
covarApproximate covariance matrix for the MLE, based on the inverse Hessian of the approximated loglikelihood evaluated at the maximizer.
samplesizeThe size of the MCMC sample
failureLogical: Did the MCMC estimation fail?
mc.seMCMC standard error estimates
newnetworkThe final network at the end of the MCMC simulation
burninIf included, the burnin used for the MCMC simulation
intervalIf included, the interval used for the MCMC simulation
networkOriginal network
theta.originalThe first value of theta0
mplefitThe MPLE fit as a glm object.
null.devianceDeviance of the null model.
mle.likThe approximate log-likelihood for the MLE. The value is only approximate because it is estimated based on the MCMC random sample.
etamapThe set of functions mapping the true parameter theta to the canonical parameter eta (irrelevant except in a curved exponential family model)
degeneracy.valueScore calculated to assess the degree of degeneracy in the model.
degeneracy.typeSupporting output for degeneracy.value. Mainly for internal use.
formulaThe original formula entered into the ergm function.
constraintsConstraints used by original ergm call
prop.weightsMCMC proposal weights used by original ergm call (part of the control.ergm function output).
offsetvector of logical telling which model parameters are to be set at a fixed value (i.e., not estimated).
droplist of terms that were dropped due to extreme values of the corresponding statistics on the observed network.
See the method print.ergm for details on how an ergm object is printed. Note that the method summary.ergm returns a summary of the relevant parts of the ergm object in concise summary format.

Model Terms

The ergm function allows the user to explore a large number of potential models for their network data. The terms currently supported by the program, and a brief description of each is given in the documentation ergm-terms. In the formula for the model, the model terms are various function-like calls, some of which require arguments, separated by + signs. For a more detailed understanding of the model terms, see and Morris, Handcock and Hunter (2008).

Notes on model specification

Although each of the statistics in a given model is a summary statistic for the entire network, it is rarely necessary to calculate statistics for an entire network in a proposed Metropolis-Hastings step.Thus, for example, if the triangle term is included in the model, a census of all triangles in the observed network is never taken; instead, only the change in the number of triangles is recorded for each edge toggle.

In the implementation of ergm, the model is initialized in R, then all the model information is passed to a C program that generates the sample of network statistics using MCMC. This sample is then returned to R, which implements a simple Newton-Raphson algorithm to approximate the MLE. An alternative style of maximum likelihood estimation is to use a stochastic approximation algorithm. This can be chosen with the control.ergm(style="Robbins-Monro") option. The mechanism for proposing new networks for the MCMC sampling scheme, which is a Metropolis-Hastings algorithm, depends on two things: The constraints, which define the set of possible networks that could be proposed in a particular Markov chain step, and the weights placed on these possible steps by the proposal distribution. The former may be controlled using the constraints argument described above. The latter may be controlled using the prop.weights argument to the control.ergm function.

The package is designed so that the user could conceivably add additional proposal types.

Placing Bounds on Degrees:

There are many times when one may wish to condition on the number of inedges or outedges possessed by a node, either as a consequence of some intrinsic property of that node (e.g., to control for activity or popularity processes), to account for known outliers of some kind, and thus we wish to limit its indegree, an intrinsic property of the sampling scheme whence came our data (e.g., the survey asked everyone to name only three friends total) or as a function of the attributes of the nodes to which a node has edges (e.g., we specify that nodes designated male have a maximum number of outdegrees to nodes designated female). To accomplish this we use the constraints term bd.

Let's consider the simple cases first. Suppose you want to condition on the total number of degrees regardless of attributes. That is, if you had a survey that asked respondents to name three alters and no more, then you might want to limit your maximal outdegree to three without regard to any of the alters' attributes. The argument is then:

constraints=~bd(maxout=3)

Similar calls are used to restrict the number of indegrees (maxin), the minimum number of outdegrees (minout), and the minimum number of indegrees (minin).

You can also set ego specific limits. For example:

constraints=bd(maxout=rep(c(3,4),c(36,35)))

limits the first 36 to 3 and the other 35 to 4 outdegrees.

Multiple restrictions can be combined. bd is very flexible. In general, the bd term can contain up to five arguments:

bd(attribs=attribs, maxout=maxout, maxin=maxin, minout=minout, minin=minin)

Omitted arguments are unrestricted, and arguments of length 1 are replicated out to all nodes (as above). If an individual entry in maxout,..., minin is NA then no restriction of that kind is applied to that actor.

In general, attribs is a matrix of the attributes on which we are conditioning. The dimensions of attribs are n_nodes rows by attrcount columns, where attrcount is the number of distinct attribute values on which we want to condition (i.e., a separate column is required for male and female if we want to condition on the number of ties to both male and female partners). The value of attribs[n, i], therefore, is TRUE if node n has attribute value i, and FALSE otherwise. (Note that, since each column represents only a single value of a single attribute, the values of this matrix are all Boolean (TRUE or FALSE).) It is important to note that attribs is a matrix of nodal attributes, not alter attributes.

So, for instance, if we wanted to construct an attribs matrix with two columns, one each for male and female attribute values (we are conditioning on these values of the attribute sex), and the attribute sex is represented in ads.sex as an n_node-long vector of 0s and 1s (men and women), then our code would look as follows:

# male column: bit vector, TRUE for males attrsex1 <- (ads.sex == 0) # female column: bit vector, TRUE for females attrsex2 <- (ads.sex == 1) # now create attribs matrix attribs <- matrix(ncol=2,nrow=71, data=c(attrsex1,attrsex2))

maxout is a matrix of alter attributes, with the same dimensions as the attribs matrix. maxout is n_nodes rows by attrcount columns. The value of maxout[n,i], therefore, is the maximum number of outdegrees permitted from node n to nodes with the attribute i (where a NA means there is no maximum).

For example: if we wanted to create a maxout matrix to work with our attribs matrix above, with a maximum from every node of five outedges to males and five outedges to females, our code would look like this:

# every node has maximum of 5 outdegrees to male alters maxoutsex1 <- c(rep(5,71)) # every node has maximum of 5 outdegrees to female alters maxoutsex2 <- c(rep(5,71)) # now create maxout matrix maxout <- cbind(maxoutsex1,maxoutsex2)

The maxin, minout, and minin matrices are constructed exactly like the maxout matrix, except for the maximum allowed indegree, the minimum allowed outdegree, and the minimum allowed indegree, respectively. Note that in an undirected network, we only look at the outdegree matrices; maxin and minin will both be ignored in this case.

References

Admiraal R, Handcock MS (2007). networksis: Simulate bipartite graphs with fixed marginals through sequential importance sampling. Statnet Project, Seattle, WA. Version 1. http://statnetproject.org.

Bender-deMoll S, Morris M, Moody J (2008). Prototype Packages for Managing and Animating Longitudinal Network Data: dynamicnetwork and rSoNIA. Journal of Statistical Software, 24(7). http://www.jstatsoft.org/v24/i07/.

Butts CT (2006). netperm: Permutation Models for Relational Data. Version 0.2, http://erzuli.ss.uci.edu/R.stuff.

Butts CT (2007). sna: Tools for Social Network Analysis. Version 1.5, http://erzuli.ss.uci.edu/R.stuff.

Butts CT (2008). network: A Package for Managing Relational Data in R. Journal of Statistical Software, 24(2). http://www.jstatsoft.org/v24/i02/.

Butts CT, with help~from David~Hunter, Handcock MS (2007). network: Classes for Relational Data. Version 1.2, http://erzuli.ss.uci.edu/R.stuff.

Goodreau SM, Handcock MS, Hunter DR, Butts CT, Morris M (2008a). A statnet Tutorial. Journal of Statistical Software, 24(8). http://www.jstatsoft.org/v24/i08/.

Goodreau SM, Kitts J, Morris M (2008b). Birds of a Feather, or Friend of a Friend? Using Exponential Random Graph Models to Investigate Adolescent Social Networks. Demography, 45, in press.

Handcock, M. S. (2003) Assessing Degeneracy in Statistical Models of Social Networks, Working Paper #39, Center for Statistics and the Social Sciences, University of Washington. www.csss.washington.edu/Papers/wp39.pdf

Handcock MS (2003b). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.0, http://statnetproject.org.

Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M (2003a). ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks. Statnet Project, Seattle, WA. Version 2, http://statnetproject.org.

Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M (2003b). statnet: Software Tools for the Statistical Modeling of Network Data. Statnet Project, Seattle, WA. Version 2, http://statnetproject.org.

Hunter, D. R. and Handcock, M. S. (2006) Inference in curved exponential family models for networks, Journal of Computational and Graphical Statistics.

Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M (2008b). ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks. Journal of Statistical Software, 24(3). http://www.jstatsoft.org/v24/i03/.

Krivitsky PN, Handcock MS (2007). latentnet: Latent position and cluster models for statistical networks. Seattle, WA. Version 2, http://statnetproject.org.

Morris M, Handcock MS, Hunter DR (2008). Specification of Exponential-Family Random Graph Models: Terms and Computational Aspects. Journal of Statistical Software, 24(4). http://www.jstatsoft.org/v24/i04/. Snijders, T.A.B. (2002), Markov Chain Monte Carlo Estimation of Exponential Random Graph Models. Journal of Social Structure. Available from http://www.cmu.edu/joss/content/articles/volume3/Snijders.pdf.

Examples

Run this code

#
# load the Florentine marriage data matrix
#
data(flo)
#
# attach the sociomatrix for the Florentine marriage data
# This is not yet a network object.
#
flo
#
# Create a network object out of the adjacency matrix
#
flomarriage <- network(flo,directed=FALSE)
flomarriage
#
# print out the sociomatrix for the Florentine marriage data
#
flomarriage[,]
#
# create a vector indicating the wealth of each family (in thousands of lira) 
# and add it as a covariate to the network object
#
flomarriage %v% "wealth" <- c(10,36,27,146,55,44,20,8,42,103,48,49,10,48,32,3)
flomarriage
#
# create a plot of the social network
#
plot(flomarriage)
#
# now make the vertex size proportional to their wealth
#
plot(flomarriage, vertex.cex="wealth", main="Marriage Ties")
#
# Use 'data(package = "ergm")' to list the data sets in a
#
data(package="ergm")
#
# Load a network object of the Florentine data
#
data(florentine)
#
# Fit a model where the propensity to form ties between
# families depends on the absolute difference in wealth
#
gest <- ergm(flomarriage ~ edges + absdiff("wealth"))
summary(gest)
#
# add terms for the propensity to form 2-stars and triangles
# of families 
#
gest <- ergm(flomarriage ~ kstar(1:2) + absdiff("wealth") + triangle)
summary(gest)

# import synthetic network that looks like a molecule
data(molecule)
# Add a attribute to it to mimic the atomic type
molecule %v% "atomic type" <- c(1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
#
# create a plot of the social network
# colored by atomic type
#
plot(molecule, vertex.col="atomic type",vertex.cex=3)

# measure tendency to match within each atomic type
gest <- ergm(molecule ~ edges + kstar(2) + triangle + nodematch("atomic type"),
  MCMCsamplesize=10000)
summary(gest)

# compare it to differential homophily by atomic type
gest <- ergm(molecule ~ edges + kstar(2) + triangle
                              + nodematch("atomic type",diff=TRUE),
  MCMCsamplesize=10000)
summary(gest)

Run the code above in your browser using DataLab

Data engineering and BI courses are free!