
preprocess(object, ..., lag = FALSE, covariate = FALSE,
memory = c("no", "autoregression", "stability",
"innovation", "loss"), na = NA, na.method = "fillmode",
structzero = -9, structzero.method = "remove",
verbose = FALSE)
btergm
function) or a liobject
argument), the missing data are also removed from the object<
covariate
argument. If lag = TRUE
and covariate = TRUE
, the first item in the list is adjustedlag = TRUE
is active). See the entry for lag
for more details.memory = "no"
will not create a memory term. memor
handleMissings
function).remove
and fillmode
. See the handleMissings
function for details.handleMissings
function).remove
and fillmode
. See the handleMissings
function for details.btergm
function, lists of network matrices or covariates can be preprocessed using the preprocess
function. The first object that is provided as an argument is adjusted to all remaining objects that are handed over via the ...
argument. The preprocess function can deal with constant or time-varying matrix
, vector
and data.frame
objects. The user can specify whether the object of interest is a dependent network or a covariate and whether NA and structural zero removal should take into account missing data at previous or following time steps (lag = TRUE
).The preprocessing procedure consists of four steps: (1) remove or impute missing data and/or structural zeroes in each object at each time step; (2) cross-sectional adjustment of matrix dimensions (e.g., if the object of interest has more observations than another object at the current time step, the observations are dropped from the object of interest); (3) backward-looking adjustment of matrix dimensions (if the object of interest is a dependent variable/network and the lag = TRUE
argument is used); and (4) forward-looking adjustment of matrix dimensions (if the object of interest is a covariate and the lag = TRUE
argument is used). The last two steps can be helpful for building lagged or delayed covariates which have to take into account missing data at other time steps as well.
For example, if 26 nodes are present during the first time step, 25 during the second, 23 during the third, and 25 during the fourth time step, the arguments covariate = TRUE
and lag = TRUE
return a list of three objects with 25, 23 and 25 objects, respectively, because the covariate is assumed to lag one step behind the other objects provided after the first object.
If the memory
argument is specified (see below for details on the allowed values), the function will not return a lagged network, but will attempt to create a memory term. A memory term is a dyadic covariate which captures the temporal process.
# This example illustrates the usefulness of the preprocess function.
# first network: nodes a to j present
mat1 <- rbinom(100, 1, 0.1)
mat1 <- matrix(mat1, nrow = 10) # has 10 nodes
rownames(mat1) <- letters[1:10]
colnames(mat1) <- letters[1:10]
# second network: nodes c to n present
mat2 <- rbinom(144, 1, 0.1)
mat2 <- matrix(mat2, nrow = 12) # has 12 nodes
rownames(mat2) <- letters[3:14]
colnames(mat2) <- letters[3:14]
# third network: nodes a and d to k present
mat3 <- rbinom(81, 1, 0.1)
mat3 <- matrix(mat3, nrow = 9) # has 9 nodes
rownames(mat3) <- letters[c(1, 4:11)]
colnames(mat3) <- letters[c(1, 4:11)]
# fourth network: same as second matrix
mat4 <- mat2
networks <- list(mat1, mat2, mat3, mat4)
# btergm without cross-temporal dependencies:
model.1 <- btergm(networks ~ edges + mutual)
summary(model.1)
# When cross-temporal dependencies are specified, the dimensions
# of the matrices do not match. This would cause a problem for btergm:
\dontrun{
btergm(networks[2:4] ~ edges + mutual + edgecov(networks[1:3])) # ERROR!
}
# This is because the first network in the dependent network and the
# first network in the lagged covariate are expected to have the same
# dimensions (and also at the second and third time step, of course).
# Therefore, missing nodes in the covariate (here: {k, l, m, n} at t=1,
# {a} at t=2, and {c, l, m, n} at t=3) must be removed from the
# dependent network at t=2, t=3 and t=4 as well:
dep <- preprocess(networks, lag = TRUE, covariate = FALSE)
# This reduces the size of dep from 12 to 8 at t=2, from 9 to 8 at
# t=3, and from 12 to 8 at t=4, and it removes the first network from
# the list. Moreover, some nodes are present in the lagged covariate
# but not in the dependent network (that is, at the next time step).
# Therefore, node sets {a, b}, {c, l, m, n}, and {a} must be removed
# from the lagged covariate at t=1, t=2, and t=3, respectively, to make
# the dimensions compatible:
lag <- preprocess(networks, lag = TRUE, covariate = TRUE)
# To compare the dimensions of the original versus preprocessed
# dependent networks and covariates, try the following code:
cbind(
"original_dep" = lapply(networks[2:4], nrow),
"original_lag" = lapply(networks[1:3], nrow),
"new_dep" = lapply(dep, nrow),
"new_lag" = lapply(lag, nrow)
)
# The dependent networks were reduced from 12, 9 and 12 to 8, 8 and
# 8 nodes, and the lagged networks were reduced from 10, 12 and 9 to
# 8, 8 and 8 nodes, respectively. The lagged node sets are now
# compatible. To see this:
cbind(rownames(dep[[1]]), rownames(lag[[1]]))
cbind(rownames(dep[[2]]), rownames(lag[[2]]))
cbind(rownames(dep[[3]]), rownames(lag[[3]]))
# Note, however, that the composition still changes within each list
# across some of the time steps:
cbind(rownames(dep[[1]]), rownames(dep[[2]]), rownames(dep[[3]]))
cbind(rownames(lag[[1]]), rownames(lag[[2]]), rownames(lag[[3]]))
# We can now use the btergm function on the preprocessed lists:
model.2 <- btergm(dep ~ edges + mutual + edgecov(lag))
summary(model.2)
# The model can now be estimated because the current and lagged networks
# have the same node sets at each time step. The disadvantage of this
# approach is that some observations are lost. The advantage, however,
# is that cross-temporal theories can be tested.
# However, since the node sets still differ across time steps, ROC and
# PR curves cannot be estimated. This is true because a simulation from
# nodes {c ... j} cannot be compared to a target network with nodes
# {d ... k}. Therefore, the following command would compare the wrong
# sets of nodes to estimate prediction performance:
\dontrun{
gof.2 <- gof(model.2, classicgof = FALSE, rocprgof = TRUE) # PROBLEM!
}
# To solve this problem, the most obvious approach is to estimate the
# model at earlier time steps and compute the out-of-sample predictive
# performance only for the last network:
model.3 <- btergm(dep[1:2] ~ edges + mutual + edgecov(lag[1:2]))
gof.3 <- gof(model.3, target = dep[[3]], formula = dep[[3]] ~ edges +
mutual + edgecov(lag[[3]]), classicgof = FALSE, rocprgof = TRUE)
# This models time steps 2 and 3 as a function of the lagged network
# at time steps 1 and 2, uses the resulting coefficients to predict
# the network at time step 4, and compares network 4 to simulations
# based on the coefficients from the previous time steps and the
# lagged network at the third time step. As the matrices within the
# third list item have identical node sets, predictive performance
# could be computed. The resulting ROC and PR curves can be plotted
# as follows:
plot(gof.3, boxplot = FALSE, pr = FALSE, roc.random = TRUE,
ylab = "TPR/PPV", xlab = "FPR/TPR", roc.main = "ROC and PR")
plot(gof.3, boxplot = FALSE, roc = FALSE, pr.random = TRUE,
rocpr.add = TRUE)
legend("right", legend = c("ROC", "ROC random graph", "PR",
"PR random graph"), col = c("#bd0017", "#bd001744", "#5886be",
"#5886be44"), lty = 1, lwd = 3)
# For another example with real-world data, see vignette("knecht")
Run the code above in your browser using DataLab