network: Define a Network Generator

Description

Define a network generator by providing a function (using the argument netfun) which will simulate a network of connected friends for observations i in 1:n. This network then serves as a backbone for defining and simulating from the structural equation models for dependent data. In particular, the network allows new nodes to be defined as functions of the previously simulated node values of i's friends, across all observations i. Let F_i denote the set of friends of one observation i (observations in F_i are assumed to be "connected" to i) and refer to the union of these sets F_i as a "network" on n observations, denoted by F. A user-supplied network generating function netfun should be able to simulate such network F by returning a matrix of n rows, where each row i defines a friend set F_i, i.e., row i should be a vector of observations in 1:n that are connected to i (friends of i), with the remainder filled by NAs. Each friend set F_i can contain up to Kmax unique indices j from 1:n, except for i itself. F_i is also allowed to be empty (row i has only NAs), implying that i has no friends. The functionality is illustrated in the examples below. For additional information see Details. To learn how to use the node function for defining a node as a function of the friend node values, see Syntax and Network Summary Measures.

Usage

network(name, Kmax, netfun, ..., params = list())

Arguments

name

Character name for the network, to be used in future versions

Kmax

Either an R expression that evaluates to an integer constant or an integer specifying the maximum number of friends (connections) any simulated observation can have.

netfun

Character name of the user-defined network generating function, can be any R function that returns a matrix of friend IDs of dimension c(n, Kmax). The function must accept a named argument n that specifies the total sample size o

...

Named arguments specifying distribution parameters that are accepted by the network sampling function in netfun. These parameters can be R expressions that are themselves formulas of the past node names.

params

A list of additional named parameters to be passed on to the netfun function. The parameters have to be either constants or character strings of R expressions of the past node names.

Value

A list containing the network object(s) of type DAG.net, this will be utilized when data is simulated with sim function.

Syntax

The network function call that defines the network of friends can be added to a growing DAG object by using '+' syntax, much like a new node is added to a DAG. Subsequently defined nodes (node function calls) can employ the double square bracket subsetting syntax to reference previously simulated node values for specific friends in F_i simultaneously across all observations i. For example, VarName[[net_indx]] can be used inside the node formula to reference the node VarName values of i's friends in F_i[net_indx], simultaneously across all i in 1:n.

The friend subsetting index net_indx can be any non-negative integer vector that takes values from 0 to Kmax, where 0 refers to the VarName node values of observation i itself (this is equivalent to just using VarnName in the node formula), net_indx value of 1 refers to node VarName values for observations in F_i[1], across all i in 1:n (that is, the value of VarName of i's first friend F_i[1], if the friend exists and NA otherwise), and so on, up to net_indx value of Kmax, which would reference to the last friend node values of VarName, as defined by observations in F_i[Kmax] across all i. Note that net_indx can be a vector (e.g, net_indx=c(1:Kmax)), in which case the result of the query VarName[[c(1:Kmax)]] is a matrix of Kmax columns and n rows.

By default, VarName[[j]] evaluates to missing (NA) when observation i does not have a friend under F_i[j] (i.e., in the jth spot of i's friend set). This default behavior however can be changed to return 0 instead of NA, by passing an additional argument replaceNAw0 = TRUE to the corresponding node function.

Network Summary Measures

One can also define summary measures of the network covariates by specifying a node formula that applies an R function to the result of VarName[[net_indx]]. The rules for defining and applying such summary measures are identical to the rules for defining summary measures for time-varying nodes VarName[t_indx]. For example, use sum(VarName[[net_indx]]) to define a summary measure as a sum of VarName values of friends in F_i[net_indx], across all observations i in 1:n. Similarly, use mean(VarName[[net_indx]]) to define a summary measure as a mean of VarName values of friends in F_i[net_indx], across all i. For more details on defining such summary functions see the simcausal vignette.

Details

Without the network of friends, the DAG objects constructed by calling the node function can only specify structural equation models for independent and identically distributed data. That is, if no network is specified, for each observation i a node can be defined conditionally only on i's own previously simulated node values. As a result, any two observations simulated under such data-generating model are always independent and identically distributed. Defining a network F allows one to define a new structural equation model where a node for each observation i can depend on its own simulated past, but also on the previously simulated node values of i's friends (F_i). This is accomplished by allowing the data generating distribution for each observation i's node to be defined conditionally on the past node values of i's friends (observations in F_i). The network of friends can be used in subsequent calls to node function where new nodes (random variables) defined by the node function can depend on the node values of i's friends (observations in the set F_i). During simulation it is assumed observations on F_i can simultaneously influence i.

Note that the current version of the package does not allow combining time-varying node indexing Var[t] and network node indexing Var[[net_indx]] for the same data generating distribution.

Each argument for the input network can be an evaluable R expression. All formulas are captured by delayed evaluation and are evaluated during the simulation. Formulas can refer to standard or user-specified R functions that must only apply to the values of previously defined nodes (i.e. node(s) that were called prior to network() function call).

To force the immediate evaluation of any variable inside these expressions wrap the variable with .() function, see Example 2 for .(t_end) in node.

Examples

Run this code

#--------------------------------------------------------------------------------------------------
# EXAMPLE 1. USING igraph R PACKAGE TO SIMULATE NETWORKS
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
# Example of a network sampler, will be provided as "netfun" argument to network(, netfun=);
# Generates a random graph according to the G(n,m) Erdos-Renyi model using the igraph package;
# Returns (n,Kmax) matrix of net IDs (friends) by row;
# Row i contains the IDs (row numbers) of i's friends;
# i's friends are assumed connected to i and can influence i in equations defined by node())
# When i has less than Kmax friends, the remaining i row entries are filled with NAs;
# Argument m_pn: > 0 
# a total number of edges in the network as a fraction (or multiplier) of n (sample size)
#--------------------------------------------------------------------------------------------------
generate.igraph.ER <- function(n, m_pn, Kmax, ...) {
  m <- as.integer(m_pn[1]*n)
  if (n<=10) m <- 20
  igraph.ER <- igraph::sample_gnm(n = n, m = m, directed = TRUE)
  sparse_AdjMat <- igraph.to.sparseAdjMat(igraph.ER)
  NetInd_out <- sparseAdjMat.to.NetInd(sparse_AdjMat)
  if (Kmax < NetInd_out$Kmax) message("Kmax changed, new Kmax = ", NetInd_out$Kmax)
  return(NetInd_out$NetInd_k)
}

Kmax <- 5
D <- DAG.empty()

# Sample ER model network using igraph::sample_gnm with m_pn argument:
D <- D + network("NetInd_k", Kmax = Kmax, netfun = "generate.igraph.ER", m_pn = 50)

# W1 - categorical (5 categories, 0-4):
nW1cat <- 6
rbinom2 <- function(n, size, prob) rbinom(n, size = size, prob = prob[1,])
D <- D + node("W1", distr = "rbinom2", size = (nW1cat-1), prob = c(0.4, 0.5, 0.7, 0.4))

# W2 - binary infection status at t=0, positively correlated with W1:
prob_W2 <- seq.int(0.45, 0.8, length.out = nW1cat)
D <- D + node("W2", distr = "rbern", asis.params = list(prob = "prob_W2[W1+1]"))

# W3 - binary confounder:
prob_W3 <- 0.6
D <- D + node("W3", distr = "rbern", prob = prob_W3)

# A[i] is a function W1[i] and the total of i's friends values W1, W2 and W3:
D <- D + node("A", distr = "rbern",
              prob = plogis(2 + -0.5 * W1 +
                            -0.1 * sum(W1[[1:Kmax]]) +
                            -0.4 * sum(W2[[1:Kmax]]) +
                            -0.7 * sum(W3[[1:Kmax]])),
              replaceNAw0 = TRUE)

# Y[i] is a function of netW3 (friends of i W3 values) and the total N of i's friends 
# who are infected AND untreated:
D <- D + node("Y", distr = "rbern",
              prob = plogis(-1 + 2 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                            -2 * sum(W3[[1:Kmax]])
                            ),
              replaceNAw0 = TRUE)

# Can add N untreated friends to the above outcome Y equation: sum(1 - A[[1:Kmax]]):
D <- D + node("Y", distr = "rbern",
              prob = plogis(-1 + 1.5 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                            -2 * sum(W3[[1:Kmax]]) +
                            0.25 * sum(1 - A[[1:Kmax]])
                            ),
              replaceNAw0 = TRUE)

# Can add N infected friends at baseline to the above outcome Y equation: sum(W2[[1:Kmax]]):
D <- D + node("Y", distr = "rbern",
              prob = plogis(-1 + 1 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                            -2 * sum(W3[[1:Kmax]]) +
                            0.25 * sum(1 - A[[1:Kmax]]) +
                            0.25 * sum(W2[[1:Kmax]])
                            ),
              replaceNAw0 = TRUE)

Dset <- set.DAG(D)

# Simulating data from the above sem:
datnet <- sim(Dset, n = 1000, rndseed = 543)
head(datnet, 100)

# Extracting the network matrix from the simulated data:
attributes(datnet)$netind_cl
head(attributes(datnet)$netind_cl$NetInd)

#--------------------------------------------------------------------------------------------------
# EXAMPLE 2. USING CUSTOM NETWORK GENERATING FUNCTION
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
# Example of a user-defined network sampler(s) function
# Arguments Kmax, bslVar[i] (W1) & nF are evaluated in the environment of the simulated data then 
# passed to generateNET() function
  # - unif.F: when TRUE sample friends set as discrete uniform distr (no weighting by bslVar)
  # - bslVar[i]: used for contructing weights for the probability of selecting i as 
  # someone else's friend (weighted sampling), when missing the sampling goes to uniform
  # - nF[i]: total number of friends that need to be sampled for observation i
#--------------------------------------------------------------------------------------------------
generateNET <- function(n, Kmax, unif.F = FALSE, bslVar, nF, ...) {
  nW1cat <- 6
  W1cat_arr <- c(1:nW1cat)/2
  prob_F <- plogis(-4.5 + 2.5*W1cat_arr) / sum(plogis(-4.5 + 2.5*W1cat_arr))
  NetInd_k <- matrix(NA_integer_, nrow = n, ncol = Kmax)
  nFriendTot <- rep(0L, n)

  for (index in (1:n)) {
    FriendSampSet <- setdiff( c(1:n), index)
    nFriendSamp <- max(nF[index] - nFriendTot[index], 0L)
    if (nFriendSamp>0) {
      if (length(FriendSampSet)==1)  {
        friends_i <- FriendSampSet
      } else {
        if (missing(bslVar) || unif.F[1]) {
           # Sample with uniform prob, no weighting by bslVar:
          friends_i <- sort(sample(FriendSampSet, size = nFriendSamp))
        } else {
          # Sample from the possible friend set, with prob for selecting each j
          # being based on categorical bslVar[j]
          # bslVar[i] affects the probability of having [i] selected as someone's friend bslVar
          friends_i <- sort(sample(FriendSampSet, size = nFriendSamp,
                            prob = prob_F[bslVar[FriendSampSet] + 1]))
        }
      }
      NetInd_k[index, ] <- c(as.integer(friends_i), rep_len(NA_integer_, Kmax - length(friends_i)))
      nFriendTot[index] <- nFriendTot[index] + nFriendSamp
    }
  }
  return(NetInd_k)
}

Kmax <- 6
D <- DAG.empty()

# W1 - categorical or continuous confounder (5 categories, 0-4):
nW1cat <- 6
rbinom2 <- function(n, size, prob) rbinom(n, size = size, prob = prob[1,])
D <- D + node("W1", distr = "rbinom2", size = (nW1cat-1), prob = c(0.4, 0.5, 0.7, 0.4))

# W2 - binary infection status at t=0, positively correlated with W1:
prob_W2 <- seq.int(0.45, 0.8, length.out = nW1cat)
D <- D + node("W2", distr = "rbern",
            prob = (W1==0)*.(prob_W2[1]) + (W1==1)*.(prob_W2[2]) + (W1==2)*.(prob_W2[3]) +
                   (W1==3)*.(prob_W2[4]) + (W1==4)*.(prob_W2[5]) + (W1==5)*.(prob_W2[6]))

# W3 - binary confounder:
prob_W3 <- 0.6
D <- D + node("W3", distr = "rbern", prob = prob_W3)

# nF: total number of friends for each i (nF[i]), each nF[i] is influenced by categorical W1 
normprob <- function(x) x / sum(x)
k_arr <-c(1:Kmax)
pN_0 <- 0.02
prob_Ni_W1_0 <- normprob(c(pN_0, plogis(-3 - 0 - k_arr / 2)))    # W1=0 probabilities of |F_i|
prob_Ni_W1_1 <- normprob(c(pN_0, plogis(-1.5 - 0 - k_arr / 3)))  # W1=1 probabilities of |F_i|
prob_Ni_W1_2 <- normprob(c(pN_0, pnorm(-2*abs(2 - k_arr) / 5)))  # W1=2 probabilities of |F_i|
prob_Ni_W1_3 <- normprob(c(pN_0, pnorm(-2*abs(3 - k_arr) / 5)))  # W1=3 probabilities of |F_i|
prob_Ni_W1_4 <- normprob(c(pN_0, plogis(-4 + 2 * (k_arr - 2))))  # W1=4 probabilities of |F_i|
prob_Ni_W1_5 <- normprob(c(pN_0, plogis(-4 + 2 * (k_arr - 3))))  # W1=5 probabilities of |F_i|

D <- D + node("nF.plus1", distr = "rcategor.int",
              probs = (W1 == 0)*.(prob_Ni_W1_0) + (W1 == 1)*.(prob_Ni_W1_1) +
                      (W1 == 2)*.(prob_Ni_W1_2) + (W1 == 3)*.(prob_Ni_W1_3) +
                      (W1 == 4)*.(prob_Ni_W1_4) + (W1 == 5)*.(prob_Ni_W1_5))

# Adding the network generator that depends on nF and categorical W1:
D <- D + network("NetInd_k", Kmax = Kmax, netfun = "generateNET", bslVar = W1, nF = nF.plus1 - 1)

# Define A[i] is a function W1[i] as well as the total sum of i's friends values for W1, W2 and W3:
D <- D + node("A", distr = "rbern",
              prob = plogis(2 + -0.5 * W1 +
                            -0.1 * sum(W1[[1:Kmax]]) +
                            -0.4 * sum(W2[[1:Kmax]]) +
                            -0.7 * sum(W3[[1:Kmax]])),
              replaceNAw0 = TRUE)

# Y[i] is a the total N of i's friends who are infected AND untreated
# + a function of friends W3 values
D <- D + node("pYRisk", distr = "rconst",
              const = plogis(-1 + 2 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                              -1.5 * sum(W3[[1:Kmax]])),
              replaceNAw0 = TRUE)

D <- D + node("Y", distr = "rbern", prob = pYRisk)
Dset <- set.DAG(D)

# Simulating data from the above sem:
datnet <- sim(Dset, n = 1000, rndseed = 543)

Run the code above in your browser using DataLab