What does it mean to add a network to a DAG
?
When using only node
or node_td
to define a DAG
, all observations are usually generated independently from each other (if not explicitly done otherwise using a custom node function). This reflects the classic i.i.d. assumption that is frequently used everywhere. For some data generation processes, however, this assumption is insufficient. The spread of an infectious disease is a classic example.
The network()
function allows users to relax this assumption, by making it possible to define one or more networks that can then be added to DAG
objects using the +
syntax. These networks should contain a single vertex for each observation that should be generated, placing each row of the dataset into one place in the network. Through the use of the net
function it is then possible to define new nodes as a function of the neighbors of an observation, where the neighbors of a vertex are defined as any other vertex that is directly connected to this node. For example, one could use this capability to use the mean age of an observations neighbors in a regression model, or use the number of infected neighbors to model the probability of infection. By combining this network-simulation approach with the already extensive simulation capabilities of DAG
based simulations, almost any DGP can be modelled. This approach is described more rigorously in the excellent paper given by Sofrygin et al. (2017).
Supported network types:
Users may add any number of networks to a DAG
object, making it possible to embed individuals in multiple distinct networks at the same time. These networks can then be used simultaneously to define a single or multiple (possibly time-varying) nodes, using multiple net
function calls in the respective formula
arguments. It is also possible to define time-varying or dynamic networks that change over time, possibly as a function of the generated data, simulation time or previous states of the network. Examples are given below and in the associated vignette.
The package directly supports un-directed and directed, un-weighted and weighted networks. It also supports different definitions of what the neighbors of an observation are. Note, however, that only networks which include exactly one vertex per observation are supported.
Weighted Networks:
It is possible to supply weighted networks to network()
. The weights are then also stored and available to the user when using the net
function through the internal ..weight..
variable. For example, if a weighted network was supplied, the following would be valid syntax: net(weighted.mean(A, ..weight..))
(assuming that A
is a previously defined variable). Note that the ..weight..
must be used explicitly, otherwise the weights are ignored.
Directed Networks:
Supplying directed networks is also possible. If this is done, users usually need to specify the mode
argument of the net
function when defining the formula
arguments. This argument allows users to define different kinds of neighborhoods for each observation, based on the direction of the edges.
Order of Generation:
Generally, all networks are created in the order in which they were added to the DAG
, unless sort_dag
or tx_nodes_order
are changed in sim_from_dag
or sim_discrete_time
respectively. The only exception is that all networks created using the network()
function are created after all other root nodes have already been generated.
Computational considerations:
Including net()
terms in a node might significantly increase the amount of RAM used and the required computation time, especially with very large networks and / or large values of n_sim
and / or max_t
(the latter is only relevant in discrete-time simulations using sim_discrete_time
). The reason for this is that each time a node is generated or updated over time, the mapping of individuals to their neighbors' values plus the subsequent aggregation has to be performed, which required merge()
calls etc. Usually this should not be a problem, but it might be for some large discrete-time simulations. If the same net
call is used in multiple nodes it can be beneficial to put it into an extra node
call and safe it to avoid re-calculating the same thing over and over again (see examples).
Further information:
For a theoretical treatment, please consult the paper by Sofrygin et al. (2017), who also describe their slightly different implementation of this method in the simcausal package. More information on how to specify network-based dependencies in a DAG
(using simDAG) after adding a network, please consult the net
documentation page or the associated vignette.