simcausal (version 0.5.0)

node: Create Node Object(s)

Description

Define a single DAG node and its distribution or define many repeated-measure/time-varying nodes by using argument t. The node distribution is allowed to vary as a function of time (t). Conditionaing on past nodes is accomplished by using the syntactic sugar, such as, NodeName[t]. After all the nodes have been added to the DAG, call set.DAG, a DAG object constructor, and add.action, an action (intervention) constructor.

Usage

node(name, t, distr, EFU, order, ..., params = list(), asis.params = list())

Arguments

name
Character node name or a vector of names when specifying a multivariate node. For time-dependent nodes the names will be automatically expanded to a scheme "name_t" for each t provided specified.
t
Node time-point(s). Allows specification of several time-points when t is a vector of positive integers, in which case the output will consist of a named list of length(t) nodes, corresponding to each value in t.
distr
Character name of the node distribution, can be a standard distribution R function, s.a. rnorm, rbinom, runif or user defined. The function must accept a named argument "n" to specify the total sample size. Distributional parameters (arguments) must be pa
EFU
End-of-Follow Up flag for designating a survival/censoring type node, only applies to Bernoulli nodes. When EFU=TRUE this node becomes an indicator for the end of follow-up event (censoring, end of study, death, etc). When simulated variable
order
An optional integer parameter specifying the order in which these nodes will be sampled. The value of order has to start at 1 and be unique for each new node, can be specified as a range / vector and has to be of the same length as the argument t
...
Named arguments specifying distribution parameters that are accepted by the distr function. The parameters can be R expressions that are themselves formulas of the past node names.
params
A list of additional named parameters to be passed on to the distr function. The parameters have to be either constants or character strings of R expressions of the past node names.
asis.params
(ADVANCED USE) A list of additional named distributional parameters that will be evaluated "as is", inside the currently simulated data.frame + the calling environment, without any modifications to the R expression strings inside the asis.params

Value

  • A list containing node object(s) (expanded to several nodes if t is an integer vector of length > 1)

Details

The combination of name and t must uniquely identify each node in the DAG. Use argument t to identify measurements of the same attribute (e.g. 'A1c') at various time points. The collection of all unique t values, across all nodes, should consist of non-negative integers (i.e., starting at 0).

The optional order argument can be specified, used for determining the sampling order of each node. When order not specified, it is automatically inferred based on the actual order in which the nodes were added to the DAG (earlier added nodes get lower order value and are sampled first)

All node calls that share the same generic name name must also share the same EFU value (if any is specified in at least one of them). A value of TRUE for the EFU indicates that if a simulated value for a measurement of the attribute represented by node is 1 then all the following nodes with that measurement (in terms of higher t values) in the DAG will be unobserved (i.e., their simulated value will be set to NA).

Node formulas (parameters of the distribution)

Each formula of an input node is an evaluable R expression. All formulas are delayed in the evaluation until the simulation time. Formulas can refer to standard or user-specified R functions that must only apply to the values of parent nodes, i.e. a subset of the node(s) with an order value strictly lower than that of the node characterized by the formula. Formulas must reference the parent nodes with unique name identifiers, employing the square bracket vector subsetting name[t] for referencing a parent node at a particular time point t (if any time-points were specified). The square bracket notation is used to index a generic name with the relevant time point as illustrated in the examples. When an input node is used to define several nodes (i.e., several measurement of the same attribute, t=0:5), the formula(s) specified in that node can apply to each node indexed by a given time point denoted by t. This generic expression t can then be referenced within a formula to simultaneously identify a different set of parent nodes for each time point as illustrated below. Note that the parents of each node represented by a given node object are implicitly defined by the nodes referenced in formulas of that node call.

(ADVANCED USE) All distribution parameters (e.g., mean, probs, sd, unifmin and unifmax) are interpreted by delayed evaluation, to force immediate evaluation of a variable Var wrap the variable inside .() function, as in .(Var). See Example 2 for a working example that evaluates .(t_end) variable.

Multivariate random variables (multivariate nodes)

Starting from v.0.5, a single node call can be used for defining a multivariate (and possibly correlated) random vector. To define a random vector that has more than 1 dimension, use the argument name to specify a vector with names for each dimension, e.g.,

node(c("X1","X2"), distr = "rmvnorm", mean = c(0,1)), sigma = matrix(c(1,0.75,0.75,1), ncol=2)

will define a bi-variate (correlated) normally distributed node, the simulated data set will contain this bi-variately distributed random variable in columns "X1" and "X2". Note that the multivariate sampling distribution function (such as function rmvnorm from the package mvtnorm) must return a matrix of n rows (number of observations) and length(name) columns (dimensionality). See additional examples below.

Note that one can also define time-varying multivariate nodes, e.g.,

node(c("X1","X2"), t=0:5, distr = "rmvnorm", mean = c(0,1)).

However, time-varying indexing of parent nodes is not allowed in any multivariate node formulas, e.g., using expressions like

node(c("X1","X2"), t=6, distr = "rmvnorm", mean = c(X1[t-1],1)),

is not possible and will not return any sensible results.

Examples

Run this code
#---------------------------------------------------------------------------------------
# EXAMPLE 1A: Define some Bernoulli nodes, survival outcome Y and put it together in a 
# DAG object
#---------------------------------------------------------------------------------------
W1 <- node(name = "W1", distr = "rbern", 
	prob = plogis(-0.5), order = 1)
W2 <- node(name = "W2", distr = "rbern", 
	prob = plogis(-0.5 + 0.5 * W1), order = 2)
A <- node(name = "A", distr = "rbern", 
	prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2), order = 3)
Y <- node(name = "Y", distr = "rbern", 
	prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2), order = 4)
D1A <- set.DAG(c(W1,W2,A,Y))

#---------------------------------------------------------------------------------------
# EXAMPLE 1B: Same as 1A using +node interface and no order argument
#---------------------------------------------------------------------------------------
D1B <- DAG.empty()
D1B <- D1B + node(name = "W1", distr = "rbern", 
	prob = plogis(-0.5))
D1B <- D1B + node(name = "W2", distr = "rbern", 
	prob = plogis(-0.5 + 0.5 * W1))
D1B <- D1B + node(name = "A", distr = "rbern", 
	prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2))
D1B <- D1B + node(name = "Y", distr = "rbern", 
	prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2))
D1B <- set.DAG(D1B)

#---------------------------------------------------------------------------------------
# EXAMPLE 1C: Same as 1A and 1B using add.nodes interface and no order argument
#---------------------------------------------------------------------------------------
D1C <- DAG.empty()
D1C <- add.nodes(D1C, node(name = "W1", distr = "rbern", 
	prob = plogis(-0.5)))
D1C <- add.nodes(D1C, node(name = "W2", distr = "rbern", 
	prob = plogis(-0.5 + 0.5 * W1)))
D1C <- add.nodes(D1C, node(name = "A", distr = "rbern", 
	prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2)))
D1C <- add.nodes(D1C, node(name = "Y", distr = "rbern", 
	prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2)))
D1C <- set.DAG(D1C)

#---------------------------------------------------------------------------------------
# EXAMPLE 1D: Add a uniformly distributed node and redefine outcome Y as categorical
#---------------------------------------------------------------------------------------
D_unif <- DAG.empty()
D_unif <- D_unif + 
node("W1", distr = "rbern", prob = plogis(-0.5)) + 
node("W2", distr = "rbern", prob = plogis(-0.5 + 0.5 * W1)) + 
node("W3", distr = "runif", min = plogis(-0.5 + 0.7 * W1 + 0.3 * W2), max = 10) + 
node("An", distr = "rbern", prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2 - 0.2 * sin(W3)))
# Categorical syntax 1 (probabilities as values):
D_cat_1 <- D_unif + node("Y", distr = "rcategor", probs = {0.3; 0.4})
D_cat_1 <- set.DAG(D_cat_1)
# Categorical syntax 2 (probabilities as formulas):
D_cat_2 <- D_unif + 
node("Y", distr = "rcategor", 
	probs={plogis(-0.1 + 1.2 * An + 0.3 * W1 + 0.3 * W2 + 0.2 * cos(W3)); 
			plogis(-0.5 + 0.7 * W1)})
D_cat_2 <- set.DAG(D_cat_2)

#---------------------------------------------------------------------------------------
# EXAMPLE 2A: Define Bernoulli nodes using R rbinom() function, defining prob argument
# for L2 as a function of node L1
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + 
node("L1", t = 0, distr = "rbinom", 
	prob = 0.05, size = 1) + 
node("L2", t = 0, distr = "rbinom", 
	prob = ifelse(L1[0] == 1, 0.5, 0.1), size = 1)
D <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 2B: Equivalent to 2A, passing argument size to rbinom inside a named list
# params
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + 
node("L1", t = 0, distr = "rbinom", 
	prob = 0.05, params = list(size = 1)) + 
node("L2", t = 0, distr = "rbinom", 
	prob = ifelse(L1[0] == 1,0.5,0.1), params = list(size = 1))
D <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 2C: Equivalent to 2A and 2B, define Bernoulli nodes using a wrapper "rbern"
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + 
node("L1", t = 0, distr = "rbern", prob = 0.05) + 
node("L2", t = 0, distr = "rbern", prob = ifelse(L1[0] == 1, 0.5, 0.1))
D <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 3: Define node with normal distribution using rnorm() R function
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + node("L2", t = 0, distr = "rnorm", mean = 10, sd = 5)
D <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 4: Define 34 Bernoulli nodes, or 2 Bernoulli nodes over 17 time points,
# prob argument contains .() expression that is immediately evaluated in the calling 
# environment (.(t_end) will evaluate to 16)
#---------------------------------------------------------------------------------------
t_end <- 16
D <- DAG.empty()
D <- D + 
node("L2", t = 0:t_end, distr = "rbinom", 
	prob = ifelse(t == .(t_end), 0.5, 0.1), size = 1) + 
node("L1", t = 0:t_end, distr = "rbinom", 
	prob = ifelse(L2[0] == 1, 0.5, 0.1), size = 1)
D <- set.DAG(D)

#---------------------------------------------------------------------------------------
# EXAMPLE 5: Defining new distribution function 'rbern', defining and passing a custom 
# vectorized node function 'customfun'
#---------------------------------------------------------------------------------------
rbern <- function(n, prob) { # defining a bernoulli wrapper based on R rbinom function
  rbinom(n = n, prob = prob, size = 1)
}
customfun <- function(arg, lambda) {
  res <- ifelse(arg == 1, lambda, 0.1)
  res
}
D <- DAG.empty()
D <- D +
node("W1", distr = "rbern", prob = 0.05) +
node("W2", distr = "rbern", prob = customfun(W1, 0.5)) +
node("W3", distr = "rbern", prob = ifelse(W1 == 1, 0.5, 0.1))
D1d <- set.DAG(D, vecfun = c("customfun"))
sim1d <- simobs(D1d, n = 200, rndseed = 1)


#---------------------------------------------------------------------------------------
# EXAMPLE 6: Defining latent variables I and U.Y (will be hidden from simulated data)
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
  node("I",
    distr = "rcategor.int",
    probs = c(0.1, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1)) +
  node("W1",
    distr = "rnorm",
    mean = ifelse(I == 1, 0, ifelse(I == 2, 3, 10)) + 0.6 * I,
    sd = 1) +
  node("W2",
    distr = "runif",
    min = 0.025*I, max = 0.7*I) +
  node("W3",
    distr = "rbern",
    prob = plogis(-0.5 + 0.7*W1 + 0.3*W2 - 0.2*I)) +
  node("A",
    distr = "rbern",
    prob = plogis(+4.2 - 0.5*W1 + 0.2*W2/2 + 0.2*W3)) +
	node("U.Y", distr = "rnorm", mean = 0, sd = 1) +
  node("Y",
    distr = "rconst",
    const = -0.5 + 1.2*A + 0.1*W1 + 0.3*W2 + 0.2*W3 + 0.2*I + U.Y)
Dset1 <- set.DAG(D, latent.v = c("I", "U.Y"))
simobs(Dset1, n = 200, rndseed = 1)

#---------------------------------------------------------------------------------------
# EXAMPLE 7: Multivariate random variables
#---------------------------------------------------------------------------------------
require("mvtnorm")
D <- DAG.empty()

# 3 dimensional normal (uncorrelated), using rmvnorm function from rmvnorm package:
D <- D +
  node(c("X1","X2","X3"), distr = "rmvnorm", mean = c(0,1,2))

# Bivariate normal (correlation coef 0.75):
D <- D +
  node(c("Y1","Y2"), distr = "rmvnorm",
    mean = c(0,1),
    sigma = matrix(c(1,0.75,0.75,1), ncol=2))

# Can use any component of such multivariate nodes when defining future nodes:
D <- D + node("A", distr = "rconst", const = 1 - X1)
Dset1 <- set.DAG(D, verbose = TRUE)
dat1 <- sim(Dset1, n = 200)

# Bivariate uniform copula using copula package (correlation coef 0.75):
require("copula")
D <- DAG.empty()
D <- D +
  node(c("Y1","Y2"), distr = "rCopula",
    copula = normalCopula(0.75, dim = 2))
Dset2 <- set.DAG(D)
dat2 <- sim(Dset2, n = 200)

# Bivariate binomial from previous copula, with same correlation:
vecfun.add("qbinom")
D <- D + node("A.Bin1", distr = "rconst", const = qbinom(Y1, 10, 0.5))
D <- D + node("A.Bin2", distr = "rconst", const = qbinom(Y2, 15, 0.7))
Dset3 <- set.DAG(D)
dat3 <- sim(Dset3, n = 200)

# Same as "A.Bin1" and "A.Bin2", but directly using rmvbin function in bindata package:
require("bindata")
D <- DAG.empty()
D <- D +
  node(c("B.Bin1","B.Bin2"), distr = "rmvbin",
    margprob = c(0.5, 0.5),
    bincorr = matrix(c(1,0.75,0.75,1), ncol=2))
Dset4 <- set.DAG(D)
dat4 <- sim(Dset4, n = 200)

# Time-varying multivar node (3 time-points, 2 dimensional normal):
D <- DAG.empty()
D <- D +
  node(c("X1", "X2"), t = 0:2, distr = "rmvnorm",
    mean = c(0,1),
    sigma = matrix(rep(0.75,4), ncol=2))
Dset5 <- set.DAG(D)
dat5 <- sim(Dset5, n = 200)

Run the code above in your browser using DataLab