getSample: Generates Simulated Data

Description

Generates a run of simulated observations of the form $(W,X,Y)$ to investigate the "effect" of $X$ on $Y$ taking $W$ into account.

Usage

getSample(n, O, lambda0, p = rep(1/3, 3), omega = rep(1/2, 3), 
    Sigma1 = matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2), sigma2 = omega[2]/5, 
    Sigma3 = Sigma1, f = identity, verbose = FALSE)

Arguments

An integer, the number of observations to be generated.

A 3x3 numeric matrix or data.frame. Rows are 3 baseline observations used as "class centers" for the simulation. Columns are:

$W$, baseline covariate (e.g. DNA methylation), or "confounder" in a causal model

Value

Returns a list with the following tags:
obsA matrix of n observations. The c(W,X,Y) colums of obs have the same interpretation as the columns of the input argument O.
psiA numeric, approximated value of the true parameter $\psi$ obtained using the known $\theta$ and the joint empirical distribution of $(X,W)$ based on the same run of observations as in obs. The larger the sample size, the more accurate the approximation.
gA function, the known positive conditional probability $P(X=x_0|W)$.
muA function, the known conditional expectation $E_P(X|W)$.
muAuxA function, the known conditional expectation $E_P(X|X\neq x_0, W)$.
thetaA function, the known conditional expectation $E_P(Y|X,W)$.
theta0A function, the known conditional expectation $E_P(Y|X=x_0,W)$.
sigma2A positive numeric, the known expectation $E_P(f(X-x_0)^2)$.
effICA function, the known efficient influence curve of the functional $\Psi$ at $P$, assuming that the reference value $x_0=0$.
varICA positive numeric, approximated value of the variance of the efficient influence curve of the functional $\Psi$ at $P$ and evaluated at $O$, obtained using the same run of observations as in obs. The larger the sample size, the more accurate the approximation.

item

lambda0
p
omega
Sigma1
sigma2
Sigma3
f
verbose

code

FALSE

eqn

$f(0)=0$

Details

The parameter of interest is defined as $\psi=\Psi(P)$ with $$\Psi(P) = \frac{E_P[f(X-x_0) * (\theta(X,W) - \theta(x_0,W))]}{E_P[f(X-x_0)^2]},$$ with $P$ the distribution of the random vector $(W,X,Y)$, $\theta(X,W) = E_P[Y|X,W]$, $x_0$ the reference value for $X$, and $f$ a user-supplied function such that $f(0)=0$ (e.g., $f=identity$, the default value). The value $\psi$ is obtained using the known $\theta$ and the joint empirical distribution of $(X,W)$ based on the same run of observations as in obs. Seeing $W, X, Y$ as DNA methylation, DNA copy number and gene expression, respectively, the simulation scheme implements the following constraints:

There are two or three copy number classes: normal regions (k=2), and regions of copy number gains and/or losses (k=1and/ork=3).
In normal regions, gene expression levels are negatively correlated with DNA methylation.
In regions of copy number alteration, copy number and expression are positively correlated.

References

Chambaz, A., Neuvial, P., & van der Laan, M. J. (2012). Estimation of a non-parametric variable importance measure of a continuous exposure. Electronic journal of statistics, 6, 1059--1099.

Examples

Run this code

## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
           X=c(2.722713, 9.362432),
           Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)

## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
str(sim)

obs <- sim$obs
head(obs)
pairs(obs)

## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obsV <- cbind(V, obs)
head(obsV)

## True psi and confidence intervals (case 'f=identity')      
sim01 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi1 <- sim01$psi

confInt01 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(sim01$obs))
confInt1 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(obs))
msg <- "Case f=identity:
"
msg <- c(msg, "ttrue psi is: ", paste(signif(truePsi1, 3)), "")
msg <- c(msg, "t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt01, 3)), "")
msg <- c(msg, "toptimal 95%-confidence interval is: ", 
         paste(signif(confInt1, 3)), "")
cat(msg)

## True psi and confidence intervals (case 'f=atan')
f2 <- function(x) {1*atan(x/1)}
sim02 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S, f=f2);
truePsi2 <- sim02$psi;

confInt02 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(sim02$obs))
confInt2 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(obs))

msg <- "Case f=atan:
"
msg <- c(msg, "ttrue psi is: ", paste(signif(truePsi2, 3)), "")
msg <- c(msg, "t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt02, 3)), "")
msg <- c(msg, "toptimal 95%-confidence interval is: ", 
         paste(signif(confInt2, 3)), "")
cat(msg)

Run the code above in your browser using DataLab