gendata: Simulation Scenario from Bhatnagar et al. (2018+) sail paper

Description

Function that generates data of the different simulation studies presented in the accompanying paper. This function requires the truncnorm package to be installed.

Usage

gendata(n, p, corr, E = truncnorm::rtruncnorm(n, a = -1, b = 1), betaE,
  SNR, parameterIndex)

Arguments

number of observations

number of main effect variables (X)

corr

correlation between predictors

simulated environment vector of length n. Can be continuous or integer valued. Factors must be converted to numeric. Default: truncnorm::rtruncnorm(n, a = -1, b = 1)

betaE

exposure effect size

SNR

signal to noise ratio

parameterIndex

simulation scenario index. See details for more information.

Value

A list with the following elements:

x: matrix of dimension nxp of simulated main effects
y: simulated response vector of length n
e: simulated exposure vector of length n
Y.star: linear predictor vector of length n
f1: the function f1 evaluated at x_1 (f1(X1))
f2: the function f1 evaluated at x_1 (f1(X1))
f3: the function f1 evaluated at x_1 (f1(X1))
f4: the function f1 evaluated at x_1 (f1(X1))
betaE: the value for $\beta_E$
f1.f: the function f1
f2.f: the function f2
f3.f: the function f3
f4.f: the function f4
X1: an n length vector of the first predictor
X2: an n length vector of the second predictor
X3: an n length vector of the third predictor
X4: an n length vector of the fourth predictor
scenario: a character representing the simulation scenario identifier as described in Bhatnagar et al. (2018+)
causal: character vector of causal variable names
not_causal: character vector of noise variables

Details

We evaluate the performance of our method on three of its defining characteristics: 1) the strong heredity property, 2) non-linearity of predictor effects and 3) interactions.

Heredity Property

: Truth obeys weak hierarchy (parameterIndex = 2) $$Y* = f_1(X_{1}) + f_2(X_{2}) + \beta_E * X_{E} + X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$
: Truth only has interactions (parameterIndex = 3)$$Y* = X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$

Non-linearity

Truth is linear (parameterIndex = 4) $$Y* = \sum_{j=1}^{4}\beta_j X_{j} + \beta_E * X_{E} + X_{E} * X_{3} + X_{E} * X_{4} $$

Interactions

Truth only has main effects (parameterIndex = 5) $$Y* = \sum_{j=1}^{4} f_j(X_{j}) + \beta_E * X_{E} $$

The functions are from the paper by Lin and Zhang (2006):

f2: f2 <- function(t) 3 * (2 * t - 1)^2
f3: f3 <- function(t) 4 * sin(2 * pi * t) / (2 - sin(2 * pi * t))
f4: f4 <- function(t) 6 * (0.1 * sin(2 * pi * t) + 0.2 * cos(2 * pi * t) + 0.3 * sin(2 * pi * t)^2 + 0.4 * cos(2 * pi * t)^3 + 0.5 * sin(2 * pi * t)^3)

The response is generated as $$Y = Y* + k*error$$ where Y* is the linear predictor, the error term is generated from a standard normal distribution, and k is chosen such that the signal-to-noise ratio is SNR = Var(Y*)/Var(error), i.e., the variance of the response variable Y due to error is 1/SNR of the variance of Y due to Y*

The covariates are simulated as follows as described in Huang et al. (2010). First, we generate $w1,\ldots, wp, u,v$ independently from $Normal(0,1)$ truncated to the interval [0,1] for $i=1,\ldots,n$. Then we set $x_j = (w_j + t*u)/(1 + t)$ for $j = 1,\ldots, 4$ and $x_j = (w_j + t*v)/(1 + t)$ for $j = 5,\ldots, p$, where the parameter $t$ controls the amount of correlation among predictors. This leads to a compound symmetry correlation structure where $Corr(x_j,x_k) = t^2/(1+t^2)$, for $1 \le j \le 4, 1 \le k \le 4$, and $Corr(x_j,x_k) = t^2/(1+t^2)$, for $5 \le j \le p, 5 \le k \le p$, but the covariates of the nonzero and zero components are independent.

References

Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272-2297.

Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models (2010). Annals of statistics. Aug 1;38(4):2282.

Bhatnagar SR, Yang Y, Greenwood CMT. Sparse additive interaction models with the strong heredity property (2018+). Preprint.

Examples

Run this code

# NOT RUN {
DT <- gendata(n = 75, p = 100, corr = 0, betaE = 2, SNR = 1, parameterIndex = 1)
# }

Run the code above in your browser using DataLab