Function that generates data of the different simulation studies
presented in the accompanying paper. This function requires the
truncnorm package to be installed.
gendata(n, p, corr, E = truncnorm::rtruncnorm(n, a = -1, b = 1), betaE,
SNR, parameterIndex)number of observations
number of main effect variables (X)
correlation between predictors
simulated environment vector of length n. Can be continuous
or integer valued. Factors must be converted to numeric. Default:
truncnorm::rtruncnorm(n, a = -1, b = 1)
exposure effect size
signal to noise ratio
simulation scenario index. See details for more information.
A list with the following elements:
matrix of
dimension nxp of simulated main effects
simulated response
vector of length n
simulated exposure vector of length
n
linear predictor vector of length n
the function f1 evaluated at x_1 (f1(X1))
the function f1 evaluated at x_1 (f1(X1))
the function f1 evaluated at x_1 (f1(X1))
the function f1 evaluated at x_1 (f1(X1))
the value for \(\beta_E\)
the function
f1
the function f2
the function
f3
the function f4
an n length
vector of the first predictor
an n length vector of the
second predictor
an n length vector of the third
predictor
an n length vector of the fourth predictor
a character representing the simulation scenario identifier as described in Bhatnagar et al. (2018+)
character vector of causal variable names
character vector of noise variables
We evaluate the performance of our method on three of its defining characteristics: 1) the strong heredity property, 2) non-linearity of predictor effects and 3) interactions.
Truth obeys
weak hierarchy (parameterIndex = 2) $$Y* = f_1(X_{1}) +
f_2(X_{2}) + \beta_E * X_{E} + X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$
Truth only has interactions (parameterIndex = 3)$$Y* =
X_{E} * f_3(X_{3}) + X_{E} * f_4(X_{4}) $$
Truth is
linear (parameterIndex = 4) $$Y* = \sum_{j=1}^{4}\beta_j X_{j} +
\beta_E * X_{E} + X_{E} * X_{3} + X_{E} * X_{4} $$
Truth only has main effects (parameterIndex = 5)
$$Y* = \sum_{j=1}^{4} f_j(X_{j}) + \beta_E * X_{E} $$
.
The functions are from the paper by Lin and Zhang (2006):
f2 <- function(t) 3 * (2 * t - 1)^2
f3 <- function(t) 4 * sin(2 * pi * t) / (2 - sin(2 * pi * t))
f4 <- function(t) 6 * (0.1 * sin(2 * pi * t) + 0.2 * cos(2 * pi * t) + 0.3 * sin(2 * pi * t)^2 + 0.4 * cos(2 * pi * t)^3 + 0.5 * sin(2 * pi * t)^3)
The response is generated as $$Y = Y* + k*error$$ where Y* is the linear predictor, the error term is generated from a standard normal distribution, and k is chosen such that the signal-to-noise ratio is SNR = Var(Y*)/Var(error), i.e., the variance of the response variable Y due to error is 1/SNR of the variance of Y due to Y*
The covariates are simulated as follows as described in Huang et al.
(2010). First, we generate \(w1,\ldots, wp, u,v\) independently from
\(Normal(0,1)\) truncated to the interval [0,1] for
\(i=1,\ldots,n\). Then we set \(x_j = (w_j + t*u)/(1 + t)\) for \(j
= 1,\ldots, 4\) and \(x_j = (w_j + t*v)/(1 + t)\) for \(j = 5,\ldots,
p\), where the parameter \(t\) controls the amount of correlation among
predictors. This leads to a compound symmetry correlation structure where
\(Corr(x_j,x_k) = t^2/(1+t^2)\), for \(1 \le j \le 4, 1 \le k \le 4\),
and \(Corr(x_j,x_k) = t^2/(1+t^2)\), for \(5 \le j \le p, 5 \le k \le
p\), but the covariates of the nonzero and zero components are independent.
Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5), 2272-2297.
Huang J, Horowitz JL, Wei F. Variable selection in nonparametric additive models (2010). Annals of statistics. Aug 1;38(4):2282.
Bhatnagar SR, Yang Y, Greenwood CMT. Sparse additive interaction models with the strong heredity property (2018+). Preprint.
# NOT RUN {
DT <- gendata(n = 75, p = 100, corr = 0, betaE = 2, SNR = 1, parameterIndex = 1)
# }
Run the code above in your browser using DataLab