dataset: Generates test datasets.

Description

Provides sample datasets M1-M7 used in the paper Conditional variance estimation for sufficient dimension reduction, Lukas Fertl, Efstathia Bura. The general model is given by: $$Y = g(B'X) + \epsilon$$

Usage

dataset(name = "M1", n = NULL, p = 20, sd = 0.5, ...)

Arguments

name

One of "M1", "M2", "M3", "M4", "M5", "M6" or "M7". Alternative just the dataset number 1-7.

number of samples.

Dimension of random variable $X$.

standard diviation for error term $\epsilon$.

...

Additional parameters only for "M2" (namely pmix and lambda), see: below.

Value

List with elements

Xdata, a $n\times p$ matrix.
Yresponse.
Bthe dim-reduction matrix
nameName of the dataset (name parameter)

M1

The predictors are distributed as $X\sim N_p(0, \Sigma)$ with $\Sigma_{i, j} = 0.5^{|i - j|}$ for $i, j = 1,..., p$ for a subspace dimension of $k = 1$ with a default of $n = 100$ data points. $p = 20$, $b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p$, and $Y$ is given as $$Y = cos(b_1'X) + \epsilon$$ where $\epsilon$ is distributed as generalized normal distribution with location 0, shape-parameter 0.5, and the scale-parameter is chosen such that $Var(\epsilon) = 0.5$.

M2

The predictors are distributed as $X \sim Z 1_p \lambda + N_p(0, I_p)$. with $Z \sim 2 Binom(p_{mix}) - 1\in\{-1, 1\}$ where $1_p$ is the $p$-dimensional vector of one's, for a subspace dimension of $k = 1$ with a default of $n = 100$ data points. $p = 20$, $b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p$, and $Y$ is $$Y = cos(b_1'X) + 0.5\epsilon$$ where $\epsilon$ is standard normal. Defaults for pmix is 0.3 and lambda defaults to 1.

M3

The predictors are distributed as $X\sim N_p(0, I_p)$ for a subspace dimension of $k = 1$ with a default of $n = 100$ data points. $p = 20$, $b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p$, and $Y$ is $$Y = 2 log(|b_1'X| + 2) + 0.5\epsilon$$ where $\epsilon$ is standard normal.

M4

The predictors are distributed as $X\sim N_p(0,\Sigma)$ with $\Sigma_{i, j} = 0.5^{|i - j|}$ for $i, j = 1,..., p$ for a subspace dimension of $k = 2$ with a default of $n = 100$ data points. $p = 20$, $b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p$, $b_2 = (1,-1,1,-1,1,-1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p$ and $Y$ is given as $$Y = \frac{b_1'X}{0.5 + (1.5 + b_2'X)^2} + 0.5\epsilon$$ where $\epsilon$ is standard normal.

M5

The predictors are distributed as $X\sim U([0,1]^p)$ where $U([0, 1]^p)$ is the uniform distribution with independent components on the $p$-dimensional hypercube for a subspace dimension of $k = 2$ with a default of $n = 200$ data points. $p = 20$, $b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p$, $b_2 = (1,-1,1,-1,1,-1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p$ and $Y$ is given as $$Y = cos(\pi b_1'X)(b_2'X + 1)^2 + 0.5\epsilon$$ where $\epsilon$ is standard normal.

M6

The predictors are distributed as $X\sim N_p(0, I_p)$ for a subspace dimension of $k = 3$ with a default of $n = 200$ data point. $p = 20, b_1 = e_1, b_2 = e_2$, and $b_3 = e_p$, where $e_j$ is the $j$-th unit vector in the $p$-dimensional space. $Y$ is given as $$Y = (b_1'X)^2+(b_2'X)^2+(b_3'X)^2+0.5\epsilon$$ where $\epsilon$ is standard normal.

M7

The predictors are distributed as $X\sim t_3(I_p)$ where $t_3(I_p)$ is the standard multivariate t-distribution with 3 degrees of freedom, for a subspace dimension of $k = 4$ with a default of $n = 200$ data points. $p = 20, b_1 = e_1, b_2 = e_2, b_3 = e_3$, and $b_4 = e_p$, where $e_j$ is the $j$-th unit vector in the $p$-dimensional space. $Y$ is given as $$Y = (b_1'X)(b_2'X)^2+(b_3'X)(b_4'X)+0.5\epsilon$$ where $\epsilon$ is distributed as generalized normal distribution with location 0, shape-parameter 1, and the scale-parameter is chosen such that $Var(\epsilon) = 0.25$.

References

Fertl, L. and Bura, E. (2021) "Conditional Variance Estimation for Sufficient Dimension Reduction" <arXiv:2102.08782>