Provides sample datasets M1-M7 used in the paper Conditional variance estimation for sufficient dimension reduction, Lukas Fertl, Efstathia Bura. The general model is given by: $$Y = g(B'X) + \epsilon$$
dataset(name = "M1", n = NULL, p = 20, sd = 0.5, ...)One of "M1", "M2", "M3", "M4",
"M5", "M6" or "M7". Alternative just the dataset number
1-7.
number of samples.
Dimension of random variable \(X\).
standard diviation for error term \(\epsilon\).
Additional parameters only for "M2" (namely pmix and
lambda), see: below.
List with elements
Xdata, a \(n\times p\) matrix.
Yresponse.
Bthe dim-reduction matrix
nameName of the dataset (name parameter)
The predictors are distributed as \(X\sim N_p(0, \Sigma)\) with \(\Sigma_{i, j} = 0.5^{|i - j|}\) for \(i, j = 1,..., p\) for a subspace dimension of \(k = 1\) with a default of \(n = 100\) data points. \(p = 20\), \(b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p\), and \(Y\) is given as $$Y = cos(b_1'X) + \epsilon$$ where \(\epsilon\) is distributed as generalized normal distribution with location 0, shape-parameter 0.5, and the scale-parameter is chosen such that \(Var(\epsilon) = 0.5\).
The predictors are distributed as \(X \sim Z 1_p \lambda + N_p(0, I_p)\). with
\(Z \sim 2 Binom(p_{mix}) - 1\in\{-1, 1\}\) where
\(1_p\) is the \(p\)-dimensional vector of one's, for a subspace
dimension of \(k = 1\) with a default of \(n = 100\) data points.
\(p = 20\), \(b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p\),
and \(Y\) is $$Y = cos(b_1'X) + 0.5\epsilon$$ where \(\epsilon\) is
standard normal.
Defaults for pmix is 0.3 and lambda defaults to 1.
The predictors are distributed as \(X\sim N_p(0, I_p)\) for a subspace dimension of \(k = 1\) with a default of \(n = 100\) data points. \(p = 20\), \(b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p\), and \(Y\) is $$Y = 2 log(|b_1'X| + 2) + 0.5\epsilon$$ where \(\epsilon\) is standard normal.
The predictors are distributed as \(X\sim N_p(0,\Sigma)\) with \(\Sigma_{i, j} = 0.5^{|i - j|}\) for \(i, j = 1,..., p\) for a subspace dimension of \(k = 2\) with a default of \(n = 100\) data points. \(p = 20\), \(b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p\), \(b_2 = (1,-1,1,-1,1,-1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p\) and \(Y\) is given as $$Y = \frac{b_1'X}{0.5 + (1.5 + b_2'X)^2} + 0.5\epsilon$$ where \(\epsilon\) is standard normal.
The predictors are distributed as \(X\sim U([0,1]^p)\) where \(U([0, 1]^p)\) is the uniform distribution with independent components on the \(p\)-dimensional hypercube for a subspace dimension of \(k = 2\) with a default of \(n = 200\) data points. \(p = 20\), \(b_1 = (1,1,1,1,1,1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p\), \(b_2 = (1,-1,1,-1,1,-1,0,...,0)' / \sqrt{6}\in\mathcal{R}^p\) and \(Y\) is given as $$Y = cos(\pi b_1'X)(b_2'X + 1)^2 + 0.5\epsilon$$ where \(\epsilon\) is standard normal.
The predictors are distributed as \(X\sim N_p(0, I_p)\) for a subspace dimension of \(k = 3\) with a default of \(n = 200\) data point. \(p = 20, b_1 = e_1, b_2 = e_2\), and \(b_3 = e_p\), where \(e_j\) is the \(j\)-th unit vector in the \(p\)-dimensional space. \(Y\) is given as $$Y = (b_1'X)^2+(b_2'X)^2+(b_3'X)^2+0.5\epsilon$$ where \(\epsilon\) is standard normal.
The predictors are distributed as \(X\sim t_3(I_p)\) where \(t_3(I_p)\) is the standard multivariate t-distribution with 3 degrees of freedom, for a subspace dimension of \(k = 4\) with a default of \(n = 200\) data points. \(p = 20, b_1 = e_1, b_2 = e_2, b_3 = e_3\), and \(b_4 = e_p\), where \(e_j\) is the \(j\)-th unit vector in the \(p\)-dimensional space. \(Y\) is given as $$Y = (b_1'X)(b_2'X)^2+(b_3'X)(b_4'X)+0.5\epsilon$$ where \(\epsilon\) is distributed as generalized normal distribution with location 0, shape-parameter 1, and the scale-parameter is chosen such that \(Var(\epsilon) = 0.25\).
Fertl, L. and Bura, E. (2021) "Conditional Variance Estimation for Sufficient Dimension Reduction" <arXiv:2102.08782>