Simulated gene expression data for demonstrating the features of marble.
data("dat")dat consists of four components: X, Y, E, clin.
The data model for generating Y
Use subscript \(i\) to denote the \(i\)th subject. Let \((Y_{i}, X_{i}, E_{i}, clin_{i})\) (\(i=1,\ldots,n\)) be independent and identically distributed random vectors. \(Y_{i}\) is a continuous response variable representing the phenotype. \(X_{i}\) is the \(p\)--dimensional vector of genetic factors. The environmental factors and clinical factors are denoted as the \(q\)-dimensional vector \(E_{i}\) and the \(m\)-dimensional vector \(clin_{i}\), respectively. The \(\epsilon\) follows some heavy-tailed distribution. For \(X_{ij}\) (\(j = 1,\ldots,p\)), the measurement of the \(j\)th genetic factor on the \(j\)th subject, considering the following model: $$Y_{i} = \alpha_{0} + \sum_{k=1}^{q}\alpha_{k}E_{ik}+\sum_{t=1}^{m}\gamma_{t}clin_{it}+\beta_{j}X_{ij}+\sum_{k=1}^{q}\eta_{jk}X_{ij}E_{ik}+\epsilon_{i},$$ where \(\alpha_{0}\) is the intercept, \(\alpha_{k}\)'s and \(\gamma_{t}\)'s are the regression coefficients corresponding to effects of environmental and clinical factors, respectively. The \(\beta_{j}\)'s and \(\eta_{jk}\)'s are the regression coefficients of the genetic variants and G\(\times\)E interactions effects, correspondingly. The G\(\times\)E interactions effects are defined with \(W_{j} = (X_{j}E_{1},\ldots,X_{j}E_{q}).\) With a slight abuse of notation, denote \(\tilde{W} = W_{j}.\) Denote \(\alpha=(\alpha_{1}, \ldots, \alpha_{q})^{T}\), \(\gamma=(\gamma_{1}, \ldots, \gamma_{m})^{T}\), \(\beta=(\beta_{1}, \ldots, \beta_{p})^{T}\), \(\eta=(\eta_{1}^{T}, \ldots, \eta_{p}^{T})^{T}\), \(\tilde{W} = (\tilde{W_{1}}, \dots, \tilde{W_{p}})\). Then model can be written as $$Y_{i} = E_{i}\alpha + clin_{i}\gamma + X_{ij}\beta_{j} + \tilde{W}_{i}\eta_{j} + \epsilon_{i}.$$
marble