An ExpressionSet object storing simulated genotype data. The minor allele frequency (MAF) of cases has the same prior as that of controls.
data("esSim")In this simulation, we generate additive-coded genotypes for 3 clusters of SNPs based on a mixture of 3 Bayesian hierarchical models.
In cluster \(+\), the minor allele frequency (MAF) \(\theta_{x+}\) of cases is greater than the MAF \(\theta_{y+}\) of controls.
In cluster \(0\), the MAF \(\theta_{0}\) of cases is equal to the MAF of controls.
In cluster \(-\), the MAF \(\theta_{x-}\) of cases is smaller than the MAF \(\theta_{y-}\) of controls.
The proportions of the 3 clusters of SNPs are \(\pi_{+}\), \(\pi_{0}\), and \(\pi_{-}\), respectively.
We assume a “half-flat shape” bivariate prior for the MAF in cluster \(+\) $$2h_{+}\left(\theta_{x+}\right)h_{+}\left(\theta_{y+}\right) I\left(\theta_{x+}>\theta_{y+}\right), $$ where \(I(a)\) is hte indicator function taking value \(1\) if the event \(a\) is true, and value \(0\) otherwise. The function \(h_{+}\) is the probability density function of the beta distribution \(Beta\left(\alpha_{+}, \beta_{+}\right)\).
We assume \(\theta_{0}\) has the beta prior \(Beta(\alpha_0, \beta_0)\).
We also assume a “half-flat shape” bivariate prior for the MAF in cluster \(-\) $$2h_{-}\left(\theta_{x-}\right)h_{-}\left(\theta_{y-}\right) I\left(\theta_{x-}>\theta_{y-}\right). $$ The function \(h_{-}\) is the probability density function of the beta distribution \(Beta\left(\alpha_{-}, \beta_{-}\right)\).
Given a SNP, we assume Hardy-Weinberg equilibrium holds for its genotypes. That is, given MAF \(\theta\), the probabilities of genotypes are $$Pr(geno=2) = \theta^2$$ $$Pr(geno=1) = 2\theta\left(1-\theta\right)$$ $$Pr(geno=0) = \left(1-\theta\right)^2$$
We also assume the genotypes \(0\) (wild-type), \(1\) (heterozygote), and \(2\) (mutation) follows a multinomial distribution \(Multinomial\left\{1, \left[ \theta^2, 2\theta\left(1-\theta\right), \left(1-\theta\right)^2 \right]\right\}\)
We set the number of cases as \(100\), the number of controls as \(100\), and the number of SNPs as \(1000\).
The hyperparameters are \(\alpha_{+}=2\), \(\beta_{+}=5\), \(\pi_{+}=0.1\), \(\alpha_{0}=2\), \(\beta_{0}=5\), \(\pi_{0}=0.8\), \(\alpha_{-}=2\), \(\beta_{-}=5\), \(\pi_{-}=0.1\).
Note that when we generate MAFs from the half-flat shape bivariate priors, we might get very small MAFs or get MAFs \(>0.5\). In these cased, we then delete this SNP.
So the final number of SNPs generated might be less than the initially-set number \(1000\) of SNPs.
For the dataset stored in esSim, there are \(872\) SNPs.
\(83\) SNPs are in cluster -, \(714\) SNPs are in cluster \(0\),
and \(75\) SNPs are in cluster \(+\).
Yan X, Xing L, Su J, Zhang X, Qiu W. Model-based clustering for identifying disease-associated SNPs in case-control genome-wide association studies. Scientific Reports 9, Article number: 13686 (2019) https://www.nature.com/articles/s41598-019-50229-6.
# NOT RUN {
data(esSim)
print(esSim)
pDat=pData(esSim)
print(pDat[1:2,])
print(table(pDat$memSubjs))
fDat=fData(esSim)
print(fDat[1:2,])
print(table(fDat$memGenes))
print(table(fDat$memGenes2))
# }
Run the code above in your browser using DataLab