Simulate Genotype Data from a Mixture of 3 Bayesian Hierarchical Models. The minor allele frequency (MAF) of cases has different priors than that of controls.
simGenoFuncDiffPriors(
nCases = 100,
nControls = 100,
nSNPs = 1000,
alpha.p.ca = 2,
beta.p.ca = 3,
alpha.p.co = 2,
beta.p.co = 8,
pi.p = 0.1,
alpha0 = 2,
beta0 = 5,
pi0 = 0.8,
alpha.n.ca = 2,
beta.n.ca = 8,
alpha.n.co = 2,
beta.n.co = 3,
pi.n = 0.1,
low = 0.02,
upp = 0.5,
verbose = FALSE)integer. Number of cases.
integer. Number of controls.
integer. Number of SNPs.
numeric. The first shape parameter of Beta prior in cluster \(+\) for cases.
numeric. The second shape parameter of Beta prior in cluster \(+\) for cases.
numeric. The first shape parameter of Beta prior in cluster \(+\) for controls.
numeric. The second shape parameter of Beta prior in cluster \(+\) for controls.
numeric. Mixture proportion for cluster \(+\).
numeric. The first shape parameter of Beta prior in cluster \(0\).
numeric. The second shape parameter of Beta prior in cluster \(0\).
numeric. Mixture proportion for cluster \(0\).
numeric. The first shape parameter of Beta prior in cluster \(-\) for cases.
numeric. The second shape parameter of Beta prior in cluster \(-\) for cases.
numeric. The first shape parameter of Beta prior in cluster \(-\) for controls.
numeric. The second shape parameter of Beta prior in cluster \(-\) for controls.
numeric. Mixture proportion for cluster \(-\).
numeric. A small positive value. If a MAF generated from half-flat shape
bivariate prior is smaller than low, we will delete the SNP to be generated.
numeric. A positive value. If a MAF generated from half-flat shape
bivariate prior is greater than upp, we will delete the SNP to be generated.
logical. Indicating if intermediate results or final results should be output to output screen.
An ExpressionSet object stores genotype data.
In this simulation, we generate additive-coded genotypes for 3 clusters of SNPs based on a mixture of 3 Bayesian hierarchical models.
In cluster \(+\), the minor allele frequency (MAF) \(\theta_{x+}\) of cases is greater than the MAF \(\theta_{y+}\) of controls.
In cluster \(0\), the MAF \(\theta_{0}\) of cases is equal to the MAF of controls.
In cluster \(-\), the MAF \(\theta_{x-}\) of cases is smaller than the MAF \(\theta_{y-}\) of controls.
The proportions of the 3 clusters of SNPs are \(\pi_{+}\), \(\pi_{0}\), and \(\pi_{-}\), respectively.
We assume a “half-flat shape” bivariate prior for the MAF in cluster \(+\) $$2h_{+}\left(\theta_{x+}\right)h_{+}\left(\theta_{y+}\right) I\left(\theta_{x+}>\theta_{y+}\right), $$ where \(I(a)\) is hte indicator function taking value \(1\) if the event \(a\) is true, and value \(0\) otherwise. The function \(h_{+}\) is the probability density function of the beta distribution \(Beta\left(\alpha_{+}, \beta_{+}\right)\).
We assume \(\theta_{0}\) has the beta prior \(Beta(\alpha_0, \beta_0)\).
We also assume a “half-flat shape” bivariate prior for the MAF in cluster \(-\) $$2h_{-}\left(\theta_{x-}\right)h_{-}\left(\theta_{y-}\right) I\left(\theta_{x-}>\theta_{y-}\right). $$ The function \(h_{-}\) is the probability density function of the beta distribution \(Beta\left(\alpha_{-}, \beta_{-}\right)\).
Given a SNP, we assume Hardy-Weinberg equilibrium holds for its genotypes. That is, given MAF \(\theta\), the probabilities of genotypes are $$Pr(geno=2) = \theta^2$$ $$Pr(geno=1) = 2\theta\left(1-\theta\right)$$ $$Pr(geno=0) = \left(1-\theta\right)^2$$
We also assume the genotypes \(0\) (wild-type), \(1\) (heterozygote), and \(2\) (mutation) follows a multinomial distribution \(Multinomial\left\{1, \left[ \theta^2, 2\theta\left(1-\theta\right), \left(1-\theta\right)^2 \right]\right\}\)
Note that when we generate MAFs from the half-flat shape bivariate priors, we might get very small MAFs or get MAFs \(>0.5\). In these cased, we then delete this SNP.
So the final number of SNPs generated might be less than the initially-set number of SNPs.
Yan X, Xing L, Su J, Zhang X, Qiu W. Model-based clustering for identifying disease-associated SNPs in case-control genome-wide association studies. Scientific Reports 9, Article number: 13686 (2019) https://www.nature.com/articles/s41598-019-50229-6.
# NOT RUN {
set.seed(2)
esSimDiffPriors = simGenoFuncDiffPriors(
nCases = 100,
nControls = 100,
nSNPs = 500,
alpha.p.ca = 2, beta.p.ca = 3,
alpha.p.co = 2, beta.p.co = 8, pi.p = 0.1,
alpha0 = 2, beta0 = 5, pi0 = 0.8,
alpha.n.ca = 2, beta.n.ca = 8,
alpha.n.co = 2, beta.n.co = 3, pi.n = 0.1,
low = 0.02, upp = 0.5, verbose = FALSE
)
print(esSimDiffPriors)
pDat = pData(esSimDiffPriors)
print(pDat[1:2,])
print(table(pDat$memSubjs))
fDat = fData(esSimDiffPriors)
print(fDat[1:2,])
print(table(fDat$memGenes))
print(table(fDat$memGenes2))
# }
Run the code above in your browser using DataLab