Power calculation for association between genotype and gene expression based on single cell RNAseq data via simulation from ZINB mixed effects regression model. This function can be used to calculate one of the 4 parameters (power, sample size, minimum detectable slope, and minimum allowable MAF) by setting the corresponding parameter as NULL and providing values for the other 3 parameters.
powerEQTL.scRNAseq.sim(slope,
n,
m,
power = NULL,
m.int = -1,
sigma.int = 1,
zero.p = 0.1,
theta = 1,
MAF = 0.2,
FWER = 0.05,
nTests = 1,
nSim = 1000,
estMethod = "GLMMadaptive",
nCores = 1,
n.lower = 2.01,
n.upper = 1e+4,
slope.lower = 1e-6,
slope.upper = log(1.0e+6),
MAF.lower = 0.05,
MAF.upper = 0.49
)
numeric. Slope (see details).
integer. Total number of subjects.
integer. Number of cells per subject.
numeric. Power for testing if the slope is equal to zero.
numeric. Mean of random intercept (see details).
numeric. Standard deviation of random intercept (see details).
numeric. Probability that an excess zero occurs.
numeric. Dispersion parameter of negative binomial distribution.
The smaller theta
is, the larger variance of NB random variable is.
numeric. Minor allele frequency of the SNP.
numeric. Family-wise Type I error rate.
integer. Number of tests (i.e., number of all (SNP, gene) pairs) in eQTL analysis.
integer. Number of simulated datasets to be generated.
character. Indicates which method would be used to fit zero inflated negative binomial mixed effects model. Currently, the possible choice is “GLMMadaptive”.
integer. Number of computer cores used by mclapply
for parallel computing. For Windows, nCores=1
.
numeric. Lower bound of the total number of subjects. Only used when calculating total number of subjects.
numeric. Upper bound of the total number of subjects. Only used when calculating total number of subjects.
numeric. Lower bound of the slope. Only used when calculating minimum slope.
numeric. Upper bound of the slope Only used when calculating minimum slope.
numeric. Lower bound of the MAF. Only used when calculating minimum MAF.
numeric. Upper bound of the MAF Only used when calculating minimum MAF.
power if the input parameter power = NULL
.
sample size (total number of subjects) if the input parameter n = NULL
;
minimum detectable slope if the input parameter slope = NULL
;
minimum allowable MAF if the input parameter MAF = NULL
.
This function calculates the power for testing if genotypes of a SNP is associated with the expression of a gene via nSim
datasets generated from zero-inflated negative binomial (ZINB) regression model.
Each dataset is generated from zero-inflated negative binomial mixed effects regression model with only one covariate: genotype. That is, the read counts of a gene follows a mixture of 2-component distributions. One component takes only one value: zero. The other component is negative binomial distribution, which takes non-negative values 0, 1, 2, .... The log mean of the negative binomial distribution is linear function of the genotype.
For each dataset, the p-value for testing if the slope for genotype is equal to zero will be calculated.
The proportion of p-values \(< \alpha\) is the estimated power, where \(\alpha = FWER/nTests\).
Each simulated dataset contains gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells. The gene expression levels (read counts) follow zero-inflated negative binomial distribution. Denote \(Y_{ij}\) as the read counts for the \(j\)-th cell of the \(i\)-th subject, \(i=1,\ldots, n\), \(j=1,\ldots, m\), \(n\) is the number of subjects, and \(m\) is the number of cells per subject. Denote \(p\) as the probability that \(Y_{ij}=0\) is an excess zero. With probability \(1-p\), \(Y_{ij}\) follows a negative binomial distribution \(NB(\mu, \theta)\), where \(\mu\) is the mean (i.e., \(\mu=E(Y_{ij})\)) and \(\theta\) is the dispersion parameter. The variance of the NB distribution is \(\mu + \mu^2/\theta\). The relationship between gene expression and genotype for the \(i\)-th subject is characterized by the equation $$\mu_i = \exp(\beta_{0i} + \beta_{1} x_i),$$ where \(\beta_{0i}\) is the random intercept following a normal distribution \(N(\beta_0, \sigma^2)\) to account for within-subject correlation of gene expression, \(\beta_0\) is the mean of the random intercept, \(\sigma\) is the standard deviation of the random intercept, \(\beta_1\) is the slope, and \(x_i\) is the additive-coded genotype for the SNP with minor allele frequency \(MAF\).
We assume that the SNP satisfies the Hardy-Weinberg Equilibrium. That is, the probabilities of the 3 genotypes \((0, 1, 2)\) are \((1-MAF)^2, 2 MAF (1-MAF), MAF^2\), respectively.
For simplicity, we assume that excess zeros are caused by technical issues, hence are not related to genotypes.
Dong X, Li X, Chang T-W, Scherzer CR, Weiss ST, and Qiu W. powerEQTL: An R package and shiny application for sample size and power calculation of bulk tissue and single-cell eQTL analysis. Bioinformatics, 2021;, btab385
# NOT RUN {
nSubj = 10
nCellPerSubj = 10
# calculate power
power = powerEQTL.scRNAseq.sim(
slope = 1.62, # slope
n = nSubj, # total number of subjects
m = nCellPerSubj, # number of cells per subject
power = NULL, # power to be estimated
m.int = -1, # mean of random intercept
sigma.int = 1, # SD of the random intercept
zero.p = 0.01, # probability that an excess zero occurs
theta = 1, # dispersion parameter of NB distribution NB(mu, theta)
MAF = 0.45,
FWER = 0.05,
nTests = 1,
nSim = 5, # number of simulations
estMethod = "GLMMadaptive", # parameter estimation method for ZINB
nCores = 1 # number of computer cores used by 'mclapply'
)
print(power)
# }
Run the code above in your browser using DataLab