powerEQTL.scRNAseq.sim: Power Calculation for Association Between Genotype and Gene Expression Based on Single Cell RNAseq Data via Simulation from ZINB Mixed Effects Regression Model

Description

Power calculation for association between genotype and gene expression based on single cell RNAseq data via simulation from ZINB mixed effects regression model. This function can be used to calculate one of the 4 parameters (power, sample size, minimum detectable slope, and minimum allowable MAF) by setting the corresponding parameter as NULL and providing values for the other 3 parameters.

Usage

powerEQTL.scRNAseq.sim(slope, 
		       n, 
		       m, 
                       power = NULL,
		       m.int = -1, 
		       sigma.int = 1, 
		       zero.p = 0.1, 
		       theta = 1, 
		       MAF = 0.2, 
		       FWER = 0.05, 
		       nTests = 1, 
		       nSim = 1000, 
		       estMethod = "GLMMadaptive", 
		       nCores = 1,
                       n.lower = 2.01,
                       n.upper = 1e+4,
                       slope.lower =  1e-6,
                       slope.upper = log(1.0e+6),
                       MAF.lower = 0.05,
                       MAF.upper = 0.49
)

Arguments

slope

numeric. Slope (see details).

integer. Total number of subjects.

integer. Number of cells per subject.

power

numeric. Power for testing if the slope is equal to zero.

m.int

numeric. Mean of random intercept (see details).

sigma.int

numeric. Standard deviation of random intercept (see details).

zero.p

numeric. Probability that an excess zero occurs.

theta

numeric. Dispersion parameter of negative binomial distribution. The smaller theta is, the larger variance of NB random variable is.

MAF

numeric. Minor allele frequency of the SNP.

FWER

numeric. Family-wise Type I error rate.

nTests

integer. Number of tests (i.e., number of all (SNP, gene) pairs) in eQTL analysis.

nSim

integer. Number of simulated datasets to be generated.

estMethod

character. Indicates which method would be used to fit zero inflated negative binomial mixed effects model. Currently, the possible choice is “GLMMadaptive”.

nCores

integer. Number of computer cores used by mclapply for parallel computing. For Windows, nCores=1.

n.lower

numeric. Lower bound of the total number of subjects. Only used when calculating total number of subjects.

n.upper

numeric. Upper bound of the total number of subjects. Only used when calculating total number of subjects.

slope.lower

numeric. Lower bound of the slope. Only used when calculating minimum slope.

slope.upper

numeric. Upper bound of the slope Only used when calculating minimum slope.

MAF.lower

numeric. Lower bound of the MAF. Only used when calculating minimum MAF.

MAF.upper

numeric. Upper bound of the MAF Only used when calculating minimum MAF.

Value

power if the input parameter power = NULL.

sample size (total number of subjects) if the input parameter n = NULL;

minimum detectable slope if the input parameter slope = NULL;

minimum allowable MAF if the input parameter MAF = NULL.

Details

This function calculates the power for testing if genotypes of a SNP is associated with the expression of a gene via nSim datasets generated from zero-inflated negative binomial (ZINB) regression model.

Each dataset is generated from zero-inflated negative binomial mixed effects regression model with only one covariate: genotype. That is, the read counts of a gene follows a mixture of 2-component distributions. One component takes only one value: zero. The other component is negative binomial distribution, which takes non-negative values 0, 1, 2, .... The log mean of the negative binomial distribution is linear function of the genotype.

For each dataset, the p-value for testing if the slope for genotype is equal to zero will be calculated.

The proportion of p-values $< \alpha$ is the estimated power, where $\alpha = FWER/nTests$.

Each simulated dataset contains gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells. The gene expression levels (read counts) follow zero-inflated negative binomial distribution. Denote $Y_{ij}$ as the read counts for the $j$-th cell of the $i$-th subject, $i=1,\ldots, n$, $j=1,\ldots, m$, $n$ is the number of subjects, and $m$ is the number of cells per subject. Denote $p$ as the probability that $Y_{ij}=0$ is an excess zero. With probability $1-p$, $Y_{ij}$ follows a negative binomial distribution $NB(\mu, \theta)$, where $\mu$ is the mean (i.e., $\mu=E(Y_{ij})$) and $\theta$ is the dispersion parameter. The variance of the NB distribution is $\mu + \mu^2/\theta$. The relationship between gene expression and genotype for the $i$-th subject is characterized by the equation $$\mu_i = \exp(\beta_{0i} + \beta_{1} x_i),$$ where $\beta_{0i}$ is the random intercept following a normal distribution $N(\beta_0, \sigma^2)$ to account for within-subject correlation of gene expression, $\beta_0$ is the mean of the random intercept, $\sigma$ is the standard deviation of the random intercept, $\beta_1$ is the slope, and $x_i$ is the additive-coded genotype for the SNP with minor allele frequency $MAF$.

We assume that the SNP satisfies the Hardy-Weinberg Equilibrium. That is, the probabilities of the 3 genotypes $(0, 1, 2)$ are $(1-MAF)^2, 2 MAF (1-MAF), MAF^2$, respectively.

For simplicity, we assume that excess zeros are caused by technical issues, hence are not related to genotypes.

References

Dong X, Li X, Chang T-W, Scherzer CR, Weiss ST, and Qiu W. powerEQTL: An R package and shiny application for sample size and power calculation of bulk tissue and single-cell eQTL analysis. Bioinformatics, 2021;, btab385

Examples

Run this code

# NOT RUN {
nSubj = 10
nCellPerSubj = 10

# calculate power
power = powerEQTL.scRNAseq.sim(
  slope = 1.62, # slope
  n = nSubj, # total number of subjects
  m = nCellPerSubj, # number of cells per subject 
  power = NULL, # power to be estimated
  m.int = -1, # mean of random intercept
  sigma.int = 1, # SD of the random intercept
  zero.p = 0.01, # probability that an excess zero occurs
  theta = 1, # dispersion parameter of NB distribution NB(mu, theta)
  MAF = 0.45, 
  FWER = 0.05,
  nTests = 1,
  nSim  =  5, # number of simulations
  estMethod = "GLMMadaptive", # parameter estimation method for ZINB
  nCores = 1 # number of computer cores used by 'mclapply'
)

print(power)


# }

Run the code above in your browser using DataLab