simDat.eQTL.scRNAseq: Generate Gene Expression Levels Of One Gene And Genotypes Of One SNP For Subjects With Multiple Cells Based On ZINB Mixed Effects Regression Model

Description

Generate gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on ZINB mixed effects regression model.

Usage

simDat.eQTL.scRNAseq(nSubj = 50, 
		     nCellPerSubj = 100, 
		     zero.p = 0.01, 
		     m.int = 0, 
		     sigma.int = 1, 
		     slope = 1, 
		     theta = 1, 
		     MAF = 0.45)

Arguments

nSubj

integer. Total number of subjects.

nCellPerSubj

integer. Number of cells per subject.

zero.p

numeric. Probability that an excess zero occurs.

m.int

numeric. Mean of random intercept (see details).

sigma.int

numeric. Standard deviation of random intercept (see details).

slope

numeric. Slope (see details).

theta

numeric. dispersion parameter of negative binomial distribution. The smaller theta is, the larger variance of NB random variable is.

MAF

numeric. Minor allele frequency of the SNP.

Value

A data frame with 3 columns:

subject id

geno

additive-coded genotype of the SNP

counts

gene expression of the gene

Details

This function simulates gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on zero-inflated negative binomial (ZINB) regression model with only one covariate: genotype. That is, the read counts of a gene follows a mixture of 2-component distributions. One component takes only one value: zero. The other component is negative binomial distribution, which takes non-negative values 0, 1, 2, .... The log mean of the negative binomial distribution is linear function of the genotype.

Denote $Y_{ij}$ as the read counts for the $j$-th cell of the $i$-th subject, $i=1,\ldots, n$, $j=1,\ldots, m$, $n$ is the number of subjects, and $m$ is the number of cells per subject.

Denote $p$ as the probability that $Y_{ij}=0$ is an excess zero. With probability $1-p$, $Y_{ij}$ follows a negative binomial distribution $NB(\mu, \theta)$, where $\mu$ is the mean (i.e., $\mu=E(Y_{ij})$) and $\theta$ is the dispersion parameter. The variance of the NB distribution is $\mu + \mu^2/\theta$. The relationship between gene expression and genotype for the $i$-th subject is characterized by the equation $$\mu_i = \exp(\beta_{0i} + \beta_{1} x_i),$$ where $\beta_{0i}$ is the random intercept following a normal distribution $N(\beta_0, \sigma^2)$ to account for within-subject correlation of gene expression, $\beta_0$ is the mean of the random intercept, $\sigma$ is the standard deviation of the random intercept, $\beta_1$ is the slope, and $x_i$ is the additive-coded genotype for the SNP with minor allele frequency $MAF$.

We assume that the SNP satisfies the Hardy-Weinberg Equilibrium. That is, the probabilities of the 3 genotypes $(0, 1, 2)$ are $(1-MAF)^2, 2 MAF (1-MAF), MAF^2$, respectively.

For simplicity, we assume that excess zeros are caused by technical issues, hence are not related to genotypes.

References

Dong X, Li X, Chang T-W, Scherzer CR, Weiss ST, and Qiu W. powerEQTL: An R package and shiny application for sample size and power calculation of bulk tissue and single-cell eQTL analysis. Bioinformatics, 2021;, btab385

Examples

Run this code

# NOT RUN {
frame = simDat.eQTL.scRNAseq(nSubj = 5, 
		     nCellPerSubj = 3, 
		     zero.p = 0.01, 
		     m.int = 0, 
		     sigma.int = 1, 
		     slope = 1, 
		     theta = 1, 
		     MAF = 0.45)
print(dim(frame))
print(frame[1:10,])

# }

Run the code above in your browser using DataLab