pagoda.varnorm: Normalize gene expression variance relative to transcriptome-wide expectations

Description

Normalizes gene expression magnitudes to ensure that the variance follows chi-squared statistics with respect to its ratio to the transcriptome-wide expectation as determined by local regression on expression magnitude (and optionally gene length). Corrects for batch effects.

Usage

pagoda.varnorm(models, counts, batch = NULL, trim = 0, prior = NULL,
  fit.genes = NULL, plot = TRUE, minimize.underdispersion = FALSE,
  n.cores = detectCores(), n.randomizations = 100, weight.k = 0.9,
  verbose = 0, weight.df.power = 1, smooth.df = -1, max.adj.var = 10,
  theta.range = c(0.01, 100), gene.length = NULL)

Arguments

models

model matrix (select a subset of rows to normalize variance within a subset of cells)

counts

read count matrix

batch

measurement batch (optional)

trim

trim value for Winsorization (optional, can be set to 1-3 to reduce the impact of outliers, can be as large as 5 or 10 for datasets with several thousand cells)

prior

expression magnitude prior

fit.genes

a vector of gene names which should be used to establish the variance fit (default is NULL: use all genes). This can be used to specify, for instance, a set spike-in control transcripts such as ERCC.

plot

whether to plot the results

minimize.underdispersion

whether underdispersion should be minimized (can increase sensitivity in datasets with high complexity of population, however cannot be effectively used in datasets where multiple batches are present)

n.cores

number of cores to use

n.randomizations

number of bootstrap sampling rounds to use in estimating average expression magnitude for each gene within the given set of cells

weight.k

k value to use in the final weight matrix

verbose

verbosity level

weight.df.power

power factor to use in determining effective number of degrees of freedom (can be increased for datasets exhibiting particularly high levels of noise at low expression magnitudes)

smooth.df

degrees of freedom to be used in calculating smoothed local regression between coefficient of variation and expression magnitude (and gene length, if provided). Leave at -1 for automated guess.

max.adj.var

maximum value allowed for the estimated adjusted variance (capping of adjusted variance is recommended when scoring pathway overdispersion relative to randomly sampled gene sets)

theta.range

valid theta range (should be the same as was set in knn.error.models() call

gene.length

optional vector of gene lengths (corresponding to the rows of counts matrix)

Value

a list containing the following fields:
- mat
{adjusted expression magnitude values}
matw
{ weight matrix corresponding to the expression matrix}
arv
{ a vector giving adjusted variance values for each gene}
avmodes
{a vector estimated average expression magnitudes for each gene}
modes
{a list of batch-specific average expression magnitudes for each gene}
prior
{estimated (or supplied) expression magnitude prior}
edf
{ estimated effective degrees of freedom}
fit.genes
{ fit.genes parameter }

Examples

Run this code

data(pollen)
cd <- clean.counts(pollen)
knn <- knn.error.models(cd, k=ncol(cd)/4, n.cores=10, min.count.threshold=2, min.nonfailed=5, max.model.plots=10)
varinfo <- pagoda.varnorm(knn, counts = cd, trim = 3/ncol(cd), max.adj.var = 5, n.cores = 1, plot = FALSE)

Run the code above in your browser using DataLab