vst: Variance stabilizing transformation for UMI count data

Description

Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters.

Usage

vst(umi, cell_attr = NULL, latent_var = c("log_umi"),
  batch_var = NULL, latent_var_nonreg = NULL, n_genes = 2000,
  n_cells = NULL, method = "poisson", do_regularize = TRUE,
  res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
  bin_size = 256, min_cells = 5, residual_type = "pearson",
  return_cell_attr = FALSE, return_gene_attr = TRUE,
  return_corrected_umi = FALSE, min_variance = -Inf, bw_adjust = 3,
  gmean_eps = 1, theta_given = NULL, show_progress = TRUE)

Arguments

umi

A matrix of UMI counts with genes as rows and cells as columns

cell_attr

A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated

latent_var

The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi")

batch_var

The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited

latent_var_nonreg

The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL

n_genes

Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes)

n_cells

Number of cells to use when estimating parameters (default uses all cells)

method

Method to use for initial parameter estimation; one of 'poisson', 'nb_fast', 'nb', 'nb_theta_given'

do_regularize

Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes).

res_clip_range

Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi)))

bin_size

Number of genes to put in each bin (to show progress)

min_cells

Only use genes that have been detected in at least this many cells; default is 5

residual_type

What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson'

return_cell_attr

Make cell attributes part of the output; default is FALSE

return_gene_attr

Calculate gene attributes and make part of output; default is TRUE

return_corrected_umi

If set to TRUE output will contain corrected UMI matrix; see correct function

min_variance

Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; default is -Inf

bw_adjust

Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3

gmean_eps

Small value added when calculating geometric mean of a gene to avoid log(0); default is 1

theta_given

Named numeric vector of fixed theta values for the genes; will only be used if method is set to nb_theta_given; default is NULL

show_progress

Whether to print messages and show progress bar

Value

A list with components

Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if residual_type = 'none'

umi_corrected

Matrix of corrected UMI counts (optional)

model_str

Character representation of the model formula

model_pars

Matrix of estimated model parameters per gene (theta and regression coefficients)

model_pars_outliers

Vector indicating whether a gene was considered to be an outlier

model_pars_fit

Matrix of fitted / regularized model parameters

model_str_nonreg

Character representation of model for non-regularized variables

model_pars_nonreg

Model parameters for non-regularized variables

genes_log_gmean_step1

log-geometric mean of genes used in initial step of parameter estimation

cells_step1

Cells used in initial step of parameter estimation

arguments

List of function call arguments

cell_attr

Data frame of cell meta data (optional)

gene_attr

Data frame with gene attributes such as mean, detection rate, etc. (optional)

Details

In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done on a subset of genes and/or cells to speed things up. If method is set to 'poisson', glm will be called with family = poisson and the negative binomial theta parameter will be estimated using the response residuals in MASS::theta.ml. If method is set to 'nb_fast', glm coefficients and theta are estimated as in the 'poisson' method, but coefficients are then re-estimated using a proper negative binomial model in a second call to glm with family = MASS::negative.binomial(theta = theta). If method is set to 'nb', coefficients and theta are estimated by a single call to MASS::glm.nb.

Examples

Run this code

# NOT RUN {
vst_out <- vst(pbmc)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab