Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters.
vst(umi, cell_attr = NULL, latent_var = c("log_umi"),
batch_var = NULL, latent_var_nonreg = NULL, n_genes = 2000,
n_cells = NULL, method = "poisson", do_regularize = TRUE,
res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
bin_size = 256, min_cells = 5, residual_type = "pearson",
return_cell_attr = FALSE, return_gene_attr = TRUE,
return_corrected_umi = FALSE, min_variance = -Inf, bw_adjust = 3,
gmean_eps = 1, theta_given = NULL, show_progress = TRUE)A matrix of UMI counts with genes as rows and cells as columns
A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated
The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi")
The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited
The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL
Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes)
Number of cells to use when estimating parameters (default uses all cells)
Method to use for initial parameter estimation; one of 'poisson', 'nb_fast', 'nb', 'nb_theta_given'
Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes).
Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi)))
Number of genes to put in each bin (to show progress)
Only use genes that have been detected in at least this many cells; default is 5
What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson'
Make cell attributes part of the output; default is FALSE
Calculate gene attributes and make part of output; default is TRUE
If set to TRUE output will contain corrected UMI matrix; see correct function
Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; default is -Inf
Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3
Small value added when calculating geometric mean of a gene to avoid log(0); default is 1
Named numeric vector of fixed theta values for the genes; will only be used if method is set to nb_theta_given; default is NULL
Whether to print messages and show progress bar
A list with components
Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if residual_type = 'none'
Matrix of corrected UMI counts (optional)
Character representation of the model formula
Matrix of estimated model parameters per gene (theta and regression coefficients)
Vector indicating whether a gene was considered to be an outlier
Matrix of fitted / regularized model parameters
Character representation of model for non-regularized variables
Model parameters for non-regularized variables
log-geometric mean of genes used in initial step of parameter estimation
Cells used in initial step of parameter estimation
List of function call arguments
Data frame of cell meta data (optional)
Data frame with gene attributes such as mean, detection rate, etc. (optional)
In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done
on a subset of genes and/or cells to speed things up.
If method is set to 'poisson', glm will be called with family = poisson and
the negative binomial theta parameter will be estimated using the response residuals in
MASS::theta.ml.
If method is set to 'nb_fast', glm coefficients and theta are estimated as in the
'poisson' method, but coefficients are then re-estimated using a proper negative binomial
model in a second call to glm with
family = MASS::negative.binomial(theta = theta).
If method is set to 'nb', coefficients and theta are estimated by a single call to
MASS::glm.nb.
# NOT RUN {
vst_out <- vst(pbmc)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab