deconvolute: Deconvolute bulk RNA-Seq using single-cell RNA-Seq signature

Description

Deconvolution of bulk RNA-Seq using vector projection method with adjustable compensation for spillover.

Usage

deconvolute(
  mk,
  test,
  logged_bulk = FALSE,
  count_space = TRUE,
  comp_amount = 1,
  group_comp_amount = 0,
  weights = NULL,
  weight_method = "equal",
  adjust_comp = TRUE,
  use_filter = TRUE,
  arith_mean = FALSE,
  convert_bulk = FALSE,
  check_comp = FALSE,
  npass = 1,
  outlier_method = c("var.e", "cooks", "rstudent"),
  outlier_cutoff = switch(outlier_method, var.e = 4, cooks = 1, rstudent = 10),
  outlier_quantile = 0.9,
  verbose = TRUE,
  cores = 1L
)

Value

A list object of S3 class 'deconv' containing:

call

the matched call

mk

the original 'cellMarkers' class object

subclass

list object containing:

output, the amount of each subclass based purely on project gene expression
percent, the proportion of each subclass scaled as a percentage so that the total amount across all subclasses adds to 100%
spillover, the spillover matrix
compensation, the mixed final compensation matrix which incorporates comp_amount
rawcomp, the original unadjusted compensation matrix
comp_amount, the final values for the amount of compensation across each cell subclass after adjustment to prevent negative values
residuals, residuals, that is gene expression minus fitted values
var.e, variance of weighted residuals for each gene
weights, vector of weights
resvar, \(s^2\) the estimate of the gene expression variance for each sample
se, standard errors of cell counts
hat, diagonal elements of the hat matrix
removed, vector of outlying genes removed during successive passes

group

similar list object to subclass, but with results for the cell group analysis.

nest_output

alternative matrix of cell output results for each subclass adjusted so that the cell outputs across subclasses are nested as a proportion of cell group outputs.

nest_percent

alternative matrix of cell proportion results for each subclass adjusted so that the percentages across subclasses are nested within cell group percentages. The total percentage still adds to 100%.

comp_amount

original argument comp_amount

comp_check

optional list element returned when check_comp = TRUE

Arguments

mk: object of class 'cellMarkers'. See cellMarkers().
test: matrix of bulk RNA-Seq to be deconvoluted with genes in rows and samples in columns. We recommend raw counts as input, but normalised data can be provided, in which case set logged_bulk = TRUE.
logged_bulk: Logical, whether log2 transformed bulk RNA-Seq data is used as input in test.
count_space: Logical, whether deconvolution is performed in count space (as opposed to log2 space). Signature and test revert to count scale by 2^ exponentiation during deconvolution.
comp_amount: either a single value from 0-1 for the amount of compensation or a numeric vector with the same length as the number of cell subclasses to deconvolute.
group_comp_amount: either a single value from 0-1 for the amount of compensation for cell group analysis or a numeric vector with the same length as the number of cell groups to deconvolute.
weights: Optional vector of weights which affects how much each gene in the gene signature matrix affects the deconvolution.
weight_method: Optional. Choices include "none" or "equal" in which gene weights are calculated so that each gene has equal weighting in the vector projection; "equal" overrules any vector supplied by weights.
adjust_comp: logical, whether to optimise comp_amount to prevent negative cell proportion projections.
use_filter: logical, whether to use denoised signature matrix.
arith_mean: logical, whether to use arithmetic means (if available) for signature matrix. Mainly useful with pseudo-bulk simulation.
convert_bulk: either "ref" to convert bulk RNA-Seq to scRNA-Seq scaling using reference data or "qqmap" using quantile mapping of the bulk to scRNA-Seq datasets, or "none" (or FALSE) for no conversion.
check_comp: logical, whether to analyse compensation values across subclasses. See plot_comp().
npass: Number of passes. If npass set to 2 or more this activates removal of genes with excess variance of the residuals.
outlier_method: Method for identifying outlying genes. Options are to use the variance of the residuals for each genes, Cook's distance or absolute Studentized residuals (see details).
outlier_cutoff: Cutoff for removing genes which are outliers based on method selected by outlier_method.
outlier_quantile: Controls quantile for the cutoff for identifying outliers for outlier_method = "cook" or "rstudent".
verbose: logical, whether to show messages.
cores: Number of cores for parallelisation via parallel::mclapply().

Author

Myles Lewis

Details

Equal weighting of genes by setting weight_method = "equal" can help devolution of subclusters whose signature genes have low expression. It is enabled by default.

If a normalised (i.e. logged) bulk matrix is provided instead of raw counts, then it is important that zero expression is true zero. For this reason we do not recommend use of VST (variance stabilised transformed counts) which has a variable offset.

Multipass deconvolution can be activated by setting npass to 2 or higher. This is designed to remove genes which behave inconsistently due to noise in either the sc or bulk datasets, which is increasingly likely if you have larger signature geneset, i.e. if nsubclass is large. Or you may receive a warning message "Detected genes with extreme residuals". Three methods are available for identifying outlier genes (i.e. whose residuals are too noisy) controlled by outlier_method:

var.e, this calculates the variance of the residuals across samples for each gene. Genes whose variance of residuals are outliers based on Z-score standardisation are removed during successive passes.
cooks, this considers the deconvolution as if it were a regression and applies Cook's distance to the residuals and the hat matrix. This seems to be the most stringent method (removes fewest genes).
rstudent, externally Studentized residuals are used.

The cutoff specified by outlier_cutoff which is used to determine which genes are outliers is very sensitive to the outlier method. With var.e the variances are Z-score scaled. With Cook's distance it is typical to consider a value of >1 as fairly strong indication of an outlier, while 0.5 is considered a possible outlier. With Studentized residuals, these are expected to be on a t distribution scale. However, since gene expression itself does not derive from a normal distribution, the errors and residuals are not normally distributed either, which probably explains the need for a very high cut-off. In practice the choice of settings seems to be dataset dependent.

Description

Usage

Value

Arguments

Author

Details

See Also