CorrectDropout: Correct for experimental/bioinformatic dropout of labeled RNA.

Description

Uses the strategy described here, and similar to that originally presented in Berg et al. 2024.

Usage

CorrectDropout(
  obj,
  strategy = c("grandR", "bakR"),
  grouping_factors = NULL,
  features = NULL,
  populations = NULL,
  fraction_design = NULL,
  repeatID = NULL,
  exactMatch = TRUE,
  read_cutoff = 25,
  dropout_cutoff = 5,
  ...
)

Value

An EZbakRData object with the specified "fractions" table replaced with a dropout corrected table.

Arguments

obj

An EZbakRFractions object, which is an EZbakRData object on which you have run EstimateFractions().

strategy

Which dropout correction strategy to use. Options are:

grandR: Described here. Cite that work and grandR if using this strategy. Quasi-non-parametric strategy that finds an estimate of the dropout rate that eliminates any linear correlation between the newness of a transcript and the difference in +s4U and -s4U normalized read counts.
bakR: Described here. Uses a simple generative model of dropout to derive a likelihood function, and the dropout rate is estimated via the method of maximum likelihood.

The "bakR" strategy has the advantage of being model-derived, making it possible to assess model fit and thus whether the simple assumptions of both the "bakR" and "grandR" dropout models are met. The "grandR" strategy has the advantage of being more robust. Thus, the "grandR" strategy is currently used by default.

grouping_factors

Which sample-detail columns in the metadf should be used to group -s4U samples by for calculating the average -s4U RPM? The default value of NULL will cause all sample-detail columns to be used.

features

Character vector of the set of features you want to stratify reads by and estimate proportions of each RNA population. The default of NULL will expect there to be only one fractions table in the EZbakRFractions object.

populations

Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment.

fraction_design

"Design matrix" specifying which RNA populations exist in your samples. By default, this will be created automatically and will assume that all combinations of the mutrate_populations you have requested to analyze are present in your data. If this is not the case for your data, then you will have to create one manually. See docs for EstimateFractions (run ?EstimateFractions()) for more details.

repeatID

If multiple fractions tables exist with the same metadata, then this is the numerical index by which they are distinguished.

exactMatch

If TRUE, then features must exactly match the features metadata for a given fractions table for it to be used. Means that you cannot specify a subset of features by default. Set this to FALSE if you would like to specify a feature subset.

read_cutoff

Minimum number of reads for a feature to be used to fit the dropout model.

dropout_cutoff

Maximum ratio of -s4U:+s4U RPMs for a feature to be used to fit the dropout model (i.e., simple outlier filtering cutoff).

...

Parameters passed to internal calculate_dropout() function; namely dropout_cutoff_min, which sets the minimum dropout value used for fitting the dropout model.

Details

Dropout is the disproportionate loss of labeled RNA/reads from said RNA described independently here and here. It can originate from a combination of bioinformatic (loss of high mutation content reads due to alignment problems), technical (loss of labeled RNA during RNA extraction), and biological (transcriptional shutoff in rare cases caused by metabolic label toxicity) sources. CorrectDropout() compares label-fed and label-free controls from the same experimental conditions to estimate and correct for this dropout. It assumes that there is a single number (referred to as the dropout rate, or pdo) which describes the rate at which labeled RNA is lost (relative to unlabeled RNA). pdo ranges from 0 (no dropout) to 1 (complete loss of all labeled RNA), and is thus interpreted as the percentage of labeled RNA/reads from labeled RNA disproportionately lost, relative to the equivalent unlabeled species.

Examples

Run this code


# Simulate data to analyze
simdata <- EZSimulate(30)

# Create EZbakR input
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)

# Estimate Fractions
ezbdo <- EstimateFractions(ezbdo)

# Correct for dropout
ezbdo <- CorrectDropout(ezbdo)

Run the code above in your browser using DataLab