Learn R Programming

noisyR: Enhancing biological signal in sequencing datasets by characterising random technical noise

High-throughput sequencing enables an unprecedented resolution in transcript quantification, at the cost of magnifying the impact of technical noise. The consistent reduction of random background noise to capture functionally meaningful biological signals is still challenging. Intrinsic sequencing variability introducing low-level expression variations can obscure patterns in downstream analyses.

The noisyR package comprises an end-to-end pipeline for quantifying and removing technical noise from HTS datasets. The three main pipeline steps are [i] similarity calculation across samples, [ii] noise quantification, and [iii] noise removal; each step can be finely tuned using hyperparameters; optimal, data-driven values for these parameters are also determined.

Preprint: https://www.biorxiv.org/content/10.1101/2021.01.17.427026v2

Workflow diagram of the noisyr pipeline

Similarity calculation across samples

For the sample-similarity calculation, two approaches are available:

  1. The count matrix approach uses the original, un-normalised expression matrix, as provided after alignment and feature quantification; each sample is processed individually, only the relative expressions across samples are compared. Relying on the hypothesis that the majority of genes are not DE, most of the evaluations are expected to point towards a high similarity across samples. Choosing from a collection of >45 similarity metrics, users can select a measure to assess the localised consistency in expression across samples. A sliding window approach is used to compare the similarity of ranks or abundances for the selected features between samples. The window length is a hyperparameter, which can be user-defined or inferred from the data.
  2. The transcript approach uses as input the alignment files derived from read-mappers (in BAM format). For each sample and each exon, the point-to-point similarity of expression across the transcript is calculated across samples in a pairwise all-versus-all comparison.

The output formats for the two approaches correspond to their inputs: a denoised count matrix for the count matrix approach, and denoised BAM files for the transcript approach.

Main functions: calculate_expression_similarity_counts(), calculate_expression_similarity_transcript()

Supporting functions: get_methods_correlation_distance(), optimise_window_length(), calculate_expression_profile()

Input preparation functions: cast_matrix_to_numeric(), cast_gtf_to_genes()

Noise quantification

The noise quantification step uses the expression-similarity relation calculated in step i to determine the noise threshold, representing the level below which the gene expression is considered noisy.

For example, if a similarity threshold is used as input then the corresponding expression from a (smoothed) expression-similarity line plot is selected as the noise threshold for each sample. The shape of the distribution can vary across experiments; we provide functionality for different thresholds and recommend the choice of the one that results in the lowest variance in the noise thresholds across samples.

Options for smoothing, or summarising the observations in a box plot and selecting the minimum abundance for which the interquartile range (or median, or 5-95% range) is consistently above the similarity threshold are also available. As a general rule (due to the number of observations), we recommend using the smoothing with the count matrix approach, and the boxplot representation with the transcript option.

Indicative plots of the Pearson correlation calculated on windows of increasing average abundance for the count matrix-based noise removal approach (left) and per exon for the transcript-based noise removal approach (right).

Main function: calculate_noise_threshold_base()

Supporting functions: get_methods_calculate_noise_threshold(), calculate_first_minimum_density()

Visualisation functions: plot_expression_similarity(), calculate_noise_threshold_method_statistics()

Noise removal

The third step uses the noise threshold calculated in step ii to remove noise from the count matrix or BAM files.

  • For the count matrix approach, genes whose expression is below the noise thresholds for every sample are removed and the average noise threshold is calculated and added to every entry. This ensures that the fold-changes observed by downstream analyses are not biased by low expression, while still preserving the structure and relative expression levels in the data.
  • For the transcript approach, genes are removed from the BAM files if the expression of all their exons is below the noise thresholds for every sample. The removal is done at gene level to avoid scenarios that are not biologically possible.

Main functions: remove_noise_from_matrix(), remove_noise_from_bams()

Required packages

CRAN (use install.packages())

  • utils
  • grDevices
  • tibble
  • dplyr
  • magrittr
  • ggplot2
  • philentropy
  • doParallel
  • foreach

Bioconductor (use BiocManager::install())

  • preprocessCore
  • IRanges
  • GenomicRanges
  • Rsamtools

Copy Link

Version

Install

install.packages('noisyr')

Monthly Downloads

230

Version

1.0.0

License

GPL-2

Issues

Pull Requests

Stars

Forks

Maintainer

Ilias Moutsopoulos

Last Published

April 16th, 2021

Functions in noisyr (1.0.0)

noisyr_counts

Run the noisyR pipeline for the count matrix approach
remove_noise_from_bams

Function to remove the noisy reads from the BAM files
remove_noise_from_matrix

Function to remove the noisy reads from the expression matrix
calculate_expression_similarity_counts

Calcualate the expression levels and expression levels similarity matrices using the count matrix
noisyr

Run the noisyR pipeline
get_methods_correlation_distance

Show the methods for calculating correlation or distance
noisyr_transcript

Run the noisyR pipeline for the transcript approach
plot_expression_similarity

Plot the similarity against expression levels
optimise_window_length

Optimise the elements per window for the count matrix approach
get_methods_calculate_noise_threshold

Show the methods for calculating a noise threshold
filter_genes_transcript

Function to filter the gene table for the transcript approach
cast_gtf_to_genes

Function to extract exon names and positions from a gtf file
cast_matrix_to_numeric

Cast a matrix of any type to numeric
calculate_first_minimum_density

Function to find the first local minimum of the density of a vector
calculate_expression_similarity_transcript

Calcualte the distance matrices using the BAM files
calculate_expression_profile

Calculate the expression profile of a gene
calculate_noise_threshold_method_statistics

Function to tabulate statistics for different methods of calculating the noise threshold
calculate_noise_threshold

Function to calculate the noise threshold for a given expression matrix and parameters