Learn R Programming

readyomics (version 0.1.1)

process_ngs: Process next generation sequencing data

Description

This function performs quality control, filtering, normalization, and transformation of sequencing data raw counts. It can also build phyloseq objects for downstream ecological analyses, and optionally returns intermediate processing steps.

Usage

process_ngs(
  X,
  sample_data,
  taxa_table = NULL,
  phylo_tree = NULL,
  remove_ids = NULL,
  min_reads = 500,
  min_prev = 0.1,
  normalise = c("load", "TSS", "none"),
  load_colname = NULL,
  min_load = 10000,
  transform = c("clr", "log", "none"),
  impute_control = list(method = "GBM", output = "p-counts", z.delete = FALSE, z.warning
    = 1, suppress.print = TRUE),
  raw_phyloseq = TRUE,
  eco_phyloseq = TRUE,
  return_all = FALSE,
  verbose = TRUE
)

Value

A named list containing:

X_processed

Matrix of processed feature counts after filtering, normalization, and transformation.

sdata_final

Matched and filtered sample_data corresponding to retained samples.

phyloseq_raw

phyloseq object created from raw filtered data. NULL if raw_phyloseq = FALSE.

phyloseq_eco

phyloseq object from ecosystem abundance data. NULL if eco_phyloseq = FALSE or normalise != "load".

X_matched

(Optional) Matched and filtered count matrix, pre-normalization. Returned only if return_all = TRUE.

X_norm

(Optional) Normalized count matrix. Returned only if return_all = TRUE.

X_prev

(Optional) Prevalence-filtered matrix, pre-transformation. Returned only if return_all = TRUE.

Arguments

X

A numeric matrix or data frame of raw counts with samples as rows and features (e.g., taxa) as columns. Row names must be sample IDs.

sample_data

A data frame containing sample-level data. Must include a column named sample_id with matching row names with X.

taxa_table

Optional. Taxonomy annotation table to build phyloseq objects. Row names must match column names of X.

phylo_tree

Optional. Phylogenetic tree to add to phyloseq objects.

remove_ids

A regex or character vector to filter rows in X. Set to NULL to skip.

min_reads

Numeric. Minimum number of total reads required per sample. Default is 500.

min_prev

Numeric between 0 and 1. Minimum feature prevalence threshold. Default is 0.1 (i.e., feature must be present in >= 10 % of samples).

normalise

Normalization method. One of "load" (microbial load data), "TSS" (total sum scaling), or "none".

load_colname

Column name in sample_data containing microbial load values. Required if normalise = "load".

min_load

Numeric. Default is 1e4. Warns if any microbial load value < min_load.

transform

Transformation method. One of "clr" (centered log-ratio with zero imputation), "log" (pseudo-log using log1p()), or "none". Note: When using "clr", zero values are imputed using zCompositions::cmultRepl().

impute_control

A named list of arguments to be passed to zCompositions::cmultRepl().

raw_phyloseq

Logical. If TRUE, constructs a phyloseq object with the table of raw counts (filtered failed runs if needed). Default is TRUE.

eco_phyloseq

Logical. If TRUE, constructs a phyloseq object with the ecosystem abundances (i.e. after normalise = "load"). Default is TRUE.

return_all

Logical. If TRUE, additional intermediate data matrices (X_matched, X_norm, X_prev) are included in the output. Default is FALSE.

verbose

Logical. If TRUE, prints progress messages during execution. Default is TRUE.

Details

  • Zeros are imputed with zCompositions::cmultRepl() before CLR transformation.

  • QC or other samples are removed if remove_ids is specified.

  • Sample IDs in X and sample_data row names are matched and aligned.

  • Can generate both a phyloseq_raw phyloseq object containing raw counts and a phyloseq_eco object with ecosystem counts, if a load_colname column from sample_data is provided to normalize the counts by microbial load (recommended best practice).

References

#' McMurdie, P. J., & Holmes, S. (2013). phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE, 8(4), e61217. tools:::Rd_expr_doi("10.1371/journal.pone.0061217")

Martín-Fernández, J. A., Hron, K., Templ, M., Filzmoser, P., & Palarea-Albaladejo, J. (2015). Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling, 15(2), 134–158. tools:::Rd_expr_doi("10.1177/1471082X14535524")

Palarea-Albaladejo, J., & Martín-Fernández, J. A. (2015). zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligent Laboratory Systems, 143, 85–96. tools:::Rd_expr_doi("10.1016/j.chemolab.2015.02.019")

Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: And this is not optional. Frontiers in Microbiology, 8, 2224. tools:::Rd_expr_doi("10.3389/fmicb.2017.02224")

Vandeputte, D., Kathagen, G., D’hoe, K., Vieira-Silva, S., Valles-Colomer, M., Sabino, J., Wang, J., Tito, R. Y., De Commer, L., Darzi, Y., Vermeire, S., Falony, G., & Raes, J. (2017). Quantitative microbiome profiling links gut community variation to microbial load. Nature, 551(7681), 507–511. tools:::Rd_expr_doi("10.1038/nature24460")

See Also

  • build_phyloseq()

  • zCompositions::cmultRepl()

Examples

Run this code
if (requireNamespace("phyloseq", quietly = TRUE)) {
mock_X <- matrix(sample(0:1000, 25, replace = TRUE),
                 nrow = 5,
                 dimnames = list(paste0("sample", 1:5),
                 paste0("ASV", 1:5))
                 )

mock_sample_data <- data.frame(
  sample_id = paste0("sample", 1:5),
  load = c(1e5, 2e5, 1e4, 5e4, 1.5e5),
  condition = factor(rep(c("A", "B"), length.out = 5)),
  row.names = paste0("sample", 1:5)
  )

mock_taxa_table <- data.frame(
  Kingdom = rep("Bacteria", 5),
  Genus = paste0("Genus", 1:5),
  row.names = paste0("ASV", 1:5)
  )

result <- process_ngs(
  X = mock_X,
  sample_data = mock_sample_data,
  taxa_table = mock_taxa_table,
  normalise = "load",
  load_colname = "load",
  transform = "none",
  verbose = FALSE
  )
}

Run the code above in your browser using DataLab