Data.filter: Filtering for Microbial Features of Low Abundance and Low Prevalence.

Description

This function filteres an OTU/ASV table based on overall counts and prevalence thresholds, and optionally applies a logarithmic transformation. When grouping variables are provided, the function performs abundance and prevalence filtering within each group separately.

Usage

Data.filter(
  Data,
  metadata,
  OTU_counts_filter_value = 1000,
  OTU_filter_value = NA,
  log_base = NA,
  Group_var = NULL
)

Value

A list of class FilteredData containing:

filtered_table: The filtered OTU count table, optionally log-transformed.
parameters: A list of the filtering parameters used.
metadata: The input metadata, possibly augmented with a combined grouping variable if multiple Groups were provided.

Arguments

Data: A data frame or a list object which contains the selected biomarker count table (generated by Data.rf.classifier), where rows represent OTUs/ASVs and columns represent samples.
metadata: A data frame. Containing information about all samples, including at least the grouping of all samples as well as individual information (Group and ID), the sampling Time point for each sample, and other relevant information.
OTU_counts_filter_value: An integer, indicating the sum of the minimum abundances of OTUs/ASVs in all samples. If the sum of the abundances that OTU is below the given positive integer threshold, the OTU is excluded, and vice versa, it is retained. The default is 1000. Note: if the input Data is the important OTU table that produced via sample classification, this argument should be NA, As the low abundance OTUs/ASVs might be filtered out during the classification progress by Data.rf.classifier.
OTU_filter_value: Numeric between 0 and 1. This specifies the minimum prevalence rate of an OTU/ASV across all samples within each group or individual. OTUs/ASVs with a prevalence rate below the given threshold will be removed.
log_base: This argument specifies the base of the logarithm. When the dataset is not very large, the default is NA, and no logarithmic transformation is applied. For large datasets, the logarithm base can be 2, "e", or 10.
Group_var: A string or a vector. This specifies the grouping variables, which should match the column names in the metadata used to designate sample groups, and for pre-processing OTU data of each group or individual separately. For instance, to split the OTU table based on the Group variable, set Group_var = "Group"; to split the data based on the Group and Diet (if in metadata)categorical variables to study the interaction between different grouping variables, set Group_var = c("Group","Diet").

Author

Shijia Li

Details

The function executes several key steps:

Input Validation: It first checks whether the input Data is a data frame or a list generated by function Data.rf.classifier. If Data is a list but not a data frame, the first element is extracted. Otherwise, if Data is neither a data frame nor an appropriate list, the function stops with an error.
OTU Count Filtering: If an OTU_counts_filter_value is provided (i.e., not NA), OTUs with total counts (across all samples) less than or equal to this value are removed.
Logarithmic Transformation: If a log_base is specified (allowed values are 10, 2, or e), a log transformation (with an offset of 1 to avoid log(0)) is applied to the data. If log_base is NA, the data remains untransformed.
Prevalence Filtering without Grouping: When Group_var is not provided (NULL), if an OTU_filter_value is specified, the function filters out OTUs whose prevalence (the proportion of samples with a non-zero count) is less than the threshold. If OTU_filter_value is not provided, a warning is issued and no prevalence filtering is applied.
Group-based Prevalence Filtering: If one or more grouping variables are specified in Group_var, the function first checks that these variables exist in metadata. For each group (or combination of groups if multiple variables are provided), the prevalence of each OTU is calculated, and OTUs are retained if they meet the prevalence threshold in at least one group. The filtered OTU table is then returned.

Examples

Run this code

# Example OTU table
set.seed(123)
otu_table <- as.data.frame(matrix(sample(0:100, 100, replace = TRUE), nrow = 10))
rownames(otu_table) <- paste0("OTU", 1:10)
colnames(otu_table) <- paste0("Sample", 1:10)


# Example metadata
metadata <- data.frame(
  Group = rep(c("A", "B"), each = 5),
  row.names = paste0("Sample", 1:10)
)

# Filter OTU table without grouping
filtered_data <- Data.filter(
  Data = otu_table,
  metadata = metadata,
  OTU_counts_filter_value = 50,
  OTU_filter_value = 0.2
)

# Filter OTU table with grouping
filtered_data_grouped <- Data.filter(
  Data = otu_table,
  metadata = metadata,
  OTU_filter_value = 0.5,
  Group_var = "Group"
)

Run the code above in your browser using DataLab