Data.rf.classifier: Random Forest classification for OTU/ASV Data

Description

This function implements a random forest classification model tailored for OTU/ASV datasets. It performs data filtering, model training, performance evaluation, cross-validation, and biomarker (important microbial features) selection based on Mean Decrease Accuracy. #' @details The function processes the input OTU count data and corresponding metadata in several steps:

Data Filtering and Preparation: If a minimum count threshold (OTU_counts_filter_value) is provided, OTUs with total counts below this value are removed. The OTU table is then transposed and merged with the metadata, where a specific column (specified by Group) indicates the group labels.
Data Partitioning: The combined dataset is split into training and testing subsets based on the proportion specified by train_p.
Model Training: A random forest classifier is trained on the training data. The function computes the margin scores for the training samples, which are plotted to visualize the model’s confidence.
Performance Evaluation: Predictions are made on both training and testing datasets. Confusion matrices are generated to compare the actual versus predicted classes.
Feature Importance and Cross-Validation: OTU importance is assessed using Mean Decrease Accuracy. Repeated k-fold cross-validation (default 10-fold repeated reps times) is performed to determine the optimal number of OTUs (biomarkers). A cross-validation error curve is plotted, and the user is prompted to input the best number of OTUs based on the plot.

Usage

Data.rf.classifier(
  raw_data,
  metadata,
  train_p,
  Group,
  OTU_counts_filter_value = NA,
  reps = 5,
  cv_fold = 10,
  title_size = 10,
  axis_title_size = 8,
  legend_title_size = 8,
  legend_text_size = 6,
  seed = 123
)

Value

An object of class DataRFClassifier with the following elements:

Input_data: The transposed and (optionally) filtered OTU table.
Predicted_results_on_train_set: A vector of predicted group labels for the training set.
Predicted_results_on_test_set: A vector of predicted group labels for the test set.
Traindata_confusion_matrix: A confusion matrix comparing actual vs. predicted group labels for the training set.
Testdata_confusion_matrix: A confusion matrix comparing actual vs. predicted group labels for the test set.
Margin_scores_train: A ggplot object displaying the margin scores of the training set samples.
OTU_importance: A data frame of OTU importance metrics, sorted by Mean Decrease Accuracy.
Classifier: A random forest classifier object trained on the training set.
cross_validation: A ggplot object showing the cross-validation error curve as a function of the number of features.

Arguments

raw_data: A numeric matrix or data frame of counts data with OTUs/ASVs as rows and samples as columns.
metadata: A data frame. Containing information about all samples, including at least the grouping of all samples as well as individual information (Group and ID), the sampling Time point for each sample, and other relevant information.
train_p: A positive decimal. Indicating the percentage of data that goes to training. For example, when train_p = 0.7, 70% samples were randomly selected as training dataset. More information see rfcv.
Group: A string that specifies the columns in the metadata for grouping the temporal series samples.
OTU_counts_filter_value: An integer, indicating the sum of the minimum abundances of OTUs/ASVs in all samples. If the sum of the abundances that OTU/ASV is below the given positive integer threshold, the OTU/ASV is excluded, and vice versa, it is retained. The default is NA.
reps: An integer. The number of replications for cross-validation. By default, reps = 5. More details see rfcv.
cv_fold: An integer. Number of folds in the cross-validation. By default, cv_fold = 10. see rfcv
title_size: Numeric value for the font size of plot titles. Defaults to 10.
axis_title_size: Numeric value for the font size of axis titles. Defaults to 8.
legend_title_size: Numeric value for the font size of legend titles. Defaults to 8.
legend_text_size: Numeric value for the font size of legend text. Defaults to 6.
seed: Random seed.

Author

Shijia Li

Examples

Run this code

# \donttest{
# Example OTU count data (20 OTUs x 10 samples)
set.seed(123)
otu_data <- matrix(sample(0:100, 200, replace = TRUE), nrow = 20)
colnames(otu_data) <- paste0("Sample", 1:10)
rownames(otu_data) <- paste0("OTU", 1:20)

# Example metadata with group labels
metadata <- data.frame(Group = rep(c("Control", "Treatment"), each = 5))

# Run the classifier
result <- Data.rf.classifier(raw_data = otu_data,
                             metadata = metadata,
                             train_p = 0.7,
                             Group = "Group",
                             OTU_counts_filter_value = 50)
# }

Run the code above in your browser using DataLab