This function reads a species abundance profile from kraken2 format_report output, performs filtering to remove low-abundance and unwanted taxa, and extracts a clean taxonomy table for downstream analysis.
pre_format_report(dir = "", exclude = NULL, relative_threshold = 1e-04)A list with two components:
A matrix of filtered species abundance data
A data.frame of taxonomy information for each species
Character. Path to the directory containing the input file "mpa_profile_species.txt". Default is an empty string (current directory).
Character. Pattern to exclude specific taxa (e.g., "g__Streptococcus"). Uses grepl() for pattern matching. Default is NULL (no exclusion).
Numeric. Relative abundance threshold for filtering low-abundance taxa. Taxa with mean relative abundance below this threshold will be removed. Default is 1e-4.
The function performs the following steps:
Reads the Metaphlan species profile table
Removes "unclassified" entries and "cellular_organisms" category
Filters out taxa matching the exclude pattern (if provided)
Applies relative abundance filtering using pcutils::rm_low
Extracts and formats taxonomy information from the Metaphlan-style names
Cleans species names by removing the "s__" prefix