ko2kegg_abundance: Convert KO abundance in picrust2 export files to KEGG pathway abundance

Description

This function takes a file containing KO (KEGG Orthology) abundance data in picrust2 export format and converts it to KEGG pathway abundance data. The input file should be in .tsv, .txt, or .csv format.

Usage

ko2kegg_abundance(
  file = NULL,
  data = NULL,
  method = c("abundance", "sum"),
  filter_for_prokaryotes = TRUE
)

Value

A data frame with KEGG pathway abundance values. Rows represent KEGG pathways, identified by their KEGG pathway IDs. Columns represent samples, identified by their sample IDs from the input file.

Arguments

file

A character string representing the file path of the input file containing KO abundance data in picrust2 export format. The input file should have KO identifiers in the first column and sample identifiers in the first row. The remaining cells should contain the abundance values for each KO-sample pair.

data

An optional data.frame containing KO abundance data in the same format as the input file. If provided, the function will use this data instead of reading from the file. By default, this parameter is set to NULL.

method

Method for calculating pathway abundance. One of:

"abundance": (Default) PICRUSt2-style calculation using the mean of upper-half sorted KO abundances. This method is more robust and avoids inflating abundances for pathways with more KOs.
"sum": Simple summation of all KO abundances. This is the legacy method and may double-count KOs belonging to multiple pathways.

filter_for_prokaryotes

Logical. If TRUE (default), filters out KEGG pathways that are not relevant to prokaryotic (bacterial/archaeal) analysis. This removes pathways in categories such as:

Human diseases (cancer, neurodegenerative diseases, addiction, etc.)
Organismal systems (immune system, nervous system, endocrine system, etc.)

Bacterial infection pathways and antimicrobial resistance pathways are retained. Set to FALSE to include all KEGG pathways (for eukaryotic analysis or custom filtering).

Pathway Filtering

When filter_for_prokaryotes = TRUE, the function excludes KEGG pathways that are biologically irrelevant to prokaryotic organisms. KEGG reference pathways include pathways from all domains of life, and many human/animal-specific pathways would appear in bacterial analysis simply because some KOs are shared across organisms.

The following KEGG Level 2 categories are excluded:

Cancer pathways (overview and specific types)
Neurodegenerative diseases (Alzheimer's, Parkinson's, etc.)
Substance dependence (addiction pathways)
Cardiovascular diseases
Endocrine and metabolic diseases
Immune diseases
Organismal systems (immune, nervous, endocrine, digestive, etc.)

The following are RETAINED even with filtering:

Infectious disease: bacterial (Salmonella, E. coli, Tuberculosis, etc.)
Drug resistance: antimicrobial (antibiotic resistance)
All Metabolism pathways
Genetic/Environmental Information Processing
Cellular Processes

Details

The default "abundance" method follows PICRUSt2's approach for calculating pathway abundance:

For each pathway, collect abundances of all associated KOs present in the data
Sort the abundances in ascending order
Take the upper half of the sorted values
Calculate the mean as the pathway abundance

This approach has several advantages over simple summation:

Does not inflate abundances for pathways containing more KOs
More robust to missing or low-abundance KOs
Provides a more accurate representation of pathway activity

The "sum" method is provided for backward compatibility and simply sums all KO abundances for each pathway.

Examples

Run this code

if (FALSE) {
library(ggpicrust2)
library(readr)

# Example 1: Default - filtered for prokaryotic analysis
data(ko_abundance)
kegg_abundance <- ko2kegg_abundance(data = ko_abundance)

# Example 2: Include all pathways (for eukaryotic analysis)
kegg_abundance_all <- ko2kegg_abundance(data = ko_abundance, filter_for_prokaryotes = FALSE)

# Example 3: Using legacy sum method with filtering
kegg_abundance_sum <- ko2kegg_abundance(data = ko_abundance, method = "sum")

# Example 4: From file
input_file <- "path/to/your/picrust2/results/pred_metagenome_unstrat.tsv"
kegg_abundance <- ko2kegg_abundance(file = input_file)
}

Run the code above in your browser using DataLab