taxmapper: Maps an input taxonomy table onto a different taxonomic nomenclature.

Description

Maps an input taxonomy table onto a different taxonomic nomenclature.

Usage

taxmapper(
  tt,
  tt.ranks = colnames(tt),
  tax2map2 = "pr2",
  exceptions = c("Archaea", "Bacteria"),
  ignore.format = FALSE,
  synonym.file = "default",
  streamline = TRUE,
  outfilez = NULL
)

Arguments

The input taxonomy table you would like to map onto a new taxonomic nomenclature. Should be a dataframe of type char or list (no factors).

tt.ranks

A character vector of the column names where taxonomic names are found in tt. Supply them heirarchically (e.g. kingdom --> species)

tax2map2

The taxonomic nomenclature you would like to map onto. pr2 v4.12.0, Silva SSU v138 nr, GreenGenes v13.8 clustered at 97% similarity, and the RDP train set 16 are included in the ensembleTax package. You can map to these by specifying "pr2", "Silva", "gg", or "rdp". Otherwise should be a dataframe of type character or list (no factors) with each column corresponding to a taxonomic rank.

exceptions

A character vector of taxonomic names at the basal/root rank of tt that will be propagated onto the mapped taxonomy. ASVs assigned to these names will retain these names at their basal/root rank in the mapped taxonomy. All other ranks are assigned NA.

ignore.format

If TRUE, the algorithm modifies taxonomic names in tt to account for common variations in taxonomic name syntax and/or formatting commonly encountered in reference databases (e.g. Pseudo-nitzschia will map to Pseudonitzschia). If FALSE, formatting issues may preclude mapping of synonymous taxonomic names (e.g. Pseudonitzschia will NOT map to Pseudo-nitzschia). An exhaustive list of formatting details is included in Details.

synonym.file

If "default", taxmapper uses taxonomic synonyms included with the ensembleTax package. If a custom taxonomic synonym file is preferred, a string corresponding to the name of the csv file should be supplied. Taxonomic synonyms are searched when exact name matches are not found in tax2map2. ignore.format applies to synonyms if TRUE. Specify NULL if you wish to forego synonym searches.

streamline

If TRUE, only the mapped version of tt is returned as a dataframe. If FALSE, a 3-element list is returned where element 1 is the mapping key returned as a dataframe, element 2 is a character vector of all names that could not be mapped (no exact matches found in tax2map2), and element 3 is the mapped version of tt (a dataframe).

outfilez

If NULL, mapping files are not saved to the current working directory. Otherwise should be a 3-element character vector including, in this order, the name of the file to store the taxonomic mapping key, the name of the file to store the names that could not be mapped, and the name of the file to store the ASVs supplied with tt with their mapped taxonomic assignments. Each element of the vector should end in csv (only csv files may be saved)

Value

If streamline = TRUE, a dataframe formatted for use with ensembleTax that contains mapped taxonomic assignments for each ASV/OTU in the data set.

If streamline = FALSE, a 3-element list where the first element is a dataframe that contains all unique input taxonomic assignments and their corresponding mapped outputs, the second element is a character vector that contains all taxonomic names that could not be mapped, and the third element contains mapped taxonomic assignments for each ASV in the data set.

If is.null(outfilez) = FALSE, three csv files are saved in the current working directory containing each of the three list elements above.

Details

Exceptions should be used when the user knows a particular taxonomic group is not found in tax2map2. The user is responsible for supplying valid taxonomic names as these must be found in tt and will be propagated as given to all ASVs that are assigned this name in tt. This should only be used for high-level taxonomic groups that are not found in a database (e.g. for retaining Eukaryota when mapping onto a prokaryote-only taxonomic nomenclature).

When ignore.format = TRUE, names for which taxmapper cannot find exact matches in tax2map2 are altered in case an exact match was not found due to formatting issues. To do this taxmapper first checks for hyphens "-", underscores "_", and single spaces " ". If these are found, variants of the name with the hyphen/underscore/spaces replaced by each of the other two, as well as all subnames spearated by these characters, and all subnames pasted together with none of these special characters, are searched against tax2map2 for exact matches. It also creates all-lower and all-upper case versions of these elements and again searches for exact name matches for these names. To prevent matching of arbitrary names often used in reference databases like "Clade X", after creating all of the above alternative names, those names that begin with any variant of the words "clade" or "group" and those names that are 2 characters or less are removed prior to re-searching tax2map2. All alternative names created when ignore.format = TRUE are also searched for synonyms in synonym.file. Be advised that setting ignore.format = TRUE does not guarantee a more finely resolved mapped taxonomy table, and can actually result in a less-resolved mapped taxonomy table in some circumstances.

For high-throughput implementation of taxmapper, it's recommended to set streamline = TRUE.

Examples

Run this code

# NOT RUN {
fake.silva <- data.frame(ASV = c("AAAA", "ATCG", "GCGC", "TATA", "TCGA"),
domain = c("Bacteria", "Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota"),
phylum = c("Firmicutes", "Diatomea", "Retaria", "MAST-12", "Diatomea"),
class = c(NA, "Coscinodiscophytina_cl", "Polycystinea", "MAST-12A",
"Mediophyceae"),
order = c(NA, "Fragilariales", "Collodaria", NA, NA),
family = c(NA, "Fragilariales_fa", "Collodaria_fa", NA, NA),
genus = c(NA, "Podocystis", "Collophidium", NA, NA),
stringsAsFactors = FALSE)
head(fake.silva)
mapped.silva <- taxmapper(fake.silva,
                          tt.ranks = colnames(fake.silva)[2:ncol(fake.silva)],
                          tax2map2 = "pr2",
                          exceptions = c("Archaea", "Bacteria"),
                          ignore.format = FALSE,
                          synonym.file = "default",
                          streamline = TRUE,
                          outfilez = NULL)

# }