extract_taxonomy: Extract taxonomy information from sequence headers

Description

Extracts the taxonomy from metadata (e.g. sequence headers) or parsed sequence data. The location and identity of important information in the input is specified using a regular expression with capture groups and an corresponding key. An object of type taxmap is returned containing the specifed information. Taxa are translated into unique codes if they are not already encoded this way.

Usage

extract_taxonomy(input, ...)
# S3 method for default
extract_taxonomy(input, key = c("class", "taxon_id", "name",
  "taxon_info", "obs_id", "obs_info"), regex = "(.*)", class_key = c("name",
  "taxon_id", "taxon_info"), class_regex = "(.*)", class_sep = NULL,
  class_rev = FALSE, database = c("none", "ncbi", "itis", "eol", "col",
  "tropicos", "nbn"), allow_na = TRUE, vigilance = c("warning", "error",
  "message", "none"), return_match = FALSE, return_input = FALSE,
  redundant_names = FALSE, batch_size = 100, verbosity = c("low", "none",
  "high"), ...)
# S3 method for DNAbin
extract_taxonomy(input, ...)
# S3 method for list
extract_taxonomy(input, ...)

Arguments

input

A vector from which to extract taxonomy information or an object of class ape{DNAbin}.

...

Not used.

key

(character) The identity of the capturing groups defined using regex. The length of key must be equal to the number of capturing groups specified in regex. Any names added to the terms will be used as column names in the output. Only "taxon_info" and "obs_info" can be used multiple times. Each term must be one of those decribed below:

taxon_id: A unique numeric id for a taxon for a particular database (e.g. ncbi accession number). Requires an internet connection.
name: The name of a taxon. Not necessarily unique, but are interpretable by a particular database. Requires an internet connection.
taxon_info: Arbitrary taxon info you want included in the output. Can be used more than once.
class: A list of taxa information that constitutes the full taxonomic classification from broad to specific (see class_rev) for a particular database. Individual taxa are separated by the class_sep argument and the information is parsed by the class_regex and class_key arguments.
obs_id: An unique observation (e.g. sequence) identifier for a particular database. Requires an internet connection.
obs_info: Arbitrary observation info you want included in the output. Can be used more than once.

regex

(character; length == 1) A regular expression with capturing groups indicating the locations of relevant information. The identity of the information must be specified using the key argument.

class_key

(character of length 1) The identity of the capturing groups defined using class_iregex. The length of class_key must be equal to the number of capturing groups specified in class_regex. Any names added to the terms will be used as column names in the output. At least "taxon_id" or "name" must be specified. Only "taxon_info" can be used multiple times. Each term must be one of those decribed below:

taxon_id: A unique numeric id for a taxon for a particular database (e.g. ncbi accession number). Requires an internet connection.
name: The name of a taxon. Not necessarily unique, but are interpretable by a particular database. Requires an internet connection.
taxon_info: Arbitrary taxon info you want included in the output. Can be used more than once.

class_regex

(character of length 1) A regular expression with capturing groups indicating the locations of data for each taxon in the class term in the key argument. The identity of the information must be specified using the class_key argument. The class_sep option can be used to split the classification into data for each taxon before matching. If class_sep is NULL, each match of class_regex defines a taxon in the classification.

class_sep

(character of length 1) Used with the class term in the key argument. The character(s) used to separate individual taxa within a classification. After the string defined by the class capture group in regex is split by class_sep, its capture groups are extracted by class_regex and defined by class_key. If NULL, every match of class_regex is used instead with first splitting by class_sep.

class_rev

(logical of length 1) Used with the class term in the key argument. If TRUE, the order of taxon data in a classfication is reversed to be specific to broad.

database

(character of length 1) The name of the database that patterns given in parser will apply to. Valid databases include "ncbi", "itis", "eol", "col", "tropicos", "nbn", and "none". "none" will cause no database to be quired; use this if you want to not use the internet. NOTE: Only "ncbi" has been tested so far.

allow_na

(logical of length 1) If TRUE, any missing data will be represented as NAs in the output. This preserves the correspondance between the input and output values. Missing data can be generated if the regex does not match the input or online queries fail.

vigilance

(character of length 1) Controls the reporting of possible problems, such as missing data and failed online queries (see allow_na). The following values are possible:

"none": No warnings or errors are generated if the function can complete.
"message": A message is generated when atypical events occur.
"warning": Warnings are generated when atypical events occur.
"error": Errors are generated when atypical events occur, stopping the completion of the function.

return_match

(logical of length 1) If TRUE, include the part of the input matched by regex in the output object.

return_input

(logical of length 1) If TRUE, include the input in the output object.

redundant_names

(logical of length 1) If TRUE, remove any occurrence of the a supertaxon's name at the start of the taxon name. This is useful for removing the redundant genus information in species binomials.

batch_size

(numeric of length 1) The number of IDs to look up at once. This only effects querys using "obs_id". If there is an error looking up an ID, reducing this to 1 can prevent it from ruining the whole batch, but it will take longer.

verbosity

(character of length 1) Controls the printing of progress updates. The following values are possible:

"none": No progress reports are printed
"low": Minimal progress reports of a fixed length are printed.
"high": Lots of information is printed depending on the amount of the input.

Value

Returns an object of type taxmap

Examples

Run this code

# NOT RUN {
# Extract embedded classifications from UNITE FASTA file offline
file_path <- system.file("extdata", "unite_general_release.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
x <- extract_taxonomy(sequences,
                      regex = "^(.*)\\|(.*)\\|(.*)\\|.*\\|(.*)$",
                      key = c(seq_name = "obs_info", seq_id = "obs_info",
                              other_id = "obs_info", "class"),
                      class_regex = "^(.*)__(.*)$",
                      class_key = c(unite_rank = "taxon_info", "name"),
                      class_sep = ";")
# Look up taxonomic data online using sequence ID
# This might take a while. The speed is dependent on NCBI's servers. 
file_path <- system.file("extdata", "ncbi_basidiomycetes.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
y <- extract_taxonomy(sequences,
                      regex = "^.*\\|(.*)\\|.*\\|(.*)\\|(.*)$",
                      key = c(gi_no = "obs_info", "obs_id", desc = "obs_info"),
                      database = "ncbi")
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab