extract_tax_data: Extracts taxonomy info from vectors with regex

Description

Parse taxonomic information in a character vector into a taxmap() object. The location and identity of important information in the input is specified using a regular expression with capture groups and a corresponding key. An object of type taxmap() is returned containing the specified information. See the key option for accepted sources of taxonomic information.

Usage

extract_tax_data(tax_data, key, regex, class_key = "taxon_name",
  class_regex = "(.*)", class_sep = NULL, sep_is_regex = FALSE,
  class_rev = FALSE, database = "ncbi", include_match = FALSE,
  include_tax_data = TRUE)

Arguments

tax_data

A vector from which to extract taxonomy information.

key

(character) The identity of the capturing groups defined using regex. The length of key must be equal to the number of capturing groups specified in regex. Any names added to the terms will be used as column names in the output. Only "info" can be used multiple times. Each term must be one of those described below:

taxon_id: A unique numeric id for a taxon for a particular database (e.g. ncbi accession number). Requires an internet connection.
taxon_name: The name of a taxon (e.g. "Mammalia" or "Homo sapiens"). Not necessarily unique, but interpretable by a particular database. Requires an internet connection.
class: A list of taxon information that constitutes the full taxonomic classification (e.g. "K_Mammalia;P_Carnivora;C_Felidae"). Individual taxa are separated by the class_sep argument and the information is parsed by the class_regex and class_key arguments.
seq_id: Sequence ID for a particular database that is associated with a taxonomic classification. Currently only works with the "ncbi" database.
info: Arbitrary taxon info you want included in the output. Can be used more than once.

regex

(character of length 1) A regular expression with capturing groups indicating the locations of relevant information. The identity of the information must be specified using the key argument.

class_key

(character of length 1) The identity of the capturing groups defined using class_regex. The length of class_key must be equal to the number of capturing groups specified in class_regex. Any names added to the terms will be used as column names in the output. Only "info" can be used multiple times. Each term must be one of those described below:

taxon_name: The name of a taxon. Not necessarily unique.
info: Arbitrary taxon info you want included in the output. Can be used more than once.

class_regex

(character of length 1) A regular expression with capturing groups indicating the locations of data for each taxon in the class term in the key argument. The identity of the information must be specified using the class_key argument. The class_sep option can be used to split the classification into data for each taxon before matching. If class_sep is NULL, each match of class_regex defines a taxon in the classification.

class_sep

(character of length 1) Used with the class term in the key argument. The character(s) used to separate individual taxa within a classification. After the string defined by the class capture group in regex is split by class_sep, its capture groups are extracted by class_regex and defined by class_key. If NULL, every match of class_regex is used instead with first splitting by class_sep.

sep_is_regex

(TRUE/FALSE) Whether or not class_sep should be used as a regular expression.

class_rev

(logical of length 1) Used with the class term in the key argument. If TRUE, the order of taxon data in a classification is reversed to be specific to broad.

database

(character of length 1) The name of the database that patterns given in parser will apply to. Valid databases include "ncbi", "itis", "eol", "col", "tropicos", "nbn", and "none". "none" will cause no database to be quired; use this if you want to not use the internet. NOTE: Only "ncbi" has been tested extensively so far.

include_match

(logical of length 1) If TRUE, include the part of the input matched by regex in the output object.

include_tax_data

(TRUE/FALSE) Whether or not to include tax_data as a dataset.

Value

Returns an object of type taxmap()

Examples

Run this code

# NOT RUN {
  # For demonstration purposes, the following example dataset has all the
  # types of data that can be used, but any one of them alone would work.
  raw_data <- c(
  ">id:AB548412-tid:9689-Panthera leo-tax:K_Mammalia;P_Carnivora;C_Felidae;G_Panthera;S_leo",
  ">id:FJ358423-tid:9694-Panthera tigris-tax:K_Mammalia;P_Carnivora;C_Felidae;G_Panthera;S_tigris",
  ">id:DQ334818-tid:9643-Ursus americanus-tax:K_Mammalia;P_Carnivora;C_Felidae;G_Ursus;S_americanus"
  )

  # Build a taxmap object from classifications
  extract_tax_data(raw_data,
                   key = c(my_seq = "info", my_tid = "info", org = "info", tax = "class"),
                   regex = "^>id:(.+)-tid:(.+)-(.+)-tax:(.+)$",
                   class_sep = ";", class_regex = "^(.+)_(.+)$",
                   class_key = c(my_rank = "info", tax_name = "taxon_name"))

  # Build a taxmap object from taxon ids
  # Note: this requires an internet connection
  extract_tax_data(raw_data,
                   key = c(my_seq = "info", my_tid = "taxon_id", org = "info", tax = "info"),
                   regex = "^>id:(.+)-tid:(.+)-(.+)-tax:(.+)$")

  # Build a taxmap object from ncbi sequence accession numbers
  # Note: this requires an internet connection
  extract_tax_data(raw_data,
                   key = c(my_seq = "seq_id", my_tid = "info", org = "info", tax = "info"),
                   regex = "^>id:(.+)-tid:(.+)-(.+)-tax:(.+)$")

  # Build a taxmap object from taxon names
  # Note: this requires an internet connection
  extract_tax_data(raw_data,
                   key = c(my_seq = "info", my_tid = "info", org = "taxon_name", tax = "info"),
                   regex = "^>id:(.+)-tid:(.+)-(.+)-tax:(.+)$")
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples