Learn R Programming

⚠️There's a newer version (0.11.1) of this package.Take me there.

Convert accession numbers to taxonomy

Introduction

taxonomizr provides some simple functions to parse NCBI taxonomy files and accession dumps and efficiently use them to assign taxonomy to accession numbers or taxonomic IDs. This is useful for example to assign taxonomy to BLAST results. This is all done locally after downloading the appropriate files from NCBI using included functions (see below).

Installation

Once the package is on CRAN, it should install with a simple:

install.packages("taxonomizr")

To install the development version directly from github, use the devtools library and run:

devtools::install_github("sherrillmix/taxonomizr")

To use the library, load it in R:

library(taxonomizr)

Preparation

In order to avoid constant internet access and slow APIs, the first step in using the package is to downloads all necessary files from NCBI. This uses a bit of disk space but makes future access reliable and fast.

Note: It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.

Download names and nodes

First, download the necessary names and nodes files from NCBI:

getNamesAndNodes()
## [1] "./names.dmp" "./nodes.dmp"

Download accession to taxa files

Then download accession to taxa id conversion files from NCBI. Note: this is a pretty big download (several gigabytes):

#this is a big download
getAccession2taxid()
## [1] "./nucl_gb.accession2taxid.gz"  "./nucl_est.accession2taxid.gz"
## [3] "./nucl_gss.accession2taxid.gz" "./nucl_wgs.accession2taxid.gz"

If you would also like to identify protein accession numbers, also download the prot file from NCBI (again this is a big download):

#this is a big download
getAccession2taxid(types='prot')
## [1] "./prot.accession2taxid.gz"

Convert accessions to database

Then process the downloaded accession files into a more easily accessed form (this could take a while):

read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')
## Reading nucl_est.accession2taxid.gz.
## Reading nucl_gb.accession2taxid.gz.
## Reading nucl_gss.accession2taxid.gz.
## Reading nucl_wgs.accession2taxid.gz.
## Reading in values. This may take a while.
## Adding index. This may also take a while.
## [1] TRUE

Now everything should be ready for processing. All files are cached locally and so the preparation is only required once (or whenever you would like to update the data). It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.

Assigning taxonomy

Finding taxonomy for NCBI accession numbers

First, load the nodes and names files into memory:

taxaNodes<-read.nodes('nodes.dmp')
taxaNames<-read.names('names.dmp')

Then we are ready to convert NCBI accession numbers to taxonomic IDs. For example, to find the taxonomic IDs associated with NCBI accession numbers "LN847353.1" and "AL079352.3":

taxaId<-accessionToTaxa(c("LN847353.1","AL079352.3"),"accessionTaxa.sql")
print(taxaId)
## [1] 1313 9606

And to get the taxonomy for those IDs:

getTaxonomy(taxaId,taxaNodes,taxaNames)
##      superkingdom phylum       class      order            
## 1313 "Bacteria"   "Firmicutes" "Bacilli"  "Lactobacillales"
## 9606 "Eukaryota"  "Chordata"   "Mammalia" "Primates"       
##      family             genus           species                   
## 1313 "Streptococcaceae" "Streptococcus" "Streptococcus pneumoniae"
## 9606 "Hominidae"        "Homo"          "Homo sapiens"

Finding taxonomy for taxonomic names

If you'd like to find IDs for taxonomic names then you can do something like:

taxaId<-getId(c('Homo sapiens','Bos taurus','Homo'),taxaNames)
print(taxaId)
## [1] "9606" "9913" "9605"

And again to get the taxonomy for those IDs use getTaxonomy:

getTaxonomy(taxaId,taxaNodes,taxaNames)
##      superkingdom phylum     class      order      family      genus 
## 9606 "Eukaryota"  "Chordata" "Mammalia" "Primates" "Hominidae" "Homo"
## 9913 "Eukaryota"  "Chordata" "Mammalia" NA         "Bovidae"   "Bos" 
## 9605 "Eukaryota"  "Chordata" "Mammalia" "Primates" "Hominidae" "Homo"
##      species       
## 9606 "Homo sapiens"
## 9913 "Bos taurus"  
## 9605 NA

Copy Link

Version

Install

install.packages('taxonomizr')

Monthly Downloads

613

Version

0.2.2

License

GPL-2

Maintainer

Scott SherrillMix

Last Published

March 9th, 2017

Functions in taxonomizr (0.2.2)

getNamesAndNodes

Download names and nodes files from NCBI
getTaxonomy

Get taxonomic ranks for a taxa
accessionToTaxa

Convert accessions to taxa
lastNotNa

Return last not NA value
condenseTaxa

Condense a taxa table for a single read
read.accession2taxid

Read NCBI accession2taxid files
getAccession2taxid

Download accession2taxid files from NCBI
getId

Find a given taxa by name
trimTaxa

Trim columns from taxa file
streamingRead

Process a large file piecewise
read.names

Read NCBI names file
read.nodes

Read NCBI nodes file