import_qiime: Import function to read the now legacy-format QIIME OTU table.

Description

QIIME produces several files that can be directly imported by the phyloseq-package. Originally, QIIME produced its own custom format table that contained both OTU-abundance and taxonomic identity information. This function is still included in phyloseq mainly to accommodate these now-outdated files. Recent versions of QIIME store output in the biom-format, an emerging file format standard for microbiome data. If your data is in the biom-format, if it ends with a .biom file name extension, then you should use the import_biom function instead.

Usage

import_qiime(otufilename = NULL, mapfilename = NULL, treefilename = NULL, refseqfilename = NULL, refseqFunction = readDNAStringSet, refseqArgs = NULL, parseFunction = parse_taxonomy_qiime, verbose = TRUE, ...)

Arguments

otufilename

(Optional). A character string indicating the file location of the OTU file. The combined OTU abundance and taxonomic identification file, tab-delimited, as produced by QIIME under default output settings. Default value is NULL.

mapfilename

(Optional). The QIIME map file is required for processing barcoded primers in QIIME as well as some of the post-clustering analysis. This is a required input file for running QIIME. Its strict formatting specification should be followed for correct parsing by this function. Default value is NULL.

treefilename

(Optional). Default value is NULL. A file representing a phylogenetic tree or a phylo object. Files can be NEXUS or Newick format. See read_tree for more details. Also, if using a recent release of the GreenGenes database tree, try the read_tree_greengenes function -- this should solve some issues specific to importing that tree. If provided, the tree should have the same OTUs/tip-labels as the OTUs in the other files. Any taxa or samples missing in one of the files is removed from all. As an example from the QIIME pipeline, this tree would be a tree of the representative 16S rRNA sequences from each OTU cluster, with the number of leaves/tips equal to the number of taxa/species/OTUs, or the complete reference database tree that contains the OTU identifiers of every OTU in your abundance table. Note that this argument can be a tree object (phylo-class) for cases where the tree has been --- or needs to be --- imported separately, as in the case of the GreenGenes tree mentioned earlier (coderead_tree_greengenes).

refseqfilename

(Optional). Default NULL. The file path of the biological sequence file that contains at a minimum a sequence for each OTU in the dataset. Alternatively, you may provide an already-imported XStringSet object that satisfies this condition. In either case, the names of each OTU need to match exactly the taxa_names of the other components of your data. If this is not the case, for example if the data file is a FASTA format but contains additional information after the OTU name in each sequence header, then some additional parsing is necessary, which you can either perform separately before calling this function, or describe explicitly in a custom function provided in the (next) argument, refseqFunction. Note that the XStringSet class can represent any arbitrary sequence, including user-defined subclasses, but is most-often used to represent RNA, DNA, or amino acid sequences. The only constraint is that this special list of sequences has exactly one named element for each OTU in the dataset.

refseqFunction

(Optional). Default is readDNAStringSet, which expects to read a fasta-formatted DNA sequence file. If your reference sequences for each OTU are amino acid, RNA, or something else, then you will need to specify a different function here. This is the function used to read the file connection provided as the the previous argument, refseqfilename. This argument is ignored if refseqfilename is already a XStringSet class.

refseqArgs

(Optional). Default NULL. Additional arguments to refseqFunction. See XStringSet-io for details about additional arguments to the standard read functions in the Biostrings package.

parseFunction

(Optional). An optional custom function for parsing the character string that contains the taxonomic assignment of each OTU. The default parsing function is parse_taxonomy_qiime, specialized for splitting the ";"-delimited strings and also attempting to interpret greengenes prefixes, if any, as that is a common format of the taxonomy string produced by QIIME.

verbose

(Optional). A logical. Default is TRUE. Should progresss messages be catted to standard out?

...

Additional arguments passed to read_tree

Value

A phyloseq-class object.

Details

Other related files include the mapping-file that typically stores sample covariates, converted naturally to the sample_data-class component data type in the phyloseq-package. QIIME may also produce a phylogenetic tree with a tip for each OTU, which can also be imported specified here or imported separately using read_tree.

See "http://www.qiime.org/" for details on using QIIME. While there are many complex dependencies, QIIME can be downloaded as a pre-installed linux virtual machine that runs ``off the shelf''.

The different files useful for import to phyloseq are not collocated in a typical run of the QIIME pipeline. See the main phyloseq vignette for an example of where ot find the relevant files in the output directory.

References

http://qiime.org/

``QIIME allows analysis of high-throughput community sequencing data.'' J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D Bushman, Elizabeth K Costello, Noah Fierer, Antonio Gonzalez Pena, Julia K Goodrich, Jeffrey I Gordon, Gavin A Huttley, Scott T Kelley, Dan Knights, Jeremy E Koenig, Ruth E Ley, Catherine A Lozupone, Daniel McDonald, Brian D Muegge, Meg Pirrung, Jens Reeder, Joel R Sevinsky, Peter J Turnbaugh, William A Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld and Rob Knight; Nature Methods, 2010; doi:10.1038/nmeth.f.303

Examples

Run this code

otufile <- system.file("extdata", "GP_otu_table_rand_short.txt.gz", package="phyloseq")
mapfile <- system.file("extdata", "master_map.txt", package="phyloseq")
trefile <- system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq")
import_qiime(otufile, mapfile, trefile)

Run the code above in your browser using DataLab