util.blast: Functions to Work with BLAST Output Files

Description

Summarize the representation of taxa in a BLAST tabular output file, after filtering for similarity, E-value, and number of hits.

Usage

count.taxa(file, gi.taxid, taxid.phylum, similarity=30, evalue=1e-5,
    max.hits=10, min.query=0, min.phylum=0, min.taxon=0)

Arguments

file

character, name of BLAST tabular output file.

gi.taxid

list, first component is sequence identifiers (gi numbers), second is taxon ids.

taxid.phylum

dataframe, with columns taxid (taxon id), phylum (name of phylum), species (name of species).

similarity

numeric, hits above this similarity score are kept.

evalue

character, hits below this evalue are kept.

max.hits

numeric, up to this many hits are processed for each query sequence.

min.query

numeric, query sequence is counted if a single phylum makes up this fraction of hits.

min.phylum

numeric, phylum total abundance is kept if it is above this value.

min.taxon

numeric, this taxon is kept if it makes up at least this fraction of total.

Value

A dataframe with one row for each query sequence that remains after the filtering operations. The columns are query, subject (i.e., sequence id or gi number), similarity, evalue, taxid, phylum and species.

Details

The BLAST (Altschul et al., 1997) tabular output file can be generated using the -m 8 switch to the blastall command. The BLAST results are tied to taxon ids using the gi.taxid list. For this to work, the sequence identifiers in BLAST database (and that show up in the output file) must be the same as those in the gi.taxid list.

Five example files are packages with CHNOSZ to illustrate the use of this function. The files bisonX_vs_refseq39.blastp where X is one of N, R or P are tabular BLAST results for proteins in the Bison Pool Environmental Genome. The BLAST files contain the first hit for each of 2500 query sequences. The target database for the searches was constructed from microbial protein sequences in National Center for Biotechnology Information (NCBI) RefSeq database version 39. gi.taxid.txt is a table that lists all of the sequence identifiers (gi numbers) that appear in the example BLAST files, together with the corresponding taxon ids used in the NCBI databases. taxid.phylum.csv lists all of the taxon ids and the corresponding phylum and species names.

References

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. H., Zhang, Z., Miller, W. and Lipman, D. J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402. http://dx.doi.org/doi:10.1093/nar/25.17.3389

Joint Genome Institute, 2007. Bison Pool Environmental Genome. Protein sequence files downloaded from IMG/M (http://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=FindGenomes&page=findGenomes) on 2009-05-13.

Examples

Run this code

# process some of the BLAST output for proteins
   # from Bison Pool metagenome (JGI, 2007)
   # read the file that connects taxids with the sequence identifier
   tfile <- system.file("gi.taxid.txt",package="CHNOSZ")
   gi.taxid <- scan(tfile,what=as.list(character(2)),flush=TRUE)
   # read the file that connects names with the taxids
   nfile <- system.file("taxid.phylum.csv",package="CHNOSZ")
   taxid.phylum <- read.csv(nfile)
   # the BLAST files
   bfile <- paste("bison",c("N","R","P"),"_vs_refseq39.blastp",sep="")
   for(i in 1:3) {
     file <- system.file(bfile[i],package="CHNOSZ")
     # process the blast file -- get taxon names
     b <- count.taxa(file,gi.taxid,taxid.phylum,min.taxon=2^-6)
     # count each of the phyla
     bd <- as.matrix(sapply(unique(b$phylum),function(x) (length(which(x==b$phylum)))))
     # make a matrix -- each column for a different file
     if(i==1) bardata <- bd else {
       bardata <- merge(bardata,bd,all=TRUE,by="row.names")
       rownames(bardata) <- bardata$Row.names
       bardata <- bardata[,-1]
     }
   }
   # normalize the counts
   bardata[is.na(bardata)] <- 0
   bardata <- t(t(bardata)/colSums(bardata))
   colnames(bardata) <- c("N","R","P")
   # make a bar chart
   bp <- barplot(as.matrix(bardata),col=rainbow(nrow(bardata)))
   # add labels to the bars
   names <- substr(row.names(bardata),1,3)
   for(i in 1:3) {
     bd <- bardata[,i]
     ib <- bd!=0
     y <- (cumsum(bd) - bd/2)[ib]
     text(bp[i],y,names[ib])
   }
   title(main=paste("Most Abundant Phyla in a Portion",
     "of the Bison Pool Metagenome",sep=""))

Run the code above in your browser using DataLab