generative.prob: Compute generative probabilities from BLAST output

Description

Compute generative probabilities from BLAST output

generative.prob() computes the probability for a read to be generated by a certain species, given that it originates from this species. The input for this function is the tabular BLAST output format, either default or custom. The function uses the recorded mismatches to produce a Poisson probability.

generative.prob.nucl() for when we have nucleotide similarity, i.e we have performed BLASTn.

Usage

generative.prob(blast.output.file = NULL, read.length.file = 80,
  contig.weight.file = 1, gi.taxon.file = NULL, gen.prob.unknown = 1e-06,
  outDir = NULL, blast.default = TRUE)
generative.prob.nucl(blast.output.file = NULL, read.length.file = 80,
  contig.weight.file = 1, gi.taxon.file, gen.prob.unknown = 1e-20,
  outDir = NULL, genomeLength = NULL, blast.default = TRUE)

Arguments

blast.output.file

This is the tabular BLASTx output format for generative.prob(), while it is the tabular BLASTn output format for generative.prob.nucl(). It can either be the default output format or a specific custom output format, incorporating read length and taxon id

read.length.file

This argument can either be a file mapping each read to its length or a numerical value, representing the average read length.

contig.weight.file

This argument can either be a file where weights are assigned to reads and contigs. For unassembled reads the weight is equal to 1 while for contigs the weight should reflect the number of reads that assembled it.

gi.taxon.file

For generative.prob() this would be the 'gi_taxid_prot.dmp' taxonomy file, mapping each protein gi identifier to the corresponding taxon identifier. It can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz . For genera

gen.prob.unknown

User-defined generative probability for unknown category. Default value for generative.prob() is 1e-06, while for generative.prob.nucl() is 1e-20.

outDir

Output directory.

blast.default

logical. Is the input the default blast output tabular format? Default value is TRUE

genomeLength

This is applicable only for generative.prob.nucl() . It is a file mapping each genome/nucleotide to its respective length. The file must be tab seperated and the first column the nucleotide gi identifier and the second the corresponding sequence length. I

Value

step1: A list with five elements. The first one (pij.sparse.mat) is a sparse matrix with the generative probability between each read and each species. The second (ordered.species) is matrix containing all the potential species as recorded by BLAST, ordered by the number of reads. The third one (read.weights) is a data.frame mapping each contig to a weight which is essentially the number of reads that were used to assemble it. For unassembled reads the weight is equal to one. The fourth one is the generative probability for the unknown category (gen.prob.unknown), to be used in all subsequent steps. Finally we also record the output directory (outDir) where the results will be stored.

Examples

Run this code

# See vignette for more details

# When using custom BLAST output file
step1 <-generative.prob(blast.output.file = "pathtoFile/blastOut.custom", blast.default=FALSE)

# When using default BLAST output file
step1 <-generative.prob(blast.output.file = "pathtoFile/blastOut.default",
                        read.length.file="pathtoFile/read.lengths",
                        contig.weight.file="pathtoFile/read.weights",
                        gi.taxon.file = "pathtoFile/taxon.file")

Run the code above in your browser using DataLab