reduce.space: Reduce the space of potential species by fitting the mixture model with all potential species as categories

Description

Having the generative probabilities from step1 (generative.prob() or generative.prob.nucl()), we could proceed directly with the PT MCMC to explore the state space. Typically the number of total potential species is large. Therefore we reduce the size of the state-space, by decreasing the number of species to the low hundreds. We achieve this by fitting a Mixture Model with as many categories as all the potential species. Post fitting, we retain only the species categories that are not empty, that is categories that have at least one read assigned to them.

reduce.space.explicit is the same function as reduce.space but with more involved syntax.

Usage

reduce.space(step1, read.cutoff = 1, EMiter = 500, seed = 1)
reduce.space.explicit(pij.sparse.mat, ordered.species, read.weights,
  outDir, gen.prob.unknown, read.cutoff = 1, EMiter = 500, seed = 1)

Arguments

step1

list. The output from generative.prob() (or generative.prob.nucl(), that is the first step of the pipeline. Alternatively, it can be a character string containing the path name of the ".RData" file where step1 list was saved.

read.cutoff

numeric vector. This is the used to decide which species to retain for the subsequent MCMC exploration. Default value is 1, i.e keep all species that have at least one read assigned to them. If this number is still in the low thousands as opposed to the low hundreds the user may set this to a higher number, such as 10.

EMiter

Number of iterations for the EM algorithm. Default value is 500.

seed

Optional argument that sets the random seed (default is 1) to make results reproducible.

pij.sparse.mat

sparse Matrix of generative probabilities computed by generative.prob() / generative.prob.nucl().

ordered.species

data.frame with potential species ordered by numbers of reads matching them. Computed by generative.prob().

read.weights

data.frame mapping each read identifier to a weight. For contigs the weight is the number of reads that were used to assemble it. For unassembled reads the weight is equal to one.

outDir

character vector holding the path to the output directory where the results are written.

gen.prob.unknown

numeric vector. This is the generative probability for the unknown category. Default value for BLASTx-analysis is 1e-06 while for BLASTn-analysis is 1e-20.

Value

step2: A list with six elements. The first one (ordered.species) is a data.frame containing all the non-empty species categories, as decided by the all inclusive mixture model, ordered by the number of reads assigned to them. The second one (pij.sparse.mat) is a sparse matrix with the generative probability between each read and each species. read.weights, gen.prob.unknown, outDir are all carried forward from the "step1" object. Finally outputEM which records the species abundances throughout the EM iterations (not used in step3 and step4).

Examples

Run this code

# NOT RUN {
## See vignette for more details.

# }
# NOT RUN {
# Either load the object created by previous step
data(step1)  ## example output of step1, i.e generative.prob()
step2 <- reduce.space(step1=step1)

# or alternatively point to the location of the step1.RData object
step2 <- reduce.space(step1="/pathtoFile/step1.RData")
# }

Run the code above in your browser using DataLab