dada: High resolution sample inference from amplicon data.

Description

The dada function takes as input dereplicated amplicon sequencing reads and returns the inferred composition of the sample (or samples). Put another way, dada removes all sequencing errors to reveal the members of the sequenced community. If dada is run in selfConsist=TRUE mode, the algorithm will infer both the sample composition and the parameters of its error model from the data.

Usage

dada(derep, err, errorEstimationFunction = loessErrfun, selfConsist = FALSE, pool = FALSE, ...)

Arguments

derep

(Required). A derep-class object, the output of derepFastq. A list of such objects can be provided, in which case each will be denoised with a shared error model.

err

(Required). 16xN numeric matrix. Each entry must be between 0 and 1.

The matrix of estimated rates for each possible nucleotide transition (from sample nucleotide to read nucleotide). Rows correspond to the 16 possible transitions (t_ij) indexed such that 1:A->A, 2:A->C, ..., 16:T->T Columns correspond to quality scores. Typically there are 41 columns for the quality scores 0-40. However, if USE_QUALS=FALSE, the matrix must have only one column. If selfConsist = TRUE, err can be set to NULL and an initial error matrix will be estimated from the data by assuming that all reads are errors away from one true sequence.

errorEstimationFunction

(Optional). Function. Default loessErrfun.

If USE_QUALS = TRUE, errorEstimationFunction(dada()$trans_out) is computed after sample inference, and the return value is used as the new estimate of the err matrix in $err_out. If USE_QUALS = FALSE, this argument is ignored, and transition rates are estimated by maximum likelihood (t_ij = n_ij/n_i).

selfConsist

(Optional). logical(1). Default FALSE.

If selfConsist = TRUE, the algorithm will alternate between sample inference and error rate estimation until convergence. Error rate estimation is performed by errorEstimationFunction. If selfConsist=FALSE the algorithm performs one round of sample inference based on the provided err matrix.

pool

(Optional). logical(1). Default is FALSE.

If pool = TRUE, the algorithm will pool together all samples prior to sample inference. If pool = FALSE, sample inference is performed on each sample individually. This argument has no effect if only 1 sample is provided, and pool does not affect error rates, which are always estimated from pooled observations across samples.

...

(Optional). All dada_opts can be passed in as arguments to the dada() function. See setDadaOpt for a full list and description of these options.

Value

A dada-class object or list of such objects if a list of dereps was provided.

Details

Briefly, dada implements a statiscal test for the notion that a specific sequence was seen too many times to have been caused by amplicon errors from currently inferred sample sequences. Overly-abundant sequences are used as the seeds of new clusters of sequencing reads, and the final set of clusters is taken to represent the denoised composition of the sample. A more detailed explanation of the algorithm is found in two publications:

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP (2015). DADA2: High resolution sample inference from amplicon data. bioRxiv, 024034.
Rosen MJ, Callahan BJ, Fisher DS, Holmes SP (2012). Denoising PCR-amplified metagenome data. BMC bioinformatics, 13(1), 283.

dada depends on a parametric error model of substitutions. Thus the quality of its sample inference is affected by the accuracy of the estimated error rates. selfConsist mode allows these error rates to be inferred from the data. All comparisons between sequences performed by dada depend on pairwise alignments. This step is the most computationally intensive part of the algorithm, and two alignment heuristics have been implemented for speed: A kmer-distance screen and banded Needleman-Wunsch alignmemt. See setDadaOpt.

Examples

Run this code

derep1 = derepFastq(system.file("extdata", "sam1F.fastq.gz", package="dada2"))
derep2 = derepFastq(system.file("extdata", "sam2F.fastq.gz", package="dada2"))
dada(derep1, err=tperr1)
dada(list(sam1=derep1, sam2=derep2), err=tperr1, selfConsist=TRUE)
dada(derep1, err=inflateErr(tperr1,3), BAND_SIZE=32, OMEGA_A=1e-20)