mergePairsByID: Merge forward and reverse reads after DADA denoising, even if reads were not originally ordered together.

Description

This function attempts to merge each pair of denoised forward and reverse reads, rejecting any which do not sufficiently overlap or which contain too many (>0 by default) mismatches in the overlap region. Note: This function does not assume that the fastq files for the forward and reverse reads were in the same order. If they are already in the same order, use mergePairs.

Usage

mergePairsByID(dadaF, derepF, srF, dadaR, derepR, srR, minOverlap = 20, maxMismatch = 0, returnRejects = FALSE, idRegExpr = c("\\s.+$", ""), includeCol = character(0), justConcatenate = FALSE, verbose = FALSE)

Arguments

dadaF

(Required). A dada-class object. The output of dada() function on the forward reads.

derepF

(Required). A derep-class object. The derep-class object returned by derepFastq() that was used as the input to the dada-class object passed to the dadaF argument.

srF

(Required). The trimmed and filtered forward reads that you used as input for derepFastq. More generally, this is an object that inherits from the ShortRead-class. In most cases this will be ShortReadQ-class. Objects from this class are the result of readFastq. Alternatively, this can be a character string that provides the path to your forward reads fastq file.

dadaR

(Required). A dada-class object. The output of dada() function on the reverse reads.

derepR

(Required). A derep-class object. See derepF description, but for the reverse reads.

srR

(Required). See srF description, but in this case provide for the reverse reads.

minOverlap

(Optional). A numeric(1) of the minimum length of the overlap (in nucleotides) required for merging the forward and reverse reads. Default is 20.

maxMismatch

(Optional). A numeric(1) of the maximum mismatches allowed in the overlap region. Default is 0 (i.e. only exact matches in the overlap region are accepted).

returnRejects

(Optional). A logical(1). Default is FALSE. If TRUE, the pairs that that were rejected based on mismatches in the overlap region are retained in the return data.frame.

idRegExpr

(Optional). A length 2 character() vector. This is passed along in order as the first two arguments to a gsub call that defines how each read id is parsed. The default is c("\s.+$", ""), which is a gsub directive to keep the id string from the beginning up to but not including the first space. For some sequencing platforms and/or read ID schemes, an alternative parsing of the IDs may be appropriate.

includeCol

(Optional). character. Default is character(0). The returned data.table will include columns with names specified by the dada-class$clustering data.frame.

justConcatenate

(Optional). NOT CURRENTLY SUPPORTED. logical(1), Default FALSE. If TRUE, the forward and reverse-complemented reverse read are concatenated rather than merged, with a NNNNNNNNNN (10 Ns) spacer inserted between them.

verbose

(Optional). logical(1) indicating verbose text output. Default FALSE.

Value

A data.frame with a row for each unique pairing of forward/reverse denoised sequences, and the following columns:

$abundance: Number of reads corresponding to this forward/reverse combination.
$sequence: The merged sequence.
$forward: The index of the forward denoised sequence.
$reverse: The index of the reverse denoised sequence.
$nmatch: Number of matches nts in the overlap region.
$nmismatch: Number of mismatches in the overlap region.
$nindel: Number of indels in the overlap region.
$prefer: The sequence used for the overlap region. 1=forward; 2=reverse.
$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise.
$...: Additional columns specified in propagateCol.

Details

Not yet implemented: Use of the concatenate option will result in concatenating forward and reverse reads without attempting a merge/alignment step.

Examples

Run this code

# For the following example files, there are two ways to merge denoised directions.
# Because the read sequences are in order, `mergePairs()` works.
# `mergePairsByID` always works,
# because it uses the read IDs to match denoised pairs.
exFileF = system.file("extdata", "sam1F.fastq.gz", package="dada2")
exFileR = system.file("extdata", "sam1R.fastq.gz", package="dada2")
srF = ShortRead::readFastq(exFileF)
srR = ShortRead::readFastq(exFileR)
derepF = derepFastq(exFileF)
derepR = derepFastq(exFileR)
dadaF <- dada(derepF, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE)
dadaR <- dada(derepR, err=tperr1, errorEstimationFunction=loessErrfun, selfConsist=TRUE)
# Run and compare
ex1time = system.time({
ex1 <- mergePairs(dadaF, derepF, dadaR, derepR, verbose = TRUE)
    ex1 <- data.table::data.table(ex1)
 })
ex1time
# The new function, based on read IDs.
ex2time = system.time({
  ex2 = dada2:::mergePairsByID(dadaF = dadaF, derepF = derepF, srF = srF,
                       dadaR = dadaR, derepR = derepR, srR = srR, verbose = TRUE)
})
ex2time
# Compare results (should be identical)
ex2[(accept)]
data.table::setkey(ex2, sequence)
ex2[(accept), list(abundance = sum(abundance)), by = sequence]
# Same sequence set (exactly)
setequal(x = ex1$sequence,
         y = ex2[(accept)]$sequence)
# Test concatenation functionality
ex1cattime = system.time({
ex1cat <- mergePairs(dadaF, derepF, dadaR, derepR, justConcatenate = TRUE, verbose = TRUE)
sapply(ex1cat, class)
  # need to convert to a character
  ex1cat$sequence <- unlist(ex1cat$sequence)
  ex1cat <- data.table::data.table(ex1cat)
})
ex1cattime
ex2cattime = system.time({
  ex2cat <- dada2:::mergePairsByID(dadaF = dadaF, derepF = derepF, srF = srF,
                           dadaR = dadaR, derepR = derepR, srR = srR,
                           justConcatenate = TRUE, verbose = TRUE)
})
ex2cattime
ex2cat[(accept)]
# Compare results (should be identical)
data.table::setkey(ex1cat, sequence)
ex1cat[(accept), list(abundance = sum(abundance)), by = sequence]
data.table::setkey(ex2cat, sequence)
ex2cat[(accept), list(abundance = sum(abundance)), by = sequence]
# Same sequence set (exactly)
setequal(x = ex1cat$sequence,
         y = ex2cat$sequence)
intersect(x = ex1cat$sequence,
          y = ex2cat$sequence)
ex1cat[, nchar(sequence)]
ex2cat[, nchar(sequence)]