denoise: Run the denoiser pipeline for a sequence read.

Description

This function runs the complete denoising pipeline for a given input sequence and its corresponding name and phred scores. The default behaviour is set to interface with fastq files (standard output for most sequencers).

Usage

denoise(x, ...)
# S3 method for default
denoise(x, ..., name = character(), phred = NULL,
  dir_check = TRUE, double_pass = TRUE, min_match = 100,
  max_inserts = 400, censor_length = 7, added_phred = "*",
  adjust_limit = 5, ambig_char = "N", to_file = FALSE,
  keep_flanks = TRUE, keep_phred = TRUE, outformat = "fastq",
  terminate_rejects = TRUE, outfile = NULL, phred_placeholder = "#",
  aa_check = TRUE, trans_table = 0, frame_offset = 0,
  append = TRUE)

Value

a class object of code "DNAseq"

Arguments

x: a DNA sequence string.
...: additional arguments to be passed between methods.
name: an optional character string. Identifier for the sequence.
phred: an optional character string. The phred score string corresponding to the nucleotide string. If passed then the input phred scores will be modified along with the nucleotides and carried through to the sequence output. Default = NULL.
dir_check: A boolean indicating if both the forward and reverse compliments of a sequence should be checked against the PHMM. Default is TRUE.
double_pass: A boolean indicating if a second pass through the Viterbi algorithm should be conducted for sequences that had leading nucleotides not matching the PHMM. This improves the accurate establishment of reading frame and will reduce false rejections by the amino acid check, but this comes at a cost of additional processing time. Default is TRUE.
min_match: The minimum number of sequential matches to the PHMM for a sequence to be denoised. Otherwise flag the sequence as a reject.
max_inserts: The maximum number of sequention insert states occuring in a sequence (including the flanking regions). If this number is exceeded than the entire read will be discarded if terminate_rejects = TRUE. Default is 400.
censor_length: the number of base pairs in either direction of a PHMM correction to convert to placeholder characters. Default is 7.
added_phred: The phred character to use for characters inserted into the original sequence.
adjust_limit: the maximum number of corrections that can be applied to a sequence read. If this number is exceeded then the entire read is rejected. Default is 3.
ambig_char: The character to use for ambigious positions in the sequence that is output to the file. Default is N.
to_file: Boolean indicating whether the sequence should be written to a file. Default is TRUE.
keep_flanks: Should the regions of the input sequence outside of the barcode region be readded to the denoised sequence prior to outputting to the file. Options are TRUE, FALSE and 'right'. The 'right' option will keep the trailing flank but remove the leading flank. Default is TRUE. False will lead to only the denoised sequence for the 657bp barcode region being output to the file.
keep_phred: Should the original PHRED scores be kept in the output? Default is TRUE.
outformat: The format of the output file. Options are fasta or fastq (default) format.
terminate_rejects: Boolean indicating if analysis of sequences that fail to meet phred quality score or path match thresholds should be terminated early (prior to sequence adjustment and writing to file). Default it true.
outfile: The name of the file to output the data to. Default filenames are respectively: denoised.fasta or denoised.fastq.
phred_placeholder: The character to input for the phred score line. Default is '#'. Used with write_fastq and keep_phred == FALSE only.
aa_check: Boolean indicating whether the amino acid sequence should be generated and assessed for stop codons. Default = TRUE.
trans_table: The translation table to use for translating from nucleotides to amino acids. Default is 0, meaning that censored translation is performed (amigious codons ignored). Used only when aa_check = TRUE.
frame_offset: The offset to the reading frame to be applied for translation. By default the offset is zero, so the first character in the framed sequence is considered the first nucelotide of the first codon. Passing frame_offset = 1 would offset the sequence by one and therefore make the second character in the framed sequence the the first nucelotide of the first codon. Used only when aa_check = TRUE.
append: Should the denoised sequence be appended to the output file?(TRUE) Or should the sequence overwrite the output file?(FALSE) Default is TRUE.

Details

Since the pipeline is designed for recieving or outputting either fasta or fastq data, this function is hevaily paramaterized. Note that not all paramaters will affect all use cases (i.e. if your outformat is to a fasta file, then the phred_placeholder paramater is ignored as this option only pertains to fastq outputs). The user is encouraged to read the vignette for a detailed walkthrough of the denoiser pipeline that will help identify the paramaters that relate to their given needs.

Examples

Run this code

# Denoise example sequence with default paramaters.
ex_data = denoise(example_nt_string_errors, 
                  name = 'example_sequence_1', 
                  keep_phred = FALSE, 
                  to_file = FALSE)

#fastq data from a file
#previously run
fastq_example_file = system.file('extdata/coi_sequel_data_subset.fastq', 
                                 package = 'debar')
data = read_fastq(fastq_example_file)
#denoise the first sequence in the file
#use a custom censor length and no amino acid check
dn_dat_1 = denoise(x = data$sequence[[1]], 
                    name = data$header[[1]], 
                    phred = data$quality[[1]], 
                    censor_length = 11, 
                    aa_check = FALSE, 
                    to_file = FALSE)

Run the code above in your browser using DataLab