- dataframe
dataframe. Dataframe containing modification information to write back to modified FASTQ. Must have columns for unique read ID, DNA sequence, and at least one set of locations and probabilities for a particular modification type (e.g. 5C methylation).
- filename
character. File to write the modified FASTQ to. Recommended to end with .fastq (warns but works if not). If set to NA (default), no file will be output, which may be useful for testing/debugging.
- read_id_colname
character. The name of the column within the dataframe that contains the unique ID for each read. Defaults to "read".
- sequence_colname
character. The name of the column within the dataframe that contains the DNA sequence for each read. Defaults to "sequence".
The values within this column must be DNA sequences e.g. "GGCGGC".
- quality_colname
character. The name of the column within the dataframe that contains the FASTQ quality scores for each read. Defaults to "quality". If scores are not known, can be set to NA to fill in quality with "B".
If not NA, must correspond to a column where the values are the FASTQ quality scores e.g. "$12\">/2C;4:9F8:816E,6C3*," - see fastq_quality_scores.
- locations_colnames
character vector. Vector of the names of all columns within the dataframe that contain modification locations. Defaults to c("hydroxymethylation_locations", "methylation_locations").
The values within these columns must be comma-separated strings of indices at which modification was assessed, as produced by vector_to_string(), e.g. "3,6,9,12".
Will fail if these locations are not instances of the target base (e.g. "C" for "C+m?"), as the SAMtools tag system does not work otherwise. One consequence of this is that if sequences have been reversed via merge_methylation_with_metadata() or helpers, they cannot be written to FASTQ unless modification locations are symmetric e.g. CpG and offset was set to 1 when reversing (see reverse_locations_if_needed()).
- probabilities_colnames
character vector. Vector of the names of all columns within the dataframe that contain modification probabilities. Defaults to c("hydroxymethylation_probabilities", "methylation_probabilities").
The values within the columns must be comma-separated strings of modification probabilities, as produced by vector_to_string(), e.g. "0,255,128,78".
- modification_prefixes
character vector. Vector of the prefixes to be used for the MM tags specifying modification type. These are usually generated by Dorado/Guppy based on the original modified basecalling settings, and more details can be found in the SAM optional tag specifications. Defaults to c("C+h?", "C+m?").
locations_colnames, probabilities_colnames, and modification_prefixes must all have the same length e.g. 2 if there were 2 modification types assessed.
- include_blank_tags
logical. Boolean specifying what to do if a particular read has no assessed locations for a given modification type from modification_prefixes.
If TRUE (default), blank tags will be written e.g. "C+h?;" (whereas a normal, non-blank tag looks like "C+h?,0,0,0,0;"). If FALSE, tags with no assessed locations in that read will not be written at all.
- return
logical. Boolean specifying whether this function should return the FASTQ (as a character vector of each line in the FASTQ), otherwise it will return invisible(NULL). Defaults to FALSE.