Merge a dataframe of methylation/modification data (as produced by
read_modified_fastq()) with a dataframe of metadata, reversing
sequence and modification information if required such that all information
is now in the forward direction.
merge_fastq_with_metadata() is the equivalent function for working with
unmodified FASTQs (sequence and quality only).
Methylation/modification dataframe must contain columns of "read" (unique read ID),
"sequence" (DNA sequence), "quality" (FASTQ quality score), "sequence_length"
(read length), "modification_types" (a comma-separated string of SAMtools modification
headers produced via vector_to_string() e.g. "C+h?,C+m?"), and,
for each modification type, a column of comma-separated strings of modification
locations (e.g. "3,6,9,12") and a column of comma-separated strings of
modification probabilities (e.g. "255,0,64,128"). See read_modified_fastq()
for more information on how this dataframe is formatted and produced.
Other columns are allowed but not required, and will be preserved unaltered
in the merged data.
Metadata dataframe must contain "read" (unique read ID) and "direction"
(read direction, either "forward" or "reverse" for each read) columns,
and can contain any other columns with arbitrary information for each read.
Columns that might be useful include participant ID and family designations
so that each read can be associated with its participant and family.
Important: A key feature of this function is that it uses the direction
column from the metadata to identify which rows are reverse reads. These reverse
reads will then be reversed-complemented and have modification information reversed
such that all reads are in the forward direction, ideal for consistent analysis or
visualisation. The output columns are "forward_sequence", "forward_quality",
"forward_<modification_type>_locations", and "forward_<modification_type>_probabilities".
Calls reverse_sequence_if_needed(), reverse_quality_if_needed(),
reverse_locations_if_needed(), and reverse_probabilities_if_needed()
to implement the reversing - see documentation for these functions for more details.
If wanting to write reversed sequences to FASTQ via write_modified_fastq(), locations
must be symmetric (e.g. CpG) and offset must be set to 1. Asymmetric locations are impossible
to write to modified FASTQ once reversed because then e.g. cytosine methylation will be assessed
at guanines, which SAMtools can't account for. Symmetrically reversing CpGs via
reversed_location_offset = 1 is the only way to fix this.
merge_methylation_with_metadata(
methylation_data,
metadata,
reversed_location_offset = 0,
reverse_complement_mode = "DNA"
)dataframe. A merged dataframe containing all columns from the input dataframes, as well as forward versions of sequences, qualities, modification locations, and modification probabilities (with separate locations and probabilities columns created for each modification type in the modification data).
dataframe. A dataframe contaning methylation/modification data, as produced by read_modified_fastq().
Must contain a read id column (must be called "read"), a sequence column ("sequence"), a quality column ("quality"), a sequence length column ("sequence_length"), a modification types column ("modification_types"), and, for each modification type listed in modification_types, a column of locations ("<modification_type>_locations") and a column of probabilities ("<modification_type>_probabilities"). Additional columns are fine and will simply be included unaltered in the merged dataframe.
See read_modified_fastq() documentation for more details about the expected dataframe format.
dataframe. A dataframe containing metadata for each read in methylation_data.
Must contain a "read" column identical to the column of the same name in methylation_data, containing unique read IDs (this is used to merge the dataframes). Must also contain a "direction" column of "forward" and "reverse" (e.g. c("forward", "forward", "reverse")) indicating the direction of each read.
Important: Reverse reads will have their sequence, quality scores, modification locations, and modification probabilities reversed such that every output read is now forward. These will be stored in columns called "forward_sequence", "forward_quality", "forward_<modification_type>_locations", and "forward_<modification_type>_probabilities". If multiple modification types are present, multiple locations and probabilities columns will be created.
See reverse_sequence_if_needed(), reverse_quality_if_needed(), reverse_locations_if_needed(), and reverse_probabilities_if_needed() documentation for details of how the reversing is implemented.
integer. How much modification locations should be shifted by. Defaults to 0. This is important because if a CpG is assessed for methylation at the C, then reverse complementing it will give a methylation score at the G on the reverse-complemented strand. This is the most biologically accurate, but for visualising methylation it may be desired to shift the locations by 1 i.e. to correspond with the C in the reverse-complemented CpG rather than the G, which allows for consistent visualisation between forward and reverse strands. Setting (integer) values other than 0 or 1 will work, but may be biologically misleading so it is not recommended.
Highly recommended: if considering using this option, read the reverse_locations_if_needed() documentation to fully understand how it works.
character. Whether reverse-complemented sequences should be converted to DNA (i.e. A complements to T) or RNA (i.e. A complements to U). Must be either "DNA" or "RNA". Only affects reverse-complemented sequences. Sequences that were forward to begin with are not altered.
Uses reverse_complement() via reverse_sequence_if_needed().
## Locate files
modified_fastq_file <- system.file("extdata",
"example_many_sequences_raw_modified.fastq",
package = "ggDNAvis")
metadata_file <- system.file("extdata",
"example_many_sequences_metadata.csv",
package = "ggDNAvis")
## Read files
methylation_data <- read_modified_fastq(modified_fastq_file)
metadata <- read.csv(metadata_file)
## Merge data (including reversing if needed)
merge_methylation_with_metadata(methylation_data, metadata, reversed_location_offset = 0)
## Merge data with offset = 1
merge_methylation_with_metadata(methylation_data, metadata, reversed_location_offset = 1)
Run the code above in your browser using DataLab