This function reads a modified FASTQ file (e.g. created by samtools fastq -T MM,ML
from a BAM basecalled with a modification-capable model in Dorado or Guppy) to a dataframe.
By default, the dataframe contains columns for unique read id (read), sequence (sequence),
sequence length (sequence_length), quality (quality), comma-separated (via vector_to_string())
modification types present in each read (modification_types), and for each modification type,
a column of comma-separated modification locations (<type>_locations) and
a column of comma-separated modification probabilities (<type>_probabilities).
Modification locations are the indices along the read at which modification was assessed
e.g. a 3 indicates that the third base in the read was assessed for modifications of the given type.
Modification probabilities are the probability that the given modification is present, given as
an integer from 0-255 where integer \(N\) represents the probability space from \(\frac{N}{256}\)
to \(\frac{N+1}{256}\).
To extract the numbers from these columns as numeric vectors to analyse, use string_to_vector() e.g.
list_of_locations <- lapply(test_01$`C+h?_locations`, string_to_vector). Be aware that the SAM
modification types often contain special characters, meaning the colname may need to be enclosed in
backticks as in this example. Alternatively, use extract_methylation_from_dataframe() to
create a list of locations, probabilities, and lengths ready for visualisation in
visualise_methylation(). This works with any modification type extracted in this function,
just provide the relevant colname when calling extract_methylation_from_dataframe().
Optionally (by specifying debug = TRUE), the dataframe will also contain columns of
the raw MM and ML tags (<MM/ML>_raw) and of the same tags with the initial label
trimmed out (<MM/ML>_tags). This is not recommended in most situations but may help
with debugging unexpected issues as it contains the raw data exactly from the FASTQ.
Dataframes produced by this function can be written back to modified FASTQ via write_modified_fastq().