read_modified_fastq: Read modification information from modified FASTQ

Description

This function reads a modified FASTQ file (e.g. created by samtools fastq -T MM,ML from a BAM basecalled with a modification-capable model in Dorado or Guppy) to a dataframe.

By default, the dataframe contains columns for unique read id (read), sequence (sequence), sequence length (sequence_length), quality (quality), comma-separated (via vector_to_string()) modification types present in each read (modification_types), and for each modification type, a column of comma-separated modification locations (<type>_locations) and a column of comma-separated modification probabilities (<type>_probabilities).

Modification locations are the indices along the read at which modification was assessed e.g. a 3 indicates that the third base in the read was assessed for modifications of the given type. Modification probabilities are the probability that the given modification is present, given as an integer from 0-255 where integer $N$ represents the probability space from $\frac{N}{256}$ to $\frac{N+1}{256}$.

To extract the numbers from these columns as numeric vectors to analyse, use string_to_vector() e.g. list_of_locations <- lapply(test_01$`C+h?_locations`, string_to_vector). Be aware that the SAM modification types often contain special characters, meaning the colname may need to be enclosed in backticks as in this example. Alternatively, use extract_methylation_from_dataframe() to create a list of locations, probabilities, and lengths ready for visualisation in visualise_methylation(). This works with any modification type extracted in this function, just provide the relevant colname when calling extract_methylation_from_dataframe().

Optionally (by specifying debug = TRUE), the dataframe will also contain columns of the raw MM and ML tags (<MM/ML>_raw) and of the same tags with the initial label trimmed out (<MM/ML>_tags). This is not recommended in most situations but may help with debugging unexpected issues as it contains the raw data exactly from the FASTQ.

Dataframes produced by this function can be written back to modified FASTQ via write_modified_fastq().

Usage

read_modified_fastq(filename = file.choose(), debug = FALSE)

Value

dataframe. Dataframe of modification information, as described above.

Sequences can be visualised with visualise_many_sequences() and modification information can be visualised with visualise_methylation() (despite the name, any type of information can be visualised as long as it has locations and probabilities columns).

Can be written back to FASTQ via write_modified_fastq().

Arguments

filename: character. The file to be read. Defaults to file.choose() to select a file interactively.
debug: logical. Boolean value for whether the extra <MM/ML>_tags and <MM/ML>_raw columns should be added to the dataframe. Defaults to FALSE as I can't imagine this is often helpful, but the option is provided to assist with debugging.

Examples

Run this code

## Locate file
modified_fastq_file <- system.file("extdata",
                                   "example_many_sequences_raw_modified.fastq",
                                   package = "ggDNAvis")

## View file
for (i in 1:16) {
    cat(readLines(modified_fastq_file)[i], "\n")
}

## Read file to dataframe
read_modified_fastq(modified_fastq_file, debug = FALSE)
read_modified_fastq(modified_fastq_file, debug = TRUE)

Run the code above in your browser using DataLab