test_ctDNA: Tests the ctDNA positivity of a sample

Description

Given a set of reporter mutation, this functions counts the reads matching the reporter mutations in the sample to be tested, estimates the mismatch rate for the sample to be tested, and then runs a Monte Carlo simulation test to determine whether the tested sample is positive or negative.

Usage

test_ctDNA(mutations, bam, targets, reference, tag = "",
  ID_column = NULL, black_list = NULL, substitution_specific = TRUE,
  vaf_threshold = 0.1, min_base_quality = 30, max_depth = 1e+05,
  min_mapq = 40, bam_list = NULL, bam_list_tags = rep("",
  length(bam_list)), min_alt_reads = 1,
  min_samples = ceiling(length(bam_list)/10), n_simulations = 10000,
  pvalue_threshold = 0.05, seed = 123,
  informative_reads_threshold = 10000)

Arguments

mutations

A data frame with the reporter mutations. Should have the columns CHROM, POS, REF, ALT.

bam

path to bam file

targets

a data frame with the target regions. Must have three columns: chr, start and end

reference

the reference genome in BSgenome format

tag

the RG tag if the bam has more than one sample

ID_column

The name of the column that contains the ID of mutations in phase. All mutations in Phase should have the same ID in that column.

black_list

a character vector of genomic loci to filter. The format is chr_pos if substitution_specific is false or chr_pos_ref_alt if substitution_specific is true. If given, will override bam_list.

substitution_specific

logical, whether to have the loci of black_list by substitutions.

vaf_threshold

When calculating the background rate, the bases with higher than this VAF threshold will be ignored (real mutations/SNPs).

min_base_quality

minimum base quality for a read to be counted

max_depth

maximum depth above which sampling will happen

min_mapq

the minimum mapping quality for a read to be counted

bam_list

A vector containing the paths to bam files used to filter mutations. Mutations that have more than min_alt_reads in more than min_samples will be filtered. Using black_list is more recommended.

bam_list_tags

the RG tags for the bams included in bams list.

min_alt_reads

When bam_list is provided, this sets the minimum number of alternative allele reads for a sample to be counted.

min_samples

When number of samples having more than min_alt_reads exceeds this number, the mutation will be filtered.

n_simulations

the number of Monte Carlo simulations.

pvalue_threshold

the p-value threshold used to decide positivity or negativity.

seed

the random seed to make the Monte Carlo simulations reproducible.

informative_reads_threshold

the number of informative reads (unique reads mapping to specified mutations) under which the test will be undetermined.

Value

a data frame with the following columns:

sample: The sample name taken from SM field in the bam file or file base name
n_mutations: The number of mutations used in the test.
n_nonzero_alt: Number of mutations that have at least one read supporting alternative allele.
total_alt_reads: Total number of reads supporting alternative alleles of all mutations in input.
mutations_filtered: The number of filtered mutations.
background_rate: The background rate of the tested sample (after all adjustments)
informative_reads: The number of unique reads covering the mutations used.
multi_support_reads: The number of reads that support more than one mutations in phase. Non-zero values is a sign of positivity not used in the p-value calculation.
pvalue: The empirical p-value from the Monte Carlo test.
decision: The decision can be positive, negative or undetermined.

Details

This is the main function to test minimal residual disease by ctDNA positivity in a follow-up sample (e.g. after treatment). The inputs include a bam file for the follow-up sample to be tested, a list of reporter mutations (detected for example before treatment in a ctDNA positive sample), and an optional black_list (recommended to use) computed from a list of bam files of healthy-like samples or bam_list of the bam_files to use instead of black_list.

The workflow includes the following steps:

1.

Filtering mutations (optional but recommended): The mutations in the input will be filtered removing the ones reported in the black list. If bam_list is provided, the mutations will be filtered according to the min_samples and min_alt_reads parameters. See filter_mutations.

2.

The background rate will be computed for the input bam. The black list will be plugged in to exclude the black-listed loci when computing the background rate. The black list can be substitution_specific or not, but in both cases, the background rate will be adjusted accordingly. See get_background_rate.

3.

Counting reference and alternative alleles of the reporter mutations in the bam file.

4.

Merging mutations in phase (optional but recommended): If the ID_column is specified in the mutations input, mutations with the same ID (in phase) will be merged. While doing so, it is expected that real traces of mutations will be exhibited jointly in the reads spanning phased mutations, whereas artifacts will not show in all the covered mutations in phase. Therefore, the mismatches that map only to a subset of the mutations in phase but not the others (which are covered and show reference allele) will be removed. This will lead to reduction of the observed mismatches, and therefore, the computed background rate will be adjusted accordingly: new rate = old rate * (1 - purification_probability). See merge_mutations_in_phase.

5.

Monte Carlo test: this approach was used by Newman et al., Nature Biotechnology 2016.

Given \(N\) reporter mutations each with depth , randomly sample variant allele reads under the background rate \(p\) using binomial distribution ~ Binom(, \(p\)).
Repeat n_simuations times.
Count the number of simulations, where simulated data equal or exceed observed data in jointly two measurements: (1) the average VAF for the \(N\) mutations, and (2) the number of mutations with non-zero VAF.
Compute an empirical p-value as (#successes + 1)/(#simulations + 1)

6.

Make a decision: If number of informative reads is lower than informative_reads_threshold parameter, the decision will be undetermined. Otherwise, the pvalue_threshold parameters will be used to determine positivity or negativity.

Examples

Run this code

# NOT RUN {
## Load example data
data("mutations", package = "ctDNAtools")
data("targets", package = "ctDNAtools")
bamT1 <- system.file("extdata", "T1.bam", package = "ctDNAtools")
bamT2 <- system.file("extdata", "T2.bam", package = "ctDNAtools")
bamN1 <- system.file("extdata", "N1.bam", package = "ctDNAtools")
bamN2 <- system.file("extdata", "N2.bam", package = "ctDNAtools")
bamN3 <- system.file("extdata", "N3.bam", package = "ctDNAtools")

## Use human reference genome from BSgenome.Hsapiens.UCSC.hg19 library
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))
# }
# NOT RUN {
## basic usage
test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100
)

## More options
test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, min_base_quality = 20, min_mapq = 30,
  vaf_threshold = 0.05
)

## Use phasing information
test_ctDNA(
  mutations = mutations, bam = bamT2,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING"
)

## Use a black list based on loci
bg_panel <- create_background_panel(
  bam_list = c(bamN1, bamN2, bamN3),
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  substitution_specific = FALSE
)

bl1 <- create_black_list(bg_panel,
  mean_vaf_quantile = 0.99,
  min_samples_one_read = 2, min_samples_two_reads = 1
)

test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING", black_list = bl1,
  substitution_specific = FALSE
)
# }
# NOT RUN {
## Use a substitution-specific black list
bg_panel <- create_background_panel(
  bam_list = c(bamN1, bamN2, bamN3),
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  substitution_specific = TRUE
)

bl2 <- create_black_list(bg_panel,
  mean_vaf_quantile = 0.99,
  min_samples_one_read = 2, min_samples_two_reads = 1
)

test_ctDNA(
  mutations = mutations, bam = bamT1,
  targets = targets, reference = BSgenome.Hsapiens.UCSC.hg19,
  n_simulation = 100, ID_column = "PHASING", black_list = bl2,
  substitution_specific = TRUE
)
# }

Run the code above in your browser using DataLab