Learn R Programming

VariantTools (version 1.14.1)

callGenotypes: Call Genotypes

Description

Calls genotypes from a set of tallies (such as a VRanges or VCF file) and the coverage (currently as a BigWigFile). We call the genotype with the highest likelihood, where the likelihood is based on a binomial model of the variant frequency.

Usage

"callGenotypes"(variants, cov, param = CallGenotypesParam(GmapGenome(unique(genome(variants)))), BPPARAM = defaultBPPARAM()) "callGenotypes"(variants, cov, param = CallGenotypesParam(GmapGenome(unique(genome(variants)))), BPPARAM = defaultBPPARAM()) CallGenotypesParam(genome, gq.breaks = c(0, 5, 20, 60, Inf), p.error = 0.05, which = tileGenome(seqinfo(genome), ntile=ntile), ntile = 100L)

Arguments

variants
Either VRanges as returned by tallyVariants, or a TabixFile object pointing to a VCF file. Typically, these tallies are not filtered by e.g. callVariants, because it would seem more appropriate to filter on the genotype quality.
cov
The coverage, stored as a BigWigFile.
param
Parameters controlling the genotyping, constructed by CallGenotypesParam. The default value uses the genome from variants.
genome
The GmapGenome object used during tallying.
gq.breaks
A numeric vector representing an increasing sequence of genotype quality breaks to segment the wildtype runs.
p.error
The binomial probability for an error. This is used to calculate the expected frequency of hom-ref and hom-alt variants.
which
A GenomicRangesList indicating the genomic regions in which to compute the genotypes. The default is to partition the genome into ntile tiles.
ntile
When which is missing, this indicates the number of tiles to generate from the genome.
BPPARAM
A BiocParallelParam object communicating the parallelization strategy. One job is created per tile.

Value

For callGenotypes, a VRanges annotated with the genotype call information, as described in the details.

Details

In general, the behavior is very similar to that of the GATK UnifiedGenotyper (see references). For every position in the tallies, we compute a binomial likelihood for each of wildtype (0/0), het (0/1) and hom-alt (1/1), assuming the alt allele frequency to be p.error, 0.5 and 1 - p.error, respectively. The genotype with the maximum likelihood is chosen as the genotype, and the genotype quality is computed by taking the fraction of the maximum likelihood over the sum of the three likelihoods.

We assume that any position not present in the input tallies is wildtype (0/0) and compute the quality for every such position, using the provided coverage data. For scalability reasons, we segment runs of these positions according to user-specified breaks on the genotype quality. The segments become special records in the returned VRanges, where the range represents the segment, the ref is the first reference base, alt is and the totalDepth is the mean of the coverage.

The genotype information is recorded as metadata columns named according to gVCF conventions:

GT
The genotype call string: 0/0, 0/1, 1/1.

GQ
The numeric genotype quality, phred scaled. For wildtype runs, this is minimum over the run.

PL
A 3 column matrix with the likelihood for each genotype, phred scaled. We take the minimum over wildtype runs.

MIN_DP
The minimum coverage over a wildtype run; NA for single positions.

References

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303.

Examples

Run this code
## generate tallies
library(VariantTools)
bams <- LungCancerLines::LungCancerBamFiles()
tally.param <- TallyVariantsParam(gmapR::TP53Genome(), 
                                  high_base_quality = 23L,
                                  which = gmapR::TP53Which())
tallies <- tallyVariants(bams$H1993, tally.param)
sampleNames(tallies) <- "H1993"
mcols(tallies) <- NULL

## we assume the coverage is already stored as a BigWigFile
bwf <- rtracklayer::export(coverage(bams$H1993), "coverage.bw")

## simple usage
genotypes <- callGenotypes(tallies, bwf, BPPARAM=BiocParallel::SerialParam())

## write to gVCF
writeVcf(genotypes, tempfile("genotypes", fileext="vcf"), index=TRUE)

## customize
params <- CallGenotypesParam(gmapR::TP53Genome(), p.error = 1/1000)
callGenotypes(tallies, bwf, params)

Run the code above in your browser using DataLab