callVariantsSingle: Single sample variant calling

Description

A simple single sample variant calling function (calling SNVs and deletions)

Usage

callVariantsSingle( data, sampledata, samples = sampledata$Sample, errorRate = 0.001, minSupport = 2, minAF = 0.05, minStrandSupport = 1, mergeDels = TRUE, aggregator = mean)

Arguments

data

A list with elements Counts (a 4d integer array of size [1:12, 1:2, 1:k, 1:n]), Coverage (a 3d integer array of size [1:2, 1:k, 1:n]), Deletions (a 3d integer array of size [1:2, 1:k, 1:n]), Reference (a 1d integer vector of size [1:n]) -- see Details.

sampledata

A data.frame with k rows (one for each sample) and columns Column and (Sample. The tally file should contain this information as a group attribute, see getSampleData for an example.

samples

The samples on which variants should be called, by default all samples specified in sampledata are used

errorRate

The expected error rate of the sequencing technology that was used, for illumina this should be 1/1000

minSupport

minimal support required for a position to be considered variant

minAF

minimal allelic frequency for an allele at a position to be cosidered a variant

minStrandSupport

minimal per-strand support for a position to be considered variant

mergeDels

Boolean flag to specify that adjacent deletion calls should be merged

aggregator

Aggregator function for merging statistics of adjacent deletion calls, defaults to mean, which means that a deletion larger than 1bp will be annotated with the means of the counts and coverages etc.

Value

This function returns a data.frame containing annotated calls with the following slots:
ChromThe chromosome the potential variant / deletion is on
StartThe starting position of the variant / deletion
EndThe end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)
SampleThe sample in which the variant was called
altAlleleThe alternate allele for SNVs (deletions will have a "-" in that slot)
refAlleleThe reference allele for SNVs (deletions will have the deleted sequence here as extracted from the Reference dataset, if the tally file contains a sparse representation of the reference, i.e. only positions with mismatches show a reference value the missing values are substituted with "N"'s. It is strongly suggested to write the whole reference into the tally file prior to deletion calling - see writeReference for details)
SupFwdSupport for the variant in the sample on the forward strand
SupRevSupport for the variant in the sample on the reverse strand
CovFwdCoverage of the variant position in the sample on the forward strand
CovRevCoverage of the variant position in the sample on the reverse strand
AF_FwdAllele frequency of the variant in the sample on the forward strand
AF_RevAllele frequency of the variant in the sample on the reverse strand
SupportTotal Support of the variant - i.e. SupFwd + SupRev
CoverageTotal Coverage of the variant position - i.e. CovFwd + CovRev
AFTotal allele frequency of the variant, i.e. Support / Coverage
fBackgroundBackground frequency of the variant in all samples but the one the variant is called in
pErrorFwdProbablity of the observed support and coverage given the error rate on the forward strand
pErrorRevProbablity of the observed support and coverage given the error rate on the reverse strand
pErrorProbablity of the observed support and coverage given the error rate on both strands combined
pErrorCoverage of the variant position in the Control sample on the forward strand
pStrandp-Value of a fisher.test on the contingency matrix matrix(c(CovFwd,CovRev,SupFwd,SupRev), nrow = 2) at this position - low values could indicate strand bias

Details

data is a list of datasets which has to at least contain the Counts and Coverages for variant calling respectively Deletions for deletion calling (if Deletions is not present no deletion calls will be made). This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See h5dapply for specifics.

callVariantsSingle implements a simple single sample variant callign approach for SNVs and deletions (if Deletions is a dataset present in the data parameter. The function applies three essential filters to the provided data, requiring: - minSupport total support for the variant at the position - minStrandSupport support for the variant on each strand - an allele freqeuncy of at least minAF (for pure diploid samples this can be set relatively high, e.g. 0.3, for calling potentially homozygous variants a value of 0.8 or higher might be used) Calls are annotated with the p-Value of a binom.test of the present support and coverage given the error rate provided in the errorRate parameter, no filtering is done on this annotation. Adjacent deletion calls are merged based in the value of the mergeDels parameter and their statistics are aggregated with the function supplied in the aggregator parameter.

Examples

Run this code

library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsSingle,
    sampledata = sampleData,
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant

Run the code above in your browser using DataLab