Learn R Programming

h5vc (version 2.6.3)

callVariantsSingle: Single sample variant calling

Description

A simple single sample variant calling function (calling SNVs and deletions)

Usage

callVariantsSingle( data, sampledata, samples = sampledata$Sample, errorRate = 0.001, minSupport = 2, minAF = 0.05, minStrandSupport = 1, mergeDels = TRUE, aggregator = mean)

Arguments

data
A list with elements Counts (a 4d integer array of size [1:12, 1:2, 1:k, 1:n]), Coverage (a 3d integer array of size [1:2, 1:k, 1:n]), Deletions (a 3d integer array of size [1:2, 1:k, 1:n]), Reference (a 1d integer vector of size [1:n]) -- see Details.
sampledata
A data.frame with k rows (one for each sample) and columns Column and (Sample. The tally file should contain this information as a group attribute, see getSampleData for an example.
samples
The samples on which variants should be called, by default all samples specified in sampledata are used
errorRate
The expected error rate of the sequencing technology that was used, for illumina this should be 1/1000
minSupport
minimal support required for a position to be considered variant
minAF
minimal allelic frequency for an allele at a position to be cosidered a variant
minStrandSupport
minimal per-strand support for a position to be considered variant
mergeDels
Boolean flag to specify that adjacent deletion calls should be merged
aggregator
Aggregator function for merging statistics of adjacent deletion calls, defaults to mean, which means that a deletion larger than 1bp will be annotated with the means of the counts and coverages etc.

Value

  • This function returns a data.frame containing annotated calls with the following slots:
  • ChromThe chromosome the potential variant / deletion is on
  • StartThe starting position of the variant / deletion
  • EndThe end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)
  • SampleThe sample in which the variant was called
  • altAlleleThe alternate allele for SNVs (deletions will have a "-" in that slot)
  • refAlleleThe reference allele for SNVs (deletions will have the deleted sequence here as extracted from the Reference dataset, if the tally file contains a sparse representation of the reference, i.e. only positions with mismatches show a reference value the missing values are substituted with "N"'s. It is strongly suggested to write the whole reference into the tally file prior to deletion calling - see writeReference for details)
  • SupFwdSupport for the variant in the sample on the forward strand
  • SupRevSupport for the variant in the sample on the reverse strand
  • CovFwdCoverage of the variant position in the sample on the forward strand
  • CovRevCoverage of the variant position in the sample on the reverse strand
  • AF_FwdAllele frequency of the variant in the sample on the forward strand
  • AF_RevAllele frequency of the variant in the sample on the reverse strand
  • SupportTotal Support of the variant - i.e. SupFwd + SupRev
  • CoverageTotal Coverage of the variant position - i.e. CovFwd + CovRev
  • AFTotal allele frequency of the variant, i.e. Support / Coverage
  • fBackgroundBackground frequency of the variant in all samples but the one the variant is called in
  • pErrorFwdProbablity of the observed support and coverage given the error rate on the forward strand
  • pErrorRevProbablity of the observed support and coverage given the error rate on the reverse strand
  • pErrorProbablity of the observed support and coverage given the error rate on both strands combined
  • pErrorCoverage of the variant position in the Control sample on the forward strand
  • pStrandp-Value of a fisher.test on the contingency matrix matrix(c(CovFwd,CovRev,SupFwd,SupRev), nrow = 2) at this position - low values could indicate strand bias

Details

data is a list of datasets which has to at least contain the Counts and Coverages for variant calling respectively Deletions for deletion calling (if Deletions is not present no deletion calls will be made). This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See h5dapply for specifics.

callVariantsSingle implements a simple single sample variant callign approach for SNVs and deletions (if Deletions is a dataset present in the data parameter. The function applies three essential filters to the provided data, requiring: - minSupport total support for the variant at the position - minStrandSupport support for the variant on each strand - an allele freqeuncy of at least minAF (for pure diploid samples this can be set relatively high, e.g. 0.3, for calling potentially homozygous variants a value of 0.8 or higher might be used) Calls are annotated with the p-Value of a binom.test of the present support and coverage given the error rate provided in the errorRate parameter, no filtering is done on this annotation. Adjacent deletion calls are merged based in the value of the mergeDels parameter and their statistics are aggregated with the function supplied in the aggregator parameter.

Examples

Run this code
library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsSingle,
    sampledata = sampleData,
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant

Run the code above in your browser using DataLab