Learn R Programming

h5vc (version 2.6.3)

callVariants: Variant calling

Description

These functions implement various attempts at variant calling.

Usage

callVariantsPaired( data, sampledata, cl = vcConfParams() )

vcConfParams( minStrandCov = 5, maxStrandCov = 200, minStrandAltSupport = 2, maxStrandAltSupportControl = 0, minStrandDelSupport = minStrandAltSupport, maxStrandDelSupportControl = maxStrandAltSupportControl, minStrandInsSupport = minStrandAltSupport, maxStrandInsSupportControl = maxStrandAltSupportControl, minStrandCovControl = 5, maxStrandCovControl = 200, bases = 5:8, returnDataPoints = TRUE, annotateWithBackground = TRUE, mergeCalls = TRUE, mergeAggregator = mean, pValueAggregator = max )

Arguments

data
A list with elements Counts (a 4d integer array of size [1:12, 1:2, 1:k, 1:n]), Coverage (a 3d integer array of size [1:2, 1:k, 1:n]), Deletions (a 3d integer array of size [1:2, 1:k, 1:n]), Reference (a 1d integer vector of size [1:n]) -- see Details.
sampledata
A data.frame with k rows (one for each sample) and columns Type, Column and (SampleGroup or Patient). The tally file should contain this information as a group attribute, see getSampleData for an example.
cl
A list with parameters used by the variant calling functions. Such a list can be produced, for instance, by a call to vcConfParams.
minStrandCov
Minimum coverage per strand in the case sample.
maxStrandCov
Maximum coverage per strand in the case sample.
minStrandCovControl
Minimum coverage per strand in the control sample.
maxStrandCovControl
Maximum coverage per strand in the control sample.
minStrandAltSupport
Minimum support for the alternative allele per strand in the case sample. This should be 1 or higher.
maxStrandAltSupportControl
Maximum support for the alternative allele per strand in the control sample. This should usually be 0.
minStrandDelSupport
Minimum support for the deletion per strand in the case sample. This should be 1 or higher.
maxStrandDelSupportControl
Maximum support for the deletion per strand in the control sample. This should usually be 0.
minStrandInsSupport
Minimum support for the insertion per strand in the case sample. This should be 1 or higher.
maxStrandInsSupportControl
Maximum support for the insertion per strand in the control sample. This should usually be 0.
bases
Indices for subsetting in the bases dimension of the Counts array, 5:8 extracts only those calls made in the middle one of the sequencing cycle bins.
returnDataPoints
Boolean flag to specify that a data.frame with the variant calls should be returned, otherwise only position are returned as a numeric vector. If returnDataPoints == FALSE only the variant positions are returned.
annotateWithBackground
Boolean flag to specify that the background mismatch / deletion frequency estimated from all control samples in the cohort should be added to the output. A simple binomial test will be performed as well. Only usefull if returnDataPoints == TRUE
mergeCalls
Boolean flag to specify that adjacent calls should be merged where appropriate (used by callDeletionsPaired). Only usefull applied if returnDataPoints == TRUE
mergeAggregator
Aggregator function for merging adjacent calls, defaults to mean, which means that a deletion larger than 1bp will be annotated with the means of the counts and coverages
pValueAggregator
Aggregator function for combining the p-values of adjacent calls when merging, defaults to max. Is only applied if annotateWithBackground == TRUE

Value

  • The result is either a list of positions with SNVs / deletions or a data.frame containing the calls themselves which might contain annotations. Adjacent calls might be merged and calls might be annotated with p-values depending on configuration parameters. When the configuration parameter returnDataPoints is FALSE the functions return the positions of potential variants as a list containing one integer vector of positions for each sample, if no positions were found for a sample the list will contain NULL instead. In the case of returnDatapoints == TRUE the functions return either NULL if no poisitions were found or a data.frame with the following slots:
  • ChromThe chromosome the potential variant / deletion is on
  • StartThe starting position of the variant / deletion
  • EndThe end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)
  • SampleThe Case sample in which the variant was observed
  • altAlleleThe alternate allele for SNVs (skipped for deletions, would be "-")
  • refAlleleThe reference allele for SNVs (skipped for deletions since the tally file might not contain all the information necessary to extract it)
  • caseCountFwdSupport for the variant in the Case sample on the forward strand
  • caseCountRevSupport for the variant in the Case sample on the reverse strand
  • caseCoverageFwdCoverage of the variant position in the Case sample on the forward strand
  • caseCoverageRevCoverage of the variant position in the Case sample on the reverse strand
  • controlCountFwdSupport for the variant in the Control sample on the forward strand
  • controlCountRevSupport for the variant in the Control sample on the reverse strand
  • controlCoverageFwdCoverage of the variant position in the Control sample on the forward strand
  • controlCoverageRevCoverage of the variant position in the Control sample on the reverse strand
  • If the annotateWithBackground option is set the following extra columns are returned
  • backgroundFrequencyFwdThe averaged frequency of mismatches / deletions at the position of all samples of type Control on the forward strand
  • backgroundFrequencyRevThe averaged frequency of mismatches / deletions at the position of all samples of type Control on the reverse strand
  • pValueFwdThe p.value of the test binom.test( caseCountFwd, caseCoverageFwd, p = backgroundFrequencyFwd, alternative = "greater")
  • pValueRevThe p.value of the test binom.test( caseCountRev, caseCoverageRev, p = backgroundFrequencyRev, alternative = "greater")
  • The function callDeletionsPaired merges adjacent single-base deletion calls if the option mergeCalls is set to TRUE, in that case the counts and coverages ( e.g. caseCountFwd ) are aggregated using the function supplied in the mergeAggregator option of the configuration list (defaults to mean) and the p-values pValueFwd and pValueFwd (if annotateWithBackground is TRUE), are aggregated using the function supplied in the pValueAggregator option (defaults to max).

Details

data is a list of datasets which has to at least contain the Counts and Coverages for variant calling respectively Deletions for deletion calling. This list will usually be generated by a call to the h5dapply function in which the tally file, chromosome, datasets and regions within the datasets would be specified. See ?h5dapply for specifics. In order for callVariantsPaired to return the correct locations of the variants there must be the h5dapplyInfo slot present in data as well. This is itself a list (being automatically added by h5dapply and h5readBlock respectively) and contains the slots Group (location in the HDF5 file) and Blockstart, which are used to set the chromosome and the genomic positions of variants.

vcConfParams is a helper function that builds a set of variant calling parameters as a list. This list is provided to the calling functions e.g. callVariantsPaired and influences their behavior.

callVariantsPaired implements a simple pairwise variant callign approach applying the filters specified in cl, and might additionally computes an estimate of the background mismatch rate (the mean mismatch rate of all samples labeled as 'Control' in the sampledata and annotate the calls with p-values for the binom.test of the observed mismatch counts and coverage at each of the samples labeled as 'Case'.

Examples

Run this code
library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsPaired,
    sampledata = sampleData,
    cl = vcConfParams(returnDataPoints=TRUE),
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant

Run the code above in your browser using DataLab