callGenotypes: Clustering and calling of genotypes

Description

Clusters each marker into the most likely combination from a sequence of allowed cluster combinations. These may include multisite variants from polyploid genomes

Usage

setGenoOptions(largeSample = FALSE, snpPerArrayLim = 0.8,
    arrayPerSnpLim = 0.8, ploidy = "tetra",
    filterLim = 0, detectLim = 0.8,
    wSpreadLim = suggestGeno(largeSample)$wSpreadLim,
    devCentLim = 0.35,
    hwAlpha = suggestGeno(largeSample)$hwAlpha,
    probsIndSE = suggestGeno(largeSample)$probsIndSE,
    afList = seq(0, 0.5, 0.05),
    clAlpha = suggestGeno(largeSample)$clAlpha,
    rPenalty = 2,
    rotationLim = suggestGeno(largeSample)$rotationLim,
    minClLim = 5, nSdOverlap = 2,
    minBin = suggestGeno(largeSample)$minBin,
    binWidth = suggestGeno(largeSample)$binWidth) 
callGenotypes(BSRed,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))
callGenotypes.interactive(BSRed,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))
callGenotypes.verboseTest(BSRed, singleMarker = 1,
    gO = setGenoOptions(largeSample = ncol(BSRed) > 250))

Arguments

largeSample

Logical indicating whether or not a large sample is genotyped, which influences the values of some default genotyping options. More than 2-300 arays may be considered large

snpPerArrayLim

Minimum ratio of non-NA markers per array. Failing arrays are disregarded

arrayPerSnpLim

Minimum ratio of non-NA arrays per marker. Failing markers are disregarded

ploidy

Character string indicating which genotyping classes to allow (see generatePolyCenters)

filterLim

Markers with range of assayData theta below this value are classified as MONO-filt and are not subjected to clustering. This option will speed up the analysis, however the calls and the statistics

detectLim

Ratio of called samples below which markers are classified as FAIL

wSpreadLim

The maximum allowed within cluster standard deviation along assayData theta

devCentLim

The maximum allowed difference in theta between expected and estimated cluster centres

hwAlpha

Significance level used in Hardy-Weinberg testing. Used to detect where the clustering fails, and a low value (e.g. 1e-10 - 1e-40) may be used to allow natural deviation from HW. This criterion should be used with caution, especially when t

probsIndSE

Numeric quantile for which markers with higher standard errors are excluded from the clustering step (however not discarded)

afList

Numeric vector of values within [0, 0.5], denoting which B allele frequencies to test in the case of two segregating paralogs (see below)

clAlpha

Numeric significance level for Hotelling's $T^2$ test (see below)

rPenalty

Scaling for the ordinate axis spanned by assayData intensity (see below). A high value means the influence of the intensity in the clustering is decreased

rotationLim

Controls the maximum allowed tilt of clusters, as defined by Hotelling's ellipses (see below)

minClLim

Clusters below this minimum size are disregarded in the Hotelling's test (all animals included in cluster)

nSdOverlap

Numeric multiple of assayData theta standard deviations defining within cluster spread

minBin

When estimating initial center points for the clustering, histogram peaks below this numeric value are set to zero

binWidth

Starting value for histogram bin-width used to find initial center points. Smaller clusters are detected with smaller bin-width

BSRed

"AlleleSetIllumina" object, holding a required phenoData column noiseIntensity (see preprocessBeadSet)

List of options and cut-off limits used in the clustering (see setGenoOptions)

singleMarker

Index to a specific marker to test

Value

Output from setGenoOptions is a list with all defined or suggested options needed for genotype calling Output from callGenotypes is an "AlleleSetIllumina" object with an extra assayData entry:
callMatrix with numeric values {0, 1/4, 1/2, 3/4, 1}, each indicating the ratio of B alleles for a marker in a sample
The "AlleleSetIllumina" output from callGenotypes.interactive in addition holds the assayData entries:
ped.checkLogical matrix with FALSE indicating pedigree violations among offspring. Parental values are all NA (see validateCallsPedigree)
ped.check.parentsMatrix of numeric values {0, 1, 2}, denoting no error, offspring error, and parent error, respectively (see validateSingleCall)
The following columns are added to featureData:
ClassificationGenotype call from automatic clustering
Cent.DeviationLargest distance from cluster-centre to its ideal position
Within.SDLargest within-cluster spread
HW.Chi2Chi-squared statistic from test of Hardy-Weinberg equilibrium
HW.PProbability of Hardy-Weinberg equilibrium
BAF.Locus1Estimated B-allele frequency
BAF.Locus2Estimated BAF of second paralogue (if exists)
Call.RateRatio of samples assigned to clusters
Manual.Calls.RCalls from interactive clustering (if applied)
Output from callGenotypes.verboseTest is a list containing
callThe calls from each attempted cluster-combination
fDataStatistics from each cluster-combination
fMetadataExplanation to each statistic
testTable of PASS or FAIL for each test

Details

Initially, arrays with a fraction of non-NA markers less than gO$snpPerArrayLim, and markers with a fraction of non-NA arrays less than gO$arrayPerSnpLim, are discarded from the analysis. Also, markers with range of assayData theta below gO$filterLim are called MONO-filt and are not subjected to clustering. Genotype calling for each marker is based on k-means clustering in the two dimensions defined by assayData entries intensity and theta, where intensity is penalized by gO$rPenalty times its median value. The value of gO$ploidy sets the allowed cluster combinations through the function generatePolyCenters. Data points below the average value of the phenoData column noiseIntensity across included arrays are set to missing, and samples with standard errors exceeding the quantile given by gO$probsIndSE are left out from the clustering step. Two or three of the most likely cluster combinations are estimated with the function getCenters. This function uses histograms to find starting values for the clustering, with input parameters gO$minBin defining a peak-height filter and gO$binWidth setting an initial bin-width. The ranked cluster combinations are tested in turn until one is found which passes the criteria below.

First, gO$devCentLim controls the maximum distance a cluster is allowed to deviate from its expected or ideal position. Second, gO$wSpreadLim limits how large the within cluster standard deviation can be. Both these criteria are tested in the theta-dimension, and if either fails, the algorithm will attempt to cluster the next most likely configuration and start over.

The next test is for Hardy-Weinberg (HW) equilibrium. This is a very powerful test to detect if the clustering has failed, given HW can be assumed. Otherwise, set the significance value gO$hwAlpha to zero to allow any deviation from HW. At duplicated loci, the observed B allele-frequency (BAF) is in fact the mean BAF across both paralogues. Several candidate values of BAF for one paralogue are set in gO$afList, such that the BAF for the other paralogue is given. Several values of gO$afList within [0, 0.5] are tested for HW, and the most likely BAF at both paralogs are those resulting in the highest p-value. Another, somewhat less powerful, quality control test is to look for overlapping clusters. If a cluster centre on the theta-axis is denoted Cent, and its spread is given by Within.SD, clusters are tested for overlap using Cent +/- gO$nSdOverlap * Within.SD

Markers passing the quality control are subjected to a Hotelling's $T^2$-test (Hotelling, 1931; Gidskehaug et al., 2007), which effectively superimposes an ellipse on each cluster and discards all data points falling outside its boundaries or within overlapping ellipses. Due to inaccurate ellipses for small clusters, only clusters with more than gO$minClLim members are tested. The extents of the clusters are controlled by the significance level gO$clAlpha of the $T^2$-tests in such a way that a small value yield large ellipses. The orientation of an ellipse is defined by the ratio of variances in the variance-/covariance matrix from the Hotelling's test. If the ratio of the theta-variance to the intensity-variance is given by Sratio, the weighted sum of Sratio across clusters is not allowed to exceed gO$rotationLim. Finally, the call rate is required to exceed gO$detectLim, or the algorithm will move on to the next most likely cluster combination. If a marker passes all the tests for one of the candidate cluster combinations, the genotype is called. Otherwise the marker is failed (i.e. called FAIL).

callGenotypes is the main genotype calling function. Before genotyping thousands of markers, it is important spend a little time tuning the options in gO. A few hundred markers may be called using the default settings, and plotted using plotGenotypes. Individual markers whith questionable calls can be investigated more closely using callGenotypes.verboseTest. This function performs all the tests described above, for all suggested cluster combinations, without failing any of the tests. Rather, all the test results are plotted and returned, revealing which steps can be taken to improve the clustering, if any. In these plots, green dots are the estimated centre points, and red and orange dots represent samples outside cluster boundaries or within overlapping ellipses, respectively. Several markers representing each of the possible genotype categories should be investigated in this way before commencing with genotyping the full set of markers.

For some markers with highly overlapping or inaccurate clusters, interactive clustering might be the last option. This can be performed using the function callGenotypes.interactive (requires package rggobi to be installed). This function will loop through a smaller set of markers, allowing the user to define clusters manually. Pedigree checking is performed inside the function, and a phenoData column PedigreeID is required (see validateCallsPedigree). Use the function assignToAlleleSet to incorporate the manual results into a larger "AlleleSetIllumina" object.

References

L. Gidskehaug, E. Anderssen, A. Flatberg, and B. K. Alsberg (2007) A framework for significance analysis of gene expression data using dimension reduction methods. BMC Bioinformatics 8:346 H. Hotelling (1931) The Generalization of Student's Ratio. Ann. Math. Stat. 2(3):360-378

Examples

Run this code

#Read 25 markers into an AlleleSetIllumina object
rPath <- system.file("extdata", package="beadarrayMSV")
normOpts <- setNormOptions()
dataFiles <- makeFilenames('testdata',normOpts,rPath)
beadFile <- paste(rPath,'beadData_testdata.txt',sep='/')
beadInfo <- read.table(beadFile,sep='\t',header=TRUE,as.is=TRUE)
BSRed <- createAlleleSetFromFiles(dataFiles[1:4],markers=1:25,beadInfo=beadInfo)

#Genotype calling and validation
gO <- setGenoOptions(largeSample=TRUE)
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
sumClass <- tapply(rep(1,nrow(BSRed)),fData(BSRed)$Classification,sum)
print(sumClass)

#Plot default setting calls
plotGenotypes(BSRed)

#Tune settings to call an initially failed marker
dev.new()
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)
print(verboseRes$fData)
print(verboseRes$test)
gO <- setGenoOptions(largeSample=TRUE, wSpreadLim=.1, hwAlpha=1e-50)
verboseRes <- callGenotypes.verboseTest(BSRed, gO=gO, singleMarker=23)

#New settings give (likely incorrect) SNP-call
BSRed <- callGenotypes(BSRed, gO=gO)
BSRed <- validateCallsPedigree(BSRed)
dev.new()
plotGenotypes(BSRed)

Run the code above in your browser using DataLab