snp.imputation: Calculate imputation rules

Description

Given two set of SNPs typed in the same subjects, this function calculates rules which can be used to impute one set from the other in a subsequent sample. The function can also calculate rules for imputing each SNP in a single dataset from other SNPs in the same dataset

Usage

snp.imputation(X, Y, pos.X, pos.Y, phase=FALSE, try=50, stopping=c(0.95, 4, 0.05), use.hap=c(1.0, 0.0), em.cntrl=c(50,0.01,10,0.01), minA=5)

Arguments

An object of class "SnpMatrix" or "XSnpMatrix" containing observations of the SNPs to be used for imputation ("predictor SNPs")

An object of same class as X containing observations of the SNPs to be imputed in a future sample ("target SNPs"). If this argument is missing, then target SNPs are also drawn from X

pos.X

The positions of the predictor SNPs. Can be missing if there is no Y argument and the columns of X are in genome position order

pos.Y

The positions of the target SNPs. Only required when a Y argument is present

phase

See "Details" below

try

The number of potential predictor SNPs to be considered in the stepwise regression procedure around each target SNP . The nearest try predictor SNPs to each target SNP will be considered

stopping

Parameters of the stopping rule for the stepwise regression (see below)

use.hap

Parameters to control use of the haplotype imputation method (see below)

em.cntrl

Parameters to control test for convergence of EM algorithm for fitting phased haplotypes (see below)

minA

A minimum data quantity measure for estimating pairwise linkage disequilibrium (see below)

Value

An object of class "ImputationRules".

Details

The routine first carries out a series of step-wise least-square regression analyses in which each Y SNP is regressed on the nearest try predictor (X) SNPs. If phase is TRUE, the regressions will be calculated at the chromosome (haplotype) level, variances being simply $p(1-p)$ and covariances estimated from the estimated two-locus haplotypes (this option is not yet implemented). Otherwise, the analysis is carried out at the genotype level based on conventional variance and covariance estimates using the "pairwise.complete.obs" missing value treatment (see cov). New SNPs are added to the regression until either (a) the value of $R^2$ exceeds the first parameter of stopping, (b) the number of "tag" SNPs has reached the maximum set in the second parameter of stopping, or (c) the change in $R^2$ does not achieve the target set by the third parameter of stopping. If the third parameter of stopping is NA, this last test is replaced by a test for improvement in the Akaike information criterion (AIC).

After choosing the set of "tag" SNPs in this way, a prediction rule is generated either by calculating phased haplotype frequencies, either (a) under a log-linear model for linkage disequilibrium with only first order association terms fitted, or (b) under the "saturated" model. These methods do not differ if there is only one tag SNP but, otherwise, choice between methods is controlled by the use.hap parameters. If the prediction, as measure by $R^2$ achieved with the log-linear smoothing model exceeds a threshold (the first parameter of use.hap) then this method is used. Otherwise, if the gain in $R^2$ achieved by using the second method exceeds the second parameter of use.hap, then the second method is used. Current experience is that, the log-linear method is rarely preferred with reasonable choices for use.hap, and imputation is much faster when the second method only is considered. The current default ensures that this second method is used, but the other possibility might be considered if imputing from very small samples; however this code is not extensively tested and should be regarded as experimental.

The argument em.cntrl controls convergence testing for the EM algorithm for fitting haplotype frequencies and the IPF algorithm for fitting the log-linear model. The first parameter is the maximum number of EM iterations, and the second parameter is the threshold for the change in log likelihood below which the iteration is judged to have converged. The third and fourth parameters give the maximum number of IPF iterations and the convergence tolerance. There should be no need to change the default values.

All SNPs selected for imputation must have sufficient data for estimating pairwise linkage disequilibrium with each other and with the target SNP. The statistic chosen is based on the four-fold tables of two-locus haplotype frequencies. If the frequencies in such a table are labelled $a, b, c$ and $d$ then, if $ad>bc$ then $t = min(a,d)$ and, otherwise, $t = min(b,c)$. The cell frequencies $t$ must exceed minA for all pairwise comparisons.

References

Chapman J.M., Cooper J.D., Todd J.A. and Clayton D.G. (2003) Human Heredity, 56:18-31. Wallace, C. et al. (2010) Nature Genetics, 42:68-71

Examples

Run this code

# Remove 5 SNPs from a datset and derive imputation rules for them
data(for.exercise)
sel <- c(20, 1000, 2000, 3000, 5000)
to.impute <- snps.10[,sel]
impute.from <- snps.10[,-sel]
pos.to <- snp.support$position[sel]
pos.fr <- snp.support$position[-sel]
imp <- snp.imputation(impute.from, to.impute, pos.fr, pos.to)

Run the code above in your browser using DataLab