SKAT: Sequence kernel association test

Description

Sequence kernel association tests (SKAT and SKAT-O) on summary statistics

Usage

SKAT(scoreFile, geneFile, regions, cor.path = "", annoType = "",
beta.par = c(1, 25), weights.function = ifelse(maf > 0, dbeta(maf,
beta.par[1], beta.par[2]), 0), method = "kuonen", acc = 1e-8,
lim = 1e+6, rho = FALSE, write.file = FALSE)

Arguments

scoreFile

name of data file generated by prep.score.files().

geneFile

name of a text file listing genes in refFlat format. If not set, hg19 file will be used (see Examples below).

regions

character vector of gene names to be analysed. If not set, function will attempt to analyse all genes listed in geneFile.

cor.path

path to a folder with correlation files (one file per each gene to be analysed). Names of correlation files should be constructed as "geneName.cor" (e.g. "ABCG1.cor", "ADAMTS1.cor", etc.) Each file should contain a square matrix with correlation coefficients (r) between genetic variants of a gene. An example of correlation file format: "snpname1" "snpname2" "snpname3" ... "snpname1" 1 0.018 -0.003 ... "snpname2" 0.018 1 0.081 ... "snpname3" -0.003 0.081 1 ... ... One way to generate such file from original genotypes is: write.table(cor(g), file = paste0(geneName, ".cor")) where g is a genotype matrix (nsample x nvariants) for a given gene with genotypes coded as 0, 1, 2 (exactly the same coding that was used to generate betas).

annoType

for files annotated with the seqminer package, a character (or character vector) indicating annotation types to be used (e.g. "Nonsynonymous", "Start_Loss", "Stop_loss", "Essential_Splice_Site")

beta.par

two positive numeric shape parameters in the beta distribution to assign weights for each genetic variant as a function of MAF in the default weights function (see Details). Default = c(1, 25).

weights.function

a function of minor allele frequency (MAF) to assign weights for each genetic variant. By default, the weights will be calculated using the beta distribution (see Details).

method

the method for computing P value. Available methods are "kuonen", "davies" and "hybrid" (see Details). Default = "kuonen".

acc

accuracy parameter for "davies" method.

lim

limit parameter for "davies" method.

rho

If TRUE the optimal test (SKAT-O) is performed [Lee, et al., 2012]. rho can be a vector of grid values from 0 to 1. The default grid is c(0, 0.1^2, 0.2^2, 0.3^2, 0.4^2, 0.5^2, 0.5, 1).

write.file

output file name. If specified, output (as it proceeds) will be written to the file.

Value

A list with values:

results

a data frame containing P values, numbers of variants and polymorphic variants for each of analyzed regions. If return.variance.explained = TRUE it contains also the column with marginal amounts of variance explained by each region. If reml = FALSE the new estimates of heritability (h2) and total variance with corresponding total log-likelihood are also returned.

nullmod

an object containing the estimates of the null model parameters: heritability (h2), total variance (total.var), estimates of fixed effects of covariates (alpha), the gradient (df), and the total log-likelihood (logLH).

sample.size

the sample size after omitting NAs.

time

If return.time = TRUE a list with running times for null model, regional analysis and total analysis is returned. See proc.time() for output format.

Details

SKAT uses the linear weighted kernel function to set the inter-individual similarity matrix \(K = GWWG^T\) for SKAT and \(K = GW(I\rho + (1 - \rho)ee^T)WG^T\) for SKAT-O, where \(\mathit{G}\) is the \(\mathit{n\times p}\) genotype matrix for \(\mathit{n}\) individuals and \(\mathit{p}\) genetic variants in the region, \(\mathit{W}\) is the \(\mathit{p\times p}\) diagonal weight matrix, \(I\) is the \(\mathit{p\times p}\) identity matrix, \(\rho\) is pairwise correlation coefficient between genetic effects (which will be adaptively selected from given rho), and \(e\) is the \(\mathit{p\times 1}\) vector of ones. Given the shape parameters of the beta function, beta.par = c(a, b), the weights are defined using probability density function of the beta distribution:

\(W_{i}=(B(a,b))^{^{-1}}MAF_{i}^{a-1}(1-MAF_{i})^{b-1} \),

where \(MAF_{i}\) is a minor allelic frequency for the \(i^{th}\) genetic variant in the region, which is estimated from genotypes, and \(B(a,b)\) is the beta function. This way of defining weights is the same as in original SKAT (see [Wu, et al., 2011] for details). beta.par = c(1, 1) corresponds to the unweighted SKAT. Depending on the method option chosen, either Kuonen or Davies method is used to calculate P values from the score statistics. Both an Applied Statistics algorithm that inverts the characteristic function of the mixture chisq [Davies, 1980] and a saddlepoint approximation [Kuonen, 1999] are nearly exact, with the latter usually being a bit faster. A hybrid approach was recently proposed by Wu et al. [2016]. It uses the Davies' method with high accuracy, and then switches to the saddlepoint approximation method when the Davies' method fails to converge. This approach yields more accurate results in terms of type I errors, especially for small significance levels. However, 'hybrid' method runs several times slower than the saddlepoint approximation method itself (i.e. 'kuonen' method). We therefore recommend using the hybrid approach only for those regions that show significant (or nearly significant) P values to ensure their accuracy.

References

Davies R.B. (1980) Algorithm AS 155: The Distribution of a Linear Combination of chi-2 Random Variables, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 29, N 3, P. 323-333. Kuonen D. (1999) Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika, Vol. 86, No. 4, P. 929-935. Wu M.C., et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., Vol. 89, P. 82-93. Lee S., et al. (2012) Optimal unified approach for rare variant association testing with application to small sample case-control whole-exome sequencing studies. American Journal of Human Genetics, 91, 224-237. Wu B., et al. (2016) On efficient and accurate calculation of significance p-values for sequence kernel association testing of variant set. Ann Hum Genet, 80(2): 123-135.

Examples

Run this code

# NOT RUN {
## Run SKAT with example files:
VCFfileName <- system.file("testfiles/CFH.scores.anno.vcf.gz",
	package = "sumFREGAT")
cor.path <- system.file("testfiles/", package = "sumFREGAT")
out <- SKAT(VCFfileName, region = 'CFH', cor.path = cor.path)

# }

Run the code above in your browser using DataLab