Sequence kernel association tests (SKAT and SKAT-O) on summary statistics
SKAT(scoreFile, geneFile, regions, cor.path = "", annoType = "",
beta.par = c(1, 25), weights.function = ifelse(maf > 0, dbeta(maf,
beta.par[1], beta.par[2]), 0), method = "kuonen", acc = 1e-8,
lim = 1e+6, rho = FALSE, write.file = FALSE)
name of data file generated by prep.score.files()
.
name of a text file listing genes in refFlat format. If not set, hg19 file will be used (see Examples below).
character vector of gene names to be analysed. If not set, function will
attempt to analyse all genes listed in geneFile
.
path to a folder with correlation files (one file per each gene to be analysed).
Names of correlation files should be constructed as "geneName.cor" (e.g. "ABCG1.cor", "ADAMTS1.cor", etc.)
Each file should contain a square matrix with correlation coefficients (r) between genetic variants
of a gene. An example of correlation file format:
"snpname1" "snpname2" "snpname3" ...
"snpname1" 1 0.018 -0.003 ...
"snpname2" 0.018 1 0.081 ...
"snpname3" -0.003 0.081 1 ...
...
One way to generate such file from original genotypes is:
write.table(cor(g), file = paste0(geneName, ".cor"))
where g
is a genotype matrix (nsample x nvariants) for a given gene with genotypes coded as 0, 1, 2
(exactly the same coding that was used to generate betas).
for files annotated with the seqminer
package, a character (or character vector) indicating annotation types to be used (e.g.
"Nonsynonymous", "Start_Loss", "Stop_loss", "Essential_Splice_Site")
two positive numeric shape parameters in the beta distribution to assign weights for each genetic variant as a function of MAF in the default weights function (see Details). Default = c(1, 25).
a function of minor allele frequency (MAF) to assign weights for each genetic variant. By default, the weights will be calculated using the beta distribution (see Details).
the method for computing P value. Available methods are "kuonen", "davies" and "hybrid" (see Details). Default = "kuonen".
accuracy parameter for "davies" method.
limit parameter for "davies" method.
If TRUE the optimal test (SKAT-O) is performed [Lee, et al., 2012]. rho
can be a vector of grid
values from 0 to 1. The default grid is c(0, 0.1^2, 0.2^2, 0.3^2, 0.4^2, 0.5^2, 0.5, 1).
output file name. If specified, output (as it proceeds) will be written to the file.
A list with values:
a data frame containing P values, numbers of variants and polymorphic variants for each of analyzed regions.
If return.variance.explained = TRUE
it contains also the column with marginal amounts of variance explained by each region. If reml = FALSE
the new estimates of heritability (h2) and total variance with corresponding total log-likelihood are also returned.
an object containing the estimates of the null model parameters: heritability (h2), total variance (total.var), estimates of fixed effects of covariates (alpha), the gradient (df), and the total log-likelihood (logLH).
the sample size after omitting NAs.
If return.time = TRUE
a list with running times for null model, regional analysis and total analysis is returned. See proc.time()
for output format.
SKAT uses the linear weighted kernel function to set the inter-individual
similarity matrix \(K = GWWG^T\) for SKAT and \(K = GW(I\rho + (1 - \rho)ee^T)WG^T\) for SKAT-O, where \(\mathit{G}\) is the \(\mathit{n\times p}\)
genotype matrix for \(\mathit{n}\) individuals and \(\mathit{p}\) genetic variants in the region,
\(\mathit{W}\) is the \(\mathit{p\times p}\) diagonal weight matrix, \(I\) is the \(\mathit{p\times p}\) identity matrix, \(\rho\) is pairwise correlation
coefficient between genetic effects (which will be adaptively selected from given rho
), and \(e\) is the \(\mathit{p\times 1}\) vector of ones. Given the shape parameters
of the beta function, beta.par = c(a, b)
,
the weights are defined using probability density function of the beta distribution:
\(W_{i}=(B(a,b))^{^{-1}}MAF_{i}^{a-1}(1-MAF_{i})^{b-1} \),
where \(MAF_{i}\) is a minor allelic frequency for the \(i^{th}\) genetic variant in the region,
which is estimated from genotypes, and \(B(a,b)\) is the beta function. This way of defining weights
is the same as in original SKAT (see [Wu, et al., 2011] for details). beta.par = c(1, 1)
corresponds
to the unweighted SKAT.
Depending on the method option chosen, either Kuonen or Davies method is used to calculate P values from the
score statistics. Both an Applied Statistics algorithm that inverts the characteristic function
of the mixture chisq [Davies, 1980] and a saddlepoint approximation [Kuonen, 1999] are nearly exact,
with the latter usually being a bit faster.
A hybrid approach was recently proposed by Wu et al. [2016]. It uses the Davies' method with high accuracy, and then switches to the saddlepoint approximation method when the Davies' method fails to converge. This approach yields more accurate results in terms of type I errors, especially for small significance levels. However, 'hybrid' method runs several times slower than the saddlepoint approximation method itself (i.e. 'kuonen' method). We therefore recommend using the hybrid approach only for those regions that show significant (or nearly significant) P values to ensure their accuracy.
Davies R.B. (1980) Algorithm AS 155: The Distribution of a Linear Combination of chi-2 Random Variables, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 29, N 3, P. 323-333. Kuonen D. (1999) Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika, Vol. 86, No. 4, P. 929-935. Wu M.C., et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., Vol. 89, P. 82-93. Lee S., et al. (2012) Optimal unified approach for rare variant association testing with application to small sample case-control whole-exome sequencing studies. American Journal of Human Genetics, 91, 224-237. Wu B., et al. (2016) On efficient and accurate calculation of significance p-values for sequence kernel association testing of variant set. Ann Hum Genet, 80(2): 123-135.
# NOT RUN {
## Run SKAT with example files:
VCFfileName <- system.file("testfiles/CFH.scores.anno.vcf.gz",
package = "sumFREGAT")
cor.path <- system.file("testfiles/", package = "sumFREGAT")
out <- SKAT(VCFfileName, region = 'CFH', cor.path = cor.path)
# }
Run the code above in your browser using DataLab