assocTest: Perform Association Test

Description

Method for performing a kernel-based association test given a genotype, VCF file, or kernel matrix

Usage

## S3 method for class 'GenotypeMatrix,NullModel':
assocTest(Z, model, ranges,
          kernel=c("linear.podkat", "localsim.podkat",
                   "quadratic.podkat", "linear.SKAT",
                   "localsim.SKAT", "quadratic.SKAT"),
          width=1000, weights=NULL, weightFunc=betaWeights(),
          method=NULL, adj=c("automatic", "none", "force"),
          pValueLimit=0.05)
## S3 method for class 'matrix,NullModel':
assocTest(Z, model, method=NULL,
          adj=c("automatic", "none", "force"), pValueLimit=0.05)
## S3 method for class 'TabixFile,NullModel':
assocTest(Z, model, ranges,
          kernel=c("linear.podkat", "localsim.podkat",
                   "quadratic.podkat", "linear.SKAT",
                   "localsim.SKAT", "quadratic.SKAT"),
          cl=NULL, nnodes=1, batchSize=20,
          noIndels=TRUE, onlyPass=TRUE, na.limit=1, MAF.limit=1,
          na.action=c("impute.major", "omit"),
          MAF.action=c("invert", "omit", "ignore"),
          sex=NULL, weightFunc=betaWeights(), width=1000,
          method=NULL, adj=c("automatic", "none", "force"),
          pValueLimit=(0.1 / length(ranges)), tmpdir=tempdir(),
          displayProgress=TRUE)
## S3 method for class 'character,NullModel':
assocTest(Z, model, ...)

Arguments

an object of class GenotypeMatrix, a quadratic kernel matrix, an object of class TabixFile, or a character string with a file name

model

an object of class NullModel

ranges

an object with genomic regions to be tested; may be an object of class GRanges or GRangesList. If missing, assocTest takes the whole genotype matrix or the genotypes in the VCF file as a whole.

kernel

determines the kernel that should be used for association testing (see Subsection 9.2 of the package vignette for details)

width

tolerance radius parameter for position-dependent kernels linear.podkat, quadratic.podkat, and localsim.podkat; must be single positive numeric value; ignored for kernels linear.SKAT, quadratic.SKAT, and localsim.SKAT (see Subsection 9.2 of the package vignette for details)

weights

for the method with signature GenotypeMatrix,NullModel, it is also possible to supply weights directly as a numeric vector that is as long as the number of columns of Z. In this case, the argument weightFunc is ignored. Use NULL (default) to use automatic weighting with the function supplied as argument weightFunc. If weightFunc is NULL too, no weighting takes place, i.e. an unweighted kernel is used.

weightFunc

function for computing variant weights from minor allele frequencies (MAFs); see weightFuncs for weighting and Subsection 9.3 of the package vignette for functions provided by the podkat package. Use NULL for unweighted kernels.

method

identifies the method for computing the p-values. If the null model is of type logistic and small sample correction is applied (see argument adj below), possible values are unbiased, population, sample, and SKAT (see details below and Subsection 9.5 of the package vignette). If the null model is of type linear or if the null model is of type logistic and no small sample correction is applied, possible values are davies, liu, and liu.mod (see details below and Subsection 9.1 of the package vignette). If the null model is of type bernoulli, this argument is ignored.

adj

whether or not to use small sample correction for logistic models (binary trait with covariates). The choice none turns off small sample correction. If force is chosen, small sample correction is turned on unconditionally. If automatic is chosen (default), small sample correction is turned on if the number of samples does not exceed 2,000. This argument is ignored for any type of model except logistic and small sample correction is switched off. For details how to train a null model for small sample correction, see nullModel and Sections 4 and 9.5 of the package vignette. An adjustment of higher moments is performed whenever sampled null model residuals are available in the null model model (slot res.resampled.adj, see NullModel).

pValueLimit

if the null model is of type bernoulli, assocTest performs an exact mixture of Bernoulli test. This test uses a combinatorial algorithm to compute exact p-values and, for the sake of computational efficiency, quits if a pre-specified p-value threshold is exceeded. This threshold can be specified with the pValueLimit argument. This argument is ignored for other types of tests/null models.

if cl is an object of class SOCKcluster, association testing is carried out in parallel on the cluster specified by cl. If NULL (default), either no parallelization is done (if nnodes=1) or assocTest launches a cluster with nnodes R client processes on localhost. See Subsection 8.5.2 of the package vignette.

nnodes

if cl is NULL and nnodes is greater than 1, makePSOCKcluster is called with nnodes nodes on localhost, i.e. nnodes R slave processes are launched on which association testing is carried out in parallel. The default is 1. See Subsection 8.5.2 of the package vignette.

batchSize

parameter which determines how many regions of ranges are processed at once. The larger batchSize, the larger the the batches that are read from the VCF file Z. A larger batchSize reduces the number of individual read operations, which improves performance. However, a larger batchSize also requires larger amounts of memory. A good choice of batchSize, therefore, depends on the size and sparseness of the VCF file and as well on the available memory. See Subsection 8.5 of the package vignette.

noIndels

if TRUE (default), only single nucleotide variants (SNVs) are considered and indels in the VCF file Z are skipped.

onlyPass

if TRUE (default), only variants are considered whose value in the FILTER column is PASS.

na.limit

all variants with a missing value ratio above this threshold in the VCF file Z are not considered.

MAF.limit

all variants with a minor allele frequency (MAF) above this threshold in the VCF file Z are not considered.

na.action

if impute.major, all missing values will be imputed by major alleles before association testing. If omit, all columns containing missing values in the VCF file Z are ignored.

MAF.action

if invert, all columns with an MAF exceeding 0.5 will be inverted in the sense that all minor alleles will be replaced by major alleles and vice versa. If omit, all variants in the VCF file with an MAF greater than 0.5 are ignored. If ignore, no action is taken and MAFs greater than 0.5 are kept as they are.

sex

if NULL, all samples are treated the same without any modifications; if sex is a factor with levels F (female) and M (male) that is as long as length(model), this argument is interpreted as the sex of the samples. In this case, the genotypes corresponding to male samples are doubled before further processing. This is designed for mixed-sex analyses of the X chromosome outside of the pseudoautosomal regions.

tmpdir

if computations are parallelized over multiple client processes (see arguments nnodes and cl), the exchange of the null model object between the master process and the client processes is done via a temporary file. The tmpdir argument allows to specify into which directory the temporary file should be saved. On multi-core systems, the default should be sufficient. If the computations are distributed over a custom cluster, the tmpdir argument needs to be chosen such that all clients can access it via the same path.

displayProgress

if TRUE (default) and if ranges is a GRangesList, a progress message is printed upon completion of each list component (typically consisting of regions of one chromosome); this argument is ignored if ranges is not an object of class GRangesList.

...

all other parameters are passed on to the assocTest method with signature TabixFile,NullModel.

Value

an object of class AssocTestResult or AssocTestResultRanges (see details above)

Details

The assocTest method is the main function of the podkat package. For a given genotype and a null model, it performs the actual association test(s).

For null models of types linear and logistic (see NullModel and nullModel), a variance component score test is used (see Subsection 9.1 of the package vignette for details). The test relies on the choice of a particular kernel to measure the pairwise similarities of genotypes. The choice of the kernel can be made with the kernel argument (see computeKernel and Subsection 9.2 of the package vignette for more details). For null models of type linear, the test statistic follows a mixture of chi-squares distribution. For models of typ logistic, the test statistic approximately follows a mixture of chi-squares distribution. The computation of p-values for a given mixture of chi-squares can be done according to Davies (1980) (which is the default), according to Liu et al. (2009), or using a modified method similar to the one suggested by Liu et al. (2009) as implemented in the SKAT package, too. Which method is used can be controlled using the method argument. If method according to Davies (1980) fails, assocTest resorts to the method by Liu et al. (2009). See also Subsection 9.1 of the package vignette for more details.

For null models of type logistic, the assocTest method also offers the small sample correction suggested by Lee et al. (2012). Whether small sample correction is applied, is controlled by the adj argument. The additional adjustment of higher moments as suggested by Lee et al. (2012) is performed whenever resampled null model residuals are available in the null model model (slot res.resampled.adj, see NullModel). In this case, the method argument controls how the excess kurtosis of test statistics sampled from the null distribution are computed. The default setting unbiased computes unbiased estimates by using the exact expected value determined from the mixture components. The settings population and sample use almost unbiased and biased sample statistics, respectively. The choice SKAT uses the same method as implemented in the SKAT package. See Subsection 9.5 of the package vignette for more details. If the null model is of type bernoulli, the test statistic follows a mixture of Bernoulli distributions. In this case, an exact p-value is determined that is computed as the probability to observe a test statistic for random Bernoulli-distributed traits (under the null hypothesis) that is at least as large as the observed test statistic. For reasons of computational complexity, this option is limited to sample numbers not larger than 100. See Subsection 9.1 of the package vignette for more details.

The podkat package offers multiple interfaces for association testing all of which require the second argument model to be a NullModel object. The simplest method is to call assocTest for an object of class GenotypeMatrix as first argument Z. If the ranges argument is not supplied, a single association test is performed using the entire genotype matrix contained in Z and an object of class AssocTestResult is returned. In this case, all variants need to reside on the same chromosome (compare with computeKernel). If the ranges argument is specified, each region in ranges is tested separately and the result is returned as an AssocTestResultRanges object.

As said, the simplest method is to store the entire genotype in a GenotypeMatrix object and to call assocTest as described above. This approach has the shortcoming that the entire genotype must be read (e.g. from a VCF file) and kept in memory as a whole. For large studies, in particular, whole-genome studies, this is not feasible. In order to be able to cope with large studies, the podkat package offers an interface that allows for reading from a VCF file piece by piece without the need to read and store the entire genotype at once. If Z is a TabixFile object or the name of a VCF file, assocTest reads from the file in batches of batchSize regions, performs the association tests for these regions, and returns the results as an AssocTestResultRanges object. This sequential batch processing can also be parallelized. The user can either set up a cluster him-/herself and pass the SOCKcluster object as cl argument. If the cl is NULL, users can leave the setup of the cluster to assocTest. In this case, the only thing necessary is to determine the number of R client processes by the nnodes argument. The variant with the VCF interface supports the same pre-processing and filter arguments as readGenotypeMatrix to control which variants are actually taken into account and how to handle variants with MAFs greater than 50%. If the argument Z is a numeric matrix, Z is interpreted as a kernel matrix $\boldmath{K}$. Then a single association test is performed as described above and the result is returned as an AssocTestResult object. This allows the user to use a custom kernel not currently implemented in the podkat package. The assocTest function assumes that row and column objects in the kernel matrix are in the same order. It does not perform any check whether row and column names are the same or whether the kernel matrix is actually positive semi-definite. Users should be aware that running the function for invalid kernels matrices, i.e. for a matrix that is not positive semi-definite, produces meaningless results and may even lead to unexpected errors.

Finally, note that the samples in the null model model and in the genotype (GenotypeMatrix object or VCF file) need not be aligned to each other. If both the samples in model and in the genotype are named (i.e. row names are defined for Z if it is a GenotypeMatrix object; VCF files always contain sample names anyway), assocTest checks if all samples in model are present in the genotype. If so, it selects only those samples from the genotype that occur in the null model. If not, it quits with an error. If either the samples in the null model or the genotypes are not named, assocTest assumes that the samples are aligned to each other. This applies only if the number of samples in the null model and the number of genotypes are the same or if the number of genotypes equals the number of samples in the null model plus the number of samples that were omitted from the null model when it was trained (see NullModel and nullModel). Otherwise, the function quits with an error. An analogous procedure is applied if the kernel matrix interface is used (signature matrix,NullModel).

References

http://www.bioinf.jku.at/software/podkat

Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82-93. DOI: http://dx.doi.org/10.1016/j.ajhg.2011.05.029{10.1016/j.ajhg.2011.05.029}.

Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J., Nickerson, D. A., NHLBI Exome Sequencing Project - ESP Lung Project Team, Christiani, D. C., Wurfel, M. M., and Lin, X. (2012) Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet. 91, 224-237. DOI: http://dx.doi.org/10.1016/j.ajhg.2012.06.007{10.1016/j.ajhg.2012.06.007}.

Davies, R. B. (1980) The distribution of a linear combination of $\chi^2$ random variables. J. R. Stat. Soc. Ser. C-Appl. Stat. 29, 323-333.

Liu, H., Tang, Y., and Zhang, H. (2009) A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput. Stat. Data Anal. 53, 853-856.

Examples

Run this code

## load genome description
data(hgA)

## partition genome into overlapping windows
windows <- partitionRegions(hgA)

## load genotype data from VCF file
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
Z <- readGenotypeMatrix(vcfFile)

## read phenotype data from CSV file (continuous trait + covariates)
phenoFile <- system.file("examples/example1lin.csv", package="podkat")
pheno.c <- read.table(phenoFile, header=TRUE, sep=",")

## train null model with all covariates in data frame 'pheno'
model.c <- nullModel(y ~ ., pheno.c)

## perform association test
res <- assocTest(Z, model.c, windows)
print(res)

## perform association test using the VCF interface
res <- assocTest(vcfFile, model.c, windows, batchSize=100)
print(res)

## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")

## read phenotype data from CSV file (binary trait + covariates)
phenoFile <- system.file("examples/example1log.csv", package="podkat")
pheno.b <-read.table(phenoFile, header=TRUE, sep=",")

## train null model with all covariates in data frame 'pheno'
model.b <- nullModel(y ~ ., pheno.b)

## perform association test
res <- assocTest(Z, model.b, windows)
print(res)

## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")

Run the code above in your browser using DataLab