## S3 method for class 'GenotypeMatrix,NullModel':
assocTest(Z, model, ranges,
kernel=c("linear.podkat", "localsim.podkat",
"quadratic.podkat", "linear.SKAT",
"localsim.SKAT", "quadratic.SKAT"),
width=1000, weights=NULL, weightFunc=betaWeights(),
method=NULL, adj=c("automatic", "none", "force"),
pValueLimit=0.05)
## S3 method for class 'matrix,NullModel':
assocTest(Z, model, method=NULL,
adj=c("automatic", "none", "force"), pValueLimit=0.05)
## S3 method for class 'TabixFile,NullModel':
assocTest(Z, model, ranges,
kernel=c("linear.podkat", "localsim.podkat",
"quadratic.podkat", "linear.SKAT",
"localsim.SKAT", "quadratic.SKAT"),
cl=NULL, nnodes=1, batchSize=20,
noIndels=TRUE, onlyPass=TRUE, na.limit=1, MAF.limit=1,
na.action=c("impute.major", "omit"),
MAF.action=c("invert", "omit", "ignore"),
sex=NULL, weightFunc=betaWeights(), width=1000,
method=NULL, adj=c("automatic", "none", "force"),
pValueLimit=(0.1 / length(ranges)), tmpdir=tempdir(),
displayProgress=TRUE)
## S3 method for class 'character,NullModel':
assocTest(Z, model, ...)
GenotypeMatrix
, a
quadratic kernel matrix, an object of class
TabixFile
, or a character string
with a file nameNullModel
GRanges
or
GRangesList
. If missing, assocTest
takes the
whole genotype matrix or the genotypes in the VCF file as a whole.GenotypeMatrix,NullModel
, it is also possible to supply
weights directly as a numeric vector that is as long as the number of
columns of Z
. In this case, the argument weightFunc
is ignored. Use NULL
(default) to use automatic weighting
with the function supplied as argument weightFunc
. If
weightFunc
is NULL
too, no weighting takes place, i.e.
an unweighted kernel is used.weightFuncs
for weighting
and Subsection 9.3 of the package vignette for functions provided by
the NULL
for unweighted kernels.adj
below), possible values are
nullModel
and Sections 4 and 9.5 of the package
vignette. An adjustment of higher moments is
performed whenever sampled null model residuals are available
in the null model model
(slot res.resampled.adj
, see
NullModel
).assocTest
performs an exact mixture of Bernoulli test. This
test uses a combinatorial algorithm to compute exact p-values and,
for the sake of computational efficiency, quits if a pre-specified
p-value threshold is exceeded. This threshold can be specified with
the pValueLimit
argument. This argument is ignored for other
types of tests/null models.cl
is an object of class
SOCKcluster
, association
testing is carried out in parallel on the cluster specified by cl
.
If NULL
(default), either no parallelization is done (if
nnodes=1
) or assocTest
launches a cluster with
nnodes
R client processes on localhost
. See Subsection
8.5.2 of the package vignette.cl
is NULL
and nnodes
is greater
than 1, makePSOCKcluster
is called with nnodes
nodes on localhost
, i.e. nnodes
R slave processes are
launched on which association testing is carried out in parallel.
The default is 1
. See Subsection 8.5.2 of
the package vignette.ranges
are processed at once. The larger batchSize
, the larger the
the batches that are read from the VCF file Z
. A larger
batchSize
reduces the number of individual read operations,
which improves performance. However, a larger batchSize
also
requires larger amounts of memory. A good choice of batchSize
,
therefore, depends on the size and sparseness of the VCF file and as
well on the available memory. See Subsection 8.5 of
the package vignette.TRUE
(default), only single nucleotide
variants (SNVs) are considered
and indels in the VCF file Z
are skipped.TRUE
(default), only variants are considered
whose value in the FILTER
column is Z
are not considered.Z
are not considered.Z
are ignored.NULL
, all samples are treated the same
without any modifications; if sex
is a factor with levels
F
(female) and M
(male) that is as long as
length(model)
, this argument is
interpreted as the sex of the samples. In this case, the genotypes
corresponding to male samples are
doubled before further processing. This is designed for mixed-sex
analyses of the X chromosome outside of the pseudoautosomal
regions.nnodes
and cl
), the exchange
of the null model object between the master process and the client
processes is done via a temporary file. The tmpdir
argument
allows to specify into which directory the temporary file should be
saved. On multi-core systems, the default should be sufficient.
If the computations are distributed over a custom cluster,
the tmpdir
argument needs to be chosen such that all clients
can access it via the same path.TRUE
(default) and if ranges
is a
GRangesList
, a progress message is printed
upon completion of each list component (typically consisting of
regions of one chromosome); this argument is ignored if
ranges
is not an object of class GRangesList
.assocTest
method with signature TabixFile,NullModel
.AssocTestResult
or
AssocTestResultRanges
(see details above)assocTest
method is the main function of the For null models of types
and
nullModel
), a variance component score test is used
(see Subsection 9.1 of the package vignette for details).
The test relies on the choice of a particular kernel to measure the
pairwise similarities of genotypes. The choice of the kernel can be
made with the kernel
argument (see computeKernel
and Subsection 9.2 of the package vignette for more details).
For null models of type method
argument. If method according to Davies (1980) fails,
assocTest
resorts to the method by Liu et al. (2009).
See also Subsection 9.1 of the package vignette for more details.
For null models of type assocTest
method also offers the small sample correction suggested by
Lee et al. (2012). Whether small sample correction is applied,
is controlled by the adj
argument. The additional adjustment
of higher moments as suggested by Lee et al. (2012) is
performed whenever resampled null model residuals are available in
the null model model
(slot res.resampled.adj
, see
). In this case, the method
argument controls how the excess kurtosis of test statistics sampled
from the null distribution are computed. The
default setting
The model
to be a
object.
The simplest method is to call assocTest
for an object
of class
as first argument
Z
. If the ranges
argument is not supplied, a single
association test is performed using the entire genotype matrix
contained in Z
and an object of class
is returned. In this case,
all variants need to reside on the same chromosome (compare with
computeKernel
). If the ranges
argument is
specified, each region in ranges
is tested separately and the
result is returned as an
object.
As said, the simplest method is to store the entire genotype in a
object and to call assocTest
as described above. This approach has the shortcoming that the entire
genotype must be read (e.g. from a VCF file) and kept in memory as a
whole. For large studies, in particular, whole-genome studies, this is
not feasible. In order to be able to cope with large studies, the
Z
is a TabixFile
object or the name of a VCF file, assocTest
reads from the file
in batches of batchSize
regions, performs the association tests
for these regions, and returns the results as an
object. This sequential
batch processing can also be parallelized. The user can either set up
a cluster him-/herself and pass the
SOCKcluster
object as
cl
argument. If the cl
is NULL
, users can
leave the setup of the cluster to assocTest
. In this case, the
only thing necessary is to determine the number of R client processes
by the nnodes
argument. The variant with the VCF interface
supports the same pre-processing and filter arguments as
readGenotypeMatrix
to control which variants are actually
taken into account and how to handle variants with MAFs greater than
50%.
If the argument Z
is a numeric matrix, Z
is
interpreted as a kernel matrix
$\boldmath{K}$. Then a single association test is
performed as described above and the result is returned as an
object. This allows the user to
use a custom kernel not currently implemented in the assocTest
function assumes that row and column
objects in the kernel matrix are in the same order. It does not
perform any check whether row and column names are the same or whether
the kernel matrix is actually positive semi-definite. Users should be
aware that running the function for invalid kernels matrices,
i.e. for a matrix that is not positive semi-definite, produces
meaningless results and may even lead to unexpected errors.
Finally, note that the samples in the null model model
and
in the genotype (GenotypeMatrix
object or VCF file)
need not be aligned to each other. If both the samples in model
and in the genotype are named (i.e. row names are defined
for Z
if it is a GenotypeMatrix
object; VCF files
always contain sample names anyway), assocTest
checks if all
samples in model
are present in the genotype. If so, it
selects only those samples from the genotype that occur in the null
model. If not, it quits with an error. If either the samples in the
null model or the genotypes are not named, assocTest
assumes
that the samples are aligned to each other. This applies only if
the number of samples in the null model and the number of genotypes
are the same or if the number of genotypes equals the number of
samples in the null model plus the number of samples that were omitted
from the null model when it was trained (see
and nullModel
).
Otherwise, the function quits with an error.
An analogous procedure is applied if the kernel matrix interface is
used (signature matrix,NullModel
).
Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011)
Rare-variant association testing for sequencing data with the sequence
kernel association test. Am. J. Hum. Genet. 89, 82-93.
DOI:
Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J.,
Nickerson, D. A., NHLBI Exome Sequencing Project - ESP Lung Project
Team, Christiani, D. C., Wurfel, M. M., and Lin, X. (2012)
Optimal unified approach for rare-variant association testing with
application to small-sample case-control whole-exome sequencing studies.
Am. J. Hum. Genet. 91, 224-237. DOI:
Davies, R. B. (1980) The distribution of a linear combination of $\chi^2$ random variables. J. R. Stat. Soc. Ser. C-Appl. Stat. 29, 323-333.
Liu, H., Tang, Y., and Zhang, H. (2009) A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput. Stat. Data Anal. 53, 853-856.
AssocTestResult
,
AssocTestResultRanges
, nullModel
,
NullModel
, computeKernel
,
weightFuncs
, readGenotypeMatrix
,
GenotypeMatrix
,
plot
, qqplot
, p.adjust
,
filterResult
## load genome description
data(hgA)
## partition genome into overlapping windows
windows <- partitionRegions(hgA)
## load genotype data from VCF file
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
Z <- readGenotypeMatrix(vcfFile)
## read phenotype data from CSV file (continuous trait + covariates)
phenoFile <- system.file("examples/example1lin.csv", package="podkat")
pheno.c <- read.table(phenoFile, header=TRUE, sep=",")
## train null model with all covariates in data frame 'pheno'
model.c <- nullModel(y ~ ., pheno.c)
## perform association test
res <- assocTest(Z, model.c, windows)
print(res)
## perform association test using the VCF interface
res <- assocTest(vcfFile, model.c, windows, batchSize=100)
print(res)
## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")
## read phenotype data from CSV file (binary trait + covariates)
phenoFile <- system.file("examples/example1log.csv", package="podkat")
pheno.b <-read.table(phenoFile, header=TRUE, sep=",")
## train null model with all covariates in data frame 'pheno'
model.b <- nullModel(y ~ ., pheno.b)
## perform association test
res <- assocTest(Z, model.b, windows)
print(res)
## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")
Run the code above in your browser using DataLab