Learn R Programming

sumFREGAT (version 1.0.0)

FLM: Functional Linear Model

Description

A region-based association test on summary statistics under functional linear models (functional data analysis approach)

Usage

FLM(scoreFile, geneFile, regions, cor.path = "", annoType = "",
n, beta.par = c(1, 1), weights.function = ifelse(maf > 0,
dbeta(maf, beta.par[1], beta.par[2]), 0), GVF = FALSE,
BSF = "fourier", kg = 30, kb = 25, order = 4, flip.genotypes = FALSE,
Fan = TRUE, write.file = FALSE)

Arguments

scoreFile

name of data file generated by prep.score.files().

geneFile

name of a text file listing genes in refFlat format. If not set, hg19 file will be used (see Examples below).

regions

character vector of gene names to be analysed. If not set, function will attempt to analyse all genes listed in geneFile.

cor.path

path to a folder with correlation files (one file per each gene to be analysed). Names of correlation files should be constructed as "geneName.cor" (e.g. "ABCG1.cor", "ADAMTS1.cor", etc.) Each file should contain a square matrix with correlation coefficients (r) between genetic variants of a gene. An example of correlation file format: "snpname1" "snpname2" "snpname3" ... "snpname1" 1 0.018 -0.003 ... "snpname2" 0.018 1 0.081 ... "snpname3" -0.003 0.081 1 ... ... One way to generate such file from original genotypes is: write.table(cor(g), file = paste0(geneName, ".cor")) where g is a genotype matrix (nsample x nvariants) for a given gene with genotypes coded as 0, 1, 2 (exactly the same coding that was used to generate betas).

annoType

for files annotated with the seqminer package, a character (or character vector) indicating annotation types to be used (e.g. "Nonsynonymous", "Start_Loss", "Stop_loss", "Essential_Splice_Site")

n

size of the sample on which summary statitics were obtained.

beta.par

two positive numeric shape parameters in the beta distribution to assign weights for each genetic variant as a function of MAF in the default weights function (see Details). Default = c(1, 1) corresponds to standard unweighted FLM.

weights.function

a function of minor allele frequency (MAF) to assign weights for each genetic variant. By default, the weights will be calculated using the beta distribution (see Details).

GVF

a basis function type for Genetic Variant Functions. Can be set to "bspline" (B-spline basis) or "fourier" (Fourier basis). The default GVF = FALSE assumes beta-smooth only. If GVF = TRUE the B-spline basis will be used.

BSF

a basis function type for beta-smooth. Can be set to "bspline" (B-spline basis) or "fourier" (Fourier basis, default).

kg

the number of basis functions to be used for GVF (default = 30, has no effect under GVF = FALSE).

kb

the number of basis functions to be used for BSF (default = 25).

order

a polynomial order to be used in "bspline". Default = 4 corresponds to the cubic B-splines. as no effect if only Fourier bases are used.

flip.genotypes

a logical value indicating whether the genotypes of some genetic variants should be flipped (relabeled) for their better functional representation [Vsevolozhskaya, et al., 2014]. Default = FALSE.

Fan

if TRUE (default) then linearly dependent genetic variants will be omitted, as it was done in the original realization of FLM test by Fan et al. (2013).

write.file

output file name. If specified, output (as it proceeds) will be written to the file.

Value

A data frame containing P values, numbers of variants and filtered variants for each of analyzed regions. It also contains the names of the functional models used for each region (it may not always coincide with what was set, because of restrictions described in Details section). The first part of the name relates to the functional basis of GVFs and the second one to that of BSF, e.g. "F30-B25" means that 30 Fourier basis functions were used for construction of GVFs and 25 B-spline basis functions were used for construction of BSF. "0-F25" means that genotypes were not smoothed and 25 Fourier basis functions were used for beta-smooth. "MLR" means that standard multiple linear regression was applied.

Details

The test assumes that the effects of multiple genetic variants (and also their genotypes if GVFs are used) can be described as a continuous function, which can be modelled through B-spline or Fourier basis functions. When the number of basis functions (set by \(Kg\) and \(Kb\)) is less than the number of variants within the region, the famFLM test may have an advantage of using less degrees of freedom [Svishcheva, et al., 2015].

Several restrictions exist in combining B-spline or Fourier bases for construction of GVFs and BSF [Svishcheva, et al., 2015], and the famFLM function takes them into account. Namely:

1) \(m \geq Kg \geq Kb\), where \(m\) is the number of polymorphic genetic variants within a region.

2) Under \(Kg = Kb\), B-B and B-F models are equivalent to 0-B model, and F-F and F-B models are equivalent to 0-F model. 0-B and 0-F models will be used for these cases, respectively.

3) Under \(m = Kb\), 0-B and 0-F models are equivalent to a standard multiple linear regression, and it will be used for these cases.

4) When Fourier basis is used, the number of basis functions should be an odd integer. Even values will be changed accordingly.

Because of these restrictions, the model in effect may not always be the same as it has been set. The ultimate model name is returned in results in the "model" column (see below).

beta.par = c(a, b) can be used to set weights for genetic variants. Given the shape parameters of the beta function, beta.par = c(a, b), the weights are defined using probability density function of the beta distribution:

\(W_{i}=(B(a,b))^{^{-1}}MAF_{i}^{a-1}(1-MAF_{i})^{b-1} \),

where \(MAF_{i}\) is a minor allelic frequency for the \(i^{th}\) genetic variant in the region, which is estimated from genotypes, and \(B(a,b)\) is the beta function. This way of defining weights is the same as in original SKAT (see [Wu, et al., 2011] for details).

References

Svishcheva G.R., Belonogova N.M. and Axenovich T.I. (2015) Region-based association test for familial data under functional linear models. PLoS ONE 10(6): e0128999. Vsevolozhskaya O.A., et al. (2014) Functional Analysis of Variance for Association Studies. PLoS ONE 9(9): e105074. Wu M.C., et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., Vol. 89, P. 82-93. Fan R, Wang Y, Mills JL, Wilson AF, Bailey-Wilson JE, et al. (2013) Functional linear models for association analysis of quantitative traits. Genet Epidemiol 37: 726-42.

Examples

Run this code
# NOT RUN {
## Run FLM with example files:
VCFfileName <- system.file("testfiles/CFH.scores.anno.vcf.gz",
	package = "sumFREGAT")
cor.path <- system.file("testfiles/", package = "sumFREGAT")
n <- 85 # your sample size
out <- FLM(VCFfileName, region = 'CFH', cor.path = cor.path, n = n)


# }

Run the code above in your browser using DataLab