PCA: Principal Components Analysis

Description

Test for association between a trait and principal components of genotypes within a region, on summary statistics

Usage

PCA(scoreFile, geneFile, regions, cor.path = "", annoType = "",
n, beta.par = c(1, 1), weights.function = ifelse(maf > 0,
dbeta(maf, beta.par[1], beta.par[2]), 0), var.fraction = .85,
write.file = FALSE)

Arguments

scoreFile

name of data file generated by prep.score.files().

geneFile

name of a text file listing genes in refFlat format. If not set, hg19 file will be used (see Examples below).

regions

character vector of gene names to be analysed. If not set, function will attempt to analyse all genes listed in geneFile.

cor.path

path to a folder with correlation files (one file per each gene to be analysed). Names of correlation files should be constructed as "geneName.cor" (e.g. "ABCG1.cor", "ADAMTS1.cor", etc.) Each file should contain a square matrix with correlation coefficients (r) between genetic variants of a gene. An example of correlation file format: "snpname1" "snpname2" "snpname3" ... "snpname1" 1 0.018 -0.003 ... "snpname2" 0.018 1 0.081 ... "snpname3" -0.003 0.081 1 ... ... One way to generate such file from original genotypes is: write.table(cor(g), file = paste0(geneName, ".cor")) where g is a genotype matrix (nsample x nvariants) for a given gene with genotypes coded as 0, 1, 2 (exactly the same coding that was used to generate betas).

annoType

for files annotated with the seqminer package, a character (or character vector) indicating annotation types to be used (e.g. "Nonsynonymous", "Start_Loss", "Stop_loss", "Essential_Splice_Site")

size of the sample on which summary statitics were obtained.

beta.par

two positive numeric shape parameters in the beta distribution to assign weights for each genetic variant as a function of MAF in the default weights function (see Details). Default = c(1, 1).

weights.function

a function of minor allele frequency (MAF) to assign weights for each genetic variant. By default, the weights will be calculated using the beta distribution (see Details).

var.fraction

minimal proportion of genetic variance within region that should be explained by principal components used (see Details for more info).

write.file

output file name. If specified, output (as it proceeds) will be written to the file.

Value

a data frame containing P values, numbers of variants and filtered variants for each of analyzed regions. It also contains the number of the principal components used for each region and the proportion of genetic variance they make up.

Details

PCA test is a useful tool to detect association between genetic variants of a region and a trait when genetic variants are strongly correlated. PCA test is based on the spectral decomposition of correlation matrix among genetic variants. The number of top principal components will be chosen in such a way that >= var.fraction of region variance can be explained by these PCs. By default, var.fraction = 0.85, i.e. >= 85% of region variance will be explained by PCs. If var.fraction = 1 then the results of PCA test and MLR-based test are identical.

beta.par = c(a, b) can be used to set weights for genetic variants. Given the shape parameters of the beta function, beta.par = c(a, b), the weights are defined using probability density function of the beta distribution:

\(W_{i}=(B(a,b))^{^{-1}}MAF_{i}^{a-1}(1-MAF_{i})^{b-1} \),

where \(MAF_{i}\) is a minor allelic frequency for the \(i^{th}\) genetic variant in the region, which is estimated from genotypes, and \(B(a,b)\) is the beta function. This way of defining weights is the same as in original SKAT (see [Wu, et al., 2011] for details). Precision of input values (betas and P) can be important for perfect correspondence between PCA on summary statistics and PCA performed on genotypes. We suggest to keep as much decimal points as possible for input values.

References

Jolliffe, I.T. A note on the use of principal components in regression. J R Stat Soc Ser C 31, 300-303 (1982).

Examples

Run this code

# NOT RUN {
## Run PCA with example files:
VCFfileName <- system.file("testfiles/CFH.scores.anno.vcf.gz",
	package = "sumFREGAT")
cor.path <- system.file("testfiles/", package = "sumFREGAT")
n <- 85 # your sample size
out <- PCA(VCFfileName, region = 'CFH', cor.path = cor.path, n = n)


# }

Run the code above in your browser using DataLab