pcadapt: Principal Components Analysis

Description

pcadapt performs principal component analysis and compute P-values to test for selection as indicated by significant correlations between genetic variation and principal components. pcadapt also allows the user to read PCAdapt outputs for further analysis in R. Returns an object of class pcadapt.

Usage

pcadapt(data = NULL, file = NULL, K, communality_test = FALSE,
  ploidy = 2, minmaf = 0.05)

Arguments

data

a data matrix or a data frame.

file

the name of the file which the data are to be read from. Basically, the file generated with PCAdapt which has no extension.

an integer specifying the number of principal components to retain.

communality_test

a logical value indicating whether a communality test should be performed. Default value set to FALSE.

ploidy

an integer specifying the ploidy of the individuals.

minmaf

a value between 0 and 0.5 specifying the threshold under which the frequencies are considered minor allele frequencies.

Value

The returned value x is an object of class pcadapt. The different fields can be viewed using the dollar sign (example: x$neutral_sdev). The returned value contains the following components:
loadingsis a matrix containing the correlations between each genetic marker and each PC.
scoresis a matrix corresponding to the projections of the individuals onto each PC.
singular_valuescontains the ordered squared root of the proportion of variance explained by each PC.
pvaluesis a data frame containing the p-values for the K first principal components.
communalitycontains the communality for each PC which corresponds to the proportion of variance explained by the first K PCs.
pis a data frame with K columns. Gives the proportions removed from the loadings distributions in order to estimate the standard deviation of the neutral markers.
qis a data frame with K columns. Each column of q represents the kurtosis evaluated on the distribution of the loadings for each cut-off provided by p.
proportion_removedis a list of size K corresponding to the proportions of markers to remove from the loading distributions to match the kurtosis expected for a Gaussian distribution.

Details

First, a principal component analysis is performed on the scaled and centered genotype data. To account for missing data, the correlation matrix between individuals is computed using only the markers available for each pair of individuals. The scores and the loadings (correlations between PCs and genetic markers) are then found using the eigen function. The p-values are then computed based on the matrix of loadings. The loadings of the neutral markers are assumed to follow a centered Gaussian distribution. The standard deviation of the Gaussian distribution is estimated after removing a proportion of genetic markers with the largest loadings (in absolute values). The removal proportion is the smallest percentage such that the kurtosis of the truncated distribution of the loadings matches the kurtosis of a Gaussian distribution, which is equal to 3. The standard deviation of the loadings is finaly estimated using the maximum likelihood of a truncated Gaussian distribution.

Examples

Run this code

x <- read4pcadapt("geno3pops",option="example")
x <- floor(abs(x))
y <- pcadapt(x,K=10)

## Screeplot
plot(y,option="screeplot")

## PCA
plot(y,option="scores")

## Neutral SNPs distribution
plot(y,option="neutral",K=1)

## Manhattan Plot
plot(y,option="manhattan",K=1)

## Q-Q Plot
plot(y,option="qqplot",K=1)

Run the code above in your browser using DataLab