pcadapt: Principal Component Analysis for outlier detection

Description

pcadapt performs principal component analysis and computes p-values to test for outliers. The test for outliers is based on the correlations between genetic variation and the first K principal components. pcadapt also handles Pool-seq data for which the statistical analysis is performed on the genetic markers frequencies. Returns an object of class pcadapt.

Usage

pcadapt(input, K = 2, method = "mahalanobis", data.type = "genotype",
  min.maf = 0.05, ploidy = 2, output.filename = "pcadapt_output",
  clean.files = TRUE, transpose = FALSE)

Arguments

input

a character string specifying the name of the file to be processed with pcadapt.

an integer specifying the number of principal components to retain.

method

a character string specifying the method to be used to compute the p-values. Four statistics are currently available, "mahalanobis", "communality", "euclidean" and "componentwise".

data.type

a character string specifying the type of data being read, either a genotype matrix (data.type="genotype"), or a matrix of allele frequencies (data.type="pool").

min.maf

a value between 0 and 0.45 specifying the threshold of minor allele frequencies above which p-values are computed.

ploidy

an integer specifying the ploidy of the individuals.

output.filename

a character string specifying the names of the files created by pcadapt.

clean.files

a logical value indicating whether the auxiliary files should be deleted or not.

transpose

a logical value indicating whether the genotype matrix has to be tranposed or not. A genotype matrix should be p x n where p is the number of genetic markers and n is the number of individuals. If the data contains m

Value

The returned value x is an object of class pcadapt.

Details

First, a principal component analysis is performed on the scaled and centered genotype data. To account for missing data, the correlation matrix between individuals is computed using only the markers available for each pair of individuals. Depending on the specified method, different test statistics can be used.

mahalanobis (default): the Mahalanobis distance is computed for each genetic marker using a robust estimate of both mean and covariance matrix between the K vectors of z-scores.

communality: the communality statistic measures the proportion of variance explained by the first K PCs.

euclidean: the Euclidean distance between the K z-scores of each genetic marker and the mean of the K vectors of z-scores is computed.

componentwise: returns a matrix of z-scores.

To compute p-values, test statistics (stat) are divided by a genomic inflation factor (gif) when method="mahalanobis","euclidean". When method="communality", the test statistic is first multiplied by K and divided by the percentage of variance explained by the first K PCs before accounting for genomic inflation factor. When using method="mahalanobis","communality","euclidean", the scaled statistics (chi2_stat) should follow a chi-squared distribution with K degrees of freedom. When using method="componentwise", the z-scores should follow a chi-squared distribution with 1 degree of freedom. For Pool-seq data, pcadapt provides p-values based on the Mahalanobis distance for each SNP.