estploidy: Discriminate cytotypes using GBS data

Description

Use principal component analysis (PCA) and discriminant analysis (DA) to assign individuals to different cytotype groups based on GBS data. This function can be run with or without a training set of individuals with known cytotypes.

Usage

estploidy(alphas = NA, het = NA, depth = NA, train = FALSE, pl = NA, set = NA,  nclasses = 2, ids = NA, pcs = 1:2)

Arguments

alphas

a list generated by the estprops function that contains posterior estimates of allelic proportions.

het

a numeric vector with the observed heterozygosity for each individual, that is the proportion of SNPs at which the individual was heterozygous.

depth

a numeric vector with the mean number of reads per SNP for each individual (all SNPs or just heterozygous SNPs).

train

a boolean specifying whether or not a training set with known ploidy should be used.

a vector of known ploidies with one entry per individual (use `NA' for individuals with unknown ploidy); only used if train == TRUE.

set

indixes for the training set; only used if train == TRUE.

nclasses

the number of cyotypes expected.

ids

names or other IDs for individuals.

pcs

a vector giving the PC to use for DA.

Value

estploidy returns a list with three components:

Details

Assignment probabilities to different cytotype groups are obtained using PCA and DA, as described in Gompert & Mock (XXXX). Residual hetorzygosity (from regressing het on depth) and point estimates (posterior medians) for allelic proportions are first ordinated using PCA (on the centered covariance matrix). The first pcs PCs are retained for DA. If a training set is not provided, k-means clustering is used to obtain initial classifications (with nclasses groups), which are then used for leave-one-out cross-validation with DA. In this case, PC loadings and discriminant scores should be evaluated to associate groups from k-means clustering with specific cytotypes (Gompert & Mock, XXXX). If a training set is provided, it is used to define groups for DA and to estimate assignment probabilities without k-means clustering.

References

Gompert Z. and Mock K. (XXXX) Detection of individual ploidy levels with genotyping-by-sequencing (GBS) analysis. Molecular Ecology Resources, submitted.

Examples

Run this code

## load a simulated data set
data(dat)
## Not run: 
# ## obtain posterior estimates of allelic proportions; short chains are used for 
# ## the example, we recommend increasing this to at least 1000 MCMC steps with a
# ## 500 step burnin
# props<-estprops(cov1=t(dat[[1]]),cov2=t(dat[[2]]),mcmc.steps=20,mcmc.burnin=5,
#     mcmc.thin=1)
# 
# ## calculate observed heterozygosity and depth of coverage from the allele count
# ## data
# hx<-apply(is.na(dat[[1]]+dat[[2]])==FALSE,1,mean)
# dx<-apply(dat[[1]]+dat[[2]],1,mean,na.rm=TRUE)
# 
# ## run estploidy without using known ploidy data
# pl<-estploidy(alphas=props,het=hx,depth=dx,train=FALSE,pl=NA,set=NA,nclasses=2,
#     ids=dat[[3]],pcs=1:2)
# 
# ## boxplots to visualize posterior assignment probabilities by true ploidy 
# ## (which is known because these are simulated data)
# boxplot(pl$pp[,1] ~ dat[[3]],ylab="assignment probability",xlab="ploidy")
# 
# ## run estploidy with a training data set with known ploidy; the data set is 
# ## split into 100 individuals with known ploidy and 100 that are used for 
# ## inference
# truep<-dat[[3]]
# trn<-sort(sample(1:200,100,replace=FALSE))
# truep[-trn]<-NA
# plt<-estploidy(alphas=props,het=hx,depth=dx,train=TRUE,pl=truep,set=trn,
#     nclasses=2,ids=dat[[3]],pcs=1:2)
# 
# ## boxplots to visualize posterior assignment probabilities for individuals that
# ## were not part of the training set by true ploidy (which is known because 
# ## these are simulated data)
# boxplot(plt$pp[,1] ~ dat[[3]][-trn],ylab="assignment probability",xlab="ploidy")
# ## End(Not run)

Run the code above in your browser using DataLab