
HIest
provides approaches to find maximum likelihood estimates of S and H, using the likelihood functions described by Fitzpatrick (2012).
HIest(G, P, type, method = "SANN", iterations = 1000, Cscale = NULL, surf = FALSE, startgrid = 99, start = c(.5,.5), control = list(fnscale = -1, maxit = iterations))
type="dominant"
or type="allele.count"
, there should be one row per individual. For type="codominant"
, each individual is to be represented in consecutive rows (one for each allele).
type="dominant"
or type="allele.count"
, there should be one row per locus, giving the frequencies of the dominant or "1" allele. For type="codominant"
there should be a separate row for each allele AND the Allele names should match the data in G
.
"codominant"
, "dominant"
, and "allele.count"
.
"SANN"
, "L-BFGS-B"
, "surf"
, and "mcmc"
. See details.
method="mcmc" or "SANN"
.
method = "SANN"
or "mcmc"
. The default value is 100. Smaller values will cause the algorithm to search more broadly, but could make the search inefficient. See details.
method="surf"
. It is the same as the argument size
in the function HIsurf
.
control
in the optim
function. Whatever else is chosen, be sure fnscale
is negative to make optim
search for a maximum rather than a minimum.
Currently, the function provides four methods for searching the likelihood surface. method = "SANN"
is probably the best; it uses the general purpose optimization function optim
with its simulated annealing algorithm. For this estimation problem, a custom proposal function is passed to the option gr
. This proposal function draws new genomic proportions ($p_{11},p_{12},p_{22}$) from a three dimensional Dirichlet distribution centered on the old genomic proportions. The concentration of the proposal distribution is controlled by Cscale
; the larger this value, the more the proposal distribution is concentrated near the current state.
method = "mcmc"
uses a Markov-Chain Monte Carlo with Metropolis-Hastings sampling to explore the likelihood surface. It also uses the Dirichlet proposal distribution, and could be useful (with some modification of the code) for generating posterior distributions. method = "SANN"
is probaby superior for simply finding the MLE.
method = "L-BFGS-B"
also uses optim
, but with a quasi-Newton likelihood search algorithm to look for the maximum likelihood. This method is relatively fast, but it can miss the MLE if it is near the edge of the sample space.
"surf"
, finds all likelihoods on a grid defined by startgrid
and chooses the maximum. This is not going to find the MLE unless the MLE happens to be one of the grid points. However, using the option surf = TRUE
with the SANN
or mcmc
methods can improve efficiency by initiating the search at the grid point nearest the MLE.
Fitzpatrick, B. M. 2012. Estimating ancestry and heterozygosity of hybrids using molecular markers. BMC Evolutionary Biology 12:131. http://www.biomedcentral.com/1471-2148/12/131
Fitzpatrick, B. M., J. R. Johnson, D. K. Kump, H. B. Shaffer, J. J. Smith, and S. R. Voss. 2009. Rapid fixation of non-native alleles revealed by genome-wide SNP analysis of hybrid tiger salamanders. BMC Evolutionary Biology 9:176. http://www.biomedcentral.com/1471-2148/9/176
Lynch, M. 1991. The genetic interpretation of inbreeding depression and outbreeding depression. Evolution 45:622-629.
HIsurf
for a likelihood surface, HIclass
for likelihoods of early generation hybrid classes, HItest
to compare the classification to the maximum likelihood, HILL
for the basic likelihood function.
##-- A random codominant data set of 5 individuals and 5 markers with three alleles each
L <- 10
P <- data.frame(Locus=rep(1:L,each=3),Allele=rep(1:3,L),
P1=as.vector(rmultinom(L,10,c(.7,.2,.1)))/10,
P2=as.vector(rmultinom(L,10,c(.1,.2,.7)))/10)
G <- matrix(nrow=10,ncol=L)
for(i in 1:L){
G[,i] <- sample(c(1,2,3),size=10,replace=TRUE,prob=rowMeans(P[P$Locus==i,3:4]))
}
HI.cod <- HIest(G,P,type="codominant",iterations=99,surf=TRUE,startgrid=20)
# this is unlikely to converge on the MLE: increase iterations and/r startgrid.
# # optional plot
# plot(c(0,.5,1,0),c(0,1,0,0),type="l",xlab=expression(italic(S)),
# ylab=expression(italic(H[I])),lwd=2,cex.lab=1.5,cex.axis=1.5,bty="n")
# points(HI.cod$S,HI.cod$H,cex=1.5,lwd=2)
# axis(1,labels=FALSE,lwd=2);axis(2,labels=FALSE,lwd=2)
# # other examples
# ##-- Make it into allele count data (count "3" alleles)
# P.c <- P[seq(from=3,to=dim(P)[2],by=3),]
# G.c <- matrix(nrow=5,ncol=L)
# for(i in 1:5){
# G.c[i,] <- colSums(G[c(i*2-1,i*2),]==3)
# }
# HI.ac <- HIest(G.c,P.c,type="allele.count",iterations=500,surf=TRUE,startgrid=50)
# ##-- Make it into dominant data where allele 3 is dominant
# G.d <- replace(G.c,G.c==2,1)
# HI.dom <- HIest(G.d,P.c,type="dominant",iterations=500,surf=TRUE,startgrid=50)
# ## -- A real dataset (Fitzpatrick et al. 2009)
# data(Bluestone)
# Bluestone <- replace(Bluestone,is.na(Bluestone),-9)
# # parental allele frequencies (assumed diagnostic)
# BS.P <- data.frame(Locus=names(Bluestone),Allele="BTS",P1=1,P2=0)
# # estimate ancestry and heterozygosity
# # BS.est <-HIest(Bluestone,BS.P,type="allele.count")
# ## shortcut for diagnostic markers
# BS.est <- HIC(Bluestone)
# # calculate likelihoods for early generation hybrid classes
# BS.class <- HIclass(Bluestone,BS.P,type="allele.count")
# # compare classification with maximum likelihood estimates
# BS.test <- HItest(BS.class,BS.est)
# table(BS.test$c1)
# # all 41 are TRUE, meaning the best classification is at least 2 log-likelihood units
# # better than the next best
# table(BS.test$c2)
# # 2 are TRUE, meaning the MLE S and H are within 2 log-likelihood units of the best
# # classification, i.e., the simple classification is rejected in all but 2 cases
# table(BS.test$Best.class,BS.test$c2)
# # individuals were classified as F2-like (class 3) or backcross to CTS (class 4), but
# # only two of the F2's were credible
# BS.test[BS.test$c2,]
# # in only one case was the F2 classification a better fit (based on AIC) than the
# # continuous model.
Run the code above in your browser using DataLab