Where a single locus represents two or more independent isoloci (as in
an allopolyploid, or a diploidized autopolyploid), these two functions
can be used in sequence to assign alleles to isoloci.
alleleCorrelations uses K-means and UPGMA clustering of pairwise p-values
from Fisher's exact test to make initial groupings of alleles into
putative isoloci.  testAlGroups is then used to check those
groupings against individual genotypes, and adjust the assignments if necessary.
alleleCorrelations(object, samples = Samples(object), locus = 1,
                   alpha = 0.05, n.subgen = 2, n.start = 50)testAlGroups(object, fisherResults, SGploidy=2, samples=Samples(object),
             null.weight=0.5, tolerance=0.05, swap = TRUE,
             R = 100, rho = 0.95, T0 = 1, maxreps = 100)
An optional character or numeric vector indicating which samples to analyze.
A single character string or integer indicating which locus to analyze.
The significance threshold, before multiple correction, for determining whether two alleles are significantly correlated.
The number of subgenomes (number of isoloci) for this locus.  This would
be 2 for an allotetraploid or 3] for an allohexaploid.  For an 
allo-octoploid, the value would be 2 if there were two tetraploid 
subgenomes, or 4 if there were four diploid subgenomes.
Integer, passed directly to the nstart argument of the R base
function kmeans.  Lowering this number will speed up computation
time, whereas increasing it will improve the probability of finding the
correct allele assignments.  The default value of 50 should work well in
most cases.
A list output from alleleCorrelations.
The ploidy of each subgenome (each isolocus).  This is 2 for an
    allotetraploid, an allohexaploid, or an allo-octoploid with four tetraploid 
    subgenomes, or 4 for an allo-octoploid with
    two tetraploid genomes.
Numeric, indicating how genotypes with potential null alleles should
    be counted when looking for signs of homoplasy.  null.weight
    should be 0 if null
    alleles are expected to be common, and 1 if there are no null
    alleles in the dataset.  The default of 0.5 was chosen to reflect
    the fact that the presence of null alleles is generally unknown.
The proportion of genotypes that are allowed to be in disagreement with the allele assignments. This is the proportion of genotypes that are expected to have meiotic error or scoring error.
Boolean indicating whether or not to use the allele swapping algorithm before checking for homoplasy. TRUE will yield more accurate results in most cases, but FALSE may be preferable for loci with null or homoplasious alleles at high frequency.
Simulated annealing parameter for the allele swapping algorithm. Indicates how many swaps to attempt in each rep (i.e. how many swaps to attempt before changing the temperature).
Simulated annealing parameter for the allele swapping algorithm. Factor by which to reduce the temperature at the end of each rep.
Simulated annealing parameter for the allele swapping algorithm. Starting temperature.
Simulated annealing parameter for the allele swapping algorithm. Maximum number of reps if convergence is not achieved.
Both functions return lists.  For alleleCorrelations:
The name of the locus that was analyzed.
The method that was ultimately used to
    produce value$Kmeans.groups and value$UPGMA.groups.
    Either "K-means and UPGMA" or "fixed alleles".
Square matrix of logical values indicating whether there was significant negative correlation between each pair of alleles, after multiple testing correction by Holm-Bonferroni.
Square matrix of logical values indicating whether there was significant positive correlation between each pair of alleles, after multiple testing correction by Holm-Bonferroni.
Square matrix of p-values from Fisher's exact test for negative correlation between each pair of alleles.
Square matrix of p-values from Fisher's exact test for positive correlation between each pair of alleles.
Square matrix of the odds ratio estimate from Fisher's exact test for each pair of alleles.
Matrix with n.subgen rows, and as many
  columns as there are alleles in the dataset.  1 indicates that
  a given allele belongs to a given isolocus, and 0 indicates
  that it does not.  These are the groupings determined by K-means
  clustering.
Matrix in the same format as
  value$Kmeans.groups, showing groupings determined by UPGMA.
Square matrix like value$p.values.neg but
  with zeros inserted on the diagonal.  This is the matrix that was used
  for K-means clustering and UPGMA.  This matrix can be passed to the
  heatmap function in R to visualize the clusters.
Total sums of squares output from K-means clustering.
Sums of squares between clusters output from K-means
  clustering.  value$betweenss/value$totss can be used as an
  indication of clustering quality.
The table indicating presence/absence of each allele in each genotype.
For testAlGroups:
Name of the locus that was tested.
The ploidy of each subgenome, taken from the
  SGploidy argument that was passed to testAlGroups.
Matrix with as many rows as there are isoloci, and as many
  columns as there are alleles in the dataset.  1 indicates that
  a given allele belongs to a given isolocus, and 0 indicates
  that it does not.
A number ranging from zero to one 
indicating the proportion of genotypes from the dataset that are inconsistent
with assignments.
These functions implement a novel methodology, introduced in polysat version 1.4 and updated in version 1.6, for cases where one pair of microsatellite primers amplifies alleles at two or more independently-segregating loci (referred to here as isoloci). This is not typically the case with new autopolyploids, in which all copies of a locus have equal chances of pairing with each other at meiosis. It is, however, frequently the case with allopolyploids, in which there are two homeologous subgenomes that do not pair (or infrequently pair) at meiosis, or ancient autopolyploids, in which duplicated chromosomes have diverged to the point of no longer pairing at meiosis.
Within the two functions there are four major steps:
alleleCorrelations checks to see if there are any alleles
  that are present in every genotype in the dataset.  Such invariable
  alleles are assumed to be fixed at one isolocus (which is not
  necessarily true, but may be corrected by
  testAlGroups in steps 4 and 5).
  If present, each invariable allele is assigned to its own isolocus.
  If there are more invariable alleles than isoloci, the function throws
  an error.  If only one isolocus remains, all remaining (variable) alleles are
  assigned to that isolocus.  If there are as many invariable alleles as
  isoloci, all remaining (variable) alleles are assigned to all isoloci
  (i.e. they are considered homoplasious because they cannot be
  assigned).
If, after step 1, two or more isoloci remain
  without alleles assigned to them, correlations between alleles are
  tested by alleleCorrelations.  The dataset is converted
  to "genbinary" if not
  already in that format, and a Fisher's exact test, with negative
  association (odds ratio being less than one) as the alternative
  hypothesis, is performed between
  each pair of columns (alleles) in the genotype matrix.  The p-value of
  this test between each pair of alleles is stored in a square matrix,
  and zeros are inserted into the diagonal of the matrix.  K-means
  clustering and UPGMA are then performed on the square matrix of
  p-values, and the
  clusters that are produced represent initial assignments of alleles
  to isoloci.
The output of alleleCorrelations is then passed to
  testAlGroups.  If the results of K-means clustering and UPGMA
  were not identical, testAlGroups checks both sets of
  assignments against all genotypes in the dataset.  For a genotype to
  be consistent with a set of assignments, it should have at least one
  allele and no more than SGploidy alleles belonging to each
  isolocus.  The set of assignments that is consistent with the greatest
  number of genotypes is chosen, or in the case of a tie, the set of
  assignments produced by K-means clustering.
If swap = TRUE and the assignments chosen in the previous 
  step are inconsistent with some genotypes, testAlGroups attempts
  to swap the isoloci of single alleles, using a simulated annealing 
  (Bertsimas and Tsitsiklis 1993) algorithm to search for a new set of 
  assignments that is consistent with as many genotypes as possible.
  At each step, an allele is chosen at random to be moved to a different
  isolocus (which is also chosen at random if there are more than two
  isoloci).  If the new set of allele assignments is consistent with an equal or 
  greater number of genotypes than the previous set of assignments, the new
  set is retained.  If the new set is consistent with fewer genotypes than
  the old set, there is a small probability of retaining the new set, 
  dependent on how much worse the new set of assignments is and what the
  current “temperature” of the algorithm is.  After R allele
  swapping attempts, the temperature is lowered, reducing the probability 
  of retaining a set of allele assignments that is worse than the previous set.
  A new rep of R swapping attempts then begins.
  If a set of allele assignments is found that is consistent with all genotypes,
  the algorithm stops immediately.  Otherwise it stops if no changes are made
  during an entire rep of R swap attempts, or if maxreps reps
  are performed.
testAlGroups then checks through all genotypes to look
  for signs of homoplasy, meaning single alleles that should be assigned
  to more than one isolocus.  For each genotype, there should be no more
  than SGploidy alleles assigned to each isolocus.  Additionally,
  if there are no null alleles, each genotype should have at least one
  allele belonging to each isolocus.  Each time a genotype is
  encountered that does not meet these criteria, the a score is
  increased for all alleles that might be homoplasious.  (The second
  criterion is not checked if null.weight = 0.)  This score
  starts at zero and is increased by 1 if there are too many alleles per
  isolocus or by null.weight if an isolocus has no alleles.  Once
  all genotypes have been checked, the allele with the highest score is
  considered to be homoplasious and is added to the other isolocus.  (In
  a hexaploid or higher, which isolocus the allele is added to depends on the
  genotypes that were found to be inconsistent with the allele
  assignments, and which isolocus or isoloci the allele could have
  belonged to in order to fix the assignment.)  Allele scores are reset
  to zero and all alleles are then
  checked again with the new set of allele assignments.  The process is
  repeated until the proportion of genotypes that are inconsistent with
  the allele assignments is at or below tolerance.
Clark, L. V. and Drauch Schreier, A. (2017) Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between allelic variables. Molecular Ecology Resources, 17, 1090--1103. DOI: 10.1111/1755-0998.12639.
Bertsimas, D. and Tsitsiklis, J.(1993) Simulated annealing. Statistical Science 8, 10--15.
recodeAllopoly, mergeAlleleAssignments,
  catalanAlleles, processDatasetAllo
# NOT RUN {
# randomly generate example data for an allotetraploid
mydata <- simAllopoly(n.alleles=c(5,5), n.homoplasy=1)
viewGenotypes(mydata)
# test allele correlations
# n.start is lowered in this example to speed up computation time
myCorr <- alleleCorrelations(mydata, n.subgen=2, n.start=10)
myCorr$Kmeans.groups
myCorr$clustering.method
if(!is.null(myCorr$heatmap.dist)) heatmap(myCorr$heatmap.dist)
# check individual genotypes 
# (low maxreps used in order to speed processing time for this example)
myRes <- testAlGroups(mydata, myCorr, SGploidy=2, maxreps = 5)
myRes$assignments
myRes2 <- testAlGroups(mydata, myCorr, SGploidy=2, swap = FALSE)
myRes2$assignments
# }
Run the code above in your browser using DataLab