seg_lrt: Test for segregation distortion in a polyploid F1 population.

Description

Provides tests for segregation distortion for an F1 population of polyploids under various models of meiosis. You can use this test for autopolyploids that exhibit full polysomic inheritance, allopolyploids that exhibit full disomic inheritance, or segmental allopolyploids that exhibit partial preferential pairing. Double reduction is (optionally) fully accounted for in tetraploids, and (optionally) partially accounted for (only at simplex loci) for higher ploidies. Some maximum proportion of outliers can be specified (default at 3%), and so this method can accommodate moderate levels of double reduction at non-simplex loci. Offspring genotypes can either be known, or genotype uncertainty can be represented through genotype likelihoods. Parent data may or may not be provided, at your option. Parents can have different (even) ploidies, at your option. Details of the methods may be found in Gerard et al. (2025).

Usage

seg_lrt(
  x,
  p1_ploidy,
  p2_ploidy = p1_ploidy,
  p1 = NULL,
  p2 = NULL,
  model = c("seg", "auto", "auto_dr", "allo", "allo_pp", "auto_allo"),
  outlier = TRUE,
  ret_out = FALSE,
  ob = 0.03,
  db = c("ces", "prcs"),
  ntry = 3,
  opt = c("bobyqa", "L-BFGS-B"),
  optg = c("NLOPT_GN_MLSL_LDS", "NLOPT_GN_ESCH", "NLOPT_GN_CRS2_LM", "NLOPT_GN_ISRES"),
  df_tol = 0.001,
  chisq = FALSE
)

Value

A list with some or all of the following elements

stat

The test statistic.

df

The degrees of freedom of the test.

p_value

The p-value of the test.

null_bic

The null model's BIC.

outprob

Outlier probabilities. Only returned in ret_out = TRUE.

If using genotype counts, element i is the probability that an individual with genotype i-1 is an outlier. So the return vector has length ploidy plus 1.
If using genotype log-likelihoods, element i is the probability that individual i is an outlier. So the return vector has the same length as the number of individuals.

These outlier probabilities are only valid if the null of no segregation is true.

null

A list with estimates and information on the null model.

l0_pp: Maximized likelihood under the null plus the parent log-likelihoods.

l0

Maximized likelihood under using estimated parent genotypes are known parent genotypes.

q0

Estimated genotype frequencies under the null.

df0

Estimated number of parameters under the null.

gam

A list of three lists with estimates of the model parameters. The third list contains the elements outlier (which is TRUE if outliers were modeled) and pi (the estimated outlier proportion). The first two lists contain information on each parent with the following elements:

ploidy: The ploidy of the parent.

g

The (estimated) genotype of the parent.

alpha

The estimated double reduction rate(s). alpha[i] is the estimated probability that a gamete has i copies of identical by double reduction alleles.

beta

Double reduction's effect on simplex loci when type = "mix" and add_dr = TRUE.

gamma

The mixing proportions for the pairing configurations. The order is the same as in seg.

type

Either "mix" or "polysomic"

add_dr

Did we model double reduction at simplex loci when using type = "mix" (TRUE) or not (FALSE)?

alt

A list with estimates and information on the alternative model.

l1: The maximized likelihood under the alternative.

q1

The estimated genotype frequencies under the alternative.

df1

The estimated number of parameters under the alternative.

Arguments

x

The data. Can be one of two forms:

A vector of genotype counts. This is when offspring genotypes are known.
A matrix of genotype log-likelihoods. This is when there is genotype uncertainty. The rows index the individuals and the columns index the possible genotypes. The genotype log-likelihoods should be base e (natural log).

p1_ploidy, p2_ploidy

The ploidy of the first or second parent. Should be even.

p1, p2

One of three forms:

The known genotype of the first or second parent.
The vector of genotype log-likelihoods of the first or second parent. Should be base e (natural log).
NULL (completely unknown)

model

One of six forms:

"seg": Segmental allopolyploid. Allows for arbitrary levels of polysomic and disomic inheritance. This can account for partial preferential pairing. It also accounts for double reduction at simplex loci.

"auto"

Autopolyploid. Allows only for polysomic inheritance. No double reduction.

"auto_dr"

Autopolyploid, allowing for the effects of double reduction.

"allo"

Allopolyploid. Only complete disomic inheritance is explored.

"allo_pp"

Allopolyploid, allowing for the effects of partial preferential pairing. Though, autopolyploid (with complete bivalent pairing and no double reduction) is a special case of this model.

"auto_allo"

Only complete disomic and complete polysomic inheritance is studied.

outlier

A logical. Should we allow for outliers (TRUE) or not (FALSE)?

ret_out

A logical. Should we return the probability that each individual is an outlier (TRUE) or not (FALSE)?

The default upper bound on the outlier proportion.

Should we use the complete equational segregation model ("ces") or the pure random chromatid segregation model ("prcs") to determine the upper bound(s) on the double reduction rate(s). See drbounds() for details.

ntry

The number of times to try the optimization. You probably do not want to touch this.

opt

For local optimization, should we use bobyqa (Powell, 2009) or L-BFGS-B (Byrd et al, 1995)? You probably do not want to touch this.

optg

Initial global optimization used to start local optimization. Methods are described in the nloptr package (Johnson, 2008). You probably do not want to touch this. Possible values are:

"NLOPT_GN_MLSL_LDS": MLSL (Multi-Level Single-Linkage). Kucherenko and Sytsko (2005)

"NLOPT_GN_ESCH"

ESCH (evolutionary algorithm). da Silva Santos et al. (2010)

"NLOPT_GN_CRS2_LM"

Controlled Random Search (CRS) with local mutation. Kaelo and Ali (2006)

"NLOPT_GN_ISRES"

ISRES (Improved Stochastic Ranking Evolution Strategy). Runarsson and Yao (2005)

df_tol

Threshold for the rank of the Jacobian for the degrees of freedom calculation. This accounts for weak identifiability in the null model. You probably do not want to touch this.

chisq

A logical. When using known genotypes, this flags to use the chi-squared test or the Likelihood Ratio Test. Default is FALSE for the likelihood ratio test.

Null Model

The gamete frequencies under the null model can be calculated via gamfreq(). The genotype frequencies, which are just a discrete linear convolution (convolve()) of the gamete frequencies, can be calculated via gf_freq().

The null model's gamete frequencies for true autopolyploids (model = "auto") or true allopolyploids (model = "allo") are given in the seg data frame that comes with this package. I only made that data frame go up to ploidy 20, but let me know if you need it for higher ploidies.

The polyRAD folks test for full autopolyploid and full allopolyploid, so I included that as an option (model = "auto_allo").

We can account for arbitrary levels of double reduction in autopolyploids (model = "auto_dr") using the gamete frequencies from Huang et al (2019).

The null model for segmental allopolyploids (model = "allo_pp") is the mixture model of the possible allopolyploid gamete frequencies. The autopolyploid model (without double reduction) is a subset of this mixture model.

In the above mixture model, we can account for double reduction for simplex loci (model = "seg") by just slightly reducing the number of simplex gametes and increasing the number of duplex and nullplex gametes. That is, the frequencies for (nullplex, simplex, duplex) gametes go from (0.5, 0.5, 0) to (0.5 + b, 0.5 - 2 * b, b).

model = "seg" is the most general, so it is the default. But you should use other models if you have more information on your species. E.g. if you know you have an autopolyploid, use either model = "auto" or model = "auto_dr".

Unidentified Parameters

Do NOT interpret the estimated parameters in the null$gam list. These parameters are weakly identified (I had to do some fancy spectral methods to account for this in the null distribution of the tests). Even though they are technically identified, you would need a massive data set to be able to estimate them accurately.

Author

David Gerard

References

Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5), 1190-1208. tools:::Rd_expr_doi("10.1137/0916069")
da Silva Santos, C. H., Goncalves, M. S., & Hernandez-Figueroa, H. E. (2010). Designing novel photonic devices by bio-inspired computing. IEEE Photonics Technology Letters, 22(15), 1177-1179. tools:::Rd_expr_doi("10.1109/LPT.2010.2051222")
Gerard, D, Ambrosano, GB, Pereira, GdS, & Garcia, AAF (2025). Tests for segregation distortion in higher ploidy F1 populations. bioRxiv, p. 1-20. tools:::Rd_expr_doi("10.1101/2025.06.23.661114")
Huang, K., Wang, T., Dunn, D. W., Zhang, P., Cao, X., Liu, R., & Li, B. (2019). Genotypic frequencies at equilibrium for polysomic inheritance under double-reduction. G3: Genes, Genomes, Genetics, 9(5), 1693-1706. tools:::Rd_expr_doi("10.1534/g3.119.400132")
Johnson S (2008). The NLopt nonlinear-optimization package. https://github.com/stevengj/nlopt.
Kaelo, P., & Ali, M. M. (2006). Some variants of the controlled random search algorithm for global optimization. Journal of optimization theory and applications, 130, 253-264. tools:::Rd_expr_doi("10.1007/s10957-006-9101-0")
Kucherenko, S., & Sytsko, Y. (2005). Application of deterministic low-discrepancy sequences in global optimization. Computational Optimization and Applications, 30, 297-318. tools:::Rd_expr_doi("10.1007/s10589-005-4615-1")
Powell, M. J. D. (2009), The BOBYQA algorithm for bound constrained optimization without derivatives, Report No. DAMTP 2009/NA06, Centre for Mathematical Sciences, University of Cambridge, UK.
Runarsson, T. P., & Yao, X. (2005). Search biases in constrained evolutionary optimization. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(2), 233-243. tools:::Rd_expr_doi("https://doi.org/10.1109/TSMCC.2004.841906")

Examples

Run this code

set.seed(1)
p1_ploidy <- 4
p1 <- 1
p2_ploidy <- 8
p2 <- 4
q <- gf_freq(
  p1_g = p1,
  p1_ploidy = p1_ploidy,
  p1_gamma = 1,
  p1_type = "mix",
  p2_g = p2,
  p2_ploidy = p2_ploidy,
  p2_gamma= c(0.2, 0.2, 0.6),
  p2_type = "mix",
  pi = 0.01)
nvec <- c(stats::rmultinom(n = 1, size = 200, prob = q))
gl <- simgl(nvec = nvec)
seg_lrt(x = nvec, p1_ploidy = p1_ploidy, p2_ploidy = p2_ploidy, p1 = p1, p2 = p2)$p_value
seg_lrt(x = gl, p1_ploidy = p1_ploidy, p2_ploidy = p2_ploidy, p1 = p1, p2 = p2)$p_value

Run the code above in your browser using DataLab