Efficient optimization of model parameters under a HMM SELAC model
SelacHMMOptimize(codon.data.path, n.partitions = NULL, phy,
data.type = "codon", codon.model = "selac",
edge.length = "optimize", edge.linked = TRUE, nuc.model = "GTR",
estimate.aa.importance = FALSE, include.gamma = FALSE,
gamma.type = "quadrature", ncats = 4, numcode = 1,
diploid = TRUE, k.levels = 0, aa.properties = NULL,
verbose = FALSE, n.cores.by.gene = 1, n.cores.by.gene.by.site = 1,
max.tol = 0.001, max.tol.edges = 0.001, max.evals = 1e+06,
max.restarts = 3, user.optimal.aa = NULL,
fasta.rows.to.keep = NULL, recalculate.starting.brlen = TRUE,
output.by.restart = TRUE, output.restart.filename = "restartResult",
user.supplied.starting.param.vals = NULL, tol.step = 1,
optimizer.algorithm = "NLOPT_LN_SBPLX", max.iterations = 6)
Provides the path to the directory containing the gene specific fasta files of coding data. Must have a ".fasta" line ending.
The number of partitions to analyze. The order is based on the Unix order of the fasta files in the directory.
The phylogenetic tree to optimize the model parameters.
The data type being tested. Options are "codon" or "nucleotide".
The type of codon model to use. There are four options: "none", "GY94", "FMutSel0", "selac".
Indicates whether or not edge lengths should be optimized. By default it is set to "optimize", other option is "fixed", which user-supplied branch lengths.
A logical indicating whether or not edge lengths should be optimized separately for each gene. By default, a single set of each lengths is optimized for all genes.
Indicates what type nucleotide model to use. There are three options: "JC", "GTR", or "UNREST".
Indicates whether gene specific importance of distance parameter is to be estimate.
A logical indicating whether or not to include a discrete gamma model.
Indicates what type of gamma distribution to use. Options are "quadrature" after the Laguerre quadrature approach of Felsenstein 2001 or median approach of Yang 1994.
The number of discrete categories.
The ncbi genetic code number for translation. By default the standard (numcode=1) genetic code is used.
A logical indicating whether or not the organism is diploid or not.
Provides how many levels in the polynomial. By default we assume a single level (i.e., linear).
User-supplied amino acid distance properties. By default we assume Grantham (1974) properties.
Logical indicating whether each iteration be printed to the screen.
The number of cores to dedicate to parallelize analyses across gene.
The number of cores to decidate to parallelize analyses by site WITHIN a gene. Note n.cores.by.gene*n.cores.by.gene.by.site is the total number of cores dedicated to the analysis.
Supplies the relative optimization tolerance.
Supplies the relative optimization tolerance for branch lengths only. Default is that is the same as the max.tol.
Supplies the max number of iterations tried during optimization.
Supplies the number of random restarts.
If optimal.aa is set to "user", this option allows for the user-input optimal amino acids. Must be a list. To get the proper order of the partitions see "GetPartitionOrder" documentation.
Indicates which rows to remove in the input fasta files.
Whether to use given branch lengths in the starting tree or recalculate them.
Logical indicating whether or not each restart is saved to a file. Default is TRUE.
Designates the file name for each random restart.
Designates user-supplied starting values for C.q.phi.Ne, Grantham alpha, and Grantham beta. Default is NULL.
If > 1, makes for coarser tolerance at earlier iterations of the optimizer
The optimizer used by nloptr.
Sets the number of cycles to optimize the different parts of the model.
Indicates what type of optimal.aa should be used. There are four options: "none", "majrule", "optimize", or "user".
A hidden Markov model which no longers optimizes the optimal amino acids, but instead allows for the optimal sequence to vary along branches, clades, taxa, etc. Like the original function, we optimize parameters across each gene separately while keeping the shared parameters, alpha, beta, edge lengths, and nucleotide substitution parameters constant across genes. We then optimize alpha, beta, gtr, and the edge lengths while keeping the rest of the parameters for each gene fixed. This approach is potentially more efficient than simply optimizing all parameters simultaneously, especially if fitting models across 100's of genes.