SelacHMMOptimize: Efficient optimization of a Hidden Markov SELAC model

Description

Efficient optimization of model parameters under a HMM SELAC model

Usage

SelacHMMOptimize(codon.data.path, n.partitions = NULL, phy,
  data.type = "codon", codon.model = "selac",
  edge.length = "optimize", edge.linked = TRUE, nuc.model = "GTR",
  estimate.aa.importance = FALSE, include.gamma = FALSE,
  gamma.type = "quadrature", ncats = 4, numcode = 1,
  diploid = TRUE, k.levels = 0, aa.properties = NULL,
  verbose = FALSE, n.cores.by.gene = 1, n.cores.by.gene.by.site = 1,
  max.tol = 0.001, max.tol.edges = 0.001, max.evals = 1e+06,
  max.restarts = 3, user.optimal.aa = NULL,
  fasta.rows.to.keep = NULL, recalculate.starting.brlen = TRUE,
  output.by.restart = TRUE, output.restart.filename = "restartResult",
  user.supplied.starting.param.vals = NULL, tol.step = 1,
  optimizer.algorithm = "NLOPT_LN_SBPLX", max.iterations = 6)

Arguments

codon.data.path

Provides the path to the directory containing the gene specific fasta files of coding data. Must have a ".fasta" line ending.

n.partitions

The number of partitions to analyze. The order is based on the Unix order of the fasta files in the directory.

phy

The phylogenetic tree to optimize the model parameters.

data.type

The data type being tested. Options are "codon" or "nucleotide".

codon.model

The type of codon model to use. There are four options: "none", "GY94", "FMutSel0", "selac".

edge.length

Indicates whether or not edge lengths should be optimized. By default it is set to "optimize", other option is "fixed", which user-supplied branch lengths.

edge.linked

A logical indicating whether or not edge lengths should be optimized separately for each gene. By default, a single set of each lengths is optimized for all genes.

nuc.model

Indicates what type nucleotide model to use. There are three options: "JC", "GTR", or "UNREST".

estimate.aa.importance

Indicates whether gene specific importance of distance parameter is to be estimate.

include.gamma

A logical indicating whether or not to include a discrete gamma model.

gamma.type

Indicates what type of gamma distribution to use. Options are "quadrature" after the Laguerre quadrature approach of Felsenstein 2001 or median approach of Yang 1994.

ncats

The number of discrete categories.

numcode

The ncbi genetic code number for translation. By default the standard (numcode=1) genetic code is used.

diploid

A logical indicating whether or not the organism is diploid or not.

k.levels

Provides how many levels in the polynomial. By default we assume a single level (i.e., linear).

aa.properties

User-supplied amino acid distance properties. By default we assume Grantham (1974) properties.

verbose

Logical indicating whether each iteration be printed to the screen.

n.cores.by.gene

The number of cores to dedicate to parallelize analyses across gene.

n.cores.by.gene.by.site

The number of cores to decidate to parallelize analyses by site WITHIN a gene. Note n.cores.by.gene*n.cores.by.gene.by.site is the total number of cores dedicated to the analysis.

max.tol

Supplies the relative optimization tolerance.

max.tol.edges

Supplies the relative optimization tolerance for branch lengths only. Default is that is the same as the max.tol.

max.evals

Supplies the max number of iterations tried during optimization.

max.restarts

Supplies the number of random restarts.

user.optimal.aa

If optimal.aa is set to "user", this option allows for the user-input optimal amino acids. Must be a list. To get the proper order of the partitions see "GetPartitionOrder" documentation.

fasta.rows.to.keep

Indicates which rows to remove in the input fasta files.

recalculate.starting.brlen

Whether to use given branch lengths in the starting tree or recalculate them.

output.by.restart

Logical indicating whether or not each restart is saved to a file. Default is TRUE.

output.restart.filename

Designates the file name for each random restart.

user.supplied.starting.param.vals

Designates user-supplied starting values for C.q.phi.Ne, Grantham alpha, and Grantham beta. Default is NULL.

tol.step

If > 1, makes for coarser tolerance at earlier iterations of the optimizer

optimizer.algorithm

The optimizer used by nloptr.

max.iterations

Sets the number of cycles to optimize the different parts of the model.

optimal.aa

Indicates what type of optimal.aa should be used. There are four options: "none", "majrule", "optimize", or "user".

Details

A hidden Markov model which no longers optimizes the optimal amino acids, but instead allows for the optimal sequence to vary along branches, clades, taxa, etc. Like the original function, we optimize parameters across each gene separately while keeping the shared parameters, alpha, beta, edge lengths, and nucleotide substitution parameters constant across genes. We then optimize alpha, beta, gtr, and the edge lengths while keeping the rest of the parameters for each gene fixed. This approach is potentially more efficient than simply optimizing all parameters simultaneously, especially if fitting models across 100's of genes.