FAMoS

FAMoS provides an automated and unbiased model selection algorithm that aims at determining the most appropriate subset of model parameters to describe a specific data set. It only needs three inputs:

A vector containing all parameters
A cost function that calculates the negative log-likelihood for a given parameter set
The number of data points used for analysis

Due to its flexibility with respect to the cost function, the FAMoS can be used for nonlinear problems such as ordinary and partial differential equations.

Installation

You can install FAMoS from github with:

# install.packages("devtools")
devtools::install_github("GabelHub/FAMoS")
# alternative installation command
devtools::install_git("git://github.com/GabelHub/FAMoS.git", branch = "master")

Features

Adaptive methods of model selection

The FAMoS uses three different methods to find appropriate models to test:

Forward search: Add a parameter to the currently best model
Backward elimination: Remove a parameter from the currently best model
Swap search: Replace a parameter of the currently best model by another. The parameters that can be swapped need to be specified by the user

The FAMoS keeps track of the methods used in the previous iterations and dynamically changes them according to the outcome of each iteration.

Easy parallelisation

FAMoS makes use of the future-package which allows for easy parallelisation, meaning many different models can be tested simultaneously if the required computational resources are available.

Smart testing procedures

FAMoS keeps track of previously tested models and checks also that each model fulfills all user-specified restrictions, therefore testing only relevant models and saving computational resources.

Example

As a simple example, we generate a simple data set generated by two parameters and apply the FAMoS on a global model consisting of five different parameters.

 library(FAMoS)
 #set the future plan (see future documentation for more information)
 future::plan(future::sequential)

 #create data with standard deviation of 1
 x.values <- 1:7
 y.values <-  9 * x.values^2 - exp(2 * x.values)
 sd.y.values <- rep(1,7)

 #define initial parameter values
 inits <- c(p1 = 3, p2 = 4, p3 = -2, p4 = 2, p5 = 0)

 #define cost function that returns the negative log-likelihood
 cost_function <- function(parms, x.vals, y.vals, sd.y){
  # restrict the search range to -5 to +5
  if(max(abs(parms)) > 5){
    return(NA)
  }
  with(as.list(c(parms)), {
    res <- p1*4 + p2*x.vals + p3^2*x.vals^2 + p4*sin(x.vals)  - exp(p5*x.vals)
    diff <- sum((res - y.vals)^2/sd.y)
  })
 }

 #perform model selection
 res <- famos(init.par = inits,
             fit.fn = cost_function,
             nr.of.data = length(y.values),
             init.model.type = c("p1", "p3"),
             optim.runs = 1,
             x.vals = x.values,
             y.vals = y.values,
             sd.y = sd.y.values)

The FAMoS returns a lot of verbose output, telling the user what's currently happening. In the beginning, the overall settings are defined and the corresponding directories are created (if they don't exist).

#> Initializing...
#> Create FAMoS directory...
#> 
#> Algorithm run: 001
#> Information criterion for model comparison: AICc
#> Refitting disabled.
#> Starting algorithm with method 'forward'

In each iteration, the FAMoS identifies new models to be tested based on the current search method:

#> FAMoS iteration #2 - method: forward
#> Add p2
#> Add p4
#> Add p5
#> Time passed since start: 1.1 secs

Each model will be submitted and tested. Since FAMoS uses futures for evaluation, the search process can be easily parallelised by setting the corresponding future plan. Every model is subsequently evaluated by performing (multiple) optimisation routines based on optim.

#> Job ID for model 01 - 11100:
#> 
#> Fitting run # 1
#>          p1          p2          p3 
#> -4.99979345  0.00000000  0.00570456 
#> 1471914833547.12 
#> 1471910549134.14 
#> 1471910349497.77 
#> 1471910340394.53 
#> 1471910340394.53 
#> No results file present. Current parameters saved.
#> 
#> Status file updated. Fitting done.

After all models have been evaluated, the algorithm reads in the results and checks, if a better model was found

#> Evaluate results ...
#> Best AICc of this run is 14
#> Parameter p5 was added
#> Time passed since start: 1.92 secs

The cycle continues until no better model can be found based on the currently used methods. After halting, the results are returned

#> Best model found. Algorithm stopped.
#> FAMoS run 001
#> AICc of best model: 7
#> Best model (binary): 00101
#> Best model (vector):
#> p1 p2 p3 p4 p5 
#>  0  0  1  0  1 
#> Estimated parameter values:
#> p1 p2 p3 p4 p5 
#>  0  0 -3  0  2 
#> Time needed: 3.84 secs

FAMoS options in detail

init.par

The vector init.par is one of the three essential variables that need to be specified. It contains the names and initial values of all parameters, that the FAMoS is supposed search through. In our example above, we specified this vector as

#define initial parameter values
 inits <- c(p1 = 3, p2 = 4, p3 = -2, p4 = 2, p5 = 0)

Depending on the starting model, FAMoS automatically extracts the corresponding values and uses them for its first iteration only. All following iterations inherit the best values found during fitting.

Additional specifications for the use of the inital parameter vector can be supplied by the options do.not.fit and default.val.

fit.fn

To allow independence of a certain mathematical model structure, the user can specify any cost function that is able to calculate the negative log-likelihood (-2LL) from a given parameter set. The necessity for returning the negative log-likelihood stems from the fact that model selection is performed on the AIC(c) or BIC. The basic structure of the cost function is therefore:

cost.function <- function(parms, ...){
  ... calculate the negative log-likelihood based on parms...
  return(negative log-likelihood)
}

Due to this structure, the FAMoS is able to tackle many different problems, e.g. modelling approaches based on ODEs or PDEs.

Note that the cost function needs to take the complete parameter set (as specified in init.par) as an input. The exact composition of the parms vector is adaptively set by the FAMoS, which inherits the values to be fitted from the previous best fits and sets the non-fitted parameters to zero (or another value, if specified in the option default.val).

nr.of.data

To calculate the respective information criteria (AIC(c) or BIC, see Burnham and Anderson 2002 for details) the number of different data points needs to be supplied as well.

homedir

The FAMoS generates and saves many different files, in order to make results available over time as well as to simultaneously running FAMoS runs. homedir specifies the folder, in which all results are going to be stored.

do.not.fit

In order to exclude some parameters from the fitting procedures, their names can be specified in the do.not.fit option. This allows to test different model restrictions without needing to change either init.par or fit.fn. For example, if we wanted to exlude the parameter p4 from our analysis, we would specify initially

#define initial parameter values
 inits <- c(p1 = 3, p2 = 4, p3 = -2, p4 = 2, p5 = 0)
 no.fit <- c("p4")

and pass this option on to FAMoS. Note that excluded parameters are automatically removed from the initial model, if init.model.type = "random" or init.model.type = "global" is used. If the user-specfied initial model contains an exluded parameter, an error will be returned.

  The specified initial model violates critical conditions or the do.not.fit specifications

method

FAMoS can use three different methods to search for different models to test: Forward search, backward elimination and swap search. As the algorithm dynamically changes these methods over the course of each iteration, the option method only specifies the starting method.

If the algorithm is able to find a better model, the current method will be used in the next iteration as well (except the swap method, which always uses a forward serch next - if it doesn't terminate in that step). If no better model is found, the algorithm will change the method according to the following scheme:

current method	previous method	next method
forward	backward	swap (or terminate)
forward	forward or swap	backward
backward	backward	forward
backward	forward	swap (or terminate)
swap	forward or backward	terminate

In case the swap method is not used (due to unspecified critical or swap sets), the algorithm will terminate after a succession of an unsuccessful forward and backward search.

init.model.type

To verify if the FAMoS results are consistent, it is important to run the algorithm with different starting models. To set the initial model, the user can either use the built-in options random (which generates a random model) or global (which uses the complete model as a starting point). Alternatively, the user can specify a model by supplying a parameter vector containing the names of the initial model.

#Three options for the starting model
init.model1 <- "random" # generates a random starting model
init.model2 <- "global" # uses all available parameters
init.model3 <- c("p1", "p4") # a user-specified model

In case random or global are chosen, the FAMoS automatically applies critical conditions and removes excluded parameters (see options critical.parameters and do.not.fit).

refit

Before testing a model, the FAMoS checks if this model has been tested before. In case refit = FALSE (default) is specified, the model will not be tested again. If refitting is set to TRUE, FAMoS will try to optimise the model again. If the new run returns a better fit, the old results will be overwritten, otherwise the new run will be discarded.

Refitting makes sense if the model optimisation is dependent on the initial parameter combination (see also optim.runs). If a model is reencountered, it might well be that the new parameter set to be tested with is much more appropriate than the previous one, especially if this reencounter happens within the same FAMoS run.

optim.runs

Finding the best fit for each model is crucial to guarantee a correct model selection procedure. Fitting in FAMoS is based on the built-in function optim, which is repeatedly evaluated. Here, optim.runs gives the number of fitting attempts with different starting conditions. Here, the first fitting attempt takes the inherited parameter vectors from previous runs, while all following fitting attempts randomly samples parameter vectors to test.

As the default optimisation method is based on the Nelder-Mead approach, which - in my experience - tends to not give reliable results if only one optimisation is performed, the optimisation for each fitting attempt is wrapped into a while-loop, in which the fitting procedure is repeatedly halted and restarted (based on the options control.optim), until the relative convergence tolerance in con.tol is reached.

The skeleton of the underlying code looks like this:


for(i in 1:optim.runs){#number of fitting attempts specified by optim.runs
  start.parameters <- either the inherited or a randomly sampled set (for i > 2)
  
  while(abs((old.optim.value - new.optim.value)/old.optim.value) < con.tol){
    ... run optim with start.parameters ...
    start.parameters <- new parameters estimated by optim
  }
}

information.criterion

The information criterion used for performing the model selection. Options include AICc, AIC or BIC (see Burnham and Anderson 2002 for more details).

default.val

Normally, FAMoS sets the parameters that are not fitted equal to zero. However, this might not be appropriate if, for example, a parameter describes an initial condition or a baseline turnover. Here, default.val allows to specify the value that a parameter assumes, if it is not fitted. default.val needs to be given as a named list, which can either store numerical values or the name of the parameter from which the value should be inherited. For example

 #define initial parameter values
 inits <- c(p1 = 3, p2 = 4, p3 = -2, p4 = 2, p5 = 0)
 #set default values
 def.val <- list(p1 = 2, p2 = -5, p3 = "p1", p4 = 0, p5 = "p4")

Here, the values of p1, p2, and p4 are set to their respective values. However, p3 and p5 will inherit their values from p1 and p4, respectively. This feature is useful if two rates describe similar processes and one wants to test if the difference between them is significant enough to warrant the fitting of an additional parameter. Here's a short example

cost.function <- function(parms){
  x <- par1 + par2*x
  y <- par3 + par4*x
}

def.val <- list(p1 = 0, p2 = 0, p3 = "p1", p4 = "p2")

Note that the parameter inheritance cannot be chained, meaning that entries that point to another parameter need a numeric value to access

#INCORRECT use of default.val
def.val <- list(p1 = 1, p2 = "p1", p3 = "p2", p4 = "p3")
#CORRECT use of default.val
def.val <- list(p1 = 1, p2 = "p1", p3 = "p1", p4 = "p1")

swap.parameters

The swap search that FAMoS can perform relies on sets which specify parameters that can be swapped by one another. For example, if we wanted to allow parameters p1, p2 and p3, as well as p4 and p5 to be replaceable by each other, we would specify:


swap.set <- list(c("p1", "p2", "p3"), c("p4", "p5"))

critical.parameters

In some cases, it does not make sense to fit certain submodels of the global model to the data, as they might lack crucial parameters. FAMoS can incorporate these restrictions by the specification of critical parameter sets. For example, if at least one of the first three parameters need to be present in the model, and all models that don't feature p4 are not correct, we can specify:

crit.set <- list(c("p1", "p2", "p3"), c("p4"))

All critical sets are also automatically used in the swap search.

random.borders

Since the parameters of all optim.runs larger than one are sampled based on a random uniform distribution, it might be important to set the correct sampling intervals. By default, FAMoS samples parameters with a 100% deviation of the inherited parameter values (for example, if a model contains two parameters, and the currently best values are p1 = 0.1 and p2 = -1000, the sampled values will lie in the intervals [0,0.2] and [-2000,0], respectively). Alternatively, the user can specify relative or absolute sampling intervals. For relative intervals, a numeric value has to be given for each parameter denoting its relative deviation. For absolute sampling intervals, a matrix containing the lower and upper borders has to be specified. Here's an example:

#relative sampling ranges
random.bord1 <- 0.3 # deviates all parameters by 30%
random.bord2 <- c(0.1, 0.5, 0.2) # deviates the parameters by 10%, 50% and 20%, respectively

#absolute sampling ranges
random.bord3 <- matrix(c(1,2), nrow = 1) #uses the interval [1,2] for all parameter samples
random.bord4 <- cbind(c(0,-10, 0.3), c(5, -9, 0.7)) #uses the intervals [0,5], [-10,-9] and [0.3, 0.7] to sample the respective parameters

#use a function to sample the results
random.bord5 <- rnorm #note that in this case, 'mean' and 'sd' need to passed to famos as well, if other values than the default settings should be used

control.optim

Specifies the control options used for optim (see optim.runs for more details).

parscale.pars

If parameters values span over several orders of magnitudes, using the built-in option parscale in optim can reduce the numbers of evaluations needed. Setting parscale.pars = TRUE automatically adjusts the scaling procedure in optim. In our experience, using parscale.pars = TRUE is usually beneficial if a large number of parameters with different orders of magnitude need to be fitted. However, the actual performance is very problem-specific and therefore we would recommend initially testing both approaches to see which one performs better for the problem at hand. Also, one needs to make sure that the other options given in control.optim and con.tol are specified appropriately.

con.tol

Specifies the relative convergence tolerance and determines when the repeated use optim fits will be terminated (see optim.runs for more details).

save.performance

If true, a plot of the current FAMoS performance is stored in the folder "FAMoS-Results/Figures/", which will be updated during each iteration.

future.off

Initially, FAMoS can be tricky to set up. To this end, the use of futures can be switched off, which prints all function output and error messages directly into the console. This can be quite useful for debugging.

log.interval

If futures are used during a FAMoS run, there will be a message printed every X seconds, informing the user which models fits are still running. log.interval allows to specify the interval of X. Default to 10 minutes (600 seconds).