Learn R Programming

apollo (version 0.0.7)

apollo_outOfSample: Out-of-sample fit (LL)

Description

Randomly generates estimation and validation samples, estimates the model on the first and calculates the likelihood for the second, then repeats.

Usage

apollo_outOfSample(apollo_beta, apollo_fixed, apollo_probabilities,
  apollo_inputs, estimate_settings = list(estimationRoutine = "bfgs",
  maxIterations = 200, writeIter = FALSE, hessianRoutine = "numDeriv",
  printLevel = 3L, silent = TRUE), outOfSample_settings = list(nRep = 10,
  validationSize = 0.1))

Arguments

apollo_beta

Named numeric vector. Names and values for parameters.

apollo_fixed

Character vector. Names (as defined in apollo_beta) of parameters whose value should not change during estimation.

apollo_probabilities

Function. Returns probabilities of the model to be estimated. Must receive three arguments:

  • apollo_beta: Named numeric vector. Names and values of model parameters.

  • apollo_inputs: List containing options of the model. See apollo_validateInputs.

  • functionality: Character. Can be either "estimate" (default), "prediction", "validate", "conditionals", "zero_LL", or "raw".

apollo_inputs

List grouping most common inputs. Created by function apollo_validateInputs.

estimate_settings

List. Options controlling the estimation process. See apollo_estimate.

outOfSample_settings

List. Options defining the sampling procedure. The following are valid options.

nRep

Numeric scalar. Number of times a different pair of estimation and validation sets are to be extracted from the full database. Default is 30.

validationSize

Numeric scalar. Size of the validation sample. Can be a percentage of the sample (0-1) or the number of individuals in the validation sample (>1). Default is 0.1.

Value

A matrix with the log-likelihood in both the estimation and validation samples. If the model has multiple components, the log-likelihood is reported for each of them. A more complete matrix also containing the estimates is written to a file called <model_name>_outOfSample.csv in the current working directory.

Details

A common way to test for overfitting of a model is to measure its fit on a sample not used during estimation that is, measuring its out-of-sample fit. A simple way to do this is splitting the complete available dataset in two parts: an estimation sample, and a validation sample. The model of interest is estimated using only the estimation sample, and then those estimated parameters are used to measure the fit of the model (e.g. the log-likelihood of the model) on the validation sample. Doing this with only one validation sample, however, may lead to biased results, as a particular validation sample need not be representative of the population. One way to minimise this issue is to randomly draw several pairs of estimation and validation samples from the complete dataset, and apply the procedure to each pair. The splitting of the database into estimation and validaion samples is done at the individual level, not at the observation level.