Learn R Programming

maxent.ot (version 1.0.0)

cross_validate: Cross-validate bias parameters for constraint weights.

Description

Performs k-fold cross-validation of a data set and a set of input bias parameters. Cross-validation allows the space of bias parameters to be searched to find the settings that best support generalization to unseen data.

Usage

cross_validate(
  input,
  k,
  mu_values,
  sigma_values,
  grid_search = FALSE,
  output_path = NA,
  out_sep = ",",
  control_params = NA,
  upper_bound = DEFAULT_UPPER_BOUND,
  encoding = "unknown",
  model_name = NA,
  allow_negative_weights = FALSE
)

Value

A data frame with the following columns:

  • model_name: the name of the model

  • mu: the value(s) of mu tested

  • sigma: the value(s) of sigma tested

  • folds: the number of folds

  • mean_ll: the mean log likelihood of k-fold cross-validation using these bias parameters

Arguments

input

The input data frame/data table/tibble. This should contain one or more OT tableaux consisting of mappings between underlying and surface forms with observed frequency and violation profiles. Constraint violations must be numeric.

For an example of the data frame format, see inst/extdata/sample_data_frame.csv. You can read this file into a data frame using read.csv or into a tibble using dplyr::read_csv.

This function also supports the legacy OTSoft file format. You can use this format by passing in a file path string to the OTSoft file rather than a data frame.

For examples of OTSoft format, see inst/extdata/sample_data_file.txt.

k

The number of folds to use in cross-validation.

mu_values

A vector or list of mu bias parameters to use in cross-validation. Parameters may either be scalars, in which case the same mu parameter will be applied to every constraint, or vectors/lists containing a separate mu bias parameter for each constraint.

sigma_values

A vector or list of sigma bias parameters to use in cross-validation. Parameters may either be scalars, in which case the same sigma parameter will be applied to every constraint, or vectors/lists containing a separate sigma bias parameter for each constraint.

grid_search

(optional) If TRUE, the Cartesian product of the values in mu_values and sigma_values will be validated. For example, if mu_values = c(0, 1) and sigma_values = c(0.1, 1), cross-validation will be done on the mu/sigma pairs (0, 0.1), (0, 1), (1, 0.1), (1, 1). If FALSE (default), cross-validation will be done on each pair of values at the same indices in mu_values and sigma_values. For example, if mu_values = c(0, 1) and sigma_values = c(0.1, 1), cross-validation will be done on the mu/sigma pairs (0, 0.1), (1, 1).

output_path

(optional) A string specifying the path to a file to which the cross-validation results will be saved. If the file exists it will be overwritten. If this argument isn't provided, the output will not be written to a file.

out_sep

(optional) The delimiter used in the output files. Defaults to tabs.

control_params

(optional) A named list of control parameters that will be passed to the optim function. See the documentation of that function for details. Note that some parameter settings may interfere with optimization. The parameter fnscale will be overwritten with -1 if specified, since this must be treated as a maximization problem.

upper_bound

(optional) The maximum value for constraint weights. Defaults to 100.

encoding

(optional) The character encoding of the input file. Defaults to "unknown".

model_name

(optional) A name for the model. If not provided, the file name will be used if the input is a file path. If the input is a data frame the name of the variable will be used.

allow_negative_weights

(optional) Whether the optimizer should allow negative weights. Defaults to FALSE.

Details

The cross-validation procedure is as follows:

  1. Randomly divide the data into k partitions.

  2. Iterate through every combination of mu and sigma specified in the input arguments (see the documentation for the grid_search argument for details on how this is done).

  3. For each combination, for each of the k partitions, train a model on the other (k-1) partitions using optimize_weights and then run predict_probabilities on the remaining partition.

  4. Record the mean log likelihood the models apply to the held-out partitions.

Examples

Run this code
  # Get paths to OTSoft file. Note that you can also pass dataframes into
  # this function, as described in the documentation for `optimize`.
  data_file <- system.file(
      "extdata", "amp_demo_grammar.csv", package = "maxent.ot"
  )
  tableaux_df <- read.csv(data_file)

  # Define mu and sigma parameters to try
  mus <- c(0, 1)
  sigmas <- c(0.01, 0.1)

  # Do 2-fold cross-validation
  cross_validate(tableaux_df, 2, mus, sigmas)

  # Do 2-fold cross-validation with grid search of parameters
  cross_validate(tableaux_df, 2, mus, sigmas, grid_search=TRUE)

  # You can also use vectors/lists for some/all of the bias parameters to set
  # separate biases for each constraint
  mus_v <- list(
    c(0, 1),
    c(1, 0)
  )
  sigmas_v <- list(
    c(0.01, 0.1),
    c(0.1, 0.01)
  )

  cross_validate(tableaux_df, 2, mus_v, sigmas_v)

  # Save cross-validation results to a file
  tmp_output <- tempfile()
  cross_validate(tableaux_df, 2, mus, sigmas, output_path=tmp_output)

Run the code above in your browser using DataLab