optimize_weights: Optimize MaxEnt OT constraint weights

Description

Optimizes constraint weights given a data set and optional biases. If no bias arguments are provided, the bias term(s) will not be included in the optimization.

Usage

optimize_weights(
  input,
  bias_input = NA,
  mu = NA,
  sigma = NA,
  control_params = NA,
  upper_bound = DEFAULT_UPPER_BOUND,
  encoding = "unknown",
  model_name = NA,
  allow_negative_weights = FALSE
)

Value

An object with the following named attributes:

weights: A named list of the optimal constraint weights
log_lik: the log likelihood of the data under the discovered weights
k: the number of constraints
n: the number of data points in the training set

Arguments

input

The input data frame/data table/tibble. This should contain one or more OT tableaux consisting of mappings between underlying and surface forms with observed frequency and violation profiles. Constraint violations must be numeric.

For an example of the data frame format, see inst/extdata/sample_data_frame.csv. You can read this file into a data frame using read.csv or into a tibble using dplyr::read_csv.

This function also supports the legacy OTSoft file format. You can use this format by passing in a file path string to the OTSoft file rather than a data frame.

For examples of OTSoft format, see inst/extdata/sample_data_file.txt.

bias_input

(optional) A data frame/data table/tibble containing the bias mus and sigmas. Each row corresponds to an individual constraint, and consists of three columns: Constraint, which contains the constraint name, Mu, which contains the mu, and Sigma, which contains the sigma. If this argument is provided, the mu and sigma arguments will be ignored. Like the input argument, this function also supports the legacy OTSoft file format for this argument. In this case, bias_input should be a path to the bias parameters in OTSoft format.

For examples of OTSoft bias format, see inst/extdata/sample_bias_file_otsoft.txt. Each row in this file should be the name of the constraint, followed by the mu, followed by the sigma (separated by tabs).

mu

(optional) A scalar or vector that will serve as the mu for each constraint in the bias term. Constraint weights will also be initialized to this value. If a vector, its length must equal the number of constraints in the input file. This value will not be used if bias_file is provided.

sigma

(optional) A scalar or vector that will serve as the sigma for each constraint in the bias term. If a vector, its length must equal the number of constraints in the input file. This value will not be used if bias_file is provided.

control_params

(optional) A named list of control parameters that will be passed to the optim function. See the documentation of that function for details. Note that some parameter settings may interfere with optimization. The parameter fnscale will be overwritten with -1 if specified, since this must be treated as a maximization problem.

upper_bound

(optional) The maximum value for constraint weights. Defaults to 100.

encoding

(optional) The character encoding of the input file. Defaults to "unknown".

model_name

(optional) A name for the model. If not provided, the name of the variable will be used if the input is a data frame. If the input is a path to an OTSoft file, the filename will be used.

allow_negative_weights

(optional) Whether the optimizer should allow negative weights. Defaults to FALSE.

Details

The objective function $J(w)$ that is optimized is defined as

$$J(w) = \sum_{i=1}^{n}{\ln P(y_i|x_i; w)} - \sum_{k=1}^{m}{\frac{(w_k - \mu_k)^2}{2\sigma_k^2}}$$

The first term in this equation calculates the natural logarithm of the conditional likelihood of the training data under the weights $w$. $n$ is the number of data points (i.e., the sample size or the sum of the frequency column in the input),$x_i$ is the input form of the $i$th data point, and $y_i$ is the observed surface form corresponding to $x_i$.$P(y_i|x_i; w)$ represents the probability of realizing underlying $x_i$ as surface $y_i$ given weights $w$. This probability is defined as

$$P(y_i|x_i; w) = \frac{1}{Z_w(x_i)}\exp(-\sum_{k=1}^{m}{w_k f_k(y_i, x_i)})$$

where $f_k(y_i, x_i)$ is the number of violations of constraint $k$ incurred by mapping underlying $x_i$ to surface $y_i$. $Z_w(x_i)$ is a normalization term defined as

$$Z(x_i) = \sum_{y\in\mathcal{Y}(x_i)}{\exp(-\sum_{k=1}^{m}{w_k f_k(y, x_i)})}$$

where $\mathcal{Y}(x_i)$ is the set of observed surface realizations of input $x_i$.

The second term of the equation for calculating the objective function is the optional bias term, where $w_k$ is the weight of constraint $k$, and $\mu_k$ and $\sigma_k$ parameterize a normal distribution that serves as a prior for the value of $w_k$. $\mu_k$ specifies the mean of this distribution (the expected weight of constraint $k$ before seeing any data) and $sigma_k$ reflects certainty in this value: lower values of $\sigma_k$ penalize deviations from $\mu_k$ more severely, and thus require greater amounts of data to move $w_k$ away from $mu_k$. While increasing $\sigma_k$ will improve the fit to the training data, it may result in overfitting, particularly for small data sets.

A general bias with $\mu_k = 0$ for all $k$ is commonly used as a form of simple regularization to prevent overfitting (see, e.g., Goldwater and Johnson 2003). Bias terms have also been used to model proposed phonological learning biases; see for example Wilson (2006), White (2013), and Mayer (2021, Ch. 4). The choice of $\sigma$ depends on the sample size. As the number of data points increases, $\sigma$ must decrease in order for the effect of the bias to remain constant: specifically, $n\sigma^2$ must be held constant, where $n$ is the number of tokens.

Optimization is done using the optim function from the R-core statistics library. By default it uses L-BFGS-B optimization, which is a quasi-Newtonian method that allows upper and lower bounds on variables. Constraint weights are restricted to finite, non-negative values.

If no bias parameters are specified (either the bias_file argument or the mu and sigma parameters), optimization will be done without the bias term.

Examples

Run this code

  # Get paths to toy data and bias files.
  df_file <- system.file(
      "extdata", "sample_data_frame.csv", package = "maxent.ot"
  )
  bias_file <- system.file(
       "extdata", "sample_bias_data_frame.csv", package = "maxent.ot"
  )
  # Fit weights to data with no biases
  tableaux_df <- read.csv(df_file)
  optimize_weights(tableaux_df)

  # Fit weights with biases specified in file
  bias_df <- read.csv(bias_file)
  optimize_weights(tableaux_df, bias_df)

  # Fit weights with biases specified in vector form
  optimize_weights(
      tableaux_df, mu = c(1, 2), sigma = c(100, 200)
  )

  # Fit weights with biases specified as scalars
  optimize_weights(tableaux_df, mu = 0, sigma = 1000)

  # Fit weights with mix of scalar and vector biases
  optimize_weights(tableaux_df, mu = c(1, 2), sigma = 1000)

  # Pass additional arguments to optim function
  optimize_weights(tableaux_df, control_params = list(maxit = 500))

Run the code above in your browser using DataLab