The objective function \(J(w)\) that is optimized is defined as
$$J(w) = \sum_{i=1}^{n}{\ln P(y_i|x_i; w)}
- \sum_{k=1}^{m}{\frac{(w_k - \mu_k)^2}{2\sigma_k^2}}$$
The first term in this equation calculates the natural logarithm of the
conditional likelihood of the training data under the weights \(w\). \(n\)
is the number of data points (i.e., the sample size or the sum of the frequency
column in the input),\(x_i\) is the input form of the \(i\)th data
point, and \(y_i\) is the observed surface form corresponding to
\(x_i\).\(P(y_i|x_i; w)\) represents the probability of realizing
underlying \(x_i\) as surface \(y_i\) given weights \(w\). This
probability is defined as
$$P(y_i|x_i; w) = \frac{1}{Z_w(x_i)}\exp(-\sum_{k=1}^{m}{w_k f_k(y_i, x_i)})$$
where \(f_k(y_i, x_i)\) is the number of violations of constraint \(k\)
incurred by mapping underlying \(x_i\) to surface \(y_i\). \(Z_w(x_i)\)
is a normalization term defined as
$$Z(x_i) = \sum_{y\in\mathcal{Y}(x_i)}{\exp(-\sum_{k=1}^{m}{w_k f_k(y, x_i)})}$$
where \(\mathcal{Y}(x_i)\) is the set of observed surface realizations of
input \(x_i\).
The second term of the equation for calculating the objective function is
the optional bias term, where \(w_k\) is the weight of constraint \(k\), and
\(\mu_k\) and \(\sigma_k\) parameterize a normal distribution that
serves as a prior for the value of \(w_k\). \(\mu_k\) specifies the mean
of this distribution (the expected weight of constraint \(k\) before
seeing any data) and \(sigma_k\) reflects certainty in this value: lower
values of \(\sigma_k\) penalize deviations from \(\mu_k\) more severely,
and thus require greater amounts of data to move \(w_k\) away from
\(mu_k\). While increasing \(\sigma_k\) will improve the fit to the
training data, it may result in overfitting, particularly for small data
sets.
A general bias with \(\mu_k = 0\) for all \(k\) is commonly used as a
form of simple regularization to prevent overfitting (see, e.g., Goldwater
and Johnson 2003). Bias terms have also been used to model proposed
phonological learning biases; see for example Wilson (2006), White (2013),
and Mayer (2021, Ch. 4). The choice of \(\sigma\) depends on the sample
size. As the number of data points increases, \(\sigma\) must decrease in
order for the effect of the bias to remain constant: specifically,
\(n\sigma^2\) must be held constant, where \(n\) is the number of tokens.
Optimization is done using the optim function from the R-core
statistics library. By default it uses L-BFGS-B
optimization, which is a
quasi-Newtonian method that allows upper and lower bounds on variables.
Constraint weights are restricted to finite, non-negative values.
If no bias parameters are specified (either the bias_file
argument or the
mu and sigma parameters), optimization will be done without the bias term.