lnre: LNRE Models (zipfR)

Description

LNRE model constructor, returns an object representing a LNRE model with the specified parameters, or allows parameters to be estimated automatically from an observed frequency spectrum.

Usage

lnre(type=c("zm", "fzm", "gigp"),
       spc=NULL, debug=FALSE,
       cost=c("gof", "chisq", "linear", "smooth.linear", "mse", "exact"),
       m.max=15, runs=5,
       method=c("Nelder-Mead", "NLM", "BFGS", "SANN", "Custom"),
       exact=TRUE, sampling=c("Poisson", "multinomial"),
       bootstrap=0, verbose=TRUE,
       …)

Arguments

type

class of LNRE model to use (see "LNRE Models" below)

spc

observed frequency spectrum used to estimate model parameters

debug

if TRUE, detailed debugging information will be printed during parameter estimation

cost

cost function for measuring the "distance" between observed and expected vocabulary size and frequency spectrum. Parameters are estimated by minimizing this cost function (see "Cost Functions" below for a listing of available cost functions).

m.max

number of spectrum elements considered by the cost function (see "Cost Functions" below for more information). If unspecified, the default is automatically adjusted to avoid small spectrum elements that may be mathematically unreliable.

runs

number of parameter optimization runs with random initialization. Parameters from the run that achieves the smallest value of the cost function will be selected. Currently not supported for method="Custom", please use runs=1 in this case.

method

algorithm used for parameter estimation, by minimizing the value of the cost function (see "Parameter Estimation" below for details, and "Minimization Algorithms" for descriptions of the available algorithms)

exact

if FALSE, certain LNRE models will be allowed to use approximations when calculating expected values and variances, in order to improve performance and numerical stability. However, the computed values might be inaccurate or inconsistent in "extreme" situations: in particular, \(E[V]\) might be larger than \(N\) when \(N\) is very small; \(\sum_m E[V_m]\) can be larger than \(E[V]\) at the same \(N\); \(\sum_m m \cdot E[V_m]\) can be larger than \(N\)

sampling

type of random sampling model to use. Poisson sampling is mathematically simpler and allows fast and robust calculations, while multinomial sampling is more accurate especially for very small samples. Poisson sampling is the default and should be unproblematic for sample sizes \(N \ge 10000\). NB: The multinomial sampling option has not been implemented yet.

bootstrap

number of bootstrap samples used to estimate confidence intervals for estimated model parameters. Recommended values are bootstrap=100 or bootstrap=200. Bootstrapping can be very time-consuming and should not be used if the underlying sample size is very large (roughly, more than 1 million tokens). See lnre.bootstrap for further information and warnings.

verbose

if TRUE, a progress bar will be shown in the R console during the bootstrapping procedure

…

all further named arguments are interpreted as parameter values for the chosen LNRE model (see the respective manpages for names and descriptions of the model parameters)

Value

An object of a suitable subclass of lnre, depending on the type argument (e.g. lnre.fzm for type="fzm"). This object represents a LNRE model of the selected type with the specified parameter values, or with parameter values estimated from the observed frequency spectrum spc.

The internal structure of lnre objects is described on the lnre.details manpage (intended for developers).

Parameter Estimation

Automatic parameter estimation for LNRE models is performed by matching the expected vocabulary size and frequency spectrum of the model against the observed data passed in the spc argument.

For this purpose, a cost function has to be defined as a measure of the "distance" between observed and expected frequency spectrum. Parameters are then estimated by applying a minimization algorithm in order to find those parameter values that lead to the smallest possible cost.

Parameter estimation is a crucial and often also quite critical step in the application of LNRE models. Depending on the shape of the observed frequency spectrum, the automatic estimation procedure may result in a poor and counter-intuitive fit, or may fail altogether.

Usually, multiple runs of the minimization are performed with different random start values. An error will only be reported if all the estimation runs fail. Such multiple runs have not been implemented for the Custom minimization method yet; please specify runs=1 in this case.

Users can influence parameter estimation by choosing from a range of predefined cost functions and from several minimization algorithms, as described in the following sections. Some experimentation with the cost, m.max and method arguments will often help to resolve estimation failures and may result in a considerably better goodness-of-fit.

Cost Functions

The following cost functions are available and can be selected with the cost argument. All functions are based on the differences between observed and expected values for vocabulary size and the first elements of the frequency spectrum (\(V_1, \ldots, V_m\), where \(m\) is given by the m.max argument):

gof:: the multivariate chi-squared statistic used for goodness-of-fit testing (lnre.goodness.of.fit). This cost function corresponds (almost) to maximum-likelihood parameter estimation and is used by default.
chisq:: cost function based on a simplified version of the multivariate chi-squared test for goodness-of-fit (assuming independence between the random variables \(V_m\)).
linear:: linear cost function, which sums over the absolute differences between observed and expected values. This cost function puts more weight on fitting the vocabulary size and the first few elements of the frequency spectrum (where absolute differences are much larger than for higher spectrum elements).
smooth.linear:: modified version of the linear cost function, which smoothes the kink of the absolute value function for a difference of \(0\) (since non-differentiable cost functions might be problematic for gradient-base minimization algorithms)
mse:: mean squared error cost function, averaging over the squares of differences between observed and expected values. This cost function penalizes large absolute differences more heavily than linear cost (and therefore puts even greater weight on fitting vocabulary size and the first spectrum elements).
exact:: this "virtual" cost function attempts to match the observed vocabulary size and first spectrum elements exactly, ignoring differences for all higher spectrum elements. This is achieved by adjusting the value of m.max automatically, depending on the number of free parameters that are estimated (in general, the number of constraints that can be satisfied by estimating parameters is the same as the number of free parameters). Having adjusted m.max, the mse cost function is used to determined parameter values, so that the estimation procedure will not fail even if the constraints cannot be matched exactly.

Minimization Algorithms

Several different minimization algorithms can be used for parmeter estimation and are selected with the method argument:

Nelder-Mead:: the Nelder-Mead algorithm, implemented by the optim function, performs minimization without using derivatives. Parameter estimation is therefore very robust, while almost as fast and accurate as the NLM method. Nelder-Mead is the default algorithm and is also used internally by most custom minimization procedures (see below).
NLM:: a standard Newton-type algorithm for nonlinear minimization, implemented by the nlm function, which makes use of numerical derivatives of the cost function. NLM minimization converges quickly and obtains very precise parameter estimates (for a local minimum of the cost function), but it is not very stable and may cause parameter estimation to fail altogether.
SANN:: minimization by simulated annealing, also provided by the optim function. Like Nelder-Mead, this algorithm is very robust because it avoids numerical derivatives, but convergence is extremely slow. In some cases, SANN might produce a better fit than Nelder-Mead (if the latter converges to a suboptimal local minimum).
BFGS:: a quasi-Newton method developed by Broyden, Fletcher, Goldfarb and Shanno. This minimization algorithm is efficient, but should be applied with care as it will often overshoot the valid range of parameter values.
Custom:: a custom estimation procedure provided for certain types of LNRE model, which may exploit special mathematical properties of the model in order to calculate one or more of the parameter values directly. For example, one parameter of the ZM and fZM models can easily be determined from the constraint \(E[V] = V\) (but note that this additional constraint leads to a different fit than is obtained by plain minimization of the cost function!). Custom estimation might also apply special configuration settings to improve convergence of the minimization process, based on knowledge about the valid ranges and "behaviour" of model parameters. If no custom estimation procedure has been implemented for the selected LNRE model, lnre falls back on the Nelder-Mead or NLM algorithm.

See the nlm and optim manpages for more information about the minimization algorithms used and key references.

Details

Currently, the following LNRE models are supported by the zipfR package:

The Zipf-Mandelbrot (ZM) LNRE model (see lnre.zm for details).

The finite Zipf-Mandelbrot (fZM) LNRE model (see lnre.fzm for details).

The Generalized Inverse Gauss-Poisson (GIGP) LNRE model (see lnre.gigp for details).

If explicit model parameters are specified in addition to an observed frequency spectrum spc, these parameters are fixed to the given values and are excluded from the estimation procedure. This feature can be useful if fully automatic parameter estimation leads to a poor or counterintuitive fit.

Examples

Run this code

# NOT RUN {
## load Dickens dataset
data(Dickens.spc)

## estimate parameters of GIGP model and show summary
m <- lnre("gigp", Dickens.spc)
m
# }
# NOT RUN {
## N, V and V1 of spectrum used to compute model
## (should be the same as for Dickens.spc)
N(m)
V(m)
Vm(m,1)
# }
# NOT RUN {
## expected V and V_m and their variances for arbitrary N 
EV(m,100e6)
VV(m,100e6)
EVm(m,1,100e6)
VVm(m,1,100e6)

## use only 10 instead of 15 spectrum elements to estimate model
## (note how fit improves for V and V1)
m.10 <- lnre("gigp", Dickens.spc, m.max=10)
m.10

## experiment with different cost functions
m.mse <- lnre("gigp", Dickens.spc, cost="mse")
m.mse
m.exact <- lnre("gigp", Dickens.spc, cost="exact")
m.exact
# }
# NOT RUN {
## NLM minimization algorithm is faster but less robust
m.nlm <- lnre("gigp", Dickens.spc, method="NLM")
m.nlm

## ZM and fZM LNRE models have special estimation algorithms
m.zm <- lnre("zm", Dickens.spc)
m.zm
m.fzm <- lnre("fzm", Dickens.spc)
m.fzm
# }
# NOT RUN {
## estimation is much faster if approximations are allowed
m.approx <- lnre("fzm", Dickens.spc, exact=FALSE)
m.approx
# }
# NOT RUN {
## specify parameters of LNRE models directly
m <- lnre("zm", alpha=.5, B=.01)
lnre.spc(m, N=1000, m.max=10)

m <- lnre("fzm", alpha=.5, A=1e-6, B=.01)
lnre.spc(m, N=1000, m.max=10)

m <- lnre("gigp", gamma=-.5, B=.01, C=.01)
lnre.spc(m, N=1000, m.max=10)

# }

Run the code above in your browser using DataLab