This function implements parametric bootstrapping for LNRE models, i.e. it draws a specified number of random samples from a trained lnre
object. For each sample, a new model is estimated and user-defined information is extracted from this model. See ‘Details’ and ‘Examples’ below for other use cases.
lnre.bootstrap(model, N, ESTIMATOR, STATISTIC,
replicates=100, sample=c("spc", "tfl", "tokens"),
simplify=TRUE, verbose=TRUE, seed=NULL, …)
a trained LNRE model, i.e. an object belonging to a subclass of lnre
. The model must provide a rlnre
method to generate random samples from the underlying frequency distribution.
a single positive integer, specifying the size \(N\) (i.e. token count) of the individual bootstrap samples
a callback function, normally used for estimating LNRE models in the bootstrap procedure. It is called once for each bootstrap sample with the sample as first argument (in the form determined by sample
). Additional arguments (…
) are passed through to the callback, so it is possible to use ESTIMATOR=lnre
with appropriate settings. It is valid to set ESTIMATOR=identity
to pass samples through to the STATISTIC
callback.
a callback function, normally used to extract model parameters and other relevant statistics from the bootstrapped LNRE models. It is called once for each bootstrap sample, with the value returned by ESTIMATOR
as its single argument. The return values are automatically aggregated across all bootstrap samples (see ‘Value’ below). It is valid to set STATISTIC=identity
in order to pass through the results of the ESTIMATOR
callback.
a single positive integer, specifying the number of bootstrap samples to be generated
the form in which each sample is passed to ESTIMATOR
: as a frequency spectrum (spc
, the default), as a type-frequency list (tfl
) or as a factor vector representing the token sequence (tokens
). Warning: The latter can be computationally expensive for large N
.
if TRUE
, use rbind()
to combine list of results into a single data structure. In this case, the estimator should return either a vector of fixed length or a single-row data frame or matrix. No validation is carried out before attempting the simplification.
if TRUE
, shows progress bar in R console during execution (which can take quite a long time)
a single integer value used to initialize the RNG in order to generate reproducible results
any further arguments are passed through to the ESTIMATOR
callback function
If simplify=FALSE
, a list of length replicates
containing the statistics obtained from each individual bootstrap sample. In addition, the following attributes are set:
N
= sample size of the bootstrap replicates
model
= the LNRE model from which samples were generated
estimator.errors
= number of failures of the ESTIMATOR
callback
statistic.errors
= number of failures of the STATISTIC
callback
If simplify=TRUE
, the statistics are combined with rbind()
. This is performed unconditionally, so make sure that STATISTIC
returns a suitable value for all samples, typically vectors of the same length or single-row data frames with the same columns.
The return value is usually a matrix or data frame with replicates
rows. No additional attributes are set.
The confint
method for LNRE models uses bootstrapping to estimate confidence intervals for the model parameters.
For this application, ESTIMATOR=lnre
re-estimates the LNRE model from each bootstrap sample. Configuration options such as the model type, cost function, etc. are passed as additional arguments in …
, and the sample must be provided in the form of a frequency spectrum. The return values are successfully estimated LNRE models.
STATISTIC
extracts the model parameters and other coefficients of interest (such as the population diversity S
) from each model and returns them as a named vector or single-row data frame. The results are combined with simplify=TRUE
, then empirical confidence intervals are determined for each column.
For some of the more complex measures of productivity and lexical richness (see productivity.measures
), it is difficult to estimate the sampling distribution mathematically. In these cases, an empirical approximation can be obtained by parametric bootstrapping.
The most convenient approach is to set ESTIMATOR=productivity.measures
, so the desired measures can be passed as an additional argument measures=
to lnre.bootstrap
. The default sample="spc"
is appropriate for most measures and is efficient enough to carry out the procedure for multiple sample sizes.
Since the estimator already returns the required statistics for each sample in a suitable format, set STATISTIC=identity
and simplify=TRUE
.
Vocabulary growth curves can only be generated from token vectors, so set sample="tokens"
and keep N
reasonably small.
ESTIMATOR=vec2vgc
compiles vgc
objects for the samples. Pass steps
or stepsize
as desired and set m.max
if growth curves for \(V_1, V_2, \ldots\) are desired.
Either use STATISTIC=identity
and simplify=FALSE
to return a list of vgc
objects, which can be plotted or processed further with sapply()
. This strategy is particulary useful if one or more \(V_m\) are desired in addition to \(V\).
Or use STATISTIC=function (x) x$V
to extract y-coordinates for the growth curve and combine them into a matrix with simplify=TRUE
, so that prediction intervals can be computed directly. Note that the corresponding x-coordinates are not returned and have to be inferred from N
and stepsize
.
The parametric bootstrapping procedure works as follows:
replicates
random samples of N
tokens each are drawn from the population described by the LNRE model model
Each sample is passed to the callback function ESTIMATOR
in the form determined by sample
(a frequency spectrum, type-frequency list, or factor vector of tokens). If ESTIMATOR
fails, it is re-run with a different sample, otherwise the return value is passed on to STATISTIC
. Use ESTIMATOR=identity
to pass the original sample through to STATISTIC
.
The callback function STATISTIC
is used to extract relevant information for each sample. If STATISTIC
fails, the procedure is repeated from step 2 with a different sample. The callback will typically return a vector of fixed length or a single-row data frame, and the results for all bootstrap samples are combined into a matrix or data frame if simplify=TRUE
.
Warning: Keep in mind that sampling a token vector can be slow and consume large amounts of memory for very large N
(much more than 1 million tokens). If possible, use sample="spc"
or sample="tfl"
, which can be generated more efficiently.
lnre
for more information about LNRE models. The high-level estimator function lnre
uses lnre.bootstrap
to collect data for approximate confidence intervals.