Compute various measures of productivity and lexical richness from an observed frequency spectrum, or type-frequency list, from an observed vocabulary growth curve or from a vector of tokens.
productivity.measures(obj, measures, ...)# S3 method for tfl
productivity.measures(obj, measures, ...)
# S3 method for spc
productivity.measures(obj, measures, ...)
# S3 method for vgc
productivity.measures(obj, measures, ...)
# S3 method for default
productivity.measures(obj, measures, ...)
a suitable data object from which productivity measures
can be computed. Currently either a frequency spectrum
(of class spc
), a type-frequency list (of class tfl
),
a vocabulary growth curve (of class vgc
), or a token vector.
character vector naming the productivity measures to be computed (see "Productivity Measures" below). Names may be abbreviated as long as they remain unique. If unspecified, all supported measures are computed.
additional arguments passed on to the method implementations (currently, no further arguments are recognized)
If obj
is a frequency spectrum, type-frequency list or token vector:
A numeric vector of the same length as measures
with the corresponding observed values of the productivity measures.
If obj
is a vocabulary growth curves:
A numeric matrix with columns corresponding to the selected productivity measures and rows corresponding to the sample sizes of the vocabulary growth curve.
The following productivity measures are currently supported:
K
:Yule's (1944) \(K = 10000 \cdot \frac{ \sum_m m^2 V_m - N}{ N^2 }\) (only for complete observed frequency spectrum)
D
:Simpson's (1949) \(D = \sum_m V_m \frac{m}{N}\cdot \frac{m-1}{N-1}\) (only for complete observed frequency spectrum)
R
:Guiraud's (1954) \(R = V / \sqrt{N}\)
S
:Sichel's (1975) \(S = V_2 / V\), i.e. the proportion of dis legomena
H
:Honor<U+00E9>'s (1979) \(H = 100 \frac{ \log N }{ 1 - V_1 / V }\), a transformation of the proportion of hapax legomena adjusted for sample size
C
:Herdan's (1964) \(C = \frac{ \log V }{ \log N }\)
P
:Baayen's (1991) productivity index \(P = \frac{V_1}{N}\), which corresponds to the slope of the vocabulary growth curve (under random sampling assumptions)
TTR
:the type-token ratio TTR = \(V / N\)
Hapax
:the proportion of hapax legomena \(\frac{V_1}{V}\)
V
:the total number of types \(V\)
This function computes productivity measures based on an observed frequency spectrum, type-frequency list or vocabulary growth curve. If an expected spectrum or VGC is passed, the expectations \(E[V]\), \(E[V_m]\) will simply be substituted for the sample values \(V\), \(V_m\) in the equations. In most cases, this does not yield the expected value of the productivity measure!
Some measures can only be computed from a complete frequency spectrum. They will return NA
if obj
is an incomplete spectrum or type-frequency list, an expected spectrum or a vocabulary growth curve is passed.
Some other measures can only be computed is a sufficient number of spectrum elements is included in a vocabulary growth curve (usually at least
\(V_1\) and \(V_2\)), and will return NA
otherwise.
Such limitations are indicated in the list of measures below (unless spectrum elements \(V_1\) and \(V_2\) are sufficient).
For an expected frequency spectrum or vocabulary growth curve, accuracte expectations can be computed for the measures \(R\), \(C\), \(P\), TTR and \(V\). For \(S\), \(H\) and Hapaxes, the expecations are often reasonably good approximations (based on a normal approximation of the ratio \(V_m / V\) derived from Evert (2004b, Lemma A.8) using an (incorrect) independence assumption for \(V_m\) and \(V - V_m\)).
Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/
lnre.bootstrap
and bootstrap.confint
for parametric bootstrapping experiments,
which help to determine the true expectations and sampling distributions of all productivity measures.