productivity.measures: Measures of Productivity and Lexical Richness (zipfR)

Description

Compute various measures of productivity and lexical richness from an observed frequency spectrum, or type-frequency list, from an observed vocabulary growth curve or from a vector of tokens.

Usage

productivity.measures(obj, measures, ...)
# S3 method for tfl
productivity.measures(obj, measures, ...)
# S3 method for spc
productivity.measures(obj, measures, ...)
# S3 method for vgc
productivity.measures(obj, measures, ...)
# S3 method for default
productivity.measures(obj, measures, ...)

Arguments

obj

a suitable data object from which productivity measures can be computed. Currently either a frequency spectrum (of class spc), a type-frequency list (of class tfl), a vocabulary growth curve (of class vgc), or a token vector.

measures

character vector naming the productivity measures to be computed (see "Productivity Measures" below). Names may be abbreviated as long as they remain unique. If unspecified, all supported measures are computed.

...

additional arguments passed on to the method implementations (currently, no further arguments are recognized)

Value

If obj is a frequency spectrum, type-frequency list or token vector: A numeric vector of the same length as measures with the corresponding observed values of the productivity measures.

If obj is a vocabulary growth curves: A numeric matrix with columns corresponding to the selected productivity measures and rows corresponding to the sample sizes of the vocabulary growth curve.

Productivity Measures

The following productivity measures are currently supported:

K:: Yule's (1944) \(K = 10000 \cdot \frac{ \sum_m m^2 V_m - N}{ N^2 }\) (only for complete observed frequency spectrum)
D:: Simpson's (1949) \(D = \sum_m V_m \frac{m}{N}\cdot \frac{m-1}{N-1}\) (only for complete observed frequency spectrum)
R:: Guiraud's (1954) \(R = V / \sqrt{N}\)
S:: Sichel's (1975) \(S = V_2 / V\), i.e. the proportion of dis legomena
H:: Honor<U+00E9>'s (1979) \(H = 100 \frac{ \log N }{ 1 - V_1 / V }\), a transformation of the proportion of hapax legomena adjusted for sample size
C:: Herdan's (1964) \(C = \frac{ \log V }{ \log N }\)
P:: Baayen's (1991) productivity index \(P = \frac{V_1}{N}\), which corresponds to the slope of the vocabulary growth curve (under random sampling assumptions)
TTR:: the type-token ratio TTR = \(V / N\)
Hapax:: the proportion of hapax legomena \(\frac{V_1}{V}\)
V:: the total number of types \(V\)

Details

This function computes productivity measures based on an observed frequency spectrum, type-frequency list or vocabulary growth curve. If an expected spectrum or VGC is passed, the expectations \(E[V]\), \(E[V_m]\) will simply be substituted for the sample values \(V\), \(V_m\) in the equations. In most cases, this does not yield the expected value of the productivity measure!

Some measures can only be computed from a complete frequency spectrum. They will return NA if obj is an incomplete spectrum or type-frequency list, an expected spectrum or a vocabulary growth curve is passed.

Some other measures can only be computed is a sufficient number of spectrum elements is included in a vocabulary growth curve (usually at least \(V_1\) and \(V_2\)), and will return NA otherwise.

Such limitations are indicated in the list of measures below (unless spectrum elements \(V_1\) and \(V_2\) are sufficient).

For an expected frequency spectrum or vocabulary growth curve, accuracte expectations can be computed for the measures \(R\), \(C\), \(P\), TTR and \(V\). For \(S\), \(H\) and Hapaxes, the expecations are often reasonably good approximations (based on a normal approximation of the ratio \(V_m / V\) derived from Evert (2004b, Lemma A.8) using an (incorrect) independence assumption for \(V_m\) and \(V - V_m\)).

References

Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/

Examples

Run this code

# NOT RUN {
## TODO

# }

Run the code above in your browser using DataLab