vgc: Vocabulary Growth Curves (zipfR)

Description

In the zipfR library, vgc objects are used to represent a vocabulary growth curve (VGC). This can be an observed VGC from an incremental set of sample (such as a corpus), a randomized VGC obtained by binomial interpolation, or the expected VGC according to a LNRE model.

With the vgc constructor function, an object can be initialized directly from the specified data vectors. It is more common to read an observed VGC from a disk file with read.vgc, generate a randomized VGC with vgc.interp or compute an expected VGC with lnre.vgc, though.

vgc objects should always be treated as read-only.

Usage

vgc(N, V, Vm=NULL, VV=NULL, VVm=NULL, expected=FALSE, check=TRUE)

Arguments

integer vector of sample sizes \(N\) for which vocabulary growth data is available

vector of corresponding vocabulary sizes \(V(N)\), or expected vocabulary sizes \(E[V(N)]\) for an interpolated or expected VGC.

optional list of growth vectors for hapaxes \(V_1(N)\), dis legomena \(V_2(N)\), etc. Up to 9 growth vectors are accepted (i.e.\ \(V_m(N)\) for \(m \le 9\)). For an interpolated or expected VGC, the vectors represent expected class sizes \(E[V_m(N)]\).

optional vector of variances \(\mathop{Var}[V(N)]\) for an interpolated or expected VGC

VVm

optional list of variance vectors \(\mathop{Var}[V_m(N)]\) for an expected VGC. If present, these vectors must be defined for exactly the same frequency classes \(m\) as the vectors in Vm.

expected

if TRUE, the object represents an interpolated or expected VGC (for informational purposes only)

check

by default, various sanity checks are performed on the data supplied to the spc constructor. Specify check=FALSE to skip these sanity test, e.g. when automatically processing data from external programs that may be numerically unstable.

Value

An object of class vgc representing the specified vocabulary growth curve. This object should be treated as read-only (although such behaviour cannot be enforced in R).

Details

If variances (VV or VVm) are specified for an expected VGC, all relevant vectors must be given. In other words, VV always has to be present in this case, and VVm has to be present whenever Vm is specified, and must contain vectors for exactly the same frequency classes.

V and VVm are integer vectors for an observed VGC, but will usually be fractional for an interpolated or expected VGC.

A vgc object is a data frame with the following variables:

N: sample size \(N\)
V: corresponding vocabulary size (either observed vocabulary size \(V(N)\) or expected vocabulary size \(E[V(N)]\))
V1 … V9: optional: observed or expected spectrum elements (\(V_m(N)\) or \(E[V_m(N)]\)). Not all of these variables have to be present, but there must not be any "gaps" in the spectrum.
VV: optional: variance of expected vocabulary size, \(\mathop{Var}[V(N)]\)
VV1 … VV9: optional: variances of expected spectrum elements, \(\mathop{Var}[V_m(N)]\). If variances are present, they must be available for exactly the same frequency classes as the corresponding expected values.

The following attributes are used to store additional information about the vocabulary growth curve:

m.max: if non-zero, the VGC includes spectrum elements \(V_m(N)\) for \(m\) up to m.max. For m.max=0, no spectrum elements are present.
expected: if TRUE, the object represents an interpolated or expected VGC, with expected vocabulary size and spectrum elements. Otherwise, the object represents an observed VGC.
hasVariances: indicates whether or not the VV variable is present (as well as VV1, VV2, etc., if appropriate)

Examples

Run this code

# NOT RUN {
## load Dickens' work empirical vgc and take a look at it

data(Dickens.emp.vgc)
summary(Dickens.emp.vgc)
print(Dickens.emp.vgc)

plot(Dickens.emp.vgc,add.m=1)

## vectors of sample sizes in the vgc, and the
## corresponding V and V_1 vectors
Ns <- N(Dickens.emp.vgc)
Vs <- V(Dickens.emp.vgc)
Vm <- V(Dickens.emp.vgc,1)

## binomially interpolated V and V_1 at the same sample sizes
## as the empirical curve
data(Dickens.spc)
Dickens.bin.vgc <- vgc.interp(Dickens.spc,N(Dickens.emp.vgc),m.max=1)

## compare observed and interpolated
plot(Dickens.emp.vgc,Dickens.bin.vgc,add.m=1,legend=c("observed","interpolated"))


## load Italian ultra- prefix data
data(ItaUltra.spc)

## compute zm model
zm <- lnre("zm",ItaUltra.spc)

## compute vgc up to about twice the sample size
## with variance of V
zm.vgc <- lnre.vgc(zm,(1:100)*70, variances=TRUE)

summary(zm.vgc)
print(zm.vgc)

## plot with confidence intervals derived from variance in
## vgc (with larger datasets, ci will typically be almost
## invisible)
plot(zm.vgc)

## for more examples of vgc usages, see manpages of lnre.vgc,
## plot.vgc, print.vgc  and vgc.interp


# }

Run the code above in your browser using DataLab