Learn R Programming

zipfR (version 0.6-5)

vec2xxx: Type-Token Statistics for Samples and Empirical Data (zipfR)

Description

Compute type-frequency list, frequency spectrum and vocabulary growth curve from a token vector representing a random sample or an observed sequence of tokens.

Usage

vec2tfl(x)

vec2spc(x)

vec2vgc(x, steps=200, stepsize=NA, m.max=0)

Arguments

x
a vector of length $N_0$, representing a random sample or other observed data set of $N_0$ tokens. For each token, the corresponding element of x specifies the type that the token belongs to. Usually, x is
steps
number of steps for which vocabulary growth data $V(N)$ is calculated. The values of $N$ will be evenly spaced (up to rounding differences) from $N=1$ to $N=N_0$.
stepsize
alternative way of specifying the steps of the vocabulary growth curve. In this case, vocabulary growth data will be calculated every stepsize tokens. The first step is chosen such that the last step corresponds to the full samp
m.max
an integer in the range $1 ...9$, specifying how many spectrum elements $V_m(N)$ to include in the vocabulary growth curve. By default only vocabulary size $V(N)$ is calculated, i.e. m.max=0.

Value

  • An object of class tfl, spc or vgc, representing the type frequency list, frequency spectrum or vocabulary growth curve of the token vector x, respectively.

Details

There are two main applications for the vec2xxx functions:

[object Object],[object Object]

Both applications work well for samples of up to approx. 1 million tokens. For considerably larger data sets, specialized external software should be used, such as the Perl scripts provided on the zipfR homepage.

See Also

tfl, spc and vgc for more information about type frequency lists, frequency spectra and vocabulary growth curves

rlnre for generating random samples (in the form of the required token vectors) from a LNRE model

readLines and scan for loading token vectors from disk files

Examples

Run this code
## type-token statistics for random samples from a LNRE distribution

model <- lnre("fzm", alpha=.5, A=1e-6, B=.05)
x <- rlnre(model, 100000)

vec2tfl(x)
vec2spc(x)  # same as tfl2spc(vec2tfl(x))
vec2vgc(x)

sample.spc <- vec2spc(x)
exp.spc <- lnre.spc(model, 100000)
plot(exp.spc, sample.spc)

sample.vgc <- vec2vgc(x, m.max=1, steps=500)
exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1)
plot(exp.vgc, sample.vgc, add.m=1)


## load token vector from a file in one-token-per-line format

x <- readLines(filename)
x <- readLines(file.choose()) # with file selection dialog 

## you can also perform whitespace tokenization and filter the data

brown <- scan("brown.pos", what=character(0), quote="")
nouns <- grep("/NNS?$", brown, value=TRUE)
plot(vec2spc(nouns))
plot(vec2vgc(nouns, m.max=1), add.m=1)

Run the code above in your browser using DataLab