Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.

```
textstat_collocations(x, method = "lambda", size = 2, min_count = 2,
smoothing = 0.5, tolower = TRUE, ...)
```is.collocations(x)

x

a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with
`padding = TRUE`

. While identifying collocations for tokens objects is
supported, you will get better results with character or corpus objects due
to relatively imperfect detection of sentence boundaries from texts already
tokenized.

method

association measure for detecting collocations. Currently this
is limited to `"lambda"`

. See Details.

size

integer; the length of the collocations to be scored

min_count

numeric; minimum frequency of collocations that will be scored

smoothing

numeric; a smoothing parameter added to the observed counts (default is 0.5)

tolower

logical; if `TRUE`

, form collocations as lower-cased combinations

`textstat_collocations`

returns a data.frame of collocations and their
scores and statistics.

`is.collocation`

returns `TRUE`

if the object is of class
`collocations`

, `FALSE`

otherwise.

Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If `x`

is a tokens object and some tokens have been removed, this should be done
using `tokens_remove(x, pattern, padding = TRUE)`

so that counts will still be
accurate, but the pads will prevent those collocations from being scored.

The `lambda`

computed for a size = \(K\)-word target multi-word
expression the coefficient for the \(K\)-way interaction parameter in the
saturated log-linear model fitted to the counts of the terms forming the set
of eligible multi-word expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multi-word expressions are
considered (rather than just verbs, as in that paper). The `z`

is the
Wald \(z\)-statistic computed as the quotient of `lambda`

and the Wald
statistic for `lambda`

as described below.

In detail:

Consider a \(K\)-word target expression \(x\), and let \(z\) be any
\(K\)-word expression. Define a comparison function \(c(x,z)=(j_{1},
\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
smoothing constant `smoothing`

. The \(n_{i}\) are the counts in a
\(2^{K}\) contingency table whose dimensions are defined by the
\(c_{i}\).

\(\lambda\): The \(K\)-way interaction parameter in the saturated loglinear model fitted to the \(n_{i}\). It can be calculated as

$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} \log n_{i}$$

where \(b_{i}\) is the number of the elements of \(c_{i}\) which are equal to 1.

Wald test \(z\)-statistic \(z\) is calculated as:

$$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$

Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

# NOT RUN { txts <- data_corpus_inaugural[1:2] head(cols <- textstat_collocations(txts, size = 2, min_count = 2), 10) head(cols <- textstat_collocations(txts, size = 3, min_count = 2), 10) # extracting multi-part proper nouns (capitalized terms) toks2 <- tokens(data_corpus_inaugural) toks2 <- tokens_remove(toks2, stopwords("english"), padding = TRUE) toks2 <- tokens_select(toks2, "^([A-Z][a-z\\-]{2,})", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) seqs <- textstat_collocations(toks2, size = 3, tolower = FALSE) head(seqs, 10) # }

Run the code above in your browser using DataCamp Workspace