Learn R Programming

quanteda (version 0.7.2-1)

collocations: Detect collocations from text

Description

Detects collocations (currently, bigrams and trigrams) from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters .,!?;:(){}[] are not counted as adjacent and hence are not eligible to be collocations.

Usage

collocations(x, ...)

## S3 method for class 'character': collocations(x, method = c("lr", "chi2", "pmi", "dice", "all"), size = 2, n = NULL, ...)

## S3 method for class 'corpus': collocations(x, method = c("lr", "chi2", "pmi", "dice", "all"), size = 2, n = NULL, ...)

Arguments

x
a text, a character vector of texts, or a corpus
...
additional parameters passed to clean
method
association measure for detecting collocations. Let $i$ index documents, and $j$ index features, $n_{ij}$ refers to observed counts, and $m_{ij}$ the expected counts in a collocations frequency table of dimensions $(J - size + 1)^2$. Available measures a
size
length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are implemented so far. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.
n
the number of collocations to return, sorted in descending order of the requested statistic, or $G^2$ if none is specified.

Value

  • A data.table of collocations, their frequencies, and the computed association measure(s).

Details

Because of incompatibilities with the join operations in data.table when input files have slightly different encoding settings, collocations currently converts all text to ASCII prior to processing. We hope to improve on this in the future.

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

See Also

bigrams, ngrams

Examples

Run this code
collocations(inaugTexts[49:57], method="all", n=10)
collocations(inaugTexts[49:57], method="all", size=3, n=10)
collocations(subset(inaugCorpus, Year>1980), method="pmi", size=3, n=10)
txt <- c("This is software testing: looking for (word) pairs!
         This [is] a software testing again. For.",
         "Here: is a software testing, looking again for word pairs.")
collocations(txt)
collocations(txt, size=2:3)
removeFeatures(collocations(txt, size=2:3), stopwords("english"))

Run the code above in your browser using DataLab