collocations: Detect collocations from text

Description

Detects collocations (currently, bigrams and trigrams) from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters .,!?;:(){}[] are not counted as adjacent and hence are not eligible to be collocations.

Usage

collocations(x, ...)
## S3 method for class 'character':
collocations(x, method = c("lr", "chi2", "pmi", "dice",
  "all"), size = 2, n = NULL, ...)
## S3 method for class 'corpus':
collocations(x, method = c("lr", "chi2", "pmi", "dice",
  "all"), size = 2, n = NULL, ...)

Arguments

a text, a character vector of texts, or a corpus

...

additional parameters passed to clean

method

association measure for detecting collocations. Let $i$ index documents, and $j$ index features, $n_{ij}$ refers to observed counts, and $m_{ij}$ the expected counts in a collocations frequency table of dimensions $(J - size + 1)^2$. Available measures a

size

length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are implemented so far. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.

the number of collocations to return, sorted in descending order of the requested statistic, or $G^2$ if none is specified.

Value

A data.table of collocations, their frequencies, and the computed association measure(s).

Details

Because of incompatibilities with the join operations in data.table when input files have slightly different encoding settings, collocations currently converts all text to ASCII prior to processing. We hope to improve on this in the future.

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

Examples

Run this code

collocations(inaugTexts[49:57], method="all", n=10)
collocations(inaugTexts[49:57], method="all", size=3, n=10)
collocations(subset(inaugCorpus, Year>1980), method="pmi", size=3, n=10)
txt <- c("This is software testing: looking for (word) pairs!
         This [is] a software testing again. For.",
         "Here: is a software testing, looking again for word pairs.")
collocations(txt)
collocations(txt, size=2:3)
removeFeatures(collocations(txt, size=2:3), stopwords("english"))

Run the code above in your browser using DataLab