collocations: Detect collocations from text

Description

Detects collocations (currently, bigrams and trigrams) from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters are not counted as adjacent and hence are not eligible to be collocations.

Usage

collocations(x, ...)
## S3 method for class 'corpus':
collocations(x, method = c("lr", "chi2", "pmi", "dice",
  "all"), size = 2, n = NULL, spanPunct = FALSE, ...)
## S3 method for class 'character':
collocations(x, method = c("lr", "chi2", "pmi", "dice",
  "all"), size = 2, n = NULL, spanPunct = FALSE, ...)
## S3 method for class 'tokenizedTexts':
collocations(x, method = c("lr", "chi2", "pmi",
  "dice", "all"), size = 2, n = NULL, spanPunct = FALSE, ...)

Arguments

a text, a character vector of texts, or a corpus

...

additional parameters passed to tokenize. If wanted to include collocations separated by punctuation, then you can use this to send removePunct = TRUE to

method

association measure for detecting collocations. Let $i$ index documents, and $j$ index features, $n_{ij}$ refers to observed counts, and $m_{ij}$ the expected counts in a collocations frequency table of dimensions $(J - size + 1)^2$. Available measure

size

length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are implemented so far. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.

the number of collocations to return, sorted in descending order of the requested statistic, or $G^2$ if none is specified.

spanPunct

if FALSE, then collocations will not span punctuation marks, so that for instance marks, so is not a collocation of marks so. If TRUE, do not handle punctuation specially.

Value

A data.table of collocations, their frequencies, and the computed association measure(s).

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

Examples

Run this code

txt <- c("This is software testing: looking for (word) pairs!  
         This [is] a software testing again. For.",
         "Here: this is more Software Testing, looking again for word pairs.")
collocations(txt)
collocations(txt, removePunct = TRUE)
collocations(txt, size=2:3)
removeFeatures(collocations(txt, size=2:3), stopwords("english"))

collocations("@textasdata We really, really love the #quanteda package - thanks!!")
collocations("@textasdata We really, really love the #quanteda package - thanks!!",
              removeTwitter = TRUE)

collocations(inaugTexts[49:57], n=10)
collocations(inaugTexts[49:57], method="all", n=10)
collocations(inaugTexts[49:57], method="chi2", size=3, n=10)
collocations(subset(inaugCorpus, Year>1980), method="pmi", size=3, n=10)