collocations: Detect collocations from text

Description

Detects collocations from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters are not counted by default (spanPunct = FALSE) as adjacent and hence are not eligible to be collocations.

Usage

collocations(x, ...)
## S3 method for class 'corpus':
collocations(x, method = c("lr", "chi2", "pmi", "dice",
  "all"), size = 2, n = NULL, spanPunct = FALSE, ...)
## S3 method for class 'character':
collocations(x, method = c("lr", "chi2", "pmi", "dice",
  "all"), size = 2, n = NULL, spanPunct = FALSE, ...)
## S3 method for class 'tokenizedTexts':
collocations(x, method = c("lr", "chi2", "pmi",
  "dice", "all"), size = 2, n = NULL, spanPunct = FALSE, ...)

Arguments

a text, a character vector of texts, or a corpus

...

additional parameters passed to tokenize. If wanted to include collocations separated by punctuation, then you can use this to send removePunct = TRUE to

method

association measure for detecting collocations. Let $i$ index documents, and $j$ index features, $n_{ij}$ refers to observed counts, and $m_{ij}$ the expected counts in a collocations frequency table of dimensions $(J - size + 1)^2$. Available measure

size

length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are currently implemented. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.

the number of collocations to return, sorted in descending order of the requested statistic, or $G^2$ if none is specified.

spanPunct

if FALSE, then collocations will not span punctuation marks, so that for instance marks, so is not a collocation of marks so. If TRUE, do not handle punctuation specially.

Value

A data.table of collocations, their frequencies, and the computed association measure(s).

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

Examples

Run this code

txt <- c("This is software testing: looking for (word) pairs!  
         This [is] a software testing again. For.",
         "Here: this is more Software Testing, looking again for word pairs.")
collocations(txt)
collocations(txt, spanPunct = FALSE, removePunct = FALSE)  # default
collocations(txt, spanPunct = FALSE, removePunct = TRUE)   # includes "testing looking"
collocations(txt, spanPunct = TRUE, removePunct = TRUE)    # same as previous 
collocations(txt, spanPunct = TRUE, removePunct = FALSE)   # keep punctuation marks as "grams"

collocations(txt, size = 2:3)
removeFeatures(collocations(txt, size = 2:3), stopwords("english"))

collocations("@textasdata We really, really love the #quanteda package - thanks!!")
collocations("@textasdata We really, really love the #quanteda package - thanks!!",
              removeTwitter = TRUE)

collocations(inaugTexts[49:57], n = 10)
collocations(inaugTexts[49:57], method = "all", n = 10)
collocations(inaugTexts[49:57], method = "chi2", size = 3, n = 10)
collocations(subset(inaugCorpus, Year>1980), method = "pmi", size = 3, n = 10)