Last chance! 50% off unlimited learning
Sale ends in
Detects collocations from texts or a corpus, returning a data.frame of
collocations and their scores, sorted in descending order of the association
measure. Words separated by punctuation delimiters are not counted by
default (spanPunct = FALSE
) as adjacent and hence are not eligible to
be collocations.
collocations(x, method = c("lr", "chi2", "pmi", "dice", "all"), size = 2,
n = NULL, tolower = TRUE, punctuation = c("dontspan", "ignore",
"include"), ...)
association measure for detecting collocations. Let
"lr"
The likelihood ratio
statistic
"chi2"
Pearson's
"pmi"
point-wise mutual information
score, computed as log
"dice"
the Dice
coefficient, computed as
"all"
returns all of the above
length of the collocation. Only bigram (n=2
) and trigram
(n=3
) collocations are currently implemented. Can be c(2,3)
(or 2:3
) to return both bi- and tri-gram collocations.
the number of collocations to return, sorted in descending order of
the requested statistic, or
convert collocations to lower case if TRUE
(default)
how to handle tokens separated by punctuation characters. Options are:
dontspan
do not form collocations from tokens separated by punctuation characters (default)
ignore
ignore punctuation characters when forming collocations, meaning that collocations will include those separated by punctuation characters in the text. The punctuation characters themselves are not returned.
include
do not treat punctuation characters specially, meaning that collocations will include punctuation characters as tokens
additional parameters passed to tokens
a collocations class object: a specially classed data.table consisting of collocations, their frequencies, and the computed association measure(s).
McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.
# NOT RUN {
txt <- c("This is software testing: looking for (word) pairs!
This [is] a software testing again. For.",
"Here: this is more Software Testing, looking again for word pairs.")
collocations(txt, punctuation = "dontspan") # default
collocations(txt, punctuation = "dontspan", remove_punct = TRUE) # includes "testing looking"
collocations(txt, punctuation = "ignore", remove_punct = TRUE) # same as previous
collocations(txt, punctuation = "include", remove_punct = FALSE) # keep punctuation as tokens
collocations(txt, size = 2:3)
removeFeatures(collocations(txt, size = 2:3), stopwords("english"))
collocations("@textasdata We really, really love the #quanteda package - thanks!!")
collocations("@textasdata We really, really love the #quanteda package - thanks!!",
remove_twitter = TRUE)
collocations(data_corpus_inaugural[49:57], n = 10)
collocations(data_corpus_inaugural[49:57], method = "all", n = 10)
collocations(data_corpus_inaugural[49:57], method = "chi2", size = 3, n = 10)
collocations(corpus_subset(data_corpus_inaugural, Year>1980), method = "pmi", size = 3, n = 10)
# }
Run the code above in your browser using DataLab