sequences: find variable-length collocations with filtering

Description

This function automatically identifies contiguous collocations consisting of variable-length term sequences whose frequency is unlikey to have occurred by chance. The algorithm is based on Blaheta and Johnson's "Unsupervised Learning of Multi-Word Verbs".

Usage

sequences(x, features, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_count = 2, max_length = 5, nested = TRUE, ordered = FALSE)

Arguments

a tokens object

features

a regular expression for filtering the features to be located in sequences

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

min_count

minimum frequency of sequences for which parameters are estimated

max_length

maxium length of sequences which are collected

nested

if true, collect all the subsequences of a longer sequence as separate entities. e.g. in a sequence of capitalized words "United States Congress", "States Congress" is considered as a subsequence. But "United States" is not a subsequence because it is followed by "Congress".

ordered

if true, use the Blaheta-Johnson method that distinguishs between the order of words, and tends to promote rare sequences.

References

Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

Examples

Run this code

toks <- tokens(corpus_segment(data_corpus_inaugural, what = "sentence"))
toks <- tokens_select(toks, stopwords("english"), "remove", padding = TRUE)

# extracting multi-part proper nouns (capitalized terms)
seqs <- sequences(toks, "^([A-Z][a-z\\-]{2,})", valuetype="regex", case_insensitive = FALSE)
head(seqs, 10)

# types can be any words
seqs2 <- sequences(toks, "^([a-z]+)$", valuetype="regex", case_insensitive = FALSE, 
                   min_count = 2, ordered = TRUE)
head(seqs2, 10)

Run the code above in your browser using DataLab