tokens_compound: convert token sequences into compound tokens

Description

Replace multi-token sequences with a multi-word, or "compound" token. The resulting compound tokens will represent a phrase or multi-word expression, concatenated with concatenator (by default, the "_" character) to form a single "token". This ensures that the sequences will be processed subsequently as single tokens, for instance in constructing a dfm.

Usage

tokens_compound(x, sequences, concatenator = "_", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, join = FALSE)

Arguments

an input tokens object

sequences

the input sequence, one of:

character vector, whose elements will be split on whitespace;
list of characters, consisting of a list of token patterns, separated by white space;
tokens object;
dictionary object;
collocations object.

concatenator

the concatenation character that will connect the words making up the multi-word sequences. The default _ is highly recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation characters, at least those in the Unicode punctuation class [P] will be removed.

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching

join

logical; if TRUE, join overlapped compounds

Value

a tokens object in which the token sequences matching the patterns in sequences have been replaced by compound "tokens" joined by the concatenator

Examples

Run this code

mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised taxes: an income tax and inheritance taxes.")
mytoks <- tokens(mytexts, removePunct = TRUE)

# for lists of sequence elements
myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax"))
(cw <- tokens_compound(mytoks, myseqs))
dfm(cw)

# when used as a dictionary for dfm creation
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
(cw2 <- tokens_compound(mytoks, mydict))

# to pick up "taxes" in the second text, set valuetype = "regex"
(cw3 <- tokens_compound(mytoks, mydict, valuetype = "regex"))

# dictionaries w/glob matches
myDict <- dictionary(list(negative = c("bad* word*", "negative", "awful text"),
                          positive = c("good stuff", "like? th??")))
toks <- tokens(c(txt1 = "I liked this, when we can use bad words, in awful text.",
                 txt2 = "Some damn good stuff, like the text, she likes that too."))
tokens_compound(toks, myDict)

# with collocations
collocs <- collocations("capital gains taxes are worse than inheritance taxes", size = 2:3)
toks <- tokens("The new law included capital gains taxes and inheritance taxes.")
tokens_compound(toks, collocs)

Run the code above in your browser using DataLab