phrasetotoken: convert phrases into single tokens

Description

Replace multi-word phrases in text(s) with a compound version of the phrases concatenated with concatenator (by default, the "_" character) to form a single token. This prevents tokenization of the phrases during subsequent processing by eliminating the whitespace delimiter.

Usage

phrasetotoken(object, phrases, concatenator = "_")
## S3 method for class 'character,dictionary':
phrasetotoken(object, phrases,
  concatenator = "_")
phrasetotoken.corpus(object, phrases, concatenator = "_")
## S3 method for class 'character,collocations':
phrasetotoken(object, phrases,
  concatenator = "_")

Arguments

object

source texts, a character or character vector

phrases

a dictionary object that contains some phrases, defined as multiple words delimited by whitespace, up to 9 words long; or a quanteda collocation object created by

concatenator

the concatenation character that will connect the words making up the multi-word phrases. The default _ is highly recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation charact

Value

character or character vector of texts with phrases replaced by compound "words" joined by the concatenator

Examples

Run this code

mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised a taxes: an income tax and a sales tax.")
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
(cw <- phrasetotoken(mytexts, mydict))
dfm(cw, verbose=FALSE)

# when used as a dictionary for dfm creation
mydfm2 <- dfm(cw, dictionary=lapply(mydict, function(x) gsub("", "_", x)))
mydfm2
# to pick up "taxes" in the second text, set dictionary_regex=TRUE
mydfm3 <- dfm(cw, dictionary=lapply(mydict, phrasetotoken, mydict),
              dictionary_regex=TRUE)
mydfm3
## one more token counted for "tax" than before

Run the code above in your browser using DataLab