Learn R Programming

quanteda (version 0.9.2-0)

phrasetotoken: convert phrases into single tokens

Description

Replace multi-word phrases in text(s) with a compound version of the phrases concatenated with concatenator (by default, the "_" character) to form a single token. This prevents tokenization of the phrases during subsequent processing by eliminating the whitespace delimiter.

Usage

phrasetotoken(object, phrases, concatenator = "_")

## S3 method for class 'character,dictionary': phrasetotoken(object, phrases, concatenator = "_")

phrasetotoken.corpus(object, phrases, concatenator = "_")

## S3 method for class 'character,collocations': phrasetotoken(object, phrases, concatenator = "_")

Arguments

object
source texts, a character or character vector
phrases
a dictionary object that contains some phrases, defined as multiple words delimited by whitespace, up to 9 words long; or a quanteda collocation object created by
concatenator
the concatenation character that will connect the words making up the multi-word phrases. The default _ is highly recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation charact

Value

  • character or character vector of texts with phrases replaced by compound "words" joined by the concatenator

Examples

Run this code
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised a taxes: an income tax and a sales tax.")
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
(cw <- phrasetotoken(mytexts, mydict))
dfm(cw, verbose=FALSE)

# when used as a dictionary for dfm creation
mydfm2 <- dfm(cw, dictionary=lapply(mydict, function(x) gsub("", "_", x)))
mydfm2
# to pick up "taxes" in the second text, set dictionary_regex=TRUE
mydfm3 <- dfm(cw, dictionary=lapply(mydict, phrasetotoken, mydict),
              dictionary_regex=TRUE)
mydfm3
## one more token counted for "tax" than before

Run the code above in your browser using DataLab