Learn R Programming

quanteda (version 0.7.2-1)

tokenize: tokenize a set of texts

Description

Tokenize the texts from a character vector or from a corpus.

Usage

tokenize(x, ...)

## S3 method for class 'character': tokenize(x, simplify = FALSE, sep = " ", ...)

## S3 method for class 'corpus': tokenize(x, ...)

Arguments

x
The text(s) or corpus to be tokenized
...
additional arguments passed to clean
simplify
If TRUE, return a character vector of tokens rather than a list of length ndoc(texts), with each element of the list containing a character vector of the tokens corresponding to that text.
sep
by default, tokenize expects a "white-space" delimiter between tokens. Alternatively, sep can be used to specify another character which delimits fields.

Value

  • A list of length ndoc(x) of the tokens found in each text.

Examples

Run this code
# same for character vectors and for lists
tokensFromChar <- tokenize(inaugTexts[1:3])
tokensFromCorp <- tokenize(subset(inaugCorpus, Year<1798))
identical(tokensFromChar, tokensFromCorp)
str(tokensFromChar)
# returned as a list
head(tokenize(inaugTexts[57])[[1]], 10)
# returned as a character vector using simplify=TRUE
head(tokenize(inaugTexts[57], simplify=TRUE), 10)

# demonstrate some options with clean
head(tokenize(inaugTexts[57], simplify=TRUE, cpp=TRUE), 30)
## NOTE: not the same as
head(tokenize(inaugTexts[57], simplify=TRUE, cpp=FALSE), 30)

## MORE COMPARISONS
tokenize("this is MY <3 4U @myhandle gr8 stuff :-)", removeTwitter=FALSE, cpp=TRUE)
tokenize("this is MY <3 4U @myhandle gr8 stuff :-)", removeTwitter=FALSE, cpp=FALSE)
tokenize("great website http://textasdata.com", removeURL=FALSE, cpp=TRUE)
tokenize("great website http://textasdata.com", removeURL=FALSE, cpp=FALSE)
tokenize("great website http://textasdata.com", removeURL=TRUE, cpp=TRUE)
tokenize("great website http://textasdata.com", removeURL=TRUE, cpp=FALSE)

Run the code above in your browser using DataLab