Learn R Programming

quanteda (version 0.9.4)

tokenize: tokenize a set of texts

Description

Tokenize the texts from a character vector or from a corpus.

is.tokenizedTexts returns TRUE if the object is of class tokenizedTexts, FALSE otherwise.

Usage

tokenize(x, ...)

## S3 method for class 'character': tokenize(x, what = c("word", "sentence", "character", "fastestword", "fasterword"), removeNumbers = FALSE, removePunct = FALSE, removeSeparators = TRUE, removeTwitter = FALSE, removeHyphens = FALSE, ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE, verbose = FALSE, ...)

## S3 method for class 'corpus': tokenize(x, ...)

is.tokenizedTexts(x)

Arguments

x
The text(s) or corpus to be tokenized
...
additional arguments not used
what
the unit for splitting the text, available alternatives are: [object Object],[object Object],[object Object],[object Object],[object Object]
removeNumbers
remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day
removePunct
remove all punctuation
removeSeparators
remove Separators and separator characters (spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode "separator" category) when removePunct=FALSE. Only applicable for what = "character" (when you
removeTwitter
remove Twitter characters @ and #; set to FALSE if you wish to eliminate these.
removeHyphens
if TRUE, split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes c("self", "storage"). Default is FALSE to preserve such words as i
ngrams
integer vector of the n for n-grams, defaulting to 1 (unigrams). For bigrams, for instance, use 2; for bigrams and unigrams, use 1:2. You can even include irregular sequences such as 2:3
skip
integer vector specifying the skips for skip-grams, default is 0 for only immediately neighbouring words. Only applies if ngrams is different from the default of 1. See skipgrams.
concatenator
character to use in concatenating n-grams, default is "_", which is recommended since this is included in the regular expression and Unicode definitions of "word" characters
simplify
if TRUE, return a character vector of tokens rather than a list of length ndoc(texts), with each element of the list containing a character vector of the tokens corresponding to that text.
verbose
if TRUE, print timing messages to the console; off by default

Value

  • A list of length ndoc(x) of the tokens found in each text.

    a tokenizedText (S3) object, essentially a list of character vectors. If simplify = TRUE then return a single character vector.

Details

The tokenizer is designed to be fast and flexible as well as to handle Unicode correctly. Most of the time, users will construct dfm objects from texts or a corpus, without calling tokenize() as an intermediate step. Since tokenize() is most likely to be used by more technical users, we have set its options to default to minimal intervention. This means that punctuation is tokenized as well, and that nothing is removed by default from the text being tokenized except inter-word spacing and equivalent characters.

See Also

ngrams

Examples

Run this code
# same for character vectors and for lists
tokensFromChar <- tokenize(inaugTexts[1:3])
tokensFromCorp <- tokenize(subset(inaugCorpus, Year<1798))
identical(tokensFromChar, tokensFromCorp)
str(tokensFromChar)
# returned as a list
head(tokenize(inaugTexts[57])[[1]], 10)
# returned as a character vector using simplify=TRUE
head(tokenize(inaugTexts[57], simplify=TRUE), 10)

# removing punctuation marks and lowecasing texts
head(tokenize(toLower(inaugTexts[57]), simplify=TRUE, removePunct=TRUE), 30)
# keeping case and punctuation
head(tokenize(inaugTexts[57], simplify=TRUE), 30)
# keeping versus removing hyphens
tokenize("quanteda data objects are auto-loading.", removePunct = TRUE)
tokenize("quanteda data objects are auto-loading.", removePunct = TRUE, removeHyphens = TRUE)

## MORE COMPARISONS
txt <- "#textanalysis is MY <3 4U @myhandle gr8 #stuff :-)"
tokenize(txt, removePunct=TRUE)
tokenize(txt, removePunct=TRUE, removeTwitter=TRUE)
#tokenize("great website http://textasdata.com", removeURL=FALSE)
#tokenize("great website http://textasdata.com", removeURL=TRUE)

txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!", 
         text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokenize(txt, verbose=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=TRUE)
tokenize(txt, removeNumbers=FALSE, removePunct=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE, removeSeparators=FALSE)

# character level
tokenize("Great website: http://textasdata.com?page=123.", what="character")
tokenize("Great website: http://textasdata.com?page=123.", what="character", 
         removeSeparators=FALSE)

# sentence level         
tokenize(c("Kurt Vongeut said; only assholes use semi-colons.", 
           "Today is Thursday in Canberra:  It is yesterday in London.", 
           "Today is Thursday in Canberra:  \nIt is yesterday in London.",
           "To be?  Or\not to be?"), 
          what = "sentence")
tokenize(inaugTexts[c(2,40)], what = "sentence", simplify = TRUE)

# creating ngrams
txt <- toLower(c(mytext1 = "This is a short test sentence.",
                mytext2 = "Short.",
                mytext3 = "Short, shorter, and shortest."))
tokenize(txt, removePunct = TRUE)
removeFeatures(tokenize(txt, removePunct = TRUE), stopwords("english"))
tokenize(txt, removePunct = TRUE, ngrams = 2)
tokenize(txt, removePunct = TRUE, ngrams = 1:2)
tokenize(txt, removePunct = TRUE, ngrams = 2, skip = 1, concatenator = "")
removeFeatures(tokenize(txt, removePunct = TRUE, ngrams = 1:2), stopwords("english"))

Run the code above in your browser using DataLab