tokenize: tokenize a set of texts

Description

Tokenize the texts from a character vector or from a corpus.

is.tokenizedTexts returns TRUE if the object is of class tokenizedTexts, FALSE otherwise.

Usage

tokenize(x, ...)
## S3 method for class 'character':
tokenize(x, what = c("word", "sentence", "character",
  "fastestword", "fasterword"), removeNumbers = FALSE, removePunct = FALSE,
  removeSeparators = TRUE, removeTwitter = FALSE, removeHyphens = FALSE,
  ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE,
  verbose = FALSE, ...)
## S3 method for class 'corpus':
tokenize(x, ...)
is.tokenizedTexts(x)

Arguments

The text(s) or corpus to be tokenized

...

additional arguments not used

what

the unit for splitting the text, available alternatives are: [object Object],[object Object],[object Object],[object Object],[object Object]

removeNumbers

remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day

removePunct

remove all punctuation

removeSeparators

remove Separators and separator characters (spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode "separator" category) when removePunct=FALSE. Only applicable for what = "character" (when you

removeTwitter

remove Twitter characters @ and #; set to FALSE if you wish to eliminate these.

removeHyphens

if TRUE, split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes c("self", "storage"). Default is FALSE to preserve such words as i

ngrams

integer vector of the n for n-grams, defaulting to 1 (unigrams). For bigrams, for instance, use 2; for bigrams and unigrams, use 1:2. You can even include irregular sequences such as 2:3

skip

integer vector specifying the skips for skip-grams, default is 0 
for only immediately neighbouring words. Only applies if ngrams is 
different from the default of 1.  See skipgrams.

concatenator

character to use in concatenating n-grams, default
is "_", which is recommended since this is included in the regular 
expression and Unicode definitions of "word" characters

simplify

if TRUE, return a character vector of tokens rather 
than a list of length ndoc(texts), with each element of the 
list containing a character vector of the tokens corresponding to that 
text.

verbose

if TRUE, print timing messages to the console; off by 
default

`Value`

A list of length ndoc(x) of the tokens found in each text.
a tokenizedText (S3) object, essentially a list of character
  vectors. If simplify = TRUE then return a single character vector.

`Details`

The tokenizer is designed to be fast and flexible as well as to 
  handle Unicode correctly. Most of the time, users will construct dfm
  objects from texts or a corpus, without calling tokenize() as an 
  intermediate step.  Since tokenize() is most likely to be used by 
  more technical users, we have set its options to default to minimal 
  intervention. This means that punctuation is tokenized as well, and that 
  nothing is removed by default from the text being tokenized except
  inter-word spacing and equivalent characters.

`See Also`

ngrams

`Examples`

Run this code# same for character vectors and for lists
tokensFromChar <- tokenize(inaugTexts[1:3])
tokensFromCorp <- tokenize(subset(inaugCorpus, Year<1798))
identical(tokensFromChar, tokensFromCorp)
str(tokensFromChar)
# returned as a list
head(tokenize(inaugTexts[57])[[1]], 10)
# returned as a character vector using simplify=TRUE
head(tokenize(inaugTexts[57], simplify=TRUE), 10)

# removing punctuation marks and lowecasing texts
head(tokenize(toLower(inaugTexts[57]), simplify=TRUE, removePunct=TRUE), 30)
# keeping case and punctuation
head(tokenize(inaugTexts[57], simplify=TRUE), 30)
# keeping versus removing hyphens
tokenize("quanteda data objects are auto-loading.", removePunct = TRUE)
tokenize("quanteda data objects are auto-loading.", removePunct = TRUE, removeHyphens = TRUE)

## MORE COMPARISONS
txt <- "#textanalysis is MY <3 4U @myhandle gr8 #stuff :-)"
tokenize(txt, removePunct=TRUE)
tokenize(txt, removePunct=TRUE, removeTwitter=TRUE)
#tokenize("great website http://textasdata.com", removeURL=FALSE)
#tokenize("great website http://textasdata.com", removeURL=TRUE)

txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!", 
         text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokenize(txt, verbose=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=TRUE)
tokenize(txt, removeNumbers=FALSE, removePunct=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE, removeSeparators=FALSE)

# character level
tokenize("Great website: http://textasdata.com?page=123.", what="character")
tokenize("Great website: http://textasdata.com?page=123.", what="character", 
         removeSeparators=FALSE)

# sentence level         
tokenize(c("Kurt Vongeut said; only assholes use semi-colons.", 
           "Today is Thursday in Canberra:  It is yesterday in London.", 
           "Today is Thursday in Canberra:  \nIt is yesterday in London.",
           "To be?  Or\not to be?"), 
          what = "sentence")
tokenize(inaugTexts[c(2,40)], what = "sentence", simplify = TRUE)

# creating ngrams
txt <- toLower(c(mytext1 = "This is a short test sentence.",
                mytext2 = "Short.",
                mytext3 = "Short, shorter, and shortest."))
tokenize(txt, removePunct = TRUE)
removeFeatures(tokenize(txt, removePunct = TRUE), stopwords("english"))
tokenize(txt, removePunct = TRUE, ngrams = 2)
tokenize(txt, removePunct = TRUE, ngrams = 1:2)
tokenize(txt, removePunct = TRUE, ngrams = 2, skip = 1, concatenator = "")
removeFeatures(tokenize(txt, removePunct = TRUE, ngrams = 1:2), stopwords("english"))
Run the code above in your browser using DataLab