tokenize: tokenize a set of texts

Description

Tokenize the texts from a character vector or from a corpus.

is.tokenizedTexts returns TRUE if the object is of class tokenizedTexts, FALSE otherwise.

Usage

tokenize(x, ...)
# S3 method for character
tokenize(x, what = c("word", "sentence", "character",
  "fastestword", "fasterword"), remove_numbers = FALSE,
  remove_punct = FALSE, remove_symbols = FALSE, remove_separators = TRUE,
  remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
  ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE,
  verbose = FALSE, ...)
# S3 method for corpus
tokenize(x, ...)
is.tokenizedTexts(x)
as.tokenizedTexts(x, ...)
# S3 method for list
as.tokenizedTexts(x, ...)
# S3 method for tokens
as.tokenizedTexts(x, ...)

Arguments

text(s) or corpus to be tokenized

...

additional arguments not used

what

the unit for splitting the text, available alternatives are:

"word": (recommended default) smartest, but slowest, word tokenization method; see stringi-search-boundaries for details.
"fasterword": dumber, but faster, word tokenizeation method, uses stri_split_charclass(x, "\\pWHITE_SPACE")
"fastestword": dumbest, but fastest, word tokenization method, calls stri_split_fixed(x, " ")
"character": tokenization into individual characters
"sentence": sentence segmenter, smart enough to handle some exceptions in English such as "Prof. Plum killed Mrs. Peacock." (but far from perfect).

remove_numbers

remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day

remove_punct

if TRUE, remove all characters in the Unicode "Punctuation" [P] class

remove_symbols

if TRUE, remove all characters in the Unicode "Symbol" [S] class

remove_separators

remove Separators and separator characters (spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode "separator" category) when remove_punct=FALSE. Only applicable for what = "character" (when you probably want it to be FALSE) and for what = "word" (when you probably want it to be TRUE). Note that if what = "word" and you set remove_punct = TRUE, then remove_separators has no effect. Use carefully.

remove_twitter

remove Twitter characters @ and #; set to TRUE if you wish to eliminate these.

remove_hyphens

if TRUE, split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes c("self", "storage"). Default is FALSE to preserve such words as is, with the hyphens. Only applies if what = "word".

remove_url

if TRUE, find and eliminate URLs beginning with http(s) -- see section "Dealing with URLs".

ngrams

integer vector of the n for n-grams, defaulting to 1 (unigrams). For bigrams, for instance, use 2; for bigrams and unigrams, use 1:2. You can even include irregular sequences such as 2:3 for bigrams and trigrams only. See tokens_ngrams.

skip

integer vector specifying the skips for skip-grams, default is 0 for only immediately neighbouring words. Only applies if ngrams is different from the default of 1. See skipgrams.

concatenator

character to use in concatenating n-grams, default is "_", which is recommended since this is included in the regular expression and Unicode definitions of "word" characters

simplify

if TRUE, return a character vector of tokens rather than a list of length ndoc(texts), with each element of the list containing a character vector of the tokens corresponding to that text.

verbose

if TRUE, print timing messages to the console; off by default

Value

A list of length ndoc(x) of the tokens found in each text.

a tokenizedText (S3) object, essentially a list of character vectors. If simplify = TRUE then return a single character vector.

Dealing with URLs

URLs are tricky to tokenize, because they contain a number of symbols and punctuation characters. If you wish to remove these, as most people do, and your text contains URLs, then you should set what = "fasterword" and remove_url = TRUE. If you wish to keep the URLs, but do not want them mangled, then your options are more limited, since removing punctuation and symbols will also remove them from URLs. We are working on improving this behaviour.

See the examples below.

Details

The tokenizer is designed to be fast and flexible as well as to handle Unicode correctly. Most of the time, users will construct dfm objects from texts or a corpus, without calling tokenize() as an intermediate step. Since tokenize() is most likely to be used by more technical users, we have set its options to default to minimal intervention. This means that punctuation is tokenized as well, and that nothing is removed by default from the text being tokenized except inter-word spacing and equivalent characters.

as.tokenizedTexts coerces a list of character tokens to a tokenizedText class object, making the methods available for this object type available to this object.

as.tokenizedTexts coerces tokenizedTextsHashed to a tokenizedText class object, making the methods available for this object type available to this object.

Examples

Run this code

# NOT RUN {
# same for character vectors and for lists
tokensFromChar <- tokenize(data_corpus_inaugural[1:3])
tokensFromCorp <- tokenize(corpus_subset(data_corpus_inaugural, Year<1798))
identical(tokensFromChar, tokensFromCorp)
str(tokensFromChar)
# returned as a list
head(tokenize(data_corpus_inaugural[57])[[1]], 10)
# returned as a character vector using simplify=TRUE
head(tokenize(data_corpus_inaugural[57], simplify = TRUE), 10)

# removing punctuation marks and lowecasing texts
head(tokenize(char_tolower(data_corpus_inaugural[57]), simplify = TRUE, remove_punct = TRUE), 30)
# keeping case and punctuation
head(tokenize(data_corpus_inaugural[57], simplify = TRUE), 30)
# keeping versus removing hyphens
tokenize("quanteda data objects are auto-loading.", remove_punct = TRUE)
tokenize("quanteda data objects are auto-loading.", remove_punct = TRUE, remove_hyphens = TRUE)
# keeping versus removing symbols
tokenize("<tags> and other + symbols.", remove_symbols = FALSE)
tokenize("<tags> and other + symbols.", remove_symbols = TRUE)
tokenize("<tags> and other + symbols.", remove_symbols = FALSE, what = "fasterword")
tokenize("<tags> and other + symbols.", remove_symbols = TRUE, what = "fasterword")

## examples with URLs - hardly perfect!
txt <- "Repo https://githib.com/kbenoit/quanteda, and www.stackoverflow.com."
tokenize(txt, remove_url = TRUE, remove_punct = TRUE)
tokenize(txt, remove_url = FALSE, remove_punct = TRUE)
tokenize(txt, remove_url = FALSE, remove_punct = TRUE, what = "fasterword")
tokenize(txt, remove_url = FALSE, remove_punct = FALSE, what = "fasterword")


## MORE COMPARISONS
txt <- "#textanalysis is MY <3 4U @myhandle gr8 #stuff :-)"
tokenize(txt, remove_punct = TRUE)
tokenize(txt, remove_punct = TRUE, remove_twitter = TRUE)
#tokenize("great website http://textasdata.com", remove_url = FALSE)
#tokenize("great website http://textasdata.com", remove_url = TRUE)

txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!", 
         text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokenize(txt, verbose = TRUE)
tokenize(txt, remove_numbers = TRUE, remove_punct = TRUE)
tokenize(txt, remove_numbers = FALSE, remove_punct = TRUE)
tokenize(txt, remove_numbers = TRUE, remove_punct = FALSE)
tokenize(txt, remove_numbers = FALSE, remove_punct = FALSE)
tokenize(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
tokenize(txt, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE)

# character level
tokenize("Great website: http://textasdata.com?page=123.", what = "character")
tokenize("Great website: http://textasdata.com?page=123.", what = "character", 
         remove_separators = FALSE)

# sentence level         
tokenize(c("Kurt Vongeut said; only assholes use semi-colons.", 
           "Today is Thursday in Canberra:  It is yesterday in London.", 
           "Today is Thursday in Canberra:  \nIt is yesterday in London.",
           "To be?  Or\nnot to be?"), 
          what = "sentence")
tokenize(data_corpus_inaugural[c(2,40)], what = "sentence", simplify = TRUE)

# removing features (stopwords) from tokenized texts
txt <- char_tolower(c(mytext1 = "This is a short test sentence.",
                      mytext2 = "Short.",
                      mytext3 = "Short, shorter, and shortest."))
tokenize(txt, remove_punct = TRUE)
removeFeatures(tokenize(txt, remove_punct = TRUE), stopwords("english"))

# ngram tokenization
tokenize(txt, remove_punct = TRUE, ngrams = 2)
tokenize(txt, remove_punct = TRUE, ngrams = 2, skip = 1, concatenator = " ")
tokenize(txt, remove_punct = TRUE, ngrams = 1:2)
# removing features from ngram tokens
removeFeatures(tokenize(txt, remove_punct = TRUE, ngrams = 1:2), stopwords("english"))
# }

Run the code above in your browser using DataLab