Learn R Programming

quanteda (version 0.7.2-1)

tokenizeOnly: tokenizeOnly

Description

For performance comparisons of tokenize-only functions. All functions use lapply to return a list of tokenized texts, when x is a vector of texts.

Usage

tokenizeOnlyCppKB(x, sep = " ", minLength = 1)

tokenizeOnlyScan(x, sep = " ")

Arguments

x
text(s) to be tokenized
sep
separator delineating tokens
minLength
minimum length in characters of tokens to be retained

Value

  • a list of character vectors, with each list element consisting of a tokenized text

Details

tokenizeOnlyCppKB calls a C++ function that KB adapted from Kohei's code that does tokenization without the cleaning.

tokenizeOnlyScan calls the R funtion scan for tokenization.

Examples

Run this code
# on inaugural speeches
# system.time(tmp1 <- tokenizeOnlyCppKW(inaugTexts))
system.time(tmp2 <- tokenizeOnlyCppKB(inaugTexts))
system.time(tmp3 <- tokenizeOnlyScan(inaugTexts))

# on a longer set of texts
load('~/Dropbox/QUANTESS/Manuscripts/Collocations/Corpora/lauderdaleClark/Opinion_files.RData')
txts <- unlist(Opinion_files[1])
names(txts) <- NULL
# system.time(tmp4 <- tokenizeOnlyCppKW(txts))
## about  9.2 seconds on Ken's MacBook Pro
system.time(tmp5 <- tokenizeOnlyCppKB(txts))
## about  7.0 seconds
system.time(tmp6 <- tokenizeOnlyScan(txts))
## about 12.6 seconds

Run the code above in your browser using DataLab