quanteda (version 1.5.2)

tokens_recompile: recompile a serialized tokens object

Description

This function recompiles a serialized tokens object when the vocabulary has been changed in a way that makes some of its types identical, such as lowercasing when a lowercased version of the type already exists in the type table, or introduces gaps in the integer map of the types. It also re-indexes the types attribute to account for types that may have become duplicates, through a procedure such as stemming or lowercasing; or the addition of new tokens through compounding.

Usage

tokens_recompile(x, method = c("C++", "R"), gap = TRUE, dup = TRUE)

Arguments

x

the tokens object to be recompiled

method

"C++" for C++ implementation or "R" for an older R-based method

gap

if TRUE, remove gaps between token IDs

dup

if TRUE, merge duplicated token types into the same ID

Examples

Run this code
# NOT RUN {
# lowercasing
toks1 <- tokens(c(one = "a b c d A B C D",
                 two = "A B C d"))
attr(toks1, "types") <- char_tolower(attr(toks1, "types"))
unclass(toks1)
unclass(quanteda:::tokens_recompile(toks1))

# stemming
toks2 <- tokens("Stemming stemmed many word stems.")
unclass(toks2)
unclass(quanteda:::tokens_recompile(tokens_wordstem(toks2)))

# compounding
toks3 <- tokens("One two three four.")
unclass(toks3)
unclass(tokens_compound(toks3, "two three"))

# lookup
dict <- dictionary(list(test = c("one", "three")))
unclass(tokens_lookup(toks3, dict))

# empty pads
unclass(tokens_select(toks3, dict))
unclass(tokens_select(toks3, dict, pad = TRUE))

# ngrams
unclass(tokens_ngrams(toks3, n = 2:3))

# }

Run the code above in your browser using DataCamp Workspace