Learn R Programming

textreuse (version 0.1.3)

tokenize: Recompute the tokens for a document or corpus

Description

Given a TextReuseTextDocument or a TextReuseCorpus, this function recomputes the tokens and hashes with the functions specified. Optionally, it can also recompute the minhash signatures.

Usage

tokenize(x, tokenizer, ..., hash_func = hash_string, minhash_func = NULL, keep_tokens = FALSE, keep_text = TRUE)

Arguments

tokenizer
A function to split the text into tokens. See tokenizers.
...
Arguments passed on to the tokenizer.
hash_func
A function to hash the tokens. See hash_string.
minhash_func
A function to create minhash signatures. See minhash_generator.
keep_tokens
Should the tokens be saved in the document that is returned or discarded?
keep_text
Should the text be saved in the document that is returned or discarded?

Value

The modified TextReuseTextDocument or TextReuseCorpus.

Examples

Run this code
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
corpus <- tokenize(corpus, tokenize_ngrams)
head(tokens(corpus[[1]]))

Run the code above in your browser using DataLab