tokenizers

tokenize_words

tokenize_sentences

tokenize_ngrams

tokenize_skip_ngrams

A character vector of length 1 to be tokenized.

string

Should the tokens be made lower case?

lowercase

For n-gram tokenizers, the number of words in each n-gram.

For the skip n-gram tokenizer, the maximum skip distance between
words. The function will compute all skip n-grams between <code>0</code> and
<code>k</code>.

These functions each turn a text into tokens. The <code>tokenize_ngrams</code>
functions returns shingled n-grams.

Tools for measuring similarity among documents and detecting
passages which have been reused. Implements shingled n-gram, skip n-gram,
and other tokenizers; similarity/dissimilarity functions; pairwise
comparisons; minhash and locality sensitive hashing algorithms; and a
version of the Smith-Waterman local alignment algorithm suitable for
natural language.

tokenizers: Split texts into tokens

Description

Usage

Arguments

Value

Details

Examples