tokenize: Tokenizers for various objects.

Description

This tokenizer performs some basic cleaning, then splits up text on whitespace and punctuation.

Usage

tokenize(tokenizer, text)
# S3 method for FullTokenizer
tokenize(tokenizer, text)
# S3 method for BasicTokenizer
tokenize(tokenizer, text)
# S3 method for WordpieceTokenizer
tokenize(tokenizer, text)

Arguments

tokenizer

The Tokenizer object to refer to.

text

The text to tokenize. For tokenize.WordpieceTokenizer, the text should have already been passed through BasicTokenizer.

Value

A list of tokens.

Methods (by class)

FullTokenizer: Tokenizer method for objects of FullTokenizer class.
BasicTokenizer: Tokenizer method for objects of BasicTokenizer class.
WordpieceTokenizer: Tokenizer method for objects of WordpieceTokenizer class. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = list("un", "##aff", "##able") ... although, ironically, the BERT vocabulary actually gives output = list("una", "##ffa", "##ble") for this example, even though they use it as an example in their code.

Examples

Run this code

# NOT RUN {
tokenizer <- FullTokenizer("vocab.txt", TRUE)
tokenize(tokenizer, text = "a bunch of words")
# }
# NOT RUN {
tokenizer <- BasicTokenizer(TRUE)
tokenize(tokenizer, text = "a bunch of words")
# }
# NOT RUN {
vocab <- load_vocab(vocab_file = "vocab.txt")
tokenizer <- WordpieceTokenizer(vocab)
tokenize(tokenizer, text = "a bunch of words")
# }

Run the code above in your browser using DataLab