Learn R Programming

RBERT (version 0.1.11)

tokenize: Tokenizers for various objects.

Description

This tokenizer performs some basic cleaning, then splits up text on whitespace and punctuation.

Usage

tokenize(tokenizer, text)

# S3 method for FullTokenizer tokenize(tokenizer, text)

# S3 method for BasicTokenizer tokenize(tokenizer, text)

# S3 method for WordpieceTokenizer tokenize(tokenizer, text)

Arguments

tokenizer

The Tokenizer object to refer to.

text

The text to tokenize. For tokenize.WordpieceTokenizer, the text should have already been passed through BasicTokenizer.

Value

A list of tokens.

Methods (by class)

  • FullTokenizer: Tokenizer method for objects of FullTokenizer class.

  • BasicTokenizer: Tokenizer method for objects of BasicTokenizer class.

  • WordpieceTokenizer: Tokenizer method for objects of WordpieceTokenizer class. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = list("un", "##aff", "##able") ... although, ironically, the BERT vocabulary actually gives output = list("una", "##ffa", "##ble") for this example, even though they use it as an example in their code.

Examples

Run this code
# NOT RUN {
tokenizer <- FullTokenizer("vocab.txt", TRUE)
tokenize(tokenizer, text = "a bunch of words")
# }
# NOT RUN {
tokenizer <- BasicTokenizer(TRUE)
tokenize(tokenizer, text = "a bunch of words")
# }
# NOT RUN {
vocab <- load_vocab(vocab_file = "vocab.txt")
tokenizer <- WordpieceTokenizer(vocab)
tokenize(tokenizer, text = "a bunch of words")
# }

Run the code above in your browser using DataLab