tokenize: Convert Text to Token IDs

Description

Converts text into a sequence of integer token IDs that the language model can process. This is the first step in text generation, as models work with tokens rather than raw text. Different models may use different tokenization schemes (BPE, SentencePiece, etc.).

Usage

tokenize(model, text, add_special = TRUE)

Value

Integer vector of token IDs corresponding to the input text. These can be used with

generate for text generation or detokenize to convert back to text

Arguments

model: A model object created with model_load
text: Character string or vector to tokenize. Can be a single text or multiple texts
add_special: Whether to add special tokens like BOS (Beginning of Sequence) and EOS (End of Sequence) tokens (default: TRUE). These tokens help models understand text boundaries

Examples

Run this code

if (FALSE) {
# Load model
model <- model_load("path/to/model.gguf")

# Basic tokenization
tokens <- tokenize(model, "Hello, world!")
print(tokens)  # e.g., c(15339, 11, 1917, 0)

# Tokenize without special tokens (for model inputs)
raw_tokens <- tokenize(model, "Continue this text", add_special = FALSE)

# Tokenize multiple texts
batch_tokens <- tokenize(model, c("First text", "Second text"))

# Check tokenization of specific phrases
question_tokens <- tokenize(model, "What is AI?")
print(length(question_tokens))  # Number of tokens
}

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples