Learn R Programming

tokenizers.bpe (version 0.1.3)

bpe: Construct a Byte Pair Encoding model

Description

Construct a Byte Pair Encoding model on text

Usage

bpe(
  x,
  coverage = 0.9999,
  vocab_size = 5000,
  threads = -1L,
  pad_id = 0L,
  unk_id = 1L,
  bos_id = 2L,
  eos_id = 3L,
  model_path = file.path(getwd(), "youtokentome.bpe")
)

Value

an object of class youtokentome which is defined at bpe_load_model

Arguments

x

path to the text file containing training data or a character vector of text with training data

coverage

fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999

vocab_size

integer indicating the number of tokens in the final vocabulary

threads

integer with number of CPU threads to use for model processing. If equal to -1 then minimum of the number of available threads and 8 will be used

pad_id

integer, reserved id for padding

unk_id

integer, reserved id for unknown symbols

bos_id

integer, reserved id for begin of sentence token

eos_id

integer, reserved id for end of sentence token

model_path

path to the file on disk where the model will be stored. Defaults to 'youtokentome.bpe' in the current working directory

See Also

bpe_load_model

Examples

Run this code
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
model <- bpe(x$text, coverage = 0.999, vocab_size = 5000, threads = 1)
model
str(model$vocabulary)

text <- c("L'appartement est grand & vraiment bien situe en plein centre",
          "Proportion de femmes dans les situations de famille monoparentale.")
bpe_encode(model, x = text, type = "subwords")
bpe_encode(model, x = text, type = "ids")

encoded <- bpe_encode(model, x = text, type = "ids")
decoded <- bpe_decode(model, encoded)
decoded

## Remove the model file (Clean up for CRAN)
file.remove(model$model_path)

Run the code above in your browser using DataLab