sentencepiece_encode: Tokenise text alongside a Sentencepiece model

Description

Tokenise text alongside a Sentencepiece model

Usage

sentencepiece_encode(
  model,
  x,
  type = c("subwords", "ids"),
  nbest = -1L,
  alpha = 0.1
)

Value

a list with tokenised text, one for each element of x

unless you provide nbest without providing alpha in which case the result is a list of list of nbest tokenised texts

Arguments

model: an object of class sentencepiece as returned by sentencepiece_load_model or sentencepiece
x: a character vector of text (in UTF-8 Encoding)
type: a character string, either 'subwords' or 'ids' to get the subwords or the corresponding ids of these subwords as defined in the vocabulary of the model. Defaults to 'subwords'.
nbest: integer indicating the number of segmentations to extract. See the details. The argument is not used if you do not provide a value for it.
alpha: smoothing parameter to perform subword regularisation. Typical values are 0.1, 0.2 or 0.5. See the details. The argument is not used if you do not provide a value for it or do not provide a value for nbest.

Details

If you specify alpha to perform subword regularisation, keep in mind the following.
When alpha is 0.0, one segmentation is uniformly sampled from the nbest or lattice. The best Viterbi segmentation is more likely sampled when setting larger alpha values like 0.1.

If you provide a positive value for nbest, approximately samples one segmentation from nbest candidates.
If you provide a negative value for nbest, samples one segmentation from the hypotheses (Lattice) according to the generation probabilities using forward-filtering and backward-sampling algorithm.

nbest and alpha correspond respectively to the parameter l and in alpha in the paper https://arxiv.org/abs/1804.10959 where (nbest < 0 means l = infinity).

If the model is a BPE model, alpha is the merge probability p explained in https://arxiv.org/abs/1910.13267. In a BPE model, nbest-based sampling is not supported so the nbest parameter is ignored although it still needs to be provided if you want to make use of alpha.

Examples

Run this code

model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer.model")
model <- sentencepiece_load_model(file = model)

txt <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.",
         "On est d'accord sur le prix de la biere?")
sentencepiece_encode(model, x = txt, type = "subwords")
sentencepiece_encode(model, x = txt, type = "ids")

## Examples using subword regularisation
model <- system.file(package = "sentencepiece", "models", "nl-fr-dekamer-unigram.model")
model <- sentencepiece_load_model(file = model)

txt <- c("Goed zo",
         "On est d'accord")
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 4)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 2)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 2)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = 4, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = 4, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0.1)
sentencepiece_encode(model, x = txt, type = "subwords", nbest = -1, alpha = 0)
sentencepiece_encode(model, x = txt, type = "ids", nbest = -1, alpha = 0)

Run the code above in your browser using DataLab