Learn R Programming

sentencepiece (version 0.2.3)

BPEembedder: Build a BPEembed model containing a Sentencepiece and Word2vec model

Description

Build a sentencepiece model on text and build a matching word2vec model on the sentencepiece vocabulary

Usage

BPEembedder(
  x,
  tokenizer = c("bpe", "char", "unigram", "word"),
  args = list(vocab_size = 8000, coverage = 0.9999),
  ...
)

Value

an object of class BPEembed which is a list with elements

  • model: a sentencepiece model as loaded with sentencepiece_load_model

  • embedding: a matrix with embeddings as loaded with read.wordvectors

  • dim: the dimension of the embedding

  • n: the number of elements in the vocabulary

  • file_sentencepiece: the sentencepiece model file

  • file_word2vec: the word2vec embedding file

Arguments

x

a data.frame with columns doc_id and text

tokenizer

character string with the type of sentencepiece tokenizer. Either 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding, Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). Passed on to sentencepiece

args

a list of arguments passed on to sentencepiece

...

arguments passed on to word2vec for training a word2vec model

See Also

sentencepiece, word2vec, predict.BPEembed

Examples

Run this code
library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x     <- subset(belgium_parliament, language %in% "dutch")
model <- BPEembedder(x, tokenizer = "bpe", args = list(vocab_size = 1000),
                     type = "cbow", dim = 20, iter = 10) 
model

txt    <- c("De eigendomsoverdracht aan de deelstaten is ingewikkeld.")
values <- predict(model, txt, type = "encode")  

Run the code above in your browser using DataLab