BPEembedder

Build a sentencepiece model on text and build a matching word2vec model on the sentencepiece vocabulary

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling.
Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units.
The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>.
Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec',
as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Jan Wijffels

sentencepiece

Text Tokenization using Byte Pair Encoding and Unigram Modelling

BNOSAC 

Google Inc. 

The Abseil Authors 

Kenton Varda (Google Inc.) 

Sanjay Ghemawat (Google Inc.) 

Jeff Dean (Google Inc.) 

Laszlo Csomor (Google Inc.) 

Wink Saville (Google Inc.) 

Jim Meehan (Google Inc.) 

Chris Atenasio (Google Inc.) 

Jason Hsueh (Google Inc.) 

Anton Carver (Google Inc.) 

Maxim Lifantsev (Google Inc.) 

Susumu Yata 

Daisuke Okanohara 

Yuta Mori 

Benjamin Heinzerling 

BPEembedder function

<dl><dt>x</dt>
<dd>a data.frame with columns doc_id and text</dd>
<dt>tokenizer</dt>
<dd>character string with the type of sentencepiece tokenizer. Either 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding,
Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). Passed on to <code>sentencepiece</code></dd>
<dt>args</dt>
<dd>a list of arguments passed on to <code>sentencepiece</code></dd>
<dt>...</dt>
<dd>arguments passed on to <code>word2vec</code> for training a word2vec model</dd></dl>

Arguments

Build a BPEembed model containing a Sentencepiece and Word2vec model — BPEembedder

<dl>

<dt>x</dt>
<dd>a data.frame with columns doc_id and text</dd>


<dt>tokenizer</dt>
<dd>character string with the type of sentencepiece tokenizer. Either 'bpe', 'char', 'unigram' or 'word' for Byte Pair Encoding, Character level encoding,
Unigram encoding or pretokenised word encoding. Defaults to 'bpe' (Byte Pair Encoding). Passed on to <code>sentencepiece</code></dd>


<dt>args</dt>
<dd>a list of arguments passed on to <code>sentencepiece</code></dd>


<dt>...</dt>
<dd>arguments passed on to <code>word2vec</code> for training a word2vec model</dd>

</dl>

Build a BPEembed model containing a Sentencepiece and Word2vec model

BPEembedder: Build a BPEembed model containing a Sentencepiece and Word2vec model

Description

Usage

Value

Arguments

See Also

Examples