BPEembed

Use a sentencepiece model to tokenise text and get the embeddings of these

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling.
Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units.
The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>.
Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec',
as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Jan Wijffels

sentencepiece

Text Tokenization using Byte Pair Encoding and Unigram Modelling

BNOSAC 

Google Inc. 

The Abseil Authors 

Kenton Varda (Google Inc.) 

Sanjay Ghemawat (Google Inc.) 

Jeff Dean (Google Inc.) 

Laszlo Csomor (Google Inc.) 

Wink Saville (Google Inc.) 

Jim Meehan (Google Inc.) 

Chris Atenasio (Google Inc.) 

Jason Hsueh (Google Inc.) 

Anton Carver (Google Inc.) 

Maxim Lifantsev (Google Inc.) 

Susumu Yata 

Daisuke Okanohara 

Yuta Mori 

Benjamin Heinzerling 

BPEembed function

<dl><dt>file_sentencepiece</dt>
<dd>the path to the file containing the sentencepiece model</dd>
<dt>file_word2vec</dt>
<dd>the path to the file containing the word2vec embeddings</dd>
<dt>x</dt>
<dd>the result of a call to <code>sentencepiece_download_model</code>. 
If this is provided, arguments <code>file_sentencepiece</code> and <code>file_word2vec</code> will not be used.</dd>
<dt>normalize</dt>
<dd>passed on to <code>read.wordvectors</code> to read in <code>file_word2vec</code>. Defaults to <code>TRUE</code>.</dd></dl>

Arguments

Tokenise and embed text alongside a Sentencepiece and Word2vec model — BPEembed

<dl>

<dt>file_sentencepiece</dt>
<dd>the path to the file containing the sentencepiece model</dd>


<dt>file_word2vec</dt>
<dd>the path to the file containing the word2vec embeddings</dd>


<dt>x</dt>
<dd>the result of a call to <code>sentencepiece_download_model</code>. 
If this is provided, arguments <code>file_sentencepiece</code> and <code>file_word2vec</code> will not be used.</dd>


<dt>normalize</dt>
<dd>passed on to <code>read.wordvectors</code> to read in <code>file_word2vec</code>. Defaults to <code>TRUE</code>.</dd>

</dl>

Tokenise and embed text alongside a Sentencepiece and Word2vec model

BPEembed: Tokenise and embed text alongside a Sentencepiece and Word2vec model

Description

Usage

Value

Arguments

See Also

Examples