predict.BPEembed

Use the sentencepiece model to either<ul>
<li>encode: tokenise and embed text</li>
<li>decode: get the untokenised text back of tokenised data</li>
<li>tokenize: only tokenize alongside the sentencepiece model</li>
</ul>

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling.
Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units.
The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>.
Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec',
as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Jan Wijffels

sentencepiece

Text Tokenization using Byte Pair Encoding and Unigram Modelling

BNOSAC 

Google Inc. 

The Abseil Authors 

Kenton Varda (Google Inc.) 

Sanjay Ghemawat (Google Inc.) 

Jeff Dean (Google Inc.) 

Laszlo Csomor (Google Inc.) 

Wink Saville (Google Inc.) 

Jim Meehan (Google Inc.) 

Chris Atenasio (Google Inc.) 

Jason Hsueh (Google Inc.) 

Anton Carver (Google Inc.) 

Maxim Lifantsev (Google Inc.) 

Susumu Yata 

Daisuke Okanohara 

Yuta Mori 

Benjamin Heinzerling 

predict.BPEembed function

<dl><dt>object</dt>
<dd>an object of class BPEembed as returned by <code>BPEembed</code></dd>
<dt>newdata</dt>
<dd>a character vector of text to encode or a character vector of encoded tokens to decode or a list of those</dd>
<dt>type</dt>
<dd>character string, either 'encode', 'decode' or 'tokenize'</dd>
<dt>...</dt>
<dd>further arguments passed on to the methods</dd></dl>

Arguments

Encode and Decode alongside a BPEembed model — predict.BPEembed

<dl>

<dt>object</dt>
<dd>an object of class BPEembed as returned by <code>BPEembed</code></dd>


<dt>newdata</dt>
<dd>a character vector of text to encode or a character vector of encoded tokens to decode or a list of those</dd>


<dt>type</dt>
<dd>character string, either 'encode', 'decode' or 'tokenize'</dd>


<dt>...</dt>
<dd>further arguments passed on to the methods</dd>

</dl>

Encode and Decode alongside a BPEembed model

predict.BPEembed: Encode and Decode alongside a BPEembed model

Description

Usage

Value

Arguments

See Also

Examples