pre_tokenizer_whitespace

<p>This pre-tokenizer simply splits using the following regex: <code>\w+|[^\w\s]+</code></p>
<p>This pre-tokenizer simply splits using the following regex: <code>\w+|[^\w\s]+</code></p>

Interfaces with the 'Hugging Face' tokenizers library to provide implementations
of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm
<https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both
training new vocabularies and tokenizing texts.

Daniel Falbel

Fast Text Tokenization

pre_tokenizer_whitespace function

<p><code>tok::tok_pre_tokenizer</code> -&gt; <code>tok_pre_tokenizer_whitespace</code></p>

Super class


<div class="section" id="public-methods">
<h3>Public methods</h3><p></p><p><ul>
<li><p><a href="/link/pre_tokenizer_whitespace%24new()?package=tok&version=0.2.0" data-mini-rdoc="tok::pre_tokenizer_whitespace$new()"><code>pre_tokenizer_whitespace$new()</code></a></p></li>
<li><p><a href="/link/pre_tokenizer_whitespace%24clone()?package=tok&version=0.2.0" data-mini-rdoc="tok::pre_tokenizer_whitespace$clone()"><code>pre_tokenizer_whitespace$clone()</code></a></p></li>
</ul></p><p></p></div><p><hr>
<a id="method-tok_pre_tokenizer_whitespace-new"></a></p><div class="section" id="method-new-">
<h3>Method <code>new()</code></h3>
<p>Initializes the whistespace tokenizer</p><div class="section" id="usage">
<h4>Usage</h4>
<p><div class="r"></div></p><pre><code>pre_tokenizer_whitespace$new()</code></pre><p></p></div><p></p>
</div><p></p><p>
</p><p><hr>
<a id="method-tok_pre_tokenizer_whitespace-clone"></a></p><div class="section" id="method-clone-">
<h3>Method <code>clone()</code></h3>
<p>The objects of this class are cloneable with this method.</p><div class="section" id="usage">
<h4>Usage</h4>
<p><div class="r"></div></p><pre><code>pre_tokenizer_whitespace$clone(deep = FALSE)</code></pre><p></p></div><p></p>
</div><p></p><p><div class="section" id="arguments">
<h4>Arguments</h4>
<p><div class="arguments"></div></p><dl>
<dt><code>deep</code></dt>
<dd><p>Whether to make a deep clone.</p></dd></dl></div></p><p>
</p><p></p>
<p></p><p>
</p>

Methods

This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+ — pre_tokenizer_whitespace


<div class='section' id='public-methods'>
<h3>Public methods</h3>

<ul>
<li><p><a href='#method-tok_pre_tokenizer_whitespace-new'><code>pre_tokenizer_whitespace$new()</code></a></p></li>
<li><p><a href='#method-tok_pre_tokenizer_whitespace-clone'><code>pre_tokenizer_whitespace$clone()</code></a></p></li>
</ul>

</div><p><hr>
<a id="method-tok_pre_tokenizer_whitespace-new"></a></p><div class='section' id='method-new-'>
<h3>Method <code>new()</code></h3>
<p>Initializes the whistespace tokenizer</p><div class='section' id='usage'>
<h4>Usage</h4>
<p><div class="r"></p><pre><code>pre_tokenizer_whitespace$new()</code></pre><p></div></p>
</div>


</div><p><hr>
<a id="method-tok_pre_tokenizer_whitespace-clone"></a></p><div class='section' id='method-clone-'>
<h3>Method <code>clone()</code></h3>
<p>The objects of this class are cloneable with this method.</p><div class='section' id='usage'>
<h4>Usage</h4>
<p><div class="r"></p><pre><code>pre_tokenizer_whitespace$clone(deep = FALSE)</code></pre><p></div></p>
</div>

<div class='section' id='arguments'>
<h4>Arguments</h4>
<p><div class="arguments"></p><dl>
<dt><code>deep</code></dt>
<dd><p>Whether to make a deep clone.</p></dd>


</dl><p></div></p>
</div>

</div>


pre_tokenizer_whitespace: This pre-tokenizer simply splits using the following regex: `\w+|[^\w\s]+`

Description

Arguments

Super class

Methods

Public methods

Method `new()`

Usage

Method `clone()`

Usage

Arguments

See Also

Description

Arguments

Super class

Methods

Public methods

Method new()

Usage

Method clone()

Usage

Arguments

See Also

Method `new()`

Method `clone()`