strj_tokenize

Splits text into several tokens using specified tokenizer.

A collection of Japanese text processing tools for filling
Japanese iteration marks, Japanese character type conversions,
segmentation by phrase, and text normalization which is based on rules
for the 'Sudachi' morphological analyzer and the 'NEologd' (Neologism
dictionary for 'MeCab').  These features are specific to Japanese and
are not implemented in 'ICU' (International Components for Unicode).

Akiru Kato

audubon

Japanese Text Processing Tools

Koki Takahashi

Shuhei Iitsuka

Taku Kudo

strj_tokenize function

<dl><dt>text</dt>
<dd>Character vector to be tokenized.</dd>
<dt>format</dt>
<dd>Output format. Choose <code>list</code> or <code>data.frame</code>.</dd>
<dt>engine</dt>
<dd>Tokenizer name. Choose one of 'stringi', 'budoux',
'tinyseg', 'mecab', or 'sudachipy'.
Note that the specified tokenizer is installed and available when you use
'mecab' or 'sudachipy'.</dd>
<dt>rcpath</dt>
<dd>Path to a setting file for 'MeCab' or 'sudachipy' if any.</dd>
<dt>mode</dt>
<dd>Splitting mode for 'sudachipy'.</dd>
<dt>split</dt>
<dd>Logical. If passed as <code>TRUE</code>, the function splits the vector
into some sentences using <code>stringi::stri_split_boundaries(type = "sentence")</code>
before tokenizing.</dd></dl>

Arguments

Split text into tokens — strj_tokenize

<dl>

<dt>text</dt>
<dd>Character vector to be tokenized.</dd>


<dt>format</dt>
<dd>Output format. Choose <code>list</code> or <code>data.frame</code>.</dd>


<dt>engine</dt>
<dd>Tokenizer name. Choose one of 'stringi', 'budoux',
'tinyseg', 'mecab', or 'sudachipy'.
Note that the specified tokenizer is installed and available when you use
'mecab' or 'sudachipy'.</dd>


<dt>rcpath</dt>
<dd>Path to a setting file for 'MeCab' or 'sudachipy' if any.</dd>


<dt>mode</dt>
<dd>Splitting mode for 'sudachipy'.</dd>


<dt>split</dt>
<dd>Logical. If passed as <code>TRUE</code>, the function splits the vector
into some sentences using <code>stringi::stri_split_boundaries(type = "sentence")</code>
before tokenizing.</dd>

</dl>

strj_tokenize: Split text into tokens

Description

Usage

Value

Arguments

Examples