gbs_tokenize

A plain 'Rcpp' wrapper for 'MeCab' that can segment Chinese,
Japanese, and Korean text into tokens. The main goal of this package
is to provide an alternative to 'tidytext' using morphological
analysis.

Akiru Kato

gibasa

An Alternative 'Rcpp' Wrapper of 'MeCab'

Shogo Ichinose

Taku Kudo

Jorge Nocedal

Nippon Telegraph and Telephone Corporation 

gbs_tokenize function

<dl><dt>x</dt>
<dd>A data.frame like object or a character vector to be tokenized.</dd>
<dt>sys_dic</dt>
<dd>Character scalar; path to the system dictionary for 'MeCab'.
Note that the system dictionary is expected to be compiled with UTF-8,
not Shift-JIS or other encodings.</dd>
<dt>user_dic</dt>
<dd>Character scalar; path to the user dictionary for 'MeCab'.</dd>
<dt>split</dt>
<dd>Logical. When passed as <code>TRUE</code>, the function
internally splits the sentences into sub-sentences
using <code>stringi::stri_split_boundaries(type = "sentence")</code>.</dd>
<dt>partial</dt>
<dd>Logical. When passed as <code>TRUE</code>, activates
partial parsing mode.
To activate this feature, remember that all spaces at the start and end of
the input chunks are already squashed. In particular, trailing spaces
of chunks sometimes cause errors when parsing.</dd>
<dt>mode</dt>
<dd>Character scalar to switch output format.</dd></dl>

Arguments

Tokenize sentences using 'MeCab' — gbs_tokenize

<dl>

<dt>x</dt>
<dd>A data.frame like object or a character vector to be tokenized.</dd>


<dt>sys_dic</dt>
<dd>Character scalar; path to the system dictionary for 'MeCab'.
Note that the system dictionary is expected to be compiled with UTF-8,
not Shift-JIS or other encodings.</dd>


<dt>user_dic</dt>
<dd>Character scalar; path to the user dictionary for 'MeCab'.</dd>


<dt>split</dt>
<dd>Logical. When passed as <code>TRUE</code>, the function
internally splits the sentences into sub-sentences
using <code>stringi::stri_split_boundaries(type = "sentence")</code>.</dd>


<dt>partial</dt>
<dd>Logical. When passed as <code>TRUE</code>, activates
partial parsing mode.
To activate this feature, remember that all spaces at the start and end of
the input chunks are already squashed. In particular, trailing spaces
of chunks sometimes cause errors when parsing.</dd>


<dt>mode</dt>
<dd>Character scalar to switch output format.</dd>

</dl>

Tokenize sentences using 'MeCab'

gbs_tokenize: Tokenize sentences using 'MeCab'

Description

Usage

Value

Arguments