Learn R Programming

audubon (version 0.5.1)

strj_tokenize: Split text into tokens

Description

Splits text into several tokens using specified tokenizer.

Usage

strj_tokenize(
  text,
  format = c("list", "data.frame"),
  engine = c("stringi", "budoux", "tinyseg", "mecab", "sudachipy"),
  rcpath = NULL,
  mode = c("C", "B", "A"),
  split = FALSE
)

Value

A list or a data.frame.

Arguments

text

Character vector to be tokenized.

format

Output format. Choose list or data.frame.

engine

Tokenizer name. Choose one of 'stringi', 'budoux', 'tinyseg', 'mecab', or 'sudachipy'. Note that the specified tokenizer is installed and available when you use 'mecab' or 'sudachipy'.

rcpath

Path to a setting file for 'MeCab' or 'sudachipy' if any.

mode

Splitting mode for 'sudachipy'.

split

Logical. If passed as TRUE, the function splits the vector into some sentences using stringi::stri_split_boundaries(type = "sentence") before tokenizing.

Examples

Run this code
strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  )
)
strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  ),
  format = "data.frame"
)

Run the code above in your browser using DataLab