Learn R Programming

RBERT (version 0.1.11)

tokenize_chinese_chars: Add whitespace around any CJK character.

Description

(R implementation of BasicTokenizer._tokenize_chinese_chars from BERT: tokenization.py.) This may result in doubled-up spaces, but that's the behavior of the python code...

Usage

tokenize_chinese_chars(text)

Arguments

text

A character scalar.

Value

Text with spaces around CJK characters.