Learn R Programming

RBERT (version 0.1.11)

is_chinese_char: Check whether cp is the codepoint of a CJK character.

Description

(R implementation of BasicTokenizer._is_chinese_char from BERT: tokenization.py. From that file: This defines a "chinese character" as anything in the CJK Unicode block: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)

Usage

is_chinese_char(cp)

Arguments

cp

A unicode codepoint, as an integer.

Value

Logical TRUE if cp is codepoint of a CJK character.

Details

Note that the CJK Unicode block is NOT all Japanese and Korean characters, despite its name. The modern Korean Hangul alphabet is a different block, as is Japanese Hiragana and Katakana. Those alphabets are used to write space-separated words, so they are not treated specially and are handled like the alphabets of the other languages.)