stri_width: Determine the Width of Code Points

Description

Approximates the number of text columns the `cat()` function should utilize to print a string with a monospaced font.

Usage

stri_width(str)

Arguments

str

character vector or an object coercible to

Value

Returns an integer vector of the same length as str.

Details

The Unicode standard does not formalize the notion of a character width. Roughly basing on http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c and the UAX #11 we proceed as follows. The following code points are of width 0:

code points with general category (seestringi-search-charclass)Me,Mn, andCf),
C0andC1control codes (general categoryCc) - for compatibility with thencharfunction,
Hangul Jamo medial vowels and final consonants (code points with enumerable propertyUCHAR_HANGUL_SYLLABLE_TYPEequal toU_HST_VOWEL_JAMOorU_HST_TRAILING_JAMO; note that applying the NFC normalization withstri_trans_nfcis encouraged),
ZERO WIDTH SPACE (U+200B),

Characters with the UCHAR_EAST_ASIAN_WIDTH enumerable property equal to U_EA_FULLWIDTH or U_EA_WIDE are of width 2. SOFT HYPHEN (U+00AD) (for compatibility with nchar) as well as any other characters have width 1.

References

East Asian Width -- Unicode Standard Annex #11, http://www.unicode.org/reports/tr11/

Examples

Run this code

stri_width(LETTERS[1:5])
nchar(stri_trans_nfkd("\u0105"), "width") # provides incorrect information
stri_width(stri_trans_nfkd("\u0105"))
stri_width( # Full-width equivalents of ASCII characters:
   stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E)))
)
stri_width(stri_trans_nfkd("\ubc1f")) # includes Hangul Jamo medial vowels and final consonants

Run the code above in your browser using DataLab