stringi-search-charclass: Character Classes in stringi

Description

In this man page we describe how character classes are declared in the stringi package so that you may e.g. find their occurrences in your search activities or generate random code points with stri_rand_strings. Moreover, the ICU regex engine uses the same scheme for denoting character classes.

Arguments

Unicode properties

Unicode property sets are specified with a POSIX-like syntax, e.g. [:Letter:], or with a (extended) Perl-style syntax, e.g. \p{L}. The complements of the above sets are [:^Letter:] and \P{L}, respectively.

The properties' names are normalized before matching (for example, the match is case-insensitive). Moreover, many names have short aliases.

Among predefined Unicode properties we find e.g.

Unicode General Categories, e.g.Lufor uppercase letters,
Unicode Binary Properties, e.g.WHITE_SPACE,

and many more (including Unicode scripts).

Each property provides access to the large and comprehensive Unicode Character Database. Generally, the list of properties available in ICU is not perfectly documented. Please refer to the References section for some links.

Please note that some classes may seem to overlap. However, e.g. General Category Z (some space) and Binary Property WHITE_SPACE matches different character sets.

POSIX Character Classes

Beware of using POSIX character classes, e.g. [:punct:]. ICU User Guide (see below) states that in general they are not well-defined, so may end up with something different than you expect.

In particular, in POSIX-like regex engines, [:punct:] stands for the character class corresponding to the ispunct() classification function (check out man 3 ispunct on UNIX-like systems). According to ISO/IEC 9899:1990 (ISO C90), the ispunct() function tests for any printing character except for space or a character for which isalnum() is true. However, in a POSIX setting, the details of what characters belong into which class depend on the current locale. So the [:punct:] class does not lead to portable code (again, in POSIX-like regex engines).

So a POSIX flavor of [:punct:] is more like [\p{P}\p{S}] in ICU. You have been warned.

Details

All stri_*_charclass functions in stringi perform a single character (i.e. Unicode code point) search-based operations. Since stringi_0.2-1 you may obtain roughly the same results using stringi-search-regex. However, these very functions aim to be faster.

Character classes are defined using ICU's UnicodeSet patterns. Below we briefly summarize their syntax. For more details refer to the bibliographic References below.

References

The Unicode Character Database -- Unicode Standard Annex #44, http://www.unicode.org/reports/tr44/

UnicodeSet -- ICU User Guide, http://userguide.icu-project.org/strings/unicodeset

Properties -- ICU User Guide, http://userguide.icu-project.org/strings/properties

C/POSIX Migration -- ICU User Guide, http://userguide.icu-project.org/posix

Unicode Script Data, http://www.unicode.org/Public/UNIDATA/Scripts.txt

icu::Unicodeset Class Reference -- ICU4C API Documentation, http://www.icu-project.org/apiref/icu4c/classicu_1_1UnicodeSet.html