stringi-search-charclass: Character Classes in stringi

Description

In this man page we describe how character classes are declared in the stringi package so that you may e.g. find their occurrences in your search activities or generate random code points with stri_rand_strings. Moreover, ICU regex engines uses the same scheme for denoting character classes.

Arguments

<code>UnicodeSet</code> patterns

A UnicodeSet represents a subset of Unicode code points (recall that stringi converts strings in your native encoding to Unicode automatically). Legal code points are U+0000 to U+10FFFF, inclusive.

Patterns either consist of series of characters either bounded by square brackets (such patterns follow a syntax similar to that employed by version 8 regular expression character classes) or of Perl-like Unicode property set specifiers.

[] denotes an empty set, [a] -- a set consisting of character ``a'', [\u0105] -- a set with character U+0105, and [abc] -- a set with ``a'', ``b'', and ``c''.

[a-z] denotes a set consisting of characters ``a'' through ``z'' inclusively, in Unicode code point order.

Some set-theoretic operations are available. ^ denotes the complement, e.g. [^a-z] contains all characters but ``a'' through ``z''. On the other hand, [[pat1][pat2]], [[pat1]&[pat2]], and [[pat1]-[pat2]] denote union, intersection, and asymmetric difference of sets specified by pat1 and pat2, respectively.

Note that all white spaces are ignored unless they are quoted or backslashed (white spaces can be freely used for clarity, as [a c d-f m] means the same as [acd-fm]). stringi does not allow for including so-called multicharacter strings (see UnicodeSet API documentation). Also, empty string patterns are disallowed.

Any character may be preceded by a backslash in order to remove any special meaning.

A malformed pattern always results in an error.

Set expressions at a glance (according to http://userguide.icu-project.org/strings/regexp):

ll{ Example Description [abc] Match any of the characters a, b or c [^abc] Negation -- match any character except a, b or c [A-M] Range -- match any character from A to M. The characters to include are determined by Unicode code point ordering. [\u0000-\U0010ffff] Range -- match all characters. [\p{Letter}] or [\p{General_Category=Letter}] or [\p{L}] Characters with Unicode Category = Letter. All forms shown are equivalent. [\P{Letter}] Negated property. (Upper case \P) Match everything except Letters. [\p{numeric_value=9}] Match all numbers with a numeric value of 9. Any Unicode Property may be used in set expressions. [\p{Letter}&&\p{script=cyrillic}] Logical AND or intersection -- match the set of all Cyrillic letters. [\p{Letter}--\p{script=latin}] Subtraction -- match all non-Latin letters. [[a-z][A-Z][0-9]] or [a-zA-Z0-9] Implicit Logical OR or Union of Sets -- the examples match ASCII letters and digits. The two forms are equivalent. [:script=Greek:] Alternate POSIX-like syntax for properties -- equivalent to \p{script=Greek} }

Unicode properties

Unicode property sets are specified with a POSIX-like syntax, e.g. [:Letter:], or with a (extended) Perl-style syntax, e.g. \p{L}. The complements of the above sets are [:^Letter:] and \P{L}, respectively.

The properties' names are normalized before matching (for example, the match is case-insensitive). Moreover, many names have short aliases.

Among predefined Unicode properties we find e.g.

Unicode General Categories, e.g.Lufor uppercase letters,
Unicode Binary Properties, e.g.WHITE_SPACE,

and many more (including Unicode scripts).

Each property provides access to the large and comprehensive Unicode Character Database. Generally, the list of properties available in ICU is not perfectly documented. Please refer to the References section for some links.

Please note that some classes may seem to overlap. However, e.g. General Category Z (some space) and Binary Property WHITE_SPACE matches different character sets.

Unicode General Categories

The Unicode General Category property of a code point provides the most general classification of that code point. Each code point falls into one and only one Category.

ll{ Cc a C0 or C1 control code Cf a format control character Cn a reserved unassigned code point or a non-character Co a private-use character Cs a surrogate code point Lc the union of Lu, Ll, Lt Ll a lowercase letter Lm a modifier letter Lo other letters, including syllables and ideographs Lt a digraphic character, with first part uppercase Lu an uppercase letter Mc a spacing combining mark (positive advance width) Me an enclosing combining mark Mn a non-spacing combining mark (zero advance width) Nd a decimal digit Nl a letter-like numeric character No a numeric character of other type Pd a dash or hyphen punctuation mark Ps an opening punctuation mark (of a pair) Pe a closing punctuation mark (of a pair) Pc a connecting punctuation mark, like a tie Po a punctuation mark of other type Pi an initial quotation mark Pf a final quotation mark Sm a symbol of mathematical use Sc a currency sign Sk a non-letter-like modifier symbol So a symbol of other type Zs a space character (of non-zero width) Zl U+2028 LINE SEPARATOR only Zp U+2029 PARAGRAPH SEPARATOR only C the union of Cc, Cf, Cs, Co, Cn L the union of Lu, Ll, Lt, Lm, Lo M the union of Mn, Mc, Me N the union of Nd, Nl, No P the union of Pc, Pd, Ps, Pe, Pi, Pf, Po S the union of Sm, Sc, Sk, So Z the union of Zs, Zl, Zp }

Unicode Binary Properties

Each character may follow many Binary Properties at a time.

Here is a comprehensive list of supported Binary Properties:

ll{ ALPHABETIC alphabetic character ASCII_HEX_DIGIT a character matching the [0-9A-Fa-f] charclass BIDI_CONTROL a format control which have specific functions in the Bidi (bidirectional text) Algorithm BIDI_MIRRORED a character that may change display in right-to-left text DASH a kind of a dash character DEFAULT_IGNORABLE_CODE_POINT characters that are ignorable in most text processing activities, e.g. <2060..206f, fff0..fffb,="" e0000..e0fff=""> DEPRECATED a deprecated character according to the current Unicode standard (the usage of deprecated characters is strongly discouraged) DIACRITIC a character that linguistically modifies the meaning of another character to which it applies EXTENDER a character that extends the value or shape of a preceding alphabetic character, e.g. a length and iteration mark. HEX_DIGIT a character commonly used for hexadecimal numbers, cf. also ASCII_HEX_DIGIT HYPHEN a dash used to mark connections between pieces of words, plus the Katakana middle dot ID_CONTINUE a character that can continue an identifier, ID_START+Mn+Mc+Nd+Pc ID_START a character that can start an identifier, Lu+Ll+Lt+Lm+Lo+Nl IDEOGRAPHIC a CJKV (Chinese-Japanese-Korean-Vietnamese) ideograph LOWERCASE MATH NONCHARACTER_CODE_POINT QUOTATION_MARK SOFT_DOTTED a character with a ``soft dot'', like i or j, such that an accent placed on this character causes the dot to disappear TERMINAL_PUNCTUATION a punctuation character that generally marks the end of textual units UPPERCASE WHITE_SPACE a space character or TAB or CR or LF or ZWSP or ZWNBSP CASE_SENSITIVE POSIX_ALNUM POSIX_BLANK POSIX_GRAPH POSIX_PRINT POSIX_XDIGIT CASED CASE_IGNORABLE CHANGES_WHEN_LOWERCASED CHANGES_WHEN_UPPERCASED CHANGES_WHEN_TITLECASED CHANGES_WHEN_CASEFOLDED CHANGES_WHEN_CASEMAPPED CHANGES_WHEN_NFKC_CASEFOLDED }

POSIX Character Classes

Beware of using POSIX character classes, e.g. [:punct:]. ICU User Guide (see below) states that in general they are not well-defined, so may end up with something different than you expect.

In particular, in POSIX-like regex engines, [:punct:] stands for the character class corresponding to the ispunct() classification function (check out man 3 ispunct on UNIX-like systems). According to ISO/IEC 9899:1990 (ISO C90), the ispunct() function tests for any printing character except for space or a character for which isalnum() is true. However, in a POSIX setting, the details of what characters belong into which class depend on the current locale. So the [:punct:] class does not lead to portable code (again, in POSIX-like regex engines).

So a POSIX flavor of [:punct:] is more like [\p{P}\p{S}] in ICU. You have been warned.

Details

All stri_*_charclass functions in stringi perform a single character (i.e. Unicode code point) search-based operations. Since stringi_0.2-1 you may obtain roughly the same results using stringi-search-regex. However, these very functions aim to be faster.

Character classes are defined using ICU's UnicodeSet patterns. Below we briefly summarize their syntax. For more details refer to the bibliographic References below.

References

The Unicode Character Database -- Unicode Standard Annex #44, http://www.unicode.org/reports/tr44/

UnicodeSet -- ICU User Guide, http://userguide.icu-project.org/strings/unicodeset

Properties -- ICU User Guide, http://userguide.icu-project.org/strings/properties

C/POSIX Migration -- ICU User Guide, http://userguide.icu-project.org/posix

Unicode Script Data, http://www.unicode.org/Public/UNIDATA/Scripts.txt

icu::Unicodeset Class Reference -- ICU4C API Documentation, http://www.icu-project.org/apiref/icu4c/classicu_1_1UnicodeSet.html