stri_rand_strings
.UnicodeSet
represents a subset of Unicode code points
(recall that Patterns either consist of series of characters either bounded by square brackets (such patterns follow a syntax similar to that employed by version 8 regular expression character classes) or of Perl-like Unicode property set specifiers.
[]
denotes an empty set, [a]
--
a set consisting of character ``a'',
[\u0105]
-- a set with character U+0105,
and [abc]
-- a set with ``a'', ``b'', and ``c''.
[a-z]
denotes a set consisting of characters
``a'' through ``z'' inclusively, in Unicode code point order.
Some set-theoretic operations are available.
^
denotes the complement, e.g. [^a-z]
contains
all characters but ``a'' through ``z''.
On the other hand, [[pat1][pat2]]
,
[[pat1]&[pat2]]
, and [[pat1]-[pat2]]
denote union, intersection, and asymmetric diference of sets
specified by pat1
and pat2
, respectively.
Note that all white spaces are ignored unless they are quoted or backslashed
(white spaces can be freely used for clarity, as [a c d-f m]
means the same as [acd-fm]
).
UnicodeSet
API documentation).
Also, empty string patterns are disallowed.
Any character may be preceded by a backslash in order to remove any special meaning.
A malformed pattern always results in an error.
[:Letter:]
, or with a (extended) Perl-style syntax,
e.g. \p{L}
.
The complements of the above sets are
[:^Letter:]
and \P{L}
, respectively.The properties' names are normalized before matching (for example, the match is case-insensitive). Moreover, many names have short aliases.
Among predefined Unicode properties we find e.g.
Lu
for uppercase letters,WHITE_SPACE
,Each property provides access to the large and comprehensive
Unicode Character Database.
Generally, the list of properties available in
Please note that some classes may seem to overlap.
However, e.g. General Category Z
(some space) and Binary Property
WHITE_SPACE
matches different character sets.
Cc
-- a C0 or C1 control code;Cf
-- a format control character;Cn
-- a reserved unassigned code point or a non-character;Co
-- a private-use character;Cs
-- a surrogate code point;Lc
-- the union of Lu, Ll, Lt;Ll
-- a lowercase letter;Lm
-- a modifier letter;Lo
-- other letters, including syllables and ideographs;Lt
-- a digraphic character, with first part uppercase;Lu
-- an uppercase letter;Mc
-- a spacing combining mark (positive advance width);Me
-- an enclosing combining mark;Mn
-- a non-spacing combining mark (zero advance width);Nd
-- a decimal digit;Nl
-- a letter-like numeric character;No
-- a numeric character of other type;Pd
-- a dash or hyphen punctuation mark;Ps
-- an opening punctuation mark (of a pair);Pe
-- a closing punctuation mark (of a pair);Pc
-- a connecting punctuation mark, like a tie;Po
-- a punctuation mark of other type;Pi
-- an initial quotation mark;Pf
-- a final quotation mark;Sm
-- a symbol of mathematical use;Sc
-- a currency sign;Sk
-- a non-letter-like modifier symbol;So
-- a symbol of other type;Zs
-- a space character (of non-zero width);Zl
-- U+2028 LINE SEPARATOR only;Zp
-- U+2029 PARAGRAPH SEPARATOR only;C
-- the union of Cc, Cf, Cs, Co, Cn;L
-- the union of Lu, Ll, Lt, Lm, Lo;M
-- the union of Mn, Mc, Me;N
-- the union of Nd, Nl, No;P
-- the union of Pc, Pd, Ps, Pe, Pi, Pf, Po;S
-- the union of Sm, Sc, Sk, So;Z
-- the union of Zs, Zl, Zp.Here is the complete list of supported Binary Properties:
ALPHABETIC
-- alphabetic character;ASCII_HEX_DIGIT
-- a character matching the[0-9A-Fa-f]
charclass;BIDI_CONTROL
-- a format control which have specific functions
in the Bidi (bidirectional text) Algorithm;BIDI_MIRRORED
-- a character that may change display in right-to-left text;DASH
-- a kind of a dash character;DEFAULT_IGNORABLE_CODE_POINT
-- characters that are ignorable in most
text processing activities,
e.g. <2060..206f, fff0..fffb,="" e0000..e0fff="">;2060..206f,>DEPRECATED
-- a deprecated character according
to the current Unicode standard (the usage of deprecated characters
is strongly discouraged);DIACRITIC
-- a character that linguistically modifies
the meaning of another character to which it applies;EXTENDER
-- a character that extends the value
or shape of a preceding alphabetic character,
e.g. a length and iteration mark.HEX_DIGIT
-- a character commonly
used for hexadecimal numbers,
cf. alsoASCII_HEX_DIGIT
;HYPHEN
-- a dash used to mark connections between
pieces of words, plus the Katakana middle dot;ID_CONTINUE
-- a character that can continue an identifier,ID_START
+Mn
+Mc
+Nd
+Pc
;ID_START
-- a character that can start an identifier,Lu
+Ll
+Lt
+Lm
+Lo
+Nl
;IDEOGRAPHIC
-- a CJKV (Chinese-Japanese-Korean-Vietnamese)
ideograph;LOWERCASE
;MATH
;NONCHARACTER_CODE_POINT
;QUOTATION_MARK
;SOFT_DOTTED
-- a character with a ``soft dot'', like i or j,
such that an accent placed on this character causes the dot to disappear;TERMINAL_PUNCTUATION
-- a punctuation character that generally
marks the end of textual units;UPPERCASE
;WHITE_SPACE
-- a space character or TAB or CR or LF or ZWSP or ZWNBSP;CASE_SENSITIVE
;POSIX_ALNUM
;POSIX_BLANK
;POSIX_GRAPH
;POSIX_PRINT
;POSIX_XDIGIT
;CASED
;CASE_IGNORABLE
;CHANGES_WHEN_LOWERCASED
;CHANGES_WHEN_UPPERCASED
;CHANGES_WHEN_TITLECASED
;CHANGES_WHEN_CASEFOLDED
;CHANGES_WHEN_CASEMAPPED
;CHANGES_WHEN_NFKC_CASEFOLDED
.stri_*_charclass
functions in Character classes are defined using UnicodeSet
patterns. Below we briefly summarize their syntax.
For more details refer to the bibliographic References below.
UnicodeSet -- ICU User Guide,
Properties -- ICU User Guide,
Unicode Script Data,
icu::Unicodeset Class Reference -- ICU4C API Documentation,
stri_count_charclass
;
stri_detect_charclass
;
stri_extract_all_charclass
,
stri_extract_first_charclass
,
stri_extract_first_charclass
,
stri_extract_last_charclass
,
stri_extract_last_charclass
;
stri_locate_all_charclass
,
stri_locate_first_charclass
,
stri_locate_first_charclass
,
stri_locate_last_charclass
,
stri_locate_last_charclass
;
stri_replace_all_charclass
,
stri_replace_first_charclass
,
stri_replace_first_charclass
,
stri_replace_last_charclass
,
stri_replace_last_charclass
;
stri_split_charclass
;
stri_trim
, stri_trim
,
stri_trim_both
,
stri_trim_left
,
stri_trim_right
;
stringi-search
Other stringi_general_topics: stringi-arguments
;
stringi-encoding
;
stringi-locale
;
stringi-package
;
stringi-search-coll
;
stringi-search-fixed
;
stringi-search-regex
;
stringi-search