stri_rand_strings
.
Moreover, the ICU regex engine uses the same
scheme for denoting character classes.
UnicodeSet
represents a subset of Unicode code points
(recall that stringi converts strings in your native encoding
to Unicode automatically). Legal code points are U+0000 to U+10FFFF,
inclusive. Patterns either consist of series of characters either bounded by square brackets
(such patterns follow a syntax similar to that employed
by version 8 regular expression character classes)
or of Perl-like Unicode property set specifiers. []
denotes an empty set, [a]
--
a set consisting of character ``a'',
[\u0105]
-- a set with character U+0105,
and [abc]
-- a set with ``a'', ``b'', and ``c''. [a-z]
denotes a set consisting of characters
``a'' through ``z'' inclusively, in Unicode code point order. Some set-theoretic operations are available.
^
denotes the complement, e.g. [^a-z]
contains
all characters but ``a'' through ``z''.
On the other hand, [[pat1][pat2]]
,
[[pat1]\&[pat2]]
, and [[pat1]-[pat2]]
denote union, intersection, and asymmetric difference of sets
specified by pat1
and pat2
, respectively. Note that all white spaces are ignored unless they are quoted or backslashed
(white spaces can be freely used for clarity, as [a c d-f m]
means the same as [acd-fm]
).
stringi does not allow for including so-called multicharacter strings
(see UnicodeSet
API documentation).
Also, empty string patterns are disallowed. Any character may be preceded by
a backslash in order to remove any special meaning. A malformed pattern always results in an error. Set expressions at a glance
(according to http://userguide.icu-project.org/strings/regexp): Some examples: [abc]
[^abc]
[A-M]
[\u0000-\U0010ffff]
[\p{Letter}]
or [\p{General_Category=Letter}]
or [\p{L}]
[\P{Letter}]
\P
) Match everything except Letters.[\p{numeric_value=9}]
[\p{Letter}&&\p{script=cyrillic}]
[\p{Letter}--\p{script=latin}]
[[a-z][A-Z][0-9]]
or [a-zA-Z0-9]
[:script=Greek:]
\p{script=Greek}
.[:Letter:]
,
or with a (extended) Perl-style syntax, e.g. \p{L}
.
The complements of the above sets are
[:^Letter:]
and \P{L}
, respectively. The properties' names are normalized before matching
(for example, the match is case-insensitive).
Moreover, many names have short aliases. Among predefined Unicode properties we find e.g.
Lu
for uppercase letters,
WHITE_SPACE
,
Z
(some space) and Binary Property
WHITE_SPACE
matches different character sets.Cc
Cf
Cn
Co
Cs
Lc
Ll
Lm
Lo
Lt
Lu
Mc
Me
Mn
Nd
Nl
No
Pd
Ps
Pe
Pc
Po
Pi
Pf
Sm
Sc
Sk
So
Zs
Zl
Zp
C
L
M
N
P
S
Z
ALPHABETIC
ASCII_HEX_DIGIT
[0-9A-Fa-f]
charclass.BIDI_CONTROL
BIDI_MIRRORED
DASH
DEFAULT_IGNORABLE_CODE_POINT
DEPRECATED
DIACRITIC
EXTENDER
HEX_DIGIT
ASCII_HEX_DIGIT
.HYPHEN
ID_CONTINUE
ID_START
+Mn
+Mc
+Nd
+Pc
.ID_START
Lu
+Ll
+Lt
+Lm
+Lo
+Nl
.IDEOGRAPHIC
LOWERCASE
MATH
NONCHARACTER_CODE_POINT
QUOTATION_MARK
SOFT_DOTTED
TERMINAL_PUNCTUATION
UPPERCASE
WHITE_SPACE
CASE_SENSITIVE
POSIX_ALNUM
POSIX_BLANK
POSIX_GRAPH
POSIX_PRINT
POSIX_XDIGIT
CASED
CASE_IGNORABLE
CHANGES_WHEN_LOWERCASED
CHANGES_WHEN_UPPERCASED
CHANGES_WHEN_TITLECASED
CHANGES_WHEN_CASEFOLDED
CHANGES_WHEN_CASEMAPPED
CHANGES_WHEN_NFKC_CASEFOLDED
[:punct:]
. ICU User Guide (see below)
states that in general they are not well-defined, so may end up
with something different than you expect. In particular, in POSIX-like regex engines, [:punct:]
stands for
the character class corresponding to the ispunct()
classification
function (check out man 3 ispunct
on UNIX-like systems).
According to ISO/IEC 9899:1990 (ISO C90), the ispunct()
function
tests for any printing character except for space or a character
for which isalnum()
is true. However, in a POSIX setting,
the details of what characters belong into which class depend
on the current locale. So the [:punct:]
class does not lead
to portable code (again, in POSIX-like regex engines). So a POSIX flavor of [:punct:]
is more like
[\p{P}\p{S}]
in ICU. You have been warned.stri_*_charclass
functions in stringi perform
a single character (i.e. Unicode code point) search-based operations.
Since stringi_0.2-1 you may obtain
roughly the same results using stringi-search-regex.
However, these very functions aim to be faster.Character classes are defined using ICU's UnicodeSet
patterns. Below we briefly summarize their syntax.
For more details refer to the bibliographic References below.
UnicodeSet -- ICU User Guide, http://userguide.icu-project.org/strings/unicodeset
Properties -- ICU User Guide, http://userguide.icu-project.org/strings/properties
C/POSIX Migration -- ICU User Guide, http://userguide.icu-project.org/posix
Unicode Script Data, http://www.unicode.org/Public/UNIDATA/Scripts.txt
icu::Unicodeset Class Reference -- ICU4C API Documentation, http://www.icu-project.org/apiref/icu4c/classicu_1_1UnicodeSet.html
stri_trim_both
,
stringi-search
Other stringi_general_topics: stringi-arguments
,
stringi-encoding
,
stringi-locale
,
stringi-package
,
stringi-search-boundaries
,
stringi-search-coll
,
stringi-search-fixed
,
stringi-search-regex
,
stringi-search