up_alphabetic: Unicode Properties

Description

Match a Unicode Property.

Usage

up_alphabetic(lo, hi, char_class = TRUE)
up_ascii_hex_digit(lo, hi, char_class = TRUE)
up_bidi_control(lo, hi, char_class = TRUE)
up_bidi_mirrored(lo, hi, char_class = TRUE)
up_case_ignorable(lo, hi, char_class = TRUE)
up_case_sensitive(lo, hi, char_class = TRUE)
up_cased(lo, hi, char_class = TRUE)
up_changes_when_casefolded(lo, hi, char_class = TRUE)
up_changes_when_casemapped(lo, hi, char_class = TRUE)
up_changes_when_lowercased(lo, hi, char_class = TRUE)
up_changes_when_nfkc_casefolded(lo, hi, char_class = TRUE)
up_changes_when_titlecased(lo, hi, char_class = TRUE)
up_changes_when_uppercased(lo, hi, char_class = TRUE)
up_dash(lo, hi, char_class = TRUE)
up_default_ignorable_code_point(lo, hi, char_class = TRUE)
up_deprecated(lo, hi, char_class = TRUE)
up_diacritic(lo, hi, char_class = TRUE)
up_extender(lo, hi, char_class = TRUE)
up_hex_digit(lo, hi, char_class = TRUE)
up_hyphen(lo, hi, char_class = TRUE)
up_id_continue(lo, hi, char_class = TRUE)
up_id_start(lo, hi, char_class = TRUE)
up_ideographic(lo, hi, char_class = TRUE)
up_lowercase(lo, hi, char_class = TRUE)
up_math(lo, hi, char_class = TRUE)
up_noncharacter_code_point(lo, hi, char_class = TRUE)
up_posix_alnum(lo, hi, char_class = TRUE)
up_posix_blank(lo, hi, char_class = TRUE)
up_posix_graph(lo, hi, char_class = TRUE)
up_posix_print(lo, hi, char_class = TRUE)
up_posix_xdigit(lo, hi, char_class = TRUE)
up_quotation_mark(lo, hi, char_class = TRUE)
up_soft_dotted(lo, hi, char_class = TRUE)
up_terminal_punctuation(lo, hi, char_class = TRUE)
up_uppercase(lo, hi, char_class = TRUE)
up_white_space(lo, hi, char_class = TRUE)
UP_ALPHABETIC
UP_ASCII_HEX_DIGIT
UP_BIDI_CONTROL
UP_BIDI_MIRRORED
UP_DASH
UP_DEFAULT_IGNORABLE_CODE_POINT
UP_DEPRECATED
UP_DIACRITIC
UP_EXTENDER
UP_HEX_DIGIT
UP_HYPHEN
UP_ID_CONTINUE
UP_ID_START
UP_IDEOGRAPHIC
UP_LOWERCASE
UP_MATH
UP_NONCHARACTER_CODE_POINT
UP_QUOTATION_MARK
UP_SOFT_DOTTED
UP_TERMINAL_PUNCTUATION
UP_UPPERCASE
UP_WHITE_SPACE
UP_CASE_SENSITIVE
UP_POSIX_ALNUM
UP_POSIX_BLANK
UP_POSIX_GRAPH
UP_POSIX_PRINT
UP_POSIX_XDIGIT
UP_CASED
UP_CASE_IGNORABLE
UP_CHANGES_WHEN_LOWERCASED
UP_CHANGES_WHEN_UPPERCASED
UP_CHANGES_WHEN_TITLECASED
UP_CHANGES_WHEN_CASEFOLDED
UP_CHANGES_WHEN_CASEMAPPED
UP_CHANGES_WHEN_NFKC_CASEFOLDED

Value

A character vector representing part or all of a regular expression.

Format

An object of class regex (inherits from character) of length 1.

Arguments

lo: A non-negative integer. Minimum number of repeats, when grouped.
hi: positive integer. Maximum number of repeats, when grouped.
char_class: TRUE or FALSE. Should the values be wrapped into a character class?

References

Table 12 of the Unicode Standard Annex #44 defines the Unicode General Categories. http://www.unicode.org/reports/tr44/

You can see which characters are contained in a category by visiting, e.g., http://www.fileformat.info/info/unicode/category/Nd/list.htm

Examples

Run this code

# Classes
up_math()
up_posix_alnum()
up_changes_when_uppercased()
up_diacritic()

# With repetition
ugc_nonspacing_mark(3, 6)
up_quotation_mark(1, Inf)
up_posix_xdigit(0, Inf)

# Without a class wrapper
up_hyphen(char_class = FALSE)

# Constants
UP_ALPHABETIC
UP_DASH
UP_POSIX_ALNUM
UP_CHANGES_WHEN_LOWERCASED

if (FALSE) {
# All the Unicode properties.
# Not run, since it generates lots of output
ls("package:rebus.unicode", pattern = "^up")
}

# Usage
# Hello in Samoan, Serbian, Persian, Simplified Chinese
hello <- "t\u101lofa, \u437\u434\u440\u430\u432\u43e, \u633\u644\u627\u645, \u4f60\u597d"
stringi::stri_extract_all_regex(hello, up_alphabetic(1, Inf))
stringi::stri_extract_all_regex(hello, up_case_sensitive(1, Inf))

Run the code above in your browser using DataLab