strcoll: Compare Strings

Description

These functions provide means to compare strings in any locale using the Unicode collation algorithm.

Usage

strcoll(
  e1,
  e2,
  locale = NULL,
  strength = 3L,
  alternate_shifted = FALSE,
  french = FALSE,
  uppercase_first = NA,
  case_level = FALSE,
  normalisation = FALSE,
  numeric = FALSE
)
e1 %x<% e2<="" p="">
e1 %x<=% e2<="" p="">
e1 %x==% e2
e1 %x!=% e2
e1 %x>% e2
e1 %x>=% e2

Value

strcoll returns an integer vector representing the comparison results: if a string in e1 is smaller than the corresponding string in e2, the corresponding result will be equal to -1, and 0 if they are canonically equivalent, as well as 1 if the former is greater than the latter.

The binary operators call strcoll with default arguments and return logical vectors.

Arguments

e1, e2: character vector whose corresponding elements are to be compared
locale: NULL or "" for the default locale (see stri_locale_get) or a single string with a locale identifier, see stri_locale_list
strength: see stri_opts_collator
alternate_shifted: see stri_opts_collator
french: see stri_opts_collator
uppercase_first: see stri_opts_collator
case_level: see stri_opts_collator
normalisation: see stri_opts_collator
numeric: see stri_opts_collator

Differences from Base R

Replacements for base Comparison operators implemented with stri_cmp.

collation in different locales is difficult and non-portable across platforms [fixed here -- using services provided by ICU]
overloading `<.character` has no effect in R, because S3 method dispatch is done internally with hard-coded support for character arguments. We could have replaced the generic `<` with the one that calls UseMethod, but it feels like a too intrusive solution [fixed by introducing the `%x<%` operator]

Author

Marek Gagolewski

Details

These functions are fully vectorised with respect to both arguments.

For a locale-insensitive behaviour like that of strcmp from the standard C library, call strcoll(e1, e2, locale="C", strength=4L, normalisation=FALSE). However, some normalisation will still be performed.

Examples

Run this code

# lexicographic vs. numeric sort
strcoll("100", c("1", "10", "11", "99", "100", "101", "1000"))
strcoll("100", c("1", "10", "11", "99", "100", "101", "1000"), numeric=TRUE)
strcoll("hladn\u00FD", "chladn\u00FD", locale="sk_SK")

Run the code above in your browser using DataLab