nchar_ctl: Control Sequence Aware Version of nchar

Description

nchar_ctl counts all non Control Sequence characters. nzchar_ctl returns TRUE for each input vector element that has non Control Sequence sequence characters. By default newlines and other C0 control characters are not counted.

Usage

nchar_ctl(
  x,
  type = "chars",
  allowNA = FALSE,
  keepNA = NA,
  ctl = "all",
  warn = getOption("fansi.warn", TRUE),
  strip
)
nzchar_ctl(
  x,
  keepNA = FALSE,
  ctl = "all",
  warn = getOption("fansi.warn", TRUE)
)

Value

Like base::nchar, with Control Sequences excluded.

Arguments

x

a character vector or object that can be coerced to such.

type

character(1L) partial matching c("chars", "width", "graphemes"), although types other than "chars" only work correctly with R >= 3.2.2. See ?nchar.

allowNA

logical: should NA be returned for invalid multibyte strings or "bytes"-encoded strings (rather than throwing an error)?

keepNA

logical: should NA be returned when x is NA? If false, nchar() returns 2, as that is the number of printing characters used when strings are written to output, and nzchar() is TRUE. The default for nchar(), NA, means to use keepNA = TRUE unless type is "width".

ctl

character, which Control Sequences should be treated specially. Special treatment is context dependent, and may include detecting them and/or computing their display/character width as zero. For the SGR subset of the ANSI CSI sequences, and OSC hyperlinks, fansi will also parse, interpret, and reapply the sequences as needed. You can modify whether a Control Sequence is treated specially with the ctl parameter.

"nl": newlines.
"c0": all other "C0" control characters (i.e. 0x01-0x1f, 0x7F), except for newlines and the actual ESC (0x1B) character.
"sgr": ANSI CSI SGR sequences.
"csi": all non-SGR ANSI CSI sequences.
"url": OSC hyperlinks
"osc": all non-OSC-hyperlink OSC sequences.
"esc": all other escape sequences.
"all": all of the above, except when used in combination with any of the above, in which case it means "all but".

warn

TRUE (default) or FALSE, whether to warn when potentially problematic Control Sequences are encountered. These could cause the assumptions fansi makes about how strings are rendered on your display to be incorrect, for example by moving the cursor (see ?fansi). At most one warning will be issued per element in each input vector. Will also warn about some badly encoded UTF-8 strings, but a lack of UTF-8 warnings is not a guarantee of correct encoding (use validUTF8 for that).

strip

character, deprecated in favor of ctl.

Control and Special Sequences

Control Sequences are non-printing characters or sequences of characters. Special Sequences are a subset of the Control Sequences, and include CSI SGR sequences which can be used to change rendered appearance of text, and OSC hyperlinks. See fansi for details.

Output Stability

Several factors could affect the exact output produced by fansi functions across versions of fansi, R, and/or across systems. In general it is best not to rely on exact fansi output, e.g. by embedding it in tests.

Width and grapheme calculations depend on locale, Unicode database version, and grapheme processing logic (which is still in development), among other things. For the most part fansi (currently) uses the internals of base::nchar(type='width'), but there are exceptions and this may change in the future.

How a particular display format is encoded in Control Sequences is not guaranteed to be stable across fansi versions. Additionally, which Special Sequences are re-encoded vs transcribed untouched may change. In general we will strive to keep the rendered appearance stable.

To maximize the odds of getting stable output set normalize_state to TRUE and type to "chars" in functions that allow it, and set term.cap to a specific set of capabilities.

Graphemes

fansi approximates grapheme widths and counts by using heuristics for grapheme breaks that work for most common graphemes, including emoji combining sequences. The heuristic is known to work incorrectly with invalid combining sequences, prepending marks, and sequence interruptors. fansi does not provide a full implementation of grapheme break detection to avoid carrying a copy of the Unicode grapheme breaks table, and also because the hope is that R will add the feature eventually itself.

The utf8 package provides a conforming grapheme parsing implementation.

Details

nchar_ctl and nzchar_ctl are implemented in statically compiled code, so in particular nzchar_ctl will be much faster than the otherwise equivalent nzchar(strip_ctl(...)).

These functions will warn if either malformed or escape or UTF-8 sequences are encountered as they may be incorrectly interpreted.

Examples

Run this code

nchar_ctl("\033[31m123\a\r")
## with some wide characters
cn.string <-  sprintf("\033[31m%s\a\r", "\u4E00\u4E01\u4E03")
nchar_ctl(cn.string)
nchar_ctl(cn.string, type='width')

## Remember newlines are not counted by default
nchar_ctl("\t\n\r")

## The 'c0' value for the `ctl` argument does not include
## newlines.
nchar_ctl("\t\n\r", ctl="c0")
nchar_ctl("\t\n\r", ctl=c("c0", "nl"))

## The _sgr flavor only treats SGR sequences as zero width
nchar_sgr("\033[31m123")
nchar_sgr("\t\n\n123")

## All of the following are Control Sequences or C0 controls
nzchar_ctl("\n\033[42;31m\033[123P\a")

Run the code above in your browser using DataLab