substr_ctl: ANSI Control Sequence Aware Version of substr

Description

substr_ctl is a drop-in replacement for substr. Performance is slightly slower than substr. ANSI CSI SGR sequences will be included in the substrings to reflect the format of the substring when it was embedded in the source string. Additionally, other Control Sequences specified in ctl are treated as zero-width.

Usage

substr_ctl(x, start, stop, warn = getOption("fansi.warn"),
  term.cap = getOption("fansi.term.cap"), ctl = "all")
substr2_ctl(x, start, stop, type = "chars", round = "start",
  tabs.as.spaces = getOption("fansi.tabs.as.spaces"),
  tab.stops = getOption("fansi.tab.stops"),
  warn = getOption("fansi.warn"),
  term.cap = getOption("fansi.term.cap"), ctl = "all")
substr_sgr(x, start, stop, warn = getOption("fansi.warn"),
  term.cap = getOption("fansi.term.cap"))
substr2_sgr(x, start, stop, type = "chars", round = "start",
  tabs.as.spaces = getOption("fansi.tabs.as.spaces"),
  tab.stops = getOption("fansi.tab.stops"),
  warn = getOption("fansi.warn"),
  term.cap = getOption("fansi.term.cap"))

Arguments

a character vector or object that can be coerced to character.

start

integer. The first element to be replaced.

stop

integer. The last element to be replaced.

warn

TRUE (default) or FALSE, whether to warn when potentially problematic Control Sequences are encountered. These could cause the assumptions fansi makes about how strings are rendered on your display to be incorrect, for example by moving the cursor (see fansi).

term.cap

character a vector of the capabilities of the terminal, can be any combination "bright" (SGR codes 90-97, 100-107), "256" (SGR codes starting with "38;5" or "48;5"), and "truecolor" (SGR codes starting with "38;2" or "48;2"). Changing this parameter changes how fansi interprets escape sequences, so you should ensure that it matches your terminal capabilities. See term_cap_test for details.

ctl

character, which Control Sequences should be treated specially. See the "_ctl vs. _sgr" section for details.

"nl": newlines.
"c0": all other "C0" control characters (i.e. 0x01-0x1f, 0x7F), except for newlines and the actual ESC (0x1B) character.
"sgr": ANSI CSI SGR sequences.
"csi": all non-SGR ANSI CSI sequences.
"esc": all other escape sequences.
"all": all of the above, except when used in combination with any of the above, in which case it means "all but".

type

character(1L) partial matching c("chars", "width"), although type="width" only works correctly with R >= 3.2.2.

round

character(1L) partial matching c("start", "stop", "both", "neither"), controls how to resolve ambiguities when a start or stop value in "width" type mode falls within a multi-byte character or a wide display character. See details.

tabs.as.spaces

FALSE (default) or TRUE, whether to convert tabs to spaces. This can only be set to TRUE if strip.spaces is FALSE.

tab.stops

integer(1:n) indicating position of tab stops to use when converting tabs to spaces. If there are more tabs in a line than defined tab stops the last tab stop is re-used. For the purposes of applying tab stops, each input line is considered a line and the character count begins from the beginning of the input line.

_ctl vs. _sgr

The *_ctl versions of the functions treat all Control Sequences specially by default. Special treatment is context dependent, and may include detecting them and/or computing their display/character width as zero. For the SGR subset of the ANSI CSI sequences, fansi will also parse, interpret, and reapply the text styles they encode if needed. You can modify whether a Control Sequence is treated specially with the ctl parameter. You can exclude a type of Control Sequence from special treatment by combining "all" with that type of sequence (e.g. ctl=c("all", "nl") for special treatment of all Control Sequences but newlines). The *_sgr versions only treat ANSI CSI SGR sequences specially, and are equivalent to the *_ctl versions with the ctl parameter set to "sgr".

Details

substr2_ctl and substr2_sgr add the ability to retrieve substrings based on display width, and byte width in addition to the normal character width. substr2_ctl also provides the option to convert tabs to spaces with tabs_as_spaces prior to taking substrings.

Because exact substrings on anything other than character width cannot be guaranteed (e.g. as a result of multi-byte encodings, or double display-width characters) substr2_ctl must make assumptions on how to resolve provided start/stop values that are infeasible and does so via the round parameter.

If we use "start" as the round value, then any time the start value corresponds to the middle of a multi-byte or a wide character, then that character is included in the substring, while any similar partially included character via the stop is left out. The converse is true if we use "stop" as the round value. "neither" would cause all partial characters to be dropped irrespective whether they correspond to start or stop, and "both" could cause all of them to be included.

These functions map string lengths accounting for ANSI CSI SGR sequence semantics to the naive length calculations, and then use the mapping in conjunction with base::substr() to extract the string. This concept is borrowed directly from G<U+00E1>bor Cs<U+00E1>rdi's crayon package, although the implementation of the calculation is different.

Examples

Run this code

# NOT RUN {
substr_ctl("\033[42mhello\033[m world", 1, 9)
substr_ctl("\033[42mhello\033[m world", 3, 9)

## Width 2 and 3 are in the middle of an ideogram as
## start and stop positions respectively, so we control
## what we get with `round`

cn.string <- paste0("\033[42m", "\u4E00\u4E01\u4E03", "\033[m")

substr2_ctl(cn.string, 2, 3, type='width')
substr2_ctl(cn.string, 2, 3, type='width', round='both')
substr2_ctl(cn.string, 2, 3, type='width', round='start')
substr2_ctl(cn.string, 2, 3, type='width', round='stop')

## the _sgr variety only treat as special CSI SGR,
## compare the following:

substr_sgr("\033[31mhello\tworld", 1, 6)
substr_ctl("\033[31mhello\tworld", 1, 6)
substr_ctl("\033[31mhello\tworld", 1, 6, ctl=c('all', 'c0'))
# }