fansi: Details About Manipulation of Strings Containing Control Sequences

Description

Counterparts to R string manipulation functions that account for the effects of ANSI text formatting control sequences.

Arguments

Control Characters and Sequences

Control characters and sequences are non-printing inline characters that can be used to modify terminal display and behavior, for example by changing text color or cursor position.

We will refer to ANSI control characters and sequences as "Control Sequences" hereafter.

There are three types of Control Sequences that fansi can treat specially:

"C0" control characters, such as tabs and carriage returns (we include delete in this set, even though technically it is not part of it).
Sequences starting in "ESC[", also known as ANSI CSI sequences.
Sequences starting in "ESC" and followed by something other than "[".

Control Sequences starting with ESC are assumed to be two characters long (including the ESC) unless they are of the CSI variety, in which case their length is computed as per the ECMA-48 specification. There are non-CSI escape sequences that may be longer than two characters, but fansi will (incorrectly) treat them as if they were two characters long.

In theory it is possible to encode Control Sequences with a single byte introducing character in the 0x40-0x5F range instead of the traditional "ESC[". Since this is rare and it conflicts with UTF-8 encoding, we do not support it.

The special treatment of Control Sequences is to compute their display/character width as zero. For the SGR subset of the ANSI CSI sequences, fansi will also parse, interpret, and reapply the text styles they encode as needed. Whether a particular type of Control Sequence is treated specially can be specified via the ctl parameter to the fansi functions that have it.

ANSI CSI SGR Control Sequences

NOTE: not all displays support ANSI CSI SGR sequences; run term_cap_test to see whether your display supports them.

ANSI CSI SGR Control Sequences are the subset of CSI sequences that can be used to change text appearance (e.g. color). These sequences begin with "ESC[" and end in "m". fansi interprets these sequences and writes new ones to the output strings in such a way that the original formatting is preserved. In most cases this should be transparent to the user.

Occasionally there may be mismatches between how fansi and a display interpret the CSI SGR sequences, which may produce display artifacts. The most likely source of artifacts are Control Sequences that move the cursor or change the display, or that fansi otherwise fails to interpret, such as:

Unknown SGR substrings.
"C0" control characters like tabs and carriage returns.
Other escape sequences.

Another possible source of problems is that different displays parse and interpret control sequences differently. The common CSI SGR sequences that you are likely to encounter in formatted text tend to be treated consistently, but less common ones are not. fansi tries to hew by the ECMA-48 specification for CSI control sequences, but not all terminals do.

The most likely source of problems will be 24-bit CSI SGR sequences. For example, a 24-bit color sequence such as "ESC[38;2;31;42;4" is a single foreground color to a terminal that supports it, or separate foreground, background, faint, and underline specifications for one that does not. To mitigate this particular problem you can tell fansi what your terminal capabilities are via the term.cap parameter or the "fansi.term.cap" global option, although fansi does try to detect them by default.

fansi will will warn if it encounters Control Sequences that it cannot interpret or that might conflict with terminal capabilities. You can turn off warnings via the warn parameter or via the "fansi.warn" global option.

fansi can work around "C0" tab control characters by turning them into spaces first with tabs_as_spaces or with the tabs.as.spaces parameter available in some of the fansi functions.

We chose to interpret ANSI CSI SGR sequences because this reduces how much string transcription we need to do during string manipulation. If we do not interpret the sequences then we need to record all of them from the beginning of the string and prepend all the accumulated tags up to beginning of a substring to the substring. In many case the bulk of those accumulated tags will be irrelevant as their effects will have been superseded by subsequent tags.

fansi assumes that ANSI CSI SGR sequences should be interpreted in cumulative "Graphic Rendition Combination Mode". This means new SGR sequences add to rather than replace previous ones, although in some cases the effect is the same as replacement (e.g. if you have a color active and pick another one).

Encodings / UTF-8

fansi will convert any non-ASCII strings to UTF-8 before processing them, and fansi functions that return strings will return them encoded in UTF-8. In some cases this will be different to what base R does. For example, substr re-encodes substrings to their original encoding.

Interpretation of UTF-8 strings is intended to be consistent with base R. There are three ways things may not work out exactly as desired:

fansi, despite its best intentions, handles a UTF-8 sequence differently to the way R does.
R incorrectly handles a UTF-8 sequence.
Your display incorrectly handles a UTF-8 sequence.

These issues are most likely to occur with invalid UTF-8 sequences, combining character sequences, and emoji. For example, as of this writing R (and the OSX terminal) consider emojis to be one wide characters, when in reality they are two wide. Do not expect the fansi width calculations to to work correctly with strings containing emoji.

Internally, fansi computes the width of every UTF-8 character sequence outside of the ASCII range using the native R_nchar function. This will cause such characters to be processed slower than ASCII characters. Additionally, fansi character width computations can differ from R width computations despite the use of R_nchar. fansi always computes width for each character individually, which assumes that the sum of the widths of each character is equal to the width of a sequence. However, it is theoretically possible for a character sequence that forms a single grapheme to break that assumption. In informal testing we have found this to be rare because in the most common multi-character graphemes the trailing characters are computed as zero width.

As of R 3.4.0 substr appears to use UTF-8 character byte sizes as indicated by the leading byte, irrespective of whether the subsequent bytes lead to a valid sequence. Additionally, UTF-8 byte sequences as long as 5 or 6 bytes may be allowed, which is likely a holdover from older Unicode versions. fansi mimics this behavior. It is likely substr will start failing with invalid UTF-8 byte sequences with R 3.6.0 (as per SVN r74488). In general, you should assume that fansi may not replicate base R exactly when there are illegal UTF-8 sequences present.

Our long term objective is to implement proper UTF-8 character width computations, but for simplicity and also because R and our terminal do not do it properly either we are deferring the issue for now.

R < 3.2.2 support

Nominally you can build and run this package in R versions between 3.1.0 and 3.2.1. Things should mostly work, but please be aware we do not run the test suite under versions of R less than 3.2.2. One key degraded capability is width computation of wide-display characters. Under R < 3.2.2 fansi will assume every character is 1 display width. Additionally, fansi may not always report malformed UTF-8 sequences as it usually does. One exception to this is nchar_ctl as that is just a thin wrapper around base::nchar.

Miscellaneous

The native code in this package assumes that all strings are NULL terminated and no longer than (32 bit) INT_MAX (excluding the NULL). This should be a safe assumption since the code is designed to work with STRSXPs and CHRSXPs. Behavior is undefined and probably bad if you somehow manage to provide to fansi strings that do not adhere to these assumptions.