nchar_ctl
counts all non Control Sequence characters.
nzchar_ctl
returns TRUE for each input vector element that has non Control
Sequence sequence characters. By default newlines and other C0 control
characters are not counted.
nchar_ctl(
x,
type = "chars",
allowNA = FALSE,
keepNA = NA,
ctl = "all",
warn = getOption("fansi.warn", TRUE),
strip
)nzchar_ctl(
x,
keepNA = FALSE,
ctl = "all",
warn = getOption("fansi.warn", TRUE)
)
Like base::nchar
, with Control Sequences excluded.
a character vector or object that can be coerced to such.
character(1L) partial matching
c("chars", "width", "graphemes")
, although types other than "chars" only
work correctly with R >= 3.2.2. See ?nchar
.
logical: should NA
be returned for invalid
multibyte strings or "bytes"
-encoded strings (rather than
throwing an error)?
logical: should NA
be returned when
x
is NA
? If false, nchar()
returns
2
, as that is the number of printing characters used when
strings are written to output, and nzchar()
is TRUE
. The
default for nchar()
, NA
, means to use keepNA = TRUE
unless type
is "width"
.
character, which Control Sequences should be treated
specially. Special treatment is context dependent, and may include
detecting them and/or computing their display/character width as zero. For
the SGR subset of the ANSI CSI sequences, and OSC hyperlinks, fansi
will also parse, interpret, and reapply the sequences as needed. You can
modify whether a Control Sequence is treated specially with the ctl
parameter.
"nl": newlines.
"c0": all other "C0" control characters (i.e. 0x01-0x1f, 0x7F), except for newlines and the actual ESC (0x1B) character.
"sgr": ANSI CSI SGR sequences.
"csi": all non-SGR ANSI CSI sequences.
"url": OSC hyperlinks
"osc": all non-OSC-hyperlink OSC sequences.
"esc": all other escape sequences.
"all": all of the above, except when used in combination with any of the above, in which case it means "all but".
TRUE (default) or FALSE, whether to warn when potentially
problematic Control Sequences are encountered. These could cause the
assumptions fansi
makes about how strings are rendered on your display
to be incorrect, for example by moving the cursor (see ?fansi
).
At most one warning will be issued per element in each input vector. Will
also warn about some badly encoded UTF-8 strings, but a lack of UTF-8
warnings is not a guarantee of correct encoding (use validUTF8
for
that).
character, deprecated in favor of ctl
.
Control Sequences are non-printing characters or sequences of characters.
Special Sequences are a subset of the Control Sequences, and include CSI
SGR sequences which can be used to change rendered appearance of text, and
OSC hyperlinks. See fansi
for details.
Several factors could affect the exact output produced by fansi
functions across versions of fansi
, R
, and/or across systems.
In general it is best not to rely on exact fansi
output, e.g. by
embedding it in tests.
Width and grapheme calculations depend on locale, Unicode database
version, and grapheme processing logic (which is still in development), among
other things. For the most part fansi
(currently) uses the internals of
base::nchar(type='width')
, but there are exceptions and this may change in
the future.
How a particular display format is encoded in Control Sequences is
not guaranteed to be stable across fansi
versions. Additionally, which
Special Sequences are re-encoded vs transcribed untouched may change.
In general we will strive to keep the rendered appearance stable.
To maximize the odds of getting stable output set normalize_state
to
TRUE
and type
to "chars"
in functions that allow it, and
set term.cap
to a specific set of capabilities.
fansi
approximates grapheme widths and counts by using heuristics for
grapheme breaks that work for most common graphemes, including emoji
combining sequences. The heuristic is known to work incorrectly with
invalid combining sequences, prepending marks, and sequence interruptors.
fansi
does not provide a full implementation of grapheme break detection to
avoid carrying a copy of the Unicode grapheme breaks table, and also because
the hope is that R will add the feature eventually itself.
The utf8
package provides a
conforming grapheme parsing implementation.
nchar_ctl
and nzchar_ctl
are implemented in statically compiled code, so
in particular nzchar_ctl
will be much faster than the otherwise equivalent
nzchar(strip_ctl(...))
.
These functions will warn if either malformed or escape or UTF-8 sequences are encountered as they may be incorrectly interpreted.
?fansi
for details on how Control Sequences are
interpreted, particularly if you are getting unexpected results,
unhandled_ctl
for detecting bad control sequences.
nchar_ctl("\033[31m123\a\r")
## with some wide characters
cn.string <- sprintf("\033[31m%s\a\r", "\u4E00\u4E01\u4E03")
nchar_ctl(cn.string)
nchar_ctl(cn.string, type='width')
## Remember newlines are not counted by default
nchar_ctl("\t\n\r")
## The 'c0' value for the `ctl` argument does not include
## newlines.
nchar_ctl("\t\n\r", ctl="c0")
nchar_ctl("\t\n\r", ctl=c("c0", "nl"))
## The _sgr flavor only treats SGR sequences as zero width
nchar_sgr("\033[31m123")
nchar_sgr("\t\n\n123")
## All of the following are Control Sequences or C0 controls
nzchar_ctl("\n\033[42;31m\033[123P\a")
Run the code above in your browser using DataLab