stri_split_lines: Split a String into Text Lines

Description

These functions split each character string into text lines.

Usage

stri_split_lines(str, n_max = -1L, omit_empty = FALSE)
stri_split_lines1(str)

Arguments

str

character vector

n_max

integer vector, maximal number of pieces to return

omit_empty

logical vector; determines whether empty strings should be removed from the result

Value

stri_split_lines returns a list of character vectors. If any input string is NA, then the corresponding list element is a NA string.
stri_split_lines1(str) is like stri_split_lines(str[1])[[1]] (with default parameters), thus it returns a character vector. Moreover, if the input string ends at a newline sequence, the last empty string is omitted from the result. Therefore, this function is convenient for splitting a loaded text file into lines.

Details

Vectorized over str, pattern, n_max, and omit_empty.

If n_max is negative (default), then all pieces are extracted.

omit_empty is applied during splitting: if set to TRUE, then empty strings will never appear in the resulting vector.

Newlines are represented on different platforms e.g. by carriage return (CR, 0x0D), line feed (LF, 0x0A), CRLF, or next line (NEL, 0x85). Moreover, the Unicode Standard defines two unambiguous separator characters, Paragraph Separator (PS, 0x2029) and Line Separator (LS, 0x2028). Sometimes also vertical tab (VT, 0x0B) and form feed (FF, 0x0C) are used.

This function follows UTR#18 rules, where a newline sequence corresponds to the following regular expression: (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]. Each match is used to split a text line. Of course, the search is not performed via regexes here, for efficiency reasons.

References

Unicode Newline Guidelines -- Unicode Technical Report #13, http://www.unicode.org/standard/reports/tr13/tr13-5.html

Unicode Regular Expressions -- Unicode Technical Standard #18, http://www.unicode.org/reports/tr18/