stri_split_boundaries: Split a String at Text Boundaries

Description

This function locates text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.

Usage

stri_split_boundaries(
  str,
  n = -1L,
  tokens_only = FALSE,
  simplify = FALSE,
  ...,
  opts_brkiter = NULL
)

Value

If simplify=FALSE (the default), then the functions return a list of character vectors.

Otherwise, stri_list2matrix with byrow=TRUE

and n_min=n arguments is called on the resulting object. In such a case, a character matrix with length(str) rows is returned. Note that stri_list2matrix's fill

argument is set to an empty string and NA, for simplify equal to TRUE and NA, respectively.

Arguments

str: character vector or an object coercible to
n: integer vector, maximal number of strings to return
tokens_only: single logical value; may affect the result if n is positive, see Details
simplify: single logical value; if TRUE or NA, then a character matrix is returned; otherwise (the default), a list of character vectors is given, see Value
...: additional settings for opts_brkiter
opts_brkiter: a named list with ICU BreakIterator's settings, see stri_opts_brkiter; NULL for the default break iterator, i.e., line_break

Author

Marek Gagolewski and other contributors

Details

Vectorized over str and n.

If n is negative (the default), then all text pieces are extracted.

Otherwise, if tokens_only is FALSE (which is the default), then n-1 tokens are extracted (if possible) and the n-th string gives the (non-split) remainder (see Examples). On the other hand, if tokens_only is TRUE, then only full tokens (up to n pieces) are extracted.

For more information on text boundary analysis performed by ICU's BreakIterator, see stringi-search-boundaries.

Examples

Run this code

test <- 'The\u00a0above-mentioned    features are very useful. ' %s+%
   'Spam, spam, eggs, bacon, and spam. 123 456 789'
stri_split_boundaries(test, type='line')
stri_split_boundaries(test, type='word')
stri_split_boundaries(test, type='word', skip_word_none=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_letter=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_number=TRUE)
stri_split_boundaries(test, type='sentence')
stri_split_boundaries(test, type='sentence', skip_sentence_sep=TRUE)
stri_split_boundaries(test, type='character')

# a filtered break iterator with the new ICU:
stri_split_boundaries('Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.', type='sentence', locale='en_US@ss=standard') # ICU >= 56 only

Run the code above in your browser using DataLab