stri_locate_all_boundaries: Locate Specific Text Boundaries

Description

These functions locate specific text boundaries (like character, word, line, or sentence boundaries). stri_locate_all_* locate all the matches. On the other hand, stri_locate_first_* and stri_locate_last_* give the first or the last matches, respectively.

Usage

stri_locate_all_boundaries(str, omit_no_match = FALSE, ...,
  opts_brkiter = NULL)
stri_locate_last_boundaries(str, ..., opts_brkiter = NULL)
stri_locate_first_boundaries(str, ..., opts_brkiter = NULL)
stri_locate_all_words(str, omit_no_match = FALSE, locale = NULL)
stri_locate_last_words(str, locale = NULL)
stri_locate_first_words(str, locale = NULL)

Arguments

str

character vector or an object coercible to

omit_no_match

single logical value; if FALSE, then 2 missing values will indicate that there are no text boundaries

...

additional settings for opts_brkiter

opts_brkiter

a named list with ICU BreakIterator's settings as generated with stri_opts_brkiter; NULL for default break iterator, i.e. line_break

locale

NULL or "" for text boundary analysis following the conventions of the default locale, or a single string with locale identifier, see stringi-locale

Value

For stri_locate_all_*, a list of length(str) integer matrices is returned. The first column gives the start positions of substrings between located boundaries, and the second column gives the end positions. The indices are code point-based, thus they may be passed e.g. to the stri_sub function. Moreover, you may get two NAs in one row for no match (if omit_no_match is FALSE) or NA arguments.

stri_locate_first_* and stri_locate_last_*, on the other hand, return an integer matrix with two columns, giving the start and end positions of the first or the last matches, respectively, and two NAs if and only if they are not found.

Details

Vectorized over str.

For more information on the text boundary analysis performed by ICU's BreakIterator, see stringi-search-boundaries.

In case of stri_locate_*_words, just like in stri_extract_all_words and stri_count_words, ICU's word BreakIterator iterator is used to locate word boundaries, and all non-word characters (UBRK_WORD_NONE rule status) are ignored. This is function is equivalent to a call to stri_locate_*_boundaries(str, type="word", skip_word_none=TRUE, locale=locale)

Examples

Run this code

# NOT RUN {
test <- "The\u00a0above-mentioned    features are very useful. Warm thanks to their developers."
stri_locate_all_boundaries(test, type="line")
stri_locate_all_boundaries(test, type="word")
stri_locate_all_boundaries(test, type="sentence")
stri_locate_all_boundaries(test, type="character")
stri_locate_all_words(test)

stri_extract_all_boundaries("Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.", type="sentence", locale="en_US@ss=standard") # ICU >= 56 only

# }

Run the code above in your browser using DataLab