stri_split_boundaries: Split a String at Specific Text Boundaries

Description

This function locates specific text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.

Usage

stri_split_boundaries(str, boundary = "line-break", locale = NULL)

Arguments

str

character vector or an object coercible to

boundary

character vector, each string is one of character, line-break, sentence, or word

locale

NULL or "" for text boundary analysis following the conventions of the default locale, or a single string with locale identifier, see stringi-locale.

Value

Returns a list of character vectors.

Details

Vectorized over str and boundary.

Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:

Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.
Counting characters, words, sentences, or paragraphs.
Making a list of the unique words in a document.
Capitalizing the first letter of each word.
Locating a particular unit of the text (For example, finding the third word in the document).

This function uses ICU's BreakIterator to split given strings at specific boundaries. The character boundary iterator tries to match what a user would think of as a ``character'' -- a basic unit of a writing system for a language -- which may be more than just a single Unicode code point. The word boundary iterator locates the boundaries of words, for purposes such as ``Find whole words'' operations. The line_break iterator locates positions that would be appropriate points to wrap lines when displaying the text. On the other hand, a sentence-break iterator locates sentence boundaries.

For technical details on different classes of text boundaries refer to the ICU User Guide, see below. For extracting individual words from the text using a BreakIterator, see stri_extract_words.

References

Boundary Analysis -- ICU User Guide, http://userguide.icu-project.org/boundaryanalysis

Other locale_sensitive: %!==%, %!=%, %<=%< a="">, %<%< a="">, %===%, %==%, %>=%, %>%, %stri!==%, %stri!=%, %stri<=%< a="">, %stri<%< a="">, %stri===%, %stri==%, %stri>=%, %stri>%; stri_cmp, stri_cmp_eq, stri_cmp_equiv, stri_cmp_ge, stri_cmp_gt, stri_cmp_le, stri_cmp_lt, stri_cmp_neq, stri_cmp_nequiv, stri_compare; stri_count_coll; stri_detect_coll; stri_duplicated, stri_duplicated_any; stri_enc_detect2; stri_extract_all_coll, stri_extract_first_coll, stri_extract_first_coll, stri_extract_last_coll, stri_extract_last_coll; stri_extract_words; stri_locate_all_coll, stri_locate_first_coll, stri_locate_first_coll, stri_locate_last_coll, stri_locate_last_coll; stri_locate_boundaries; stri_locate_words; stri_opts_collator; stri_order, stri_sort; stri_replace_all_coll, stri_replace_first_coll, stri_replace_first_coll, stri_replace_last_coll, stri_replace_last_coll; stri_split_coll; stri_trans_tolower, stri_trans_totitle, stri_trans_toupper; stri_unique; stri_wrap; stringi-locale; stringi-search-coll

Other search_split: stri_split_charclass; stri_split_coll; stri_split_fixed; stri_split_lines, stri_split_lines1, stri_split_lines1; stri_split_regex; stri_split; stringi-search

Other text_boundaries: stri_extract_words; stri_locate_boundaries; stri_locate_words; stri_wrap

Examples

Run this code

if (stri_install_check(silent=TRUE))
stri_split_boundaries("The\u00a0above-mentioned packages are...", boundary='line')

Run the code above in your browser using DataLab