Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text.
Examples of the boundary analysis process process include:
Locating appropriate points to word-wrap text to fit
within specific margins while displaying or printing,
see stri_wrap and stri_split_boundaries.
Counting characters, words, sentences, or paragraphs,
see stri_count_boundaries.
Making a list of the unique words in a document,
cf. stri_extract_all_words and then stri_unique.
Capitalizing the first letter of each word
or sentence, see also stri_trans_totitle.
Locating a particular unit of the text (for example,
finding the third word in the document),
see stri_locate_all_boundaries.
Generally, text boundary analysis is a locale-dependent operation. For example, in Japanese and Chinese one does not separate words with spaces - a line break can occur even in the middle of a word. These languages have punctuation and diacritical marks that cannot start or end a line, so this must also be taken into account.
stringi uses ICU's BreakIterator to locate specific
text boundaries. Note that the BreakIterator's behavior
may be controlled in come cases, see stri_opts_brkiter.
The character boundary iterator tries to match what a user
would think of as a ``character'' -- a basic unit of a writing system
for a language -- which may be more than just a single Unicode code point.
The word boundary iterator locates the boundaries
of words, for purposes such as ``Find whole words'' operations.
The line_break iterator locates positions that would
be appropriate points to wrap lines when displaying the text.
On the other hand, a break iterator of type sentence
locates sentence boundaries.
For technical details on different classes of text boundaries refer to the ICU User Guide, see below.
Boundary Analysis -- ICU User Guide, http://userguide.icu-project.org/boundaryanalysis
Other locale_sensitive: %s<%,
stri_compare,
stri_count_boundaries,
stri_duplicated,
stri_enc_detect2,
stri_extract_all_boundaries,
stri_locate_all_boundaries,
stri_opts_collator,
stri_order,
stri_split_boundaries,
stri_trans_tolower,
stri_unique, stri_wrap,
stringi-locale,
stringi-search-coll
Other text_boundaries: stri_count_boundaries,
stri_extract_all_boundaries,
stri_locate_all_boundaries,
stri_opts_brkiter,
stri_split_boundaries,
stri_split_lines,
stri_trans_tolower,
stri_wrap, stringi-search
Other stringi_general_topics: stringi-arguments,
stringi-encoding,
stringi-locale,
stringi-package,
stringi-search-charclass,
stringi-search-coll,
stringi-search-fixed,
stringi-search-regex,
stringi-search