##### Text Boundary Analysis in stringi

Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text.

##### Details

Examples of the boundary analysis process include:

• Locating positions to word-wrap text to fit within specific margins while displaying or printing, see stri_wrap and stri_split_boundaries.

• Counting characters, words, sentences, or paragraphs, see stri_count_boundaries.

• Making a list of the unique words in a document, see stri_extract_all_words and then stri_unique.

• Capitalizing the first letter of each word or sentence, see also stri_trans_totitle.

• Locating a particular unit of the text (for example, finding the third word in the document), see stri_locate_all_boundaries.

Generally, text boundary analysis is a locale-dependent operation. For example, in Japanese and Chinese one does not separate words with spaces - a line break can occur even in the middle of a word. These languages have punctuation and diacritical marks that cannot start or end a line, so this must also be taken into account.

stringi uses ICU's BreakIterator to locate specific text boundaries. Note that the BreakIterator's behavior may be controlled in come cases, see stri_opts_brkiter.

• The character boundary iterator tries to match what a user would think of as a character'' -- a basic unit of a writing system for a language -- which may be more than just a single Unicode code point.

• The word boundary iterator locates the boundaries of words, for purposes such as Find whole words'' operations.

• The line_break iterator locates positions that would be appropriate to wrap lines when displaying the text.

• The break iterator of type sentence locates sentence boundaries.

For technical details on different classes of text boundaries refer to the ICU User Guide, see below.

Boundary Analysis -- ICU User Guide, http://userguide.icu-project.org/boundaryanalysis

