Learn R Programming

corpus (version 0.5.1)

segmentation: Segmenting Text

Description

Segment text into smaller units.

Usage

sentences(x)

Arguments

x

a text or character vector.

Value

A data frame with three columns: parent, index, and text, and one row for each sentence. The parent value is the integer index of the parent text in x; the index value is the integer index of the sentence in its parent; the text value is the text of the sentence, a value of type text.

Details

sentences splits text at the sentence boundaries defined by Unicode Standard Annex #29, Section 5. These boundaries handle Unicode correctly and they give reasonable behavior across a variety of languages. Unfortunately, the UAX 29 sentence-breaking rules do not handle abbreviations correctly. So, for example, the text "I saw Mr. Jones today." will get split into two sentences.

Future versions of the sentences function may change to accommodate special rules for abbreviations like "Mr.", "Dr.", etc.

See Also

tokens.

Examples

Run this code
    sentences("I saw Mr. Jones today.")

    sentences(c("What. Are. You. Doing????",
                "She asked 'do you really mean that?' and I said 'yes.'"))

Run the code above in your browser using DataLab