text_split: Segmenting Text

Description

Segment text into smaller units.

Usage

text_split(x, units = "sentences", size = 1, filter = NULL, ...)
text_nsentence(x, filter = NULL, ...)

Arguments

a text or character vector.

units

the block size units, either "sentences" or "tokens".

size

the block size, a positive integer giving the maximum number of units per block.

filter

if non-NULL, a text filter to to use instead of the default text filter for x.

…

additional properties to set on the text filter.

Value

text_split returns a data frame with three columns named parent, index, and text, and one row for each text block. The columns are as follows:

The parent column is a factor. The levels of this factor are the names of as_corpus_text(x). Calling as.integer on the parent column gives the indices of the parent texts for the parent text for each sentence.
The index column gives the integer index of the sentence in its parent.
The text value is the text of the block, a value of type corpus_text (not a character vector).

text_nsentence returns a numeric vector with the same length as x with each element giving the number of sentences in the corresponding text.

Sentences

Sentences are defined according to a tailored version of the boundaries specified by Unicode Standard Annex #29, Section 5.

The UAX 29 sentence boundaries handle Unicode correctly and they give reasonable behavior across a variety of languages, but they do not handle abbreviations correctly and by default they treat carriage returns and line feeds as paragraph separators, often leading to incorrect breaks. To get around these shortcomings, the text filter allows tailoring the UAX 29 rules using the sent_crlf and the sent_suppress properties.

The UAX 29 rules break after full stops (periods) whenever they are followed by uppercase letters. Under these rules, the text "I saw Mr. Jones today." gets split into two sentences. To get around this, we allow a sent_suppress property, a list of sentence break suppressions which, when followed by uppercase characters, do not signal the end of a sentence.

The UAX 29 rules also specify that a carriage return (CR) or line feed (LF) indicates the end of of a sentence, so that "A split\nsentence." gets split into two sentences. This often leads to incorrect breaks, so by default, with sent_crlf = FALSE, we deviate from the UAX 29 rules and we treat CR and LF like spaces. To break sentences on CRLF, CR, and LF, specify sent_crlf = TRUE.

Details

text_split splits text into roughly evenly-sized blocks, measured in the specified units. When units = "sentences", units are sentences; when units = "tokens", units are non-NA tokens. The size parameter specifies the maximum block size.

When the minimum block size does not evenly divide the number of total units in a text, the block sizes will not be exactly equal. However, it will still be the case that no block will has more than one unit more than any other block. The extra units get allocated to the first segments in the split.

Sentences and tokens are defined by the filter argument. The documentation for text_tokens describes the tokenization rules. For sentence boundaries, see the ‘Sentences’ section below.

Examples

Run this code

# NOT RUN {
text <- c("I saw Mr. Jones today.", 
          "Split across\na line.",
          "What. Are. You. Doing????",
          "She asked 'do you really mean that?' and I said 'yes.'")

# split text into sentences
text_split(text, units = "sentences")

# get the number of sentences
text_nsentence(text)

# disable the default sentence suppressions
text_split("I saw Mr. Jones today.", units = "sentences", filter = NULL)

# break on CR and LF
text_split("Split across\na line.", units = "sentences",
           filter = text_filter(sent_crlf = TRUE))

# 2-sentence blocks
text_split(c("What. Are. You. Doing????",
           "She asked 'do you really mean that?' and I said 'yes.'"),
           units = "sentences", size = 2)

# 4-token blocks
text_split(c("What. Are. You. Doing????",
             "She asked 'do you really mean that?' and I said 'yes.'"),
           units = "tokens", size = 4)

# blocks are approximately evenly sized; 'size' gives maximum size
text_split(paste(letters, collapse = " "), "tokens", 4)
text_split(paste(letters, collapse = " "), "tokens", 16)
# }

Run the code above in your browser using DataLab