Learn R Programming

ds4psy (version 0.4.0)

text_to_sentences: text_to_sentences splits a string of text x (consisting of one or more character strings) into a vector of its constituting sentences.

Description

text_to_sentences splits at given punctuation marks (as a regular expression, default: split_delim = "\.|\?|!") and removes empty leading and trailing spaces before returning a vector of the remaining character sequences (as the sentences).

Usage

text_to_sentences(x, split_delim = "\\.|\\?|!", force_delim = FALSE)

Arguments

x

A string of text (required), typically a character vector.

split_delim

Sentence delimiters (as regex) used to split x into substrings. By default, split_delim = "\.|\?|!".

force_delim

Boolean: Enforce splitting at split_delim? If force_delim = FALSE (as per default), the function assumes a standard sentence-splitting pattern: split_delim is followed by a single space and a capital letter. If force_delim = TRUE, splits at split_delim are enforced (regardless of spacing or capitalization).

Details

The Boolean force_delim distinguishes between two splitting modes:

  1. If force_delim = FALSE (as per default), the function assumes a standard sentence-splitting pattern: A sentence delimiter in split_delim must be followed by a single space and a capital letter starting the next sentence. Sentence delimiters in split_delim are not removed from the output.

  2. If force_delim = TRUE, the function enforces splits at each delimiter in split_delim. For instance, any dot (i.e., the metacharacter "\.") is interpreted as a full stop, so that sentences containing dots mid-sentence (e.g., for abbreviations, etc.) are split into parts. Sentence delimiters in split_delim are removed from the output.

Internally, text_to_sentences uses strsplit to split strings.

See Also

text_to_words for splitting text into a vector of words; count_words for counting the frequency of words; strsplit for splitting strings.

Other text objects and functions: Umlaut, capitalize(), caseflip(), cclass, count_chars(), count_words(), l33t_rul35, metachar, read_ascii(), text_to_words(), transl33t()

Examples

Run this code
# NOT RUN {
x <- c("A first sentence. Exclamation sentence!", 
       "Any questions? But etc. can be tricky. A fourth --- and final --- sentence.")
text_to_sentences(x)
text_to_sentences(x, force_delim = TRUE)

# Changing split delimiters:
text_to_sentences(x, split_delim = "\\.")  # only split at "."

text_to_sentences("Buy apples, berries, and coconuts.")
text_to_sentences("Buy apples, berries; and coconuts.", 
                  split_delim = ",|;|\\.", force_delim = TRUE)
                  
text_to_sentences(c("123. 456? 789! 007 etc."), force_delim = TRUE)
text_to_sentences("Dr. Who is problematic.")


# }

Run the code above in your browser using DataLab