segment: segment texts into component elements

Description

Segment text(s) into tokens, sentences, paragraphs, or other sections. segment works on a character vector or corpus object, and allows the delimiters to be defined. See details.

Usage

segment(x, ...)
## S3 method for class 'character':
segment(x, what = c("tokens", "sentences", "paragraphs",
  "tags", "other"), delimiter = ifelse(what == "tokens", " ", ifelse(what ==
  "sentences", "[.!?:;]", ifelse(what == "paragraphs", "\\n{2}", ifelse(what
  == "tags", "##\\w+\\b", NULL)))), perl = FALSE, ...)
## S3 method for class 'corpus':
segment(x, what = c("tokens", "sentences", "paragraphs",
  "tags", "other"), delimiter = ifelse(what == "tokens", " ", ifelse(what ==
  "sentences", "[.!?:;]", ifelse(what == "paragraphs", "\\n{2}", ifelse(what
  == "tags", "##\\w+\\b", NULL)))), perl = FALSE, ...)

Arguments

text or corpus object to be segmented

...

provides additional arguments passed to tokenize, if what = "tokens" is used

what

unit of segmentation. Current options are tokens, sentences, paragraphs, and other. Segmenting on other allows segmentation of a text on any user-defined value, and must be accompanied by the delimiter argument.

delimiter

delimiter defined as a regex for segmentation. Each type has its own default, except other, which requires a value to be specified.

perl

logical. Should Perl-compatible regular expressions be used?

Value

A list of segmented texts, with each element of the list correponding to one of the original texts.

Details

Tokens are delimited by Separators. For sentences, the delimiter can be defined by the user. The default for sentences includes ., !, ?, plus ; and :. For paragraphs, the default is two carriage returns, although this could be changed to a single carriage return by changing the value of delimiter to

"\
{1}"

which is the R version of the regex for one newline character. (You might need this if the document was created in a word processor, for instance, and the lines were wrapped in the window rather than being hard-wrapped with a newline character.)

Examples

Run this code

# same as tokenize()
identical(tokenize(ukimmigTexts), segment(ukimmigTexts))

# segment into paragraphs
segment(ukimmigTexts[3:4], "paragraphs")

# segment a text into sentences
segmentedChar <- segment(ukimmigTexts, "sentences")
segmentedChar[2]
testCorpus <- corpus("##INTRO This is the introduction. 
                      ##DOC1 This is the first document.  
                      Second sentence in Doc 1.  
                      ##DOC3 Third document starts here.  
                      End of third document.")
testCorpusSeg <- segment(testCorpus, "tags")
summary(testCorpusSeg)
texts(testCorpusSeg)
# segment a corpus into sentences
segmentedCorpus <- segment(corpus(ukimmigTexts), "sentences")
identical(ndoc(segmentedCorpus), length(unlist(segmentedChar)))

Run the code above in your browser using DataLab