
segment
works on a character vector or corpus object, and allows the
delimiters to be defined. See details.segment(x, ...)## S3 method for class 'character':
segment(x, what = c("tokens", "sentences", "paragraphs",
"tags", "other"), delimiter = ifelse(what == "tokens", " ", ifelse(what ==
"sentences", "[.!?:;]", ifelse(what == "paragraphs", "\n{2}", ifelse(what
== "tags", "##\w+\b", NULL)))), perl = FALSE, ...)
## S3 method for class 'corpus':
segment(x, what = c("tokens", "sentences", "paragraphs",
"tags", "other"), delimiter = ifelse(what == "tokens", " ", ifelse(what ==
"sentences", "[.!?:;]", ifelse(what == "paragraphs", "\n{2}", ifelse(what
== "tags", "##\w+\b", NULL)))), perl = FALSE, ...)
perl=TRUE
,
or arguments to be passed to clean if what=tokens
,other
allows segmentation of a
text on any user-defined value, and must be accompanied by the
delimiter
argument.other
, which requires a value to be
specified..
,
!
, ?
, plus ;
and :
. For paragraphs, the default is two carriage returns, although this could be
changed to a single carriage return by changing the value of
delimiter
to "\
{1}"
which is the R version of the
regex for one newline character. (You might
need this if the document was created in a word processor, for instance,
and the lines were wrapped in the window rather than being hard-wrapped
with a newline character.)
# same as tokenize()
identical(tokenize(ukimmigTexts, lower=FALSE), segment(ukimmigTexts, lower=FALSE))
# segment into paragraphs
segment(ukimmigTexts[3:4], "paragraphs")
# segment a text into sentences
segmentedChar <- segment(ukimmigTexts, "sentences")
segmentedChar[2]
testCorpus <- corpus("##INTRO This is the introduction.
##DOC1 This is the first document.
Second sentence in Doc 1.
##DOC3 Third document starts here.
End of third document.")
testCorpusSeg <- segment(testCorpus, "tags")
summary(testCorpusSeg)
texts(testCorpusSeg)
# segment a corpus into sentences
segmentedCorpus <- segment(corpus(ukimmigTexts), "sentences")
identical(ndoc(segmentedCorpus), length(unlist(segmentedChar)))
Run the code above in your browser using DataLab