tokenize(txt, format = "file", fileEncoding = NULL, split = "[[:space:]]",
ign.comp = "-", heuristics = "abbr", heur.fix = list(pre = c("’",
"'"), suf = c("’", "'")), abbrev = NULL, tag = TRUE, lang = "kRp.env",
sentc.end = c(".", "!", "?", ";", ":"), detect = c(parag = FALSE, hline =
FALSE), clean.raw = NULL, perl = FALSE, stopwords = NULL,
stemmer = NULL)
"abbr"
tag=FALSE
, a character vector with the tokenized text. If tag=TRUE
,
returns an object of class kRp.tagged-class
.SnowballC
tokenize
can try to guess what's a headline and where a paragraph was inserted (via the detect
parameter).
A headline is assumed if a line of text without sentence ending punctuation is found,
a paragraph if two blocks of text
are separated by space. This will add extra tags into the text: "kRp.text.paste
can replace these tags, which probably preserves more of the original layout.