tokenize(txt, format = "file", fileEncoding = NULL,
split = "[[:space:]]", ign.comp = "-",
heuristics = "abbr",
heur.fix = list(pre = c("’", "'"), suf = c("’", "'")),
abbrev = NULL, tag = TRUE, lang = "kRp.env",
sentc.end = c(".", "!", "?", ";", ":"),
detect = c(parag = FALSE, hline = FALSE))
"abbr"
tag=FALSE
, a character vector with the
tokenized text. If tag=TRUE
, returns an object of
class kRp.tagged-class
.tokenize
tokenize
can try to guess what's a headline and
where a paragraph was inserted (via the detect
parameter). A headline is assumed if a line of text
without sentence ending punctuation is found, a paragraph
if two blocks of text are separated by space. This will
add extra tags into the text: "kRp.text.paste
can
replace these tags, which probably preserves more of the
original layout.tokenized.obj <- tokenize("~/mydata/corpora/russian_corpus/")
Run the code above in your browser using DataLab