Usage
tokenize(txt, format = "file", fileEncoding = NULL, split = "[[:space:]]", ign.comp = "-", heuristics = "abbr", heur.fix = list(pre = c("’", "'"), suf = c("’", "'")), abbrev = NULL, tag = TRUE, lang = "kRp.env", sentc.end = c(".", "!", "?", ";", ":"), detect = c(parag = FALSE, hline = FALSE), clean.raw = NULL, perl = FALSE, stopwords = NULL, stemmer = NULL)
Arguments
txt
Either an open connection,
the path to directory with txt files to read and tokenize, or a vector object
already holding the text corpus.
format
Either "file" or "obj",
depending on whether you want to scan files or analyze the given object.
fileEncoding
A character string naming the encoding of all files.
split
A regular expression to define the basic split method. Should only need refinement
for languages that don't separate words by space.
ign.comp
A character vector defining punctuation which might be used in composita that should
not be split.
heuristics
A vector to indicate if the tokenizer should use some heuristics. Can be none,
one or several of the following:
"abbr"
Assume that "letter-dot-letter-dot" combinations are abbreviations and leave them intact.
"suf"
Try to detect possesive suffixes like "'s",
or shorting suffixes like "'ll" and treat them as one token
"pre"
Try to detect prefixes like "s'" or "l'" and treat them as one token
Earlier releases used the names "en"
and "fr"
instead of "suf"
and "pre"
. They are still working,
that is
"en"
is equivalent to "suf"
,
whereas "fr"
is now equivalent to both "suf"
and "pre"
(and not only
"pre"
as in the past, which was missing the use of suffixes in French).
heur.fix
A list with the named vectors pre
and suf
. These will be used if heuristics
were
set to use one of the presets that try to detect pre- and/or suffixes. Change them if you document uses other
characters than the ones defined by default.
abbrev
Path to a text file with abbreviations to take care of,
one per line. Note that
this file must have the same encoding as defined by fileEncoding
.
tag
Logical. If TRUE
,
the text will be rudimentarily tagged and returned as an object
of class kRp.tagged
.
lang
A character string naming the language of the analyzed text. If set to "kRp.env"
this is got from get.kRp.env
. Only needed if tag=TRUE
. sentc.end
A character vector with tokens indicating a sentence ending. Only needed if tag=TRUE
.
detect
A named logical vector,
indicating by the setting of parag
and hline
whether tokenize
should try
to detect paragraphs and headlines.
clean.raw
A named list of character values,
indicating replacements that should globally be made to the text prior to tokenizing it.
This is applied after the text was converted into UTF-8 internally. In the list,
the name of each element represents a pattern which
is replaced by its value if met in the text. Since this is done by calling gsub
,
regular expressions are basically
supported. See the perl
attribute, too. perl
Logical,
only relevant if clean.raw
is not NULL
. If perl=TRUE
, this is forwarded to gsub
to allow for perl-like regular expressions in clean.raw
. stopwords
A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set
stopwords=tm::stopwords("en")
to use the english stopwords provided by the tm
package.
stemmer
A function or method to perform stemming. For instance,
you can set SnowballC::wordStem
if you have
the SnowballC
package installed. As of now,
you cannot provide further arguments to this function.