Usage
load.corpus.and.parse(files, corpus.dir = "", markup.type = "plain",
language = "English", splitting.rule = NULL,
sample.size = 10000, sampling = "no.sampling",
sample.overlap = 0, number.of.samples = 1,
sampling.with.replacement = FALSE, features = "w",
ngram.size = 1, preserve.case = FALSE,
encoding = "native.enc")
Arguments
files
a vector of file names.
corpus.dir
the directory containing the text files to be loaded; if
not specified, the current directory will be used.
markup.type
choose one of the following values: plain
(nothing will happen), html
(all tags will be deleted as well
as HTML header), xml
(TEI header, any text between
tags, and all the tags will be
language
an optional argument indicating the language of the texts
analyzed; the values that will affect the function's behavior are:
English.contr
, English.all
, Latin.corr
(type
help(txt.to.words.ext)
splitting.rule
if you are not satisfied with the default language
settings (or your input string of characters is not a regular text,
but a sequence of, say, dance movements represented using symbolic signs),
you can indicate your custom splitting regular ex
sample.size
desired size of samples, expressed in number of words;
default value is 10,000.
sampling
one of three values: no.sampling
(default),
normal.sampling
, random.sampling
. See make.samples
for explanation.
sample.overlap
if this opion is used, a reference text is segmented
into consecutive, equal-sized samples that are allowed to partially
overlap. If one specifies the sample.size
parameter of 5,000 and
the sample.overlap
of 100
number.of.samples
optional argument which will be used only if
random.sampling
was chosen; it is self-evident.
sampling.with.replacement
optional argument which will be used only
if random.sampling
was chosen; it specifies the method used to
randomly harvest words from texts.
features
an option for specifying the desired type of features:
w
for words, c
for characters (default: w
). See
txt.to.features
for further details.
ngram.size
an optional argument (integer) specifying the value of n,
or the size of n-grams to be produced. If this argument is missing,
the default value of 1 is used. See txt.to.features
for further
details.
preserve.case
whether ot not to lowercase all characters in the corpus
(default = F).
encoding
useful if you use Windows and non-ASCII alphabets: French,
Polish, Hebrew, etc. In such a situation, it is quite convenient to
convert your text files into Unicode and to set this option to
encoding = "UTF-8"
. In Linux and Mac, y