A high-level function that controls a number of other functions responsible for loading texts from files, deleting markup, sampling from texts, converting samples to n-grams, etc. It is build on top of a number of functions and thus it requires a large number of arguments. The only obligatory argument, however, is a vector containing the names of the files to be loaded.
load.corpus.and.parse(files = "all", corpus.dir = "", markup.type= "plain",
corpus.lang = "English", splitting.rule = NULL,
sample.size = 10000, sampling = "no.sampling",
sample.overlap = 0, number.of.samples = 1,
sampling.with.replacement = FALSE, features = "w",
ngram.size = 1, preserve.case = FALSE,
encoding = "UTF-8", ...)
The function returns an object of the class stylo.corpus
. It is a list
containing as elements the samples (entire texts or sampled subsets) split into
words/characters and combined into n-grams (if applicable).
a vector of file names. The default value all
is an
equivalent to list.files()
.
the directory containing the text files to be loaded; if not specified, the current directory will be used.
choose one of the following values: plain
(nothing will happen), html
(all tags will be deleted as well
as HTML header), xml
(TEI header, any text between <note> </note>
tags, and all the tags will be deleted), xml.drama
(as above;
additionally, speaker's names will be deleted, or strings within the
<speaker> </speaker> tags), xml.notitles
(as above; but,
additionally, all the chapter/section (sub)titles will be deleted,
or strings within each the <head> </head> tags);
see delete.markup
for further details.
an optional argument indicating the language of the texts
analyzed; the values that will affect the function's behavior are:
English.contr
, English.all
, Latin.corr
(type
help(txt.to.words.ext)
for explanation). The default value
is English
.
if you are not satisfied with the default language
settings (or your input string of characters is not a regular text,
but a sequence of, say, dance movements represented using symbolic signs),
you can indicate your custom splitting regular expression here. This
option will overwrite the above language settings. For further details,
refer to help(txt.to.words)
.
desired size of samples, expressed in number of words; default value is 10,000.
one of three values: no.sampling
(default),
normal.sampling
, random.sampling
. See make.samples
for explanation.
if this opion is used, a reference text is segmented
into consecutive, equal-sized samples that are allowed to partially
overlap. If one specifies the sample.size
parameter of 5,000 and
the sample.overlap
of 1,000, for example, the first sample of a text
contains words 1--5,000, the second 4001--9,000, the third sample 8001--13,000,
and so forth.
optional argument which will be used only if
random.sampling
was chosen; it is self-evident.
optional argument which will be used only
if random.sampling
was chosen; it specifies the method used to
randomly harvest words from texts.
an option for specifying the desired type of features:
w
for words, c
for characters (default: w
). See
txt.to.features
for further details.
an optional argument (integer) specifying the value of n,
or the size of n-grams to be produced. If this argument is missing,
the default value of 1 is used. See txt.to.features
for further
details.
whether ot not to lowercase all characters in the corpus (default = F).
useful if you use Windows and non-ASCII alphabets: French,
Polish, Hebrew, etc. In such a situation, it is quite convenient to
convert your text files into Unicode and to set this option to
encoding = "UTF-8"
. In Linux and Mac, you are always expected
to use Unicode, thus you don't need to set anything. In Windows,
consider using UTF-8 but don't forget about the way of analyzing native
ANSI encoded files: set this option to encoding = "native.enc"
.
option not used; introduced here for compatibility reasons.
Maciej Eder
load.corpus
, delete.markup
,
txt.to.words
, txt.to.words.ext
,
txt.to.features
, make.samples
if (FALSE) {
# to load file1.txt and file2.txt, stored in the subdirectory my.files:
my.corpus = load.corpus.and.parse(files = c("file1.txt", "file2.txt"),
corpus.dir = "my.files")
# to load all XML files from the current directory, while getting rid of
# all markup tags in the file, and split the texts into consecutive
# word pairs (2-grams):
my.corpus = load.corpus.and.parse(files = list.files(pattern = "[.]xml$"),
markup.type = "xml", ngram.size = 2)
}
Run the code above in your browser using DataLab