Learn R Programming

stylo (version 0.5.7)

load.corpus.and.parse: Load text files and perform pre-processing

Description

A high-level function that controls a number of other functions responsible for loading texts from files, deleting markup, sampling from texts, converting samples to n-grams, etc. It is build on top of a number of functions and thus it requires a large number of arguments. The only obligatory argument, however, is a vector containing the names of the files to be loaded.

Usage

load.corpus.and.parse(files, corpus.dir = "", markup.type = "plain",
                      language = "English", splitting.rule = NULL,
                      sample.size = 10000, sampling = "no.sampling",
                      sample.overlap = 0, number.of.samples = 1,
                      sampling.with.replacement = FALSE, features = "w", 
                      ngram.size = 1, preserve.case = FALSE,
                      encoding = "native.enc")

Arguments

files
a vector of file names.
corpus.dir
the directory containing the text files to be loaded; if not specified, the current directory will be used.
markup.type
choose one of the following values: plain (nothing will happen), html (all tags will be deleted as well as HTML header), xml (TEI header, any text between tags, and all the tags will be
language
an optional argument indicating the language of the texts analyzed; the values that will affect the function's behavior are: English.contr, English.all, Latin.corr (type help(txt.to.words.ext)
splitting.rule
if you are not satisfied with the default language settings (or your input string of characters is not a regular text, but a sequence of, say, dance movements represented using symbolic signs), you can indicate your custom splitting regular ex
sample.size
desired size of samples, expressed in number of words; default value is 10,000.
sampling
one of three values: no.sampling (default), normal.sampling, random.sampling. See make.samples for explanation.
sample.overlap
if this opion is used, a reference text is segmented into consecutive, equal-sized samples that are allowed to partially overlap. If one specifies the sample.size parameter of 5,000 and the sample.overlap of 100
number.of.samples
optional argument which will be used only if random.sampling was chosen; it is self-evident.
sampling.with.replacement
optional argument which will be used only if random.sampling was chosen; it specifies the method used to randomly harvest words from texts.
features
an option for specifying the desired type of features: w for words, c for characters (default: w). See txt.to.features for further details.
ngram.size
an optional argument (integer) specifying the value of n, or the size of n-grams to be produced. If this argument is missing, the default value of 1 is used. See txt.to.features for further details.
preserve.case
whether ot not to lowercase all characters in the corpus (default = F).
encoding
useful if you use Windows and non-ASCII alphabets: French, Polish, Hebrew, etc. In such a situation, it is quite convenient to convert your text files into Unicode and to set this option to encoding = "UTF-8". In Linux and Mac, y

Value

  • The function returns an object of the class stylo.corpus. It is a list containing as elements the samples (entire texts or sampled subsets) split into words/characters and combined into n-grams (if applicable).

See Also

load.corpus, delete.markup, txt.to.words, txt.to.words.ext, txt.to.features, make.samples

Examples

Run this code
# to load file1.txt and file2.txt, stored in the subdirectory my.files:
my.corpus = load.corpus.and.parse(files = c("file1.txt", "file2.txt"),
                        corpus.dir = "my.files")

# to load all XML files from the current directory, while getting rid of
# all markup tags in the file, and split the texts into consecutive 
# word pairs (2-grams):
my.corpus = load.corpus.and.parse(files = list.files(pattern = "[.]xml$"),
                        markup.type = "xml", ngram.size = 2)

Run the code above in your browser using DataLab