This dialog allows creating a tm corpus from various sources. Once the
documents have been loaded, they are processed according to the chosen settings,
and a document-term matrix is extracted. The first source, Directory containing plain text files, creates one
document for each .txt file found in the specified directory. The documents
are named according to the name of the file they were loaded from. When choosing
the directoty where the .txt files can be found, please note that files are not
listed in the file browser, only directories, but they will be loaded nevertheless.
The second source, Spreadsheet file, creates one document for each row
of a file containg tabular data, typically an Excel (.xls) or Open Document
Spreadsheet (.ods), comma-separated values (.csv) or tab-separated values (.tsv, .txt,
.dat) file. The first column is taken as the contents of the document, while the
remaining columns are added as variables describing each document. For the CSV format,
, or ; is used as separator, whichever is the most frequent in the
50 first lines of the file.
The third source, Factiva XML file, loads articles exported from
the Dow Jones Factiva website in the XML or HTML formats (the former
being recommended if you can choose it). Various meta-data describing the articles are
automatically extracted. If the corpus is split into several .xml or .html files, you
can put them in the same directory and select them by holding the Ctrl key to concatenate
them into a single corpus. Please note that some rticles from Factiva are known to contain
invalid character that trigger an error when loading. If this problem happens to you,
please try to identify the problematic article, for example by removing half of the
documents and retrying, until only one document is left in the corpus; then, report
the problem to the Factiva Customer Service, or ask for help to the maintainers of the
present package.
The third source, Twitter search, retrieves most recent tweets matching the search
query and written in the specified language, up to the chosen maximum number of messages.
Due to limitations imposed by Twitter, only tweets published up to 6 or 9 days ago can be
downloaded, and up to a maximum number of 1500 tweets. Search queries can notably include
one or more terms that must be present together for a tweet to match the query, and/or of
hashtags starting with #; see https://dev.twitter.com/docs/using-search if
you need more complex search strings. User names, hashtags, URLs and RT (re-tweet)
mentions are automatically removed from the corpus when computing the document-term matrix
as they generally disturb the analysis. If the option to remove user names and hashtags is
disabled, they will be included as standard text, i.e. # and @ will be
removed if the punctuation removal processing option has been enabled. The Exclude
retweets option works by identifying tweets that contain RT as a separate expression;
this operation can also be carried out manually later by using the Retweet corpus
variable that is created automatically at import time.
The original texts can optionally be split into smaller chunks, which will then be
considered as the real unit (called documents) for all analyses. In order
to get meaningful chunks, texts are only splitted into paragraphs. These are defined
by the import filter: when importing a directory of text files, a new paragraph
starts with a line break; when importing a Factiva files, paragraphs are defined
by the content provider itself, so may vary in size (heading is always a separate
paragraph); splitting has no effect when importing from a spreadsheet file. A corpus
variable called Document is created, which identifies the original text
the chunk comes from.
For all sources, a data set called corpusVariables
is created, with one row
for each document in the corpus: it contains meta-data that could be extracted from
the source, if any, and can be used to enter further meta-data about the corpus.
This can also be done by importing an existing data set via the
Data->Load data set or Data->Import data menus. Whatever way you choose, use the
Text mining->Set corpus meta-data command after that to set or update the corpus's
meta-data that will be used by later analyses (see setCorpusVariables
).
The dialog also provides a few processing options that will most likely be
all run in order to get a meaningful set of terms from a text corpus.
Among them, stopwords removal and stemming require you to select the
language used in the corpus: at the moment supported languages are
Danish (da), Dutch (nl), English (en), Finnish (fi),
French (fr), German (de), Hungarian (hu), Italian (it),
Norwegian (no), Portuguese (pt), Russian (ru), Spanish (es),
and Swedish (sv) - to specifify via their ISO 639 two-letter code.
By default, plain text (usually .txt) and comma/tab-separated values files (.csv, .tsv, .dat...)
are assumed to be in the native encoding, which is shown in the File encoding: entry.
If you know this is not the case, you can change the value of this field to one of the encodings
returned by the iconvlist()
function.
Once the corpus has been imported, its document-term matrix is extracted.