textmatrix: Textmatrix (Matrices)

Description

Creates a document-term matrix from all textfiles in a given directory.

Usage

textmatrix( mydir, stemming=FALSE, language="english",
   minWordLength=2, maxWordLength=FALSE, minDocFreq=1, 
   maxDocFreq=FALSE, minGlobFreq=FALSE, maxGlobFreq=FALSE, 
   stopwords=NULL, vocabulary=NULL, phrases=NULL, 
   removeXML=FALSE, removeNumbers=FALSE)
textvector( file, stemming=FALSE, language="english", 
   minWordLength=2, maxWordLength=FALSE, minDocFreq=1, 
   maxDocFreq=FALSE, stopwords=NULL, vocabulary=NULL, 
   phrases=NULL, removeXML=FALSE, removeNumbers=FALSE )

Arguments

file

filename (may include path).

mydir

the directory path (e.g., "corpus/texts/"); may be single files/directories or a vector of files/directories.

stemming

boolean indicating whether to reduce all terms to their wordstem.

language

specifies language for the stemming / stop-word-removal.

minWordLength

words with less than minWordLength characters will be ignored.

maxWordLength

words with more than maxWordLength characters will be ignored; per default set to FALSE to use no upper boundary.

minDocFreq

words of a document appearing less than minDocFreq within that document will be ignored.

maxDocFreq

words of a document appearing more often than maxDocFreq within that document will be ignored; per default set to FALSE to use no upper boundary for document frequencies.

minGlobFreq

words which appear in less than minGlobFreq documents will be ignored.

maxGlobFreq

words which appear in more than maxGlobFreq documents will be ignored.

stopwords

a stopword list that contains terms the will be ignored.

vocabulary

a character vector containing the words: only words in this term list will be used for building the matrix (`controlled vocabulary').

removeXML

if set to TRUE, XML tags (elements, attributes, some special characters) will be removed.

removeNumbers

if set to TRUE, terms that consist only out of numbers will be removed.

phrases

not implemented, yet.

Value

textmatrix

the document-term matrix (incl. row and column names).

Details

All documents in the specified directory are read and a matrix is composed. The matrix contains in every cell the exact number of appearances (i.e., the term frequency) of every word for all documents. If specified, simple text preprocessing mechanisms are applied (stemming, stopword filtering, wordlength cutoffs).

Stemming thereby uses Porter's snowball stemmer (from package SnowballC).

There are two stopword lists included (for english and for german), which are loaded on demand into the variables stopwords_de and stopwords_en. They can be activated by calling data(stopwords_de) or data(stopwords_en). Attention: the stopword lists have to be already loaded when textmatrix() is called.

textvector() is a support function that creates a list of term-in-document occurrences.

For every generated matrix, an own environment is added as an attribute which holds the triples that are stored by setTriple() and can be retrieved with getTriple().

If the language is set to "arabic", special characters for the Buckwalter transliteration will be kept.

Examples

Run this code

# NOT RUN {
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") )

# read them, create a document-term matrix
textmatrix(td)

# read them, drop german stopwords
data(stopwords_de)
textmatrix(td, stopwords=stopwords_de)

# read them based on a controlled vocabulary
voc = c("dog", "mouse")
textmatrix(td, vocabulary=voc, minWordLength=1)

# clean up
unlink(td, recursive=TRUE)

# }

Run the code above in your browser using DataLab