Usage
create_matrix(textColumns, language="english", minDocFreq=1, maxDocFreq=Inf,
minWordLength=3, maxWordLength=Inf, ngramLength=1, originalMatrix=NULL,
removeNumbers=FALSE, removePunctuation=TRUE, removeSparseTerms=0,
removeStopwords=TRUE, stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE,
weighting=weightTf)
Arguments
textColumns
Either character vector (e.g. data$Title) or a cbind()
of columns to use for training the algorithms (e.g. cbind(data$Title,data$Subject)
).
language
The language to be used for stemming the text data.
minDocFreq
The minimum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details. maxDocFreq
The maximum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details. minWordLength
The minimum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details. maxWordLength
The maximum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details. ngramLength
The number of words to include per n-gram for the document-term matrix.
originalMatrix
The original DocumentTermMatrix
used to train the models. If supplied, will adjust the new matrix to work with saved models.
removeNumbers
A logical
parameter to specify whether to remove numbers.
removePunctuation
A logical
parameter to specify whether to remove punctuation.
removeSparseTerms
See package tm for more details. removeStopwords
A logical
parameter to specify whether to remove stopwords using the language specified in language.
stemWords
A logical
parameter to specify whether to stem words using the language specified in language.
stripWhitespace
A logical
parameter to specify whether to strip whitespace.
toLower
A logical
parameter to specify whether to make all text lowercase.
weighting
Either weightTf
or weightTfIdf
. See package tm for more details.