dfm: create a document-feature matrix

Usage

dfm(x, ...)
## S3 method for class 'character':
dfm(x, verbose = TRUE, toLower = TRUE,
  removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
  removeTwitter = FALSE, stem = FALSE, ignoredFeatures = NULL,
  keptFeatures = NULL, language = "english", thesaurus = NULL,
  dictionary = NULL, valuetype = c("glob", "regex", "fixed"), ...)
## S3 method for class 'tokenizedTexts':
dfm(x, verbose = TRUE, toLower = TRUE,
  stem = FALSE, ignoredFeatures = NULL, keptFeatures = NULL,
  language = "english", thesaurus = NULL, dictionary = NULL,
  valuetype = c("glob", "regex", "fixed"), ...)
## S3 method for class 'corpus':
dfm(x, verbose = TRUE, groups = NULL, ...)
is.dfm(x)
as.dfm(x)

Arguments

corpus or character vector from which to generate the document-feature matrix

...

additional arguments passed to tokenize, which can include for instance ngrams and concatenator for tokenizing multi-token sequences

verbose

display messages if TRUE

toLower

convert texts to lowercase

removeNumbers

remove numbers, see tokenize

removePunct

remove punctuation, see tokenize

removeSeparators

remove separators (whitespace), see tokenize

removeTwitter

if FALSE, preserve # and @ characters, see tokenize

stem

if TRUE, stem words

ignoredFeatures

a character vector of user-supplied features to ignore, such as "stop words". To access one possible list (from any list you wish), use stopwords(). The pattern matching type will be set by value

keptFeatures

a use supplied regular expression defining which features to keep, while excluding all others. This can be used in lieu of a dictionary if there are only specific features that a user wishes to keep. To extract only Twitter usernames, for example, set