Learn R Programming

quanteda (version 0.9.6-1)

dfm: create a document-feature matrix

Usage

dfm(x, ...)

## S3 method for class 'character': dfm(x, verbose = TRUE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE, removeTwitter = FALSE, stem = FALSE, ignoredFeatures = NULL, keptFeatures = NULL, language = "english", thesaurus = NULL, dictionary = NULL, valuetype = c("glob", "regex", "fixed"), ...)

## S3 method for class 'tokenizedTexts': dfm(x, verbose = TRUE, toLower = TRUE, stem = FALSE, ignoredFeatures = NULL, keptFeatures = NULL, language = "english", thesaurus = NULL, dictionary = NULL, valuetype = c("glob", "regex", "fixed"), ...)

## S3 method for class 'corpus': dfm(x, verbose = TRUE, groups = NULL, ...)

is.dfm(x)

as.dfm(x)

Arguments

x
corpus or character vector from which to generate the document-feature matrix
...
additional arguments passed to tokenize, which can include for instance ngrams and concatenator for tokenizing multi-token sequences
verbose
display messages if TRUE
toLower
convert texts to lowercase
removeNumbers
remove numbers, see tokenize
removePunct
remove punctuation, see tokenize
removeSeparators
remove separators (whitespace), see tokenize
removeTwitter
if FALSE, preserve # and @ characters, see tokenize
stem
if TRUE, stem words
ignoredFeatures
a character vector of user-supplied features to ignore, such as "stop words". To access one possible list (from any list you wish), use stopwords(). The pattern matching type will be set by value
keptFeatures
a use supplied regular expression defining which features to keep, while excluding all others. This can be used in lieu of a dictionary if there are only specific features that a user wishes to keep. To extract only Twitter usernames, for example, set