dfm: Create a document-feature matrix

Description

Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.

Usage

dfm(
  x,
  tolower = TRUE,
  stem = FALSE,
  select = NULL,
  remove = NULL,
  dictionary = NULL,
  thesaurus = NULL,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  groups = NULL,
  verbose = quanteda_options("verbose"),
  ...
)

Arguments

character, corpus, tokens, or dfm object

tolower

convert all features to lowercase

stem

if TRUE, stem words

select

a pattern of user-supplied features to keep, while excluding all others. This can be used in lieu of a dictionary if there are only specific features that a user wishes to keep. To extract only Twitter usernames, for example, set select = "@*" and make sure that split_tags = FALSE as an additional argument passed to tokens. Note: select = "^@\\\w+\\\b" would be the regular expression version of this matching pattern. The pattern matching type will be set by valuetype. See also tokens_remove().

remove

a pattern of user-supplied features to ignore, such as "stop words". To access one possible list (from any list you wish), use stopwords(). The pattern matching type will be set by valuetype. See also tokens_select(). For behaviour of remove with ngrams > 1, see Details.

dictionary

a dictionary object to apply to the tokens when creating the dfm

thesaurus

a dictionary object that will be applied as if exclusive = FALSE. See also tokens_lookup(). For more fine-grained control over this and other aspects of converting features into dictionary/thesaurus keys from pattern matches to values, consider creating the dfm first, and then applying dfm_lookup() separately, or using tokens_lookup() on the tokenized text before calling dfm.

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

groups

either: a character vector containing the names of document variables to be used for grouping; or a factor or object that can be coerced into a factor equal in length or rows to the number of documents. NA values of the grouping value are dropped. See groups for details.

verbose

display messages if TRUE

...

additional arguments passed to tokens; not used when x is a dfm

Value

a '>dfm object

Details

The default behaviour for remove/select when constructing ngrams using dfm(x, ngrams > 1) is to remove/select any ngram constructed from a matching feature. If you wish to remove these before constructing ngrams, you will need to first tokenize the texts with ngrams, then remove the features to be ignored, and then construct the dfm using this modified tokenization object. See the code examples for an illustration.

To select on and match the features of a another dfm, x must also be a dfm.

Examples

Run this code

# NOT RUN {
## for a corpus
corp <- corpus_subset(data_corpus_inaugural, Year > 1980)
dfm(corp)
dfm(corp, tolower = FALSE)

# grouping documents by docvars in a corpus
dfm(corp, groups = "President", verbose = TRUE)

# with English stopwords and stemming
dfm(corp, remove = stopwords("english"), stem = TRUE, verbose = TRUE)
# works for both words in ngrams too
tokens("Banking industry") %>%
    tokens_ngrams(n = 2) %>%
    dfm(stem = TRUE)

# with dictionaries
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
               opposition = c("Opposition", "reject", "notincorpus"),
               taxing = "taxing",
               taxation = "taxation",
               taxregex = "tax*",
               country = "states"))
dfm(corpus_subset(data_corpus_inaugural, Year > 1900), dictionary = dict)


# removing stopwords
txt <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
             the newspaper from a boy named Seamus, in his mouth."
corp <- corpus(txt)
# note: "also" is not in the default stopwords("english")
featnames(dfm(corp, select = stopwords("english")))
# for ngrams
featnames(dfm(corp, ngrams = 2, select = stopwords("english"), remove_punct = TRUE))
featnames(dfm(corp, ngrams = 1:2, select = stopwords("english"), remove_punct = TRUE))

# removing stopwords before constructing ngrams
toks1 <- tokens(char_tolower(txt), remove_punct = TRUE)
toks2 <- tokens_remove(toks1, stopwords("english"))
toks3 <- tokens_ngrams(toks2, 2)
featnames(dfm(toks3))

# keep only certain words
dfm(corp, select = "*s")  # keep only words ending in "s"
dfm(corp, select = "s$", valuetype = "regex")

# testing Twitter functions
txttweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
                "2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
                "Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(txttweets, select = "#*", split_tags = FALSE)  # keep only hashtags
dfm(txttweets, select = "^#.*$", valuetype = "regex", split_tags = FALSE)

# for a dfm
dfm(corpus_subset(data_corpus_inaugural, Year > 1980), groups = "Party")

# }