Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.
dfm(x, tolower = TRUE, stem = FALSE, select = NULL, remove = NULL,
dictionary = NULL, thesaurus = NULL, valuetype = c("glob", "regex",
"fixed"), groups = NULL, verbose = quanteda_options("verbose"), ...)
convert all features to lowercase
if TRUE
, stem words
a pattern of user-supplied features to keep, while
excluding all others. This can be used in lieu of a dictionary if there
are only specific features that a user wishes to keep. To extract only
Twitter usernames, for example, set select = "@*"
and make sure
that remove_twitter = FALSE
as an additional argument passed to
tokens. Note: select = "^@\\w+\\b"
would be the regular
expression version of this matching pattern. The pattern matching type
will be set by valuetype
. See also tokens_remove
.
a pattern of user-supplied features to ignore, such as
"stop words". To access one possible list (from any list you wish), use
stopwords()
. The pattern matching type will be set by
valuetype
. See also tokens_select
. For behaviour of
remove
with ngrams > 1
, see Details.
a dictionary object to apply to the tokens when creating the dfm
a dictionary object that will be applied as if
exclusive = FALSE
. See also tokens_lookup
. For more
fine-grained control over this and other aspects of converting features
into dictionary/thesaurus keys from pattern matches to values, consider
creating the dfm first, and then applying dfm_lookup
separately, or using tokens_lookup
on the tokenized text
before calling dfm
.
the type of pattern matching: "glob"
for
"glob"-style wildcard expressions; "regex"
for regular expressions;
or "fixed"
for exact matching. See valuetype for details.
either: a character vector containing the names of document variables to be used for grouping; or a factor or object that can be coerced into a factor equal in length or rows to the number of documents. See groups for details.
display messages if TRUE
a dfm-class object
The default behavior for remove
/select
when
constructing ngrams using dfm(x,
ngrams > 1)
is to
remove/select any ngram constructed from a matching feature. If you wish
to remove these before constructing ngrams, you will need to first tokenize
the texts with ngrams, then remove the features to be ignored, and then
construct the dfm using this modified tokenization object. See the code
examples for an illustration.
To select on and match the features of a another dfm, x
must
also be a dfm.
# NOT RUN {
## for a corpus
corpus_post80inaug <- corpus_subset(data_corpus_inaugural, Year > 1980)
dfm(corpus_post80inaug)
dfm(corpus_post80inaug, tolower = FALSE)
# grouping documents by docvars in a corpus
dfm(corpus_post80inaug, groups = "President", verbose = TRUE)
# with English stopwords and stemming
dfm(corpus_post80inaug, remove = stopwords("english"), stem = TRUE, verbose = TRUE)
# works for both words in ngrams too
dfm("Banking industry", stem = TRUE, ngrams = 2, verbose = FALSE)
# with dictionaries
corpus_post1900inaug <- corpus_subset(data_corpus_inaugural, Year > 1900)
mydict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxing = "taxing",
taxation = "taxation",
taxregex = "tax*",
country = "states"))
dfm(corpus_post1900inaug, dictionary = mydict)
# removing stopwords
test_text <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
the newspaper from a boy named Seamus, in his mouth."
test_corpus <- corpus(test_text)
# note: "also" is not in the default stopwords("english")
featnames(dfm(test_corpus, select = stopwords("english")))
# for ngrams
featnames(dfm(test_corpus, ngrams = 2, select = stopwords("english"), remove_punct = TRUE))
featnames(dfm(test_corpus, ngrams = 1:2, select = stopwords("english"), remove_punct = TRUE))
# removing stopwords before constructing ngrams
tokens_all <- tokens(char_tolower(test_text), remove_punct = TRUE)
tokens_no_stopwords <- tokens_remove(tokens_all, stopwords("english"))
tokens_ngrams_no_stopwords <- tokens_ngrams(tokens_no_stopwords, 2)
featnames(dfm(tokens_ngrams_no_stopwords, verbose = FALSE))
# keep only certain words
dfm(test_corpus, select = "*s", verbose = FALSE) # keep only words ending in "s"
dfm(test_corpus, select = "s$", valuetype = "regex", verbose = FALSE)
# testing Twitter functions
test_tweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
"2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
"Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(test_tweets, select = "#*", remove_twitter = FALSE) # keep only hashtags
dfm(test_tweets, select = "^#.*$", valuetype = "regex", remove_twitter = FALSE)
# for a dfm
dfm1 <- dfm(data_corpus_irishbudget2010)
dfm2 <- dfm(dfm1,
groups = ifelse(docvars(data_corpus_irishbudget2010, "party") %in% c("FF", "Green"),
"Govt", "Opposition"),
tolower = FALSE, verbose = TRUE)
# }
Run the code above in your browser using DataLab