
Make a TermDocumentMatrix
from a vector of text and and
optional vector of documents. To stem a document as well use the
q_tdm_stem
version of q_tdm
which uses SnowballC's
wordStem
.
q_tdm(text, docs = seq_along(text), to = "tm", keep.hyphen = FALSE,
ngrams = NULL, ...)q_tdm_stem(text, docs = seq_along(text), to = "tm", keep.hyphen = FALSE,
ngrams = NULL, ...)
A vector of strings.
A vector of document names.
target conversion format, consisting of the name of the package into whose document-term matrix representation the dfm will be converted:
"lda"
a list with components "documents" and "vocab" as needed by
lda.collapsed.gibbs.sampler
from the lda package
"tm"
a DocumentTermMatrix from the tm package
"stm"
the format for the stm package
"austin"
the wfm
format from the austin package
"topicmodels"
the "dtm" format as used by the topicmodels package
logical. If TRUE
hyphens are retained in the terms
(e.g., "math-like" is kept as "math-like"), otherwise they become a split for
terms (e.g., "math-like" is converted to "math" & "like").
A vector of ngrams (multiple wrds with spaces). Using this option results in the ngrams that will be retained in the matrix.
Additional arguments passed to dfm
# NOT RUN {
(x <- with(presidential_debates_2012, q_tdm(dialogue, paste(time, tot, sep = "_"))))
tm::weightTfIdf(x)
(x2 <- with(presidential_debates_2012, q_tdm_stem(dialogue, paste(time, tot, sep = "_"))))
remove_stopwords(x2, stem=TRUE)
# }
Run the code above in your browser using DataLab