Learn R Programming

revtools (version 0.4.0)

make_dtm: Construct a document-term matrix (DTM)

Description

Takes bibliographic data and converts it to a DTM for passing to topic models.

Usage

make_dtm(x, stop_words, min_freq, max_freq)

Arguments

x

a vector of strings

stop_words

optional vector of strings, listing terms to be removed from the DTM prior to analysis

min_freq

minimum proportion of entries that a term must be found in to be retained in the analysis. Defaults to 0.01.

max_freq

maximum proportion of entries that a term must be found in to be retained in the analysis. Defaults to 0.85.

Value

An object of class 'matrix', listing the terms (columns) present in each string (rows).

Details

This is primarily intended to be called internally by screen_topics, but is made available for users to generate their own topic models with the same properties as those in revtools. It bascially takes any words in the title, keywords and abstracts of the supplied references, and uses them to construct a DTM.

This function uses some standard tools like stemming, converting words to lower case, and removal of numbers or punctuation. It also replaces stemmed words with the most common full word, which doesn't affect the calculations, but makes the resulting analyses easier to interpret. It doesn't use part-of-speech tagging.

Words that occur in 2 entries or fewer are always removed by make_dtm, so values of min_freq that result in a threshold below this will not affect the result. Arguments to max_freq are passed as is.

This function is synonymous with the earlier function make_DTM, which will be removed from future versions of revtools.

See Also

run_topic_model, screen_topics

Examples

Run this code
# NOT RUN {
# import some data
file_location <- system.file(
  "extdata",
  "avian_ecology_bibliography.ris",
  package = "revtools")
x <- read_bibliography(file_location)

# construct a document-term matrix
# note: this can take a long time to run for large datasets
x_dtm <- make_dtm(x$title)
dim(x_dtm) # 20 articles, 10 words
# }

Run the code above in your browser using DataLab