udpipe (version 0.3)

predict.LDA_VEM: Predict method for an object of class LDA_VEM or class LDA_Gibbs

Description

Gives either the predictions to which topic a document belongs or the term posteriors by topic indicating which terms are emitted by each topic. If you provide in newdata a document term matrix for which a document does not contain any text and hence does not have any terms with nonzero entries, the prediction will give as topic prediction NA values (see the examples).

Usage

# S3 method for LDA_VEM
predict(object, newdata, type = c("topics", "terms"),
  min_posterior = -1, min_terms = 0, labels, ...)

# S3 method for LDA_Gibbs predict(object, newdata, type = c("topics", "terms"), min_posterior = -1, min_terms = 0, labels, ...)

Arguments

object

an object of class LDA_VEM or LDA_Gibbs as returned by LDA from the topicmodels package

newdata

a document/term matrix containing data for which to make a prediction

type

either 'topic' or 'terms' for the topic predictions or the term posteriors

min_posterior

numeric in 0-1 range to output only terms emitted by each topic which have a posterior probability equal or higher than min_posterior. Only used if type is 'terms'. Provide -1 if you want to keep all values.

min_terms

integer indicating the minimum number of terms to keep in the output when type is 'terms'. Defaults to 0.

labels

a character vector of the same length as the number of topics in the topic model. Indicating how to label the topics. Only valid for type = 'topic'. Defaults to topic_prob_001 up to topic_prob_999.

...

further arguments passed on to topicmodels::posterior

Value

  • in case of type = 'topic': a data.table with columns doc_id, topic (the topic number to which the document is assigned to), topic_label (the topic label) topic_prob (the posterior probability score for that topic), topic_probdiff_2nd (the probability score for that topic - the probability score for the 2nd highest topic) and the probability scores for each topic as indicated by topic_labelyourownlabel

  • n case of type = 'terms': a list of data.frames with columns term and prob, giving the posterior probability that each term is emitted by the topic

See Also

posterior-methods

Examples

Run this code
# NOT RUN {
## Build document/term matrix on dutch nouns
data(brussels_reviews_anno)
data(brussels_reviews)
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("JJ"))
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 10)
dtm <- dtm_remove_tfidf(dtm, top = 100)

## Fit a topicmodel using VEM
library(topicmodels)
mymodel <- LDA(x = dtm, k = 4, method = "VEM")

## Get topic terminology
terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3)
terminology

## Get scores alongside the topic model
dtm <- document_term_matrix(x, vocabulary = mymodel@terms)
scores <- predict(mymodel, newdata = dtm, type = "topics")
scores <- predict(mymodel, newdata = dtm, type = "topics", 
                  labels = c("mylabel1", "xyz", "app-location", "newlabel"))
head(scores)
table(scores$topic)
table(scores$topic_label)
table(scores$topic, exclude = c())
table(scores$topic_label, exclude = c())

## Fit a topicmodel using Gibbs
library(topicmodels)
mymodel <- LDA(x = dtm, k = 4, method = "Gibbs")
terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3)
scores <- predict(mymodel, type = "topics", newdata = dtm)
# }

Run the code above in your browser using DataCamp Workspace