sldax-summary: Summary functions for objects of class '>Sldax

Description

Obtain parameter estimates, model goodness-of-fit metrics, and posterior summaries.

For SLDA or SLDAX models, label switching is handled during estimation in the gibbs_sldax() function with argument correct_ls, so it is not addressed by this function.

Usage

est_beta(mcmc_fit, burn = 0, thin = 1, stat = "mean")
est_theta(mcmc_fit, burn = 0, thin = 1, stat = "mean")
get_coherence(beta_, docs, nwords = 10)
get_exclusivity(beta_, nwords = 10, weight = 0.7)
get_toptopics(theta, ntopics)
get_topwords(beta_, nwords, vocab, method = "termscore")
get_zbar(mcmc_fit, burn = 0L, thin = 1L)
post_regression(mcmc_fit)
gg_coef(mcmc_fit, burn = 0L, thin = 1L, stat = "mean", errorbw = 0.5)
# S4 method for Sldax
gg_coef(mcmc_fit, burn = 0L, thin = 1L, stat = "mean", errorbw = 0.5)
# S4 method for Sldax
est_beta(mcmc_fit, burn = 0, thin = 1, stat = "mean")
# S4 method for Sldax
est_theta(mcmc_fit, burn = 0, thin = 1, stat = "mean")
# S4 method for matrix,matrix
get_coherence(beta_, docs, nwords = 10)
# S4 method for matrix
get_exclusivity(beta_, nwords = 10, weight = 0.7)
# S4 method for matrix
get_toptopics(theta, ntopics)
# S4 method for matrix,numeric,character
get_topwords(beta_, nwords, vocab, method = "termscore")
# S4 method for Sldax
get_zbar(mcmc_fit, burn = 0L, thin = 1L)
# S4 method for Mlr
post_regression(mcmc_fit)
# S4 method for Logistic
post_regression(mcmc_fit)
# S4 method for Sldax
post_regression(mcmc_fit)

Arguments

mcmc_fit

An object of class '>Sldax.

burn

The number of draws to discard as a burn-in period (default: 0).

thin

The number of draws to skip as a thinning period (default: 1; i.e., no thinning).

stat

The summary statistic to use on the posterior draws (default: "mean").

beta_

A \(K\) x \(V\) matrix of word-topic probabilities. Each row sums to 1.

docs

The \(D\) x max(\(N_d\)) matrix of documents (word indices) used to fit the '>Sldax model.

nwords

The number of words to retrieve (default: all).

weight

The weight (between 0 and 1) to give to exclusivity (near 1) vs. frequency (near 0). (default: 0.7).

theta

A D x K matrix of K topic proportions for all D documents.

ntopics

The number of topics to retrieve (default: all topics).

vocab

A character vector of length V containing the vocabulary.

method

If "termscore", use term scores (similar to tf-idf). If "prob", use probabilities (default: "termscore").

errorbw

Positive control parameter for the width of the +/- 2 posterior standard error bars (default: 0.5).

Value

A matrix of topic-word probability estimates.

A matrix of topic proportion estimates.

A numeric vector of coherence scores for each topic (more positive is better).

A numeric vector of exclusivity scores (more positive is better).

A data frame of the ntopics most probable topics per document.

A \(K\) x \(V\) matrix of term-scores (comparable to tf-idf).

A matrix of empirical topic proportions per document.

An object of class coda::mcmc summarizing the posterior distribution of the regression coefficients and residual variance (if applicable). Convenience functions such as summary() and plot() can be used for posterior summarization.

A ggplot object.

Details

get_zbar() computes empirical topic proportions from slot @topics.
est_theta() estimates the mean or median theta matrix.
est_beta() estimates the mean or median beta matrix.
get_toptopics() creates a tibble of the topic proportion estimates for the top ntopics topics per document sorted by probability.
get_topwords() creates a tibble of topics and the top nwords words per topic sorted by probability or term score.
get_coherence() computes the coherence metric for each topic (see Mimno, Wallach, Talley, Leenders, & McCallum, 2011).
get_exclusivity() computes the exclusivity metric for each topic (see Roberts, Stewart, & Airoldi, 2013).
post_regression() creates a coda::mcmc object containing posterior information for the regression model parameters.
gg_coef() plots regression coefficients
- Warning: this function is deprecated.
- See help("Deprecated").

Examples

Run this code

# NOT RUN {
m1 <- Sldax(ndocs = 1, nvocab = 2,
            topics = array(c(1, 2, 2, 1), dim = c(1, 4, 1)),
            theta = array(c(0.5, 0.5), dim = c(1, 2, 1)),
            beta = array(c(0.5, 0.5, 0.5, 0.5), dim = c(2, 2, 1)))
est_beta(m1, stat = "mean")
est_beta(m1, stat = "median")
m1 <- Sldax(ndocs = 2, nvocab = 2, nchain = 2,
            topics = array(c(1, 2, 2, 1,
                             1, 2, 2, 1), dim = c(2, 2, 2)),
            theta = array(c(0.5, 0.5,
                            0.5, 0.5,
                            0.5, 0.5,
                            0.5, 0.5), dim = c(2, 2, 2)),
            loglike = rep(NaN, times = 2),
            logpost = rep(NaN, times = 2),
            lpd = matrix(NaN, nrow = 2, ncol = 2),
            eta = matrix(0.0, nrow = 2, ncol = 2),
            mu0 = c(0.0, 0.0),
            sigma0 = diag(1, 2),
            eta_start = c(0.0, 0.0),
            beta = array(c(0.5, 0.5, 0.5, 0.5,
                           0.5, 0.5, 0.5, 0.5), dim = c(2, 2, 2)))
est_theta(m1, stat = "mean")
est_theta(m1, stat = "median")
mdoc <- matrix(c(1, 2, 2, 1), nrow = 1)
m1 <- Sldax(ndocs = 1, nvocab = 2,
            topics = array(c(1, 2, 2, 2), dim = c(1, 4, 1)),
            theta = array(c(0.5, 0.5), dim = c(1, 2, 1)),
            beta = array(c(0.5, 0.4, 0.5, 0.6), dim = c(2, 2, 1)))
bhat <- est_beta(m1)
get_coherence(bhat, docs = mdoc, nwords = nvocab(m1))
m1 <- Sldax(ndocs = 1, nvocab = 2,
            topics = array(c(1, 2, 2, 2), dim = c(1, 4, 1)),
            theta = array(c(0.5, 0.5), dim = c(1, 2, 1)),
            beta = array(c(0.5, 0.4, 0.5, 0.6), dim = c(2, 2, 1)))
bhat <- est_beta(m1)
get_exclusivity(bhat, nwords = nvocab(m1))
m1 <- Sldax(ndocs = 2, nvocab = 2, nchain = 2,
            topics = array(c(1, 2, 2, 1,
                             1, 2, 2, 1), dim = c(2, 2, 2)),
            theta = array(c(0.4, 0.3,
                            0.6, 0.7,
                            0.45, 0.5,
                            0.55, 0.5), dim = c(2, 2, 2)),
            loglike = rep(NaN, times = 2),
            logpost = rep(NaN, times = 2),
            lpd = matrix(NaN, nrow = 2, ncol = 2),
            eta = matrix(0.0, nrow = 2, ncol = 2),
            mu0 = c(0.0, 0.0),
            sigma0 = diag(1, 2),
            eta_start = c(0.0, 0.0),
            beta = array(c(0.5, 0.5, 0.5, 0.5,
                           0.5, 0.5, 0.5, 0.5), dim = c(2, 2, 2)))
t_hat <- est_theta(m1, stat = "mean")
get_toptopics(t_hat, ntopics = ntopics(m1))
m1 <- Sldax(ndocs = 1, nvocab = 2,
            topics = array(c(1, 2, 2, 2), dim = c(1, 4, 1)),
            theta = array(c(0.5, 0.5), dim = c(1, 2, 1)),
            beta = array(c(0.5, 0.4, 0.5, 0.6), dim = c(2, 2, 1)))
bhat <- est_beta(m1)
get_topwords(bhat, nwords = nvocab(m1), method = "termscore")
get_topwords(bhat, nwords = nvocab(m1), method = "prob")
m1 <- Sldax(ndocs = 1, nvocab = 2,
            topics = array(c(1, 2, 2, 2), dim = c(1, 4, 1)),
            theta = array(c(0.5, 0.5), dim = c(1, 2, 1)),
            beta = array(c(0.5, 0.4, 0.5, 0.6), dim = c(2, 2, 1)))
get_zbar(m1)
data(mtcars)
m1 <- gibbs_mlr(mpg ~ hp, data = mtcars, m = 2)
post_regression(m1)
# }
# NOT RUN {
library(lda) # Required if using `prep_docs()`
data(teacher_rate)  # Synthetic student ratings of instructors
docs_vocab <- prep_docs(teacher_rate, "doc")
vocab_len <- length(docs_vocab$vocab)
m1 <- gibbs_sldax(rating ~ I(grade - 1), m = 2,
                  data = teacher_rate,
                  docs = docs_vocab$documents,
                  V = vocab_len,
                  K = 2,
                  model = "sldax")
gg_coef(m1)
# }

Run the code above in your browser using DataLab