gibbs_sldax: Fit supervised or unsupervised topic models (SLDAX or LDA)

Description

gibbs_sldax() is used to fit both supervised and unsupervised topic models.

Usage

gibbs_sldax(
  formula,
  data,
  m = 100,
  burn = 0,
  thin = 1,
  docs,
  V,
  K = 2L,
  model = c("lda", "slda", "sldax", "slda_logit", "sldax_logit"),
  sample_beta = TRUE,
  sample_theta = TRUE,
  interaction_xcol = -1L,
  alpha_ = 1,
  gamma_ = 1,
  mu0 = NULL,
  sigma0 = NULL,
  a0 = NULL,
  b0 = NULL,
  eta_start = NULL,
  constrain_eta = FALSE,
  proposal_sd = NULL,
  return_assignments = FALSE,
  correct_ls = TRUE,
  verbose = FALSE,
  display_progress = FALSE
)

Arguments

formula

An object of class formula: a symbolic description of the model to be fitted.

data

An optional data frame containing the variables in the model.

The number of iterations to run the Gibbs sampler (default: 100).

burn

The number of iterations to discard as the burn-in period (default: 0).

thin

The period of iterations to keep after the burn-in period (default: 1).

docs

A D x max(\(N_d\)) matrix of word indices for all documents.

The number of unique terms in the vocabulary.

The number of topics.

model

A string denoting the type of model to fit. See 'Details'. (default: "lda").

sample_beta

A logical (default = TRUE): If TRUE, the topic-vocabulary distributions are sampled from their full conditional distribution.

sample_theta

A logical (default = TRUE): If TRUE, the topic proportions will be sampled. CAUTION: This can be memory-intensive.

interaction_xcol

EXPERIMENTAL: The column number of the design matrix for the additional predictors for which an interaction with the \(K\) topics is desired (default: -1L, no interaction). Currently only supports a single continuous predictor or a two-category categorical predictor represented as a single dummy-coded column.

alpha_

The hyper-parameter for the prior on the topic proportions (default: 1.0).

gamma_

The hyper-parameter for the prior on the topic-specific vocabulary probabilities (default: 1.0).

mu0

An optional q x 1 mean vector for the prior on the regression coefficients. See 'Details'.

sigma0

A q x q variance-covariance matrix for the prior on the regression coefficients. See 'Details'.

The shape parameter for the prior on sigma2 (default: 0.001).

The scale parameter for the prior on sigma2 (default: 0.001).

eta_start

A q x 1 vector of starting values for the regression coefficients.

constrain_eta

A logical (default = FALSE): If TRUE, the regression coefficients will be constrained so that they are in descending order; if FALSE, no constraints will be applied.

proposal_sd

The proposal standard deviations for drawing the regression coefficients, N(0, proposal_sd(j)), \(j = 1, \ldots, q\). Only used for model = "slda_logit" and model = "sldax_logit" (default: 2.38 for all coefficients).

return_assignments

A logical (default = FALSE): If TRUE, returns an N x \(max N_d\) x M array of topic assignments in slot @topics. CAUTION: this can be memory-intensive.

correct_ls

Run Stephens (2000) label switching correct algorithm on posterior? (default = TRUE).

verbose

Should parameter draws be output during sampling? (default: FALSE).

display_progress

Show progress bar? (default: FALSE). Do not use with verbose = TRUE.

Value

An object of class '>Sldax.

Details

The number of regression coefficients q in supervised topic models is determined as follows: For the SLDA model with only the \(K\) topics as predictors, \(q = K\); for the SLDAX model with \(K\) topics and \(p\) additional predictors, there are two possibilities: (1) If no interaction between an additional covariate and the \(K\) topics is desired (default: interaction_xcol = -1L), \(q = p + K\); (2) if an interaction between an additional covariate and the \(K\) topics is desired (e.g., interaction_xcol = 1), \(q = p + 2K - 1\). If you supply custom values for prior parameters mu0 or sigma0, be sure that the length of mu0 (\(q\)) and/or the number of rows and columns of sigma0 (\(q \times q\)) are correct. If you supply custom starting values for eta_start, be sure that the length of eta_start is correct.

For model, one of c("lda", "slda", "sldax", "slda_logit", "sldax_logit").

"lda": unsupervised topic model;
"slda": supervised topic model with a continuous outcome;
"sldax": supervised topic model with a continuous outcome and additional predictors of the outcome;
"slda_logit": supervised topic model with a dichotomous outcome (0/1);
"sldax_logit": supervised topic model with a dichotomous outcome (0/1) and additional predictors of the outcome.

For mu0, the first \(p\) elements correspond to coefficients for the \(p\) additional predictors (if none, \(p = 0\)), while elements \(p + 1\) to \(p + K\) correspond to coefficients for the \(K\) topics, and elements \(p + K + 1\) to \(p + 2K - 1\) correspond to coefficients for the interaction (if any) between one additional predictor and the \(K\) topics. By default, we use a vector of \(q\) 0s.

For sigma0, the first \(p\) rows/columns correspond to coefficients for the \(p\) additional predictors (if none, \(p = 0\)), while rows/columns \(p + 1\) to \(p + K\) correspond to coefficients for the \(K\) topics, and rows/columns \(p + K + 1\) to \(p + 2K - 1\) correspond to coefficients for the interaction (if any) between one additional predictor and the \(K\) topics. By default, we use an identity matrix for model = "slda" and model = "sldax" and a diagonal matrix with diagonal elements (variances) of 6.25 for model = "slda_logit" and model = "sldax_logit".

Examples

Run this code

# NOT RUN {
library(lda) # Required if using `prep_docs()`

data(teacher_rate)  # Synthetic student ratings of instructors
docs_vocab <- prep_docs(teacher_rate, "doc")
vocab_len <- length(docs_vocab$vocab)
m1 <- gibbs_sldax(rating ~ I(grade - 1), m = 2,
                  data = teacher_rate, docs = docs_vocab$documents,
                  V = vocab_len, K = 2, model = "sldax")

# }