gibbs_sldax()
is used to fit both supervised and unsupervised topic models.
gibbs_sldax(
formula,
data,
m = 100,
burn = 0,
thin = 1,
docs,
V,
K = 2L,
model = c("lda", "slda", "sldax", "slda_logit", "sldax_logit"),
sample_beta = TRUE,
sample_theta = TRUE,
interaction_xcol = -1L,
alpha_ = 1,
gamma_ = 1,
mu0 = NULL,
sigma0 = NULL,
a0 = NULL,
b0 = NULL,
eta_start = NULL,
constrain_eta = FALSE,
proposal_sd = NULL,
return_assignments = FALSE,
correct_ls = TRUE,
verbose = FALSE,
display_progress = FALSE
)
An object of class formula
: a symbolic
description of the model to be fitted.
An optional data frame containing the variables in the model.
The number of iterations to run the Gibbs sampler (default: 100
).
The number of iterations to discard as the burn-in period
(default: 0
).
The period of iterations to keep after the burn-in period
(default: 1
).
A D x max(\(N_d\)) matrix of word indices for all documents.
The number of unique terms in the vocabulary.
The number of topics.
A string denoting the type of model to fit. See 'Details'.
(default: "lda"
).
A logical (default = TRUE
): If TRUE
, the
topic-vocabulary distributions are sampled from their full conditional
distribution.
A logical (default = TRUE
): If TRUE
, the
topic proportions will be sampled. CAUTION: This can be memory-intensive.
EXPERIMENTAL: The column number of the design matrix
for the additional predictors for which an interaction with the \(K\)
topics is desired (default: -1L
, no interaction). Currently only supports
a single continuous predictor or a two-category categorical predictor
represented as a single dummy-coded column.
The hyper-parameter for the prior on the topic proportions
(default: 1.0
).
The hyper-parameter for the prior on the topic-specific
vocabulary probabilities (default: 1.0
).
An optional q x 1 mean vector for the prior on the regression coefficients. See 'Details'.
A q x q variance-covariance matrix for the prior on the regression coefficients. See 'Details'.
The shape parameter for the prior on sigma2 (default: 0.001
).
The scale parameter for the prior on sigma2 (default: 0.001
).
A q x 1 vector of starting values for the regression coefficients.
A logical (default = FALSE
): If TRUE
, the
regression coefficients will be constrained so that they are in descending
order; if FALSE
, no constraints will be applied.
The proposal standard deviations for drawing the
regression coefficients, N(0, proposal_sd(j)), \(j = 1, \ldots, q\).
Only used for model = "slda_logit"
and
model = "sldax_logit"
(default: 2.38
for all coefficients).
A logical (default = FALSE
): If
TRUE
, returns an N x \(max N_d\) x M array of topic assignments
in slot @topics
. CAUTION: this can be memory-intensive.
Run Stephens (2000) label switching correct algorithm on
posterior? (default = TRUE
).
Should parameter draws be output during sampling? (default:
FALSE
).
Show progress bar? (default: FALSE
). Do not use
with verbose = TRUE
.
The number of regression coefficients q in supervised topic models is
determined as follows: For the SLDA model with only the \(K\) topics as
predictors, \(q = K\); for the SLDAX model with \(K\) topics and \(p\)
additional predictors, there are two possibilities: (1) If no interaction
between an additional covariate and the \(K\) topics is desired
(default: interaction_xcol = -1L
), \(q = p + K\); (2) if an
interaction between an additional covariate and the \(K\) topics is desired
(e.g., interaction_xcol = 1
), \(q = p + 2K - 1\). If you supply
custom values for prior parameters mu0
or sigma0
, be sure that
the length of mu0
(\(q\)) and/or the number of rows and columns of
sigma0
(\(q \times q\)) are correct. If you supply custom starting
values for eta_start
, be sure that the length of eta_start
is
correct.
For model
, one of c("lda", "slda", "sldax", "slda_logit", "sldax_logit")
.
"lda"
: unsupervised topic model;
"slda"
: supervised topic model with a continuous outcome;
"sldax"
: supervised topic model with a continuous outcome and
additional predictors of the outcome;
"slda_logit"
: supervised topic model with a dichotomous outcome (0/1);
"sldax_logit"
: supervised topic model with a dichotomous outcome (0/1)
and additional predictors of the outcome.
For mu0
, the first \(p\) elements correspond to coefficients for the
\(p\) additional predictors (if none, \(p = 0\)), while elements
\(p + 1\) to \(p + K\) correspond to coefficients for the \(K\) topics,
and elements \(p + K + 1\) to \(p + 2K - 1\) correspond to coefficients
for the interaction (if any) between one additional predictor and the \(K\)
topics. By default, we use a vector of \(q\) 0
s.
For sigma0
, the first \(p\) rows/columns correspond to coefficients
for the \(p\) additional predictors (if none, \(p = 0\)), while
rows/columns \(p + 1\) to \(p + K\) correspond to coefficients for the
\(K\) topics, and rows/columns \(p + K + 1\) to \(p + 2K - 1\)
correspond to coefficients for the interaction (if any) between one
additional predictor and the \(K\) topics. By default, we use an identity
matrix for model = "slda"
and model = "sldax"
and a diagonal
matrix with diagonal elements (variances) of 6.25
for
model = "slda_logit"
and model = "sldax_logit"
.
Other Gibbs sampler:
gibbs_logistic()
,
gibbs_mlr()
# NOT RUN {
library(lda) # Required if using `prep_docs()`
data(teacher_rate) # Synthetic student ratings of instructors
docs_vocab <- prep_docs(teacher_rate, "doc")
vocab_len <- length(docs_vocab$vocab)
m1 <- gibbs_sldax(rating ~ I(grade - 1), m = 2,
data = teacher_rate, docs = docs_vocab$documents,
V = vocab_len, K = 2, model = "sldax")
# }
Run the code above in your browser using DataLab