Learn R Programming

textir (version 1.4)

topics: Estimation for Topic Models

Description

MAP estimation of Topic models

Usage

topics(counts, K, alpha=NULL, initheta=NULL, tol=0.1, 
bf=FALSE, kill=2, ord=TRUE, verb=1, ...)

Arguments

counts
A matrix of multinomial response counts in ncol(counts) phrases/categories for nrow(counts) documents/observations. Can be either a simple matrix or a simple_triplet_matrix.
K
The number of latent topics. If length(K)>1, topics will find the Bayes factor (vs a null single topic model) for each element and return parameter estimates for the highest probability K.
alpha
Optional Dirichlet prior concentration parameter for topic-phrase probabilities. Defaults to 1/(K*ncol(counts)).
initheta
Optional start-location for $[\theta_1 ... \theta_K]$, the topic-phrase probabilities. Dimensions must accord with the smallest element of K. If initheta=NULL, the initial estimates are built by incrementally adding topic
tol
Convergence tolerance. The optimization terminates when log posterior increase over a full parameter-set update is less than tol.
bf
An indicator for whether or not to calculate the Bayes factor for univariate K. If length(K)>1, this is ignored and Bayes factors are always calculated.
kill
For choosing from multiple K numbers of topics (evaluated in increasing order), the search will stop after kill consecutive drops in the corresponding Bayes factor. Specify kill=0 if you want Bayes factors for
ord
If TRUE, the returned topics (columns of theta) will be ordered by decreasing usage (i.e., by decreasing colSums(omega)).
verb
A switch for controlling printed output. verb > 0 will print something, with the level of detail increasing with verb.
...
Additional arguments to the undocumented internal tpx* functions.

Value

  • An topics object list with entries
  • KThe number of latent topics estimated. If input length(K)>1, on output this is a single value corresponding to the model with the highest Bayes factor.
  • thetaThe ncol{counts} by K matrix of estimated topic-phrase probabilities.
  • omegaThe nrow{counts} by K matrix of estimated document-topic weights.
  • BFThe log Bayes factor for each number of topics in the input K, against a null single topic model.
  • residualsStandardized residuals (approximately $\sim N(0,1)$) for only nonzero count entries.
  • XThe input count matrix, in simple_triplet_matrix format.

Details

A latent topic model represents each i'th document's term-count vector $X_i$ (with $\sum_{j} x_{ij} = m_i$ total phrase count) as having been drawn from a mixture of K multinomials, each parameterized by topic-phrase probabilities $\theta_i$, such that $$X_i \sim MN(m_i, \omega_1 \theta_1 + ... + \omega_K\theta_K).$$ We assign a K-dimensional Dirichlet(1/K) prior to each document's topic weights $[\omega_{i1}...\omega_{iK}]$, and the prior on each $\theta_k$ is Dirichlet with concentration $\alpha$. The topics function uses quasi-newton accelerated EM, augmented with sequential quadratic programming for conditional $\Omega | \Theta$ updates, to obtain MAP estimates for the topic model parameters. We also provide Bayes factor estimation, from marginal likelihood calculations based on a Laplace approximation around the converged MAP parameter estimates. If input length(K)>1, these Bayes factors are used for model selection. Full details are in Taddy (2011).

References

Taddy (2011), Estimation of Topic Models.

See Also

plot.topics, summary.topics, predict.topics, wsjIBM, congress109, we8there

Examples

Run this code
## see wsjibm, congress109, and we8there for data examples

## Simulation Parameters
K <- 10
n <- 150
p <- 200
omega <- t(rdir(n, rep(1/K,K)))
theta <- rdir(K, rep(1/p,p))

## Simulated counts
Q <- omega%*%t(theta)
counts <- matrix(ncol=p, nrow=n)
totals <- rpois(n, 250)
for(i in 1:n){ counts[i,] <- rmultinom(1, size=totals[i], prob=Q[i,]) }

## Bayes Factor model selection (should choose K or nearby)
simselect <- topics(counts, K=5:15, tol=.01, verb=1) 
print(simselect$K)

## MAP fit for given K=10
simfit <- topics(counts, K=K, verb=2)

## Adjust for label switching and plot the fit (color by topic)
toplab <- rep(0,K)
for(k in 1:K){ toplab[k] <- which.min(colSums(abs(simfit$omega-omega[,k]))) }
par(mfrow=c(1,2))
tpxcols <- matrix(rainbow(K), ncol=nrow(theta), byrow=TRUE)
plot(theta,simfit$theta[,toplab], ylab="fitted values", pch=21, bg=tpxcols)
plot(omega,simfit$omega[,toplab], ylab="fitted values", pch=21, bg=tpxcols)
title("True vs Fitted Values (color by topic)", outer=TRUE, line=-2)

## The S3 method plot functions
par(mfrow=c(1,2))
plot(simfit, lgd.K=2)
plot(simfit, type="resid")

Run the code above in your browser using DataLab