topics: Estimation for Topic Models

Description

MAP estimation of Topic models

Usage

topics(counts, K, shape=NULL, initopics=NULL, tol=0.1, bf=FALSE, kill=2, ord=TRUE, verb=1, ...)

Arguments

counts

A matrix of multinomial response counts in ncol(counts) phrases/categories for nrow(counts) documents/observations. Can be either a simple matrix or a simple_triplet_matrix.

The number of latent topics. If length(K)>1, topics will find the Bayes factor (vs a null single topic model) for each element and return parameter estimates for the highest probability K.

shape

Optional argument to specify the Dirichlet prior concentration parameter as shape for topic-phrase probabilities. Defaults to 1/(K*ncol(counts)). For fixed single K, this can also be a ncol(counts)

initopics

Optional start-location for $[\theta_1 ... \theta_K]$, the topic-phrase probabilities. Dimensions must accord with the smallest element of K. If NULL, the initial estimates are built by incrementally adding to

tol

Convergence tolerance: optimization stops, conditional on some extra checks, when the posterior increase over a full paramater set update is less than tol.

An indicator for whether or not to calculate the Bayes factor for univariate K. If length(K)>1, this is ignored and Bayes factors are always calculated.

kill

For choosing from multiple K numbers of topics (evaluated in increasing order), the search will stop after kill consecutive drops in the corresponding Bayes factor. Specify kill=0 if you want Bayes factors for

ord

If TRUE, the returned topics (columns of theta) will be ordered by decreasing usage (i.e., by decreasing colSums(omega)).

verb

A switch for controlling printed output. verb > 0 will print something, with the level of detail increasing with verb.

...

Additional arguments to the undocumented internal tpx* functions.

Value

An topics object list with entries
KThe number of latent topics estimated. If input length(K)>1, on output this is a single value corresponding to the model with the highest Bayes factor.
thetaThe ncol{counts} by K matrix of estimated topic-phrase probabilities.
omegaThe nrow{counts} by K matrix of estimated document-topic weights.
BFThe log Bayes factor for each number of topics in the input K, against a null single topic model.
DResidual dispersion: for each element of K, estimated dispersion parameter (which should be near one for the multinomial), degrees of freedom, and p-value for a test of whether the true dispersion is $>1$.
XThe input count matrix, in simple_triplet_matrix format.

Details

A latent topic model represents each i'th document's term-count vector $X_i$ (with $\sum_{j} x_{ij} = m_i$ total phrase count) as having been drawn from a mixture of K multinomials, each parameterized by topic-phrase probabilities $\theta_i$, such that $$X_i \sim MN(m_i, \omega_1 \theta_1 + ... + \omega_K\theta_K).$$ We assign a K-dimensional Dirichlet(1/K) prior to each document's topic weights $[\omega_{i1}...\omega_{iK}]$, and the prior on each $\theta_k$ is Dirichlet with concentration $\alpha$. The topics function uses quasi-newton accelerated EM, augmented with sequential quadratic programming for conditional $\Omega | \Theta$ updates, to obtain MAP estimates for the topic model parameters. We also provide Bayes factor estimation, from marginal likelihood calculations based on a Laplace approximation around the converged MAP parameter estimates. If input length(K)>1, these Bayes factors are used for model selection. Full details are in Taddy (2011).

References

Taddy (2011), Estimation and Selection for Topic Models. http://arxiv.org/abs/1109.4518

Examples

Run this code

## see wsjibm, congress109, and we8there for data examples

## Simulation Parameters
K <- 10
n <- 100
p <- 100
omega <- t(rdir(n, rep(1/K,K)))
theta <- rdir(K, rep(1/p,p))

## Simulated counts
Q <- omega%*%t(theta)
counts <- matrix(ncol=p, nrow=n)
totals <- rpois(n, 100)
for(i in 1:n){ counts[i,] <- rmultinom(1, size=totals[i], prob=Q[i,]) }

## Bayes Factor model selection (should choose K or nearby)
summary(simselect <- topics(counts, K=K+c(-5:5)), nwrd=0)

## MAP fit for given K
summary( simfit <- topics(counts,  K=K, verb=2), n=0 )

## Adjust for label switching and plot the fit (color by topic)
toplab <- rep(0,K)
for(k in 1:K){ toplab[k] <- which.min(colSums(abs(simfit$theta-theta[,k]))) }
par(mfrow=c(1,2))
tpxcols <- matrix(rainbow(K), ncol=ncol(theta), byrow=TRUE)
plot(theta,simfit$theta[,toplab], ylab="fitted values", pch=21, bg=tpxcols)
plot(omega,simfit$omega[,toplab], ylab="fitted values", pch=21, bg=tpxcols)
title("True vs Fitted Values (color by topic)", outer=TRUE, line=-2)

## The S3 method plot functions
par(mfrow=c(1,2))
plot(simfit, lgd.K=2)
plot(simfit, type="resid")

Run the code above in your browser using DataLab