Estimates a graph of topic correlations using either a simple thresholding
measure or more sophisticated tests from the package
topicCorr(model, method = c("simple", "huge"), cutoff = 0.01, verbose = TRUE)
An STM object for which you want to estimate correlations between topics.
Method for estimating the graph.
thresholds the covariances.
"huge" uses the semiparametric procedure
in the package
huge. See details below.
When using the simple method, this is the cutoff below which correlations are truncated to zero.
A logical which indicates whether information should be
printed to the screen when running
K by K adjacency matrix where an edge represents positive correlation selected by the model.
K by K correlation matrix. It takes values of zero where the correlation is either negative or the edge is unselected by the model selection procedure.
K by K correlation matrix element-wise multiplied by the adjacency matrix. Note that this will contain significant negative correlations as well as positive correlations.
We offer two estimation procedures for producing correlation graphs. The
results of either method can be plotted using
The first method is conceptually simpler and involves a simple thresholding
procedure on the estimated marginal topic proportion correlation matrix and
requires a human specified threshold. The second method draws on recent
literature undirected graphical model estimation and is automatically tuned.
"simple" method calculates the correlation of the MAP estimates
for the topic proportions \(\theta\) which yields the marginal correlation
of the mode of the variational distribution. Then we simply set to 0 those
edges where the correlation falls below the threshold.
An alternative strategy is to treat the problem as the recovery of edges in a high-dimensional undirected graphical model. In these settings we assume that observations come from a multivariate normal distribution with a sparse precision matrix. The goal is to infer which elements of the precision matrix are non-zero corresponding to edges in a graph. Meinshausen and Buhlmann (2006) showed that using sparse regression methods like the LASSO it is possible to consistently identify edges even in very high dimensional settings.
Selecting the option
"huge" uses the
huge package by Zhao and
Liu to estimate the graph. We use a nonparanormal transformation of the
topic proportions (\(\theta\)) which uses semiparametric Gaussian copulas
to marginally transform the data. This weakens the gaussian assumption of
the subsequent procedure. We then estimate the graph using the Meinshausen
and Buhlmann procedure. Model selection for the scale of the \(L_1\)
penalty is performed using the rotation information criterion (RIC) which
estimates the optimal degree of regularization by random rotations. Zhao
and Lieu (2012) note that this selection approach has strong empirical
performance but is sensitive to under-selection of edges. We choose this
metric as the default approach to model selection to reflect social
scientists' historically greater concern for false positive rates as opposed
to false negative rates.
We note that in models with low numbers of topics the simple procedure and the more complex procedure will often yield identical results. However, the advantage of the more complex procedure is that it scales gracefully to models with hundreds or even thousands of topics - specifically the set of cases where some higher level structure like a correlation graph would be the most useful.
Lucas, Christopher, Richard A. Nielsen, Margaret E. Roberts, Brandon M. Stewart, Alex Storer, and Dustin Tingley. "Computer-Assisted Text Analysis for Comparative Politics." Political Analysis (2015).
T. Zhao and H. Liu. The huge Package for High-dimensional Undirected Graph Estimation in R. Journal of Machine Learning Research, 2012
H. Liu, F. Han, M. Yuan, J. Lafferty and L. Wasserman. High Dimensional Semiparametric Gaussian Copula Graphical Models. Annals of Statistics,2012
N. Meinshausen and P. Buhlmann. High-dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics, 2006.