compute.lda: LDA model inference

Description

This function fits a Latent Dirichlet Allocation (LDA) to single-cell RNA-seq data.

Usage

compute.lda(data, method = "maptpx", k.topics = if (method == "maptpx") 2:15 else 4, log.scale = TRUE, sd.filter = 0.5, tot.iter = if (method == "Gibbs") 200 else 1e+06, tol = if (method == "maptpx") 0.05 else 10^-5)

Arguments

data

A matrix of (non-negative) RNA-seq expression levels where each row is a gene and each column is the cell sequenced.

method

LDA inference method to use. Can be any unique prefix of `maptpx', `Gibbs' or `VEM' (defaults to `maptpx')

k.topics

Integer (optional). Number of topics to fit in the model. If method is `maptpx', k.topics can be a vector of possible topic numbers and the the best model (evaluated on Bayes factor vs a null single topic model) will be returned.

log.scale

Boolean (optional). Whether the data should be log-scaled.

sd.filter

Numeric or FALSE (optional). Standard-deviation threshold below which genes should be removed from the data (no filtering if set to FALSE).

tot.iter, tol

Numeric parameters (optional) forwarded to the chosen LDA inference method's contol class.

Value

A LDA model fitted for data, of class LDA-class (for methods 'Gibbs' or 'VEM') or topics (for 'maptpx')

Details

Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups (topics) that explain why some parts of the data are similar [Blei, 2003]. Each topic is modelled as a (Dirichlet) distribution over observations and each set of observations is also modelled as a (Dirichlet) distribution over topics. In lieu of the traditional NLP context of word occurence counts in documents, our model uses RNA-seq observation counts in single cells. Three separate LDA inference methods can be used at the moment:

Gibbs uses Collapsed Gibbs Sampling method (implemented by Xuan-Hieu Phan and co-authors in the topicmodels package [Phan, 2008]) to infer the parameters of the Dirichlet distributions for a given number of topics. It gives high accuracy but is very time-consuming to run on large number of cells and genes.
VEM uses Variational Expectation-Maximisation (as described in [Hoffman, 2010]). This method tends to converge faster than Gibbs collapsed sampling, albeit with lower accuracy.
maptpx uses the method described in [Taddy, 2011] and implemented in package maptpx to estimate the parameters of the topic model for increasing number of topics (using previous estimates as a starting point for larger topic numbers). The best model (/number of topics) is selected based on Bayes factor over the Null model. Although potentially less accurate, this method provides the fastest way to train and select from a large number of models, when the number of topics is not well known.

When in doubt, the function can be ran with its default parameter values and should produce a usable LDA model in reasonable time (using the `maptpx' inference method). The model can be further refined for a specific number of topics with slower methods. While larger models (using large number of topics) might fit the data well, there is a high risk of overfitting and it is recommended to use the smallest possible number of topics that still explains the observations well. Anecdotally, a typical number of topics for cell differentiation data (from pluripotent to fully specialised) would seem to be around 4 or 5.

References

Blei, Ng, and Jordan. ``Latent dirichlet allocation.'' the Journal of machine Learning research 3 (2003): 993-1022.
Hoffman, Blei and Bach (2010). ``Online Learning for Latent Dirichlet Allocation.'' In J Lafferty, CKI Williams, J Shawe-Taylor, R Zemel, A Culotta (eds.), Advances in Neural Information Processing Systems 23, pp. 856-864. MIT Press, Cambridge, MA.
Hornik and Grün. ``topicmodels: An R package for fitting topic models.'' Journal of Statistical Software 40.13 (2011): 1-30.
Phan, Nguyen and Horiguchi. ``Learning to classify short and sparse text & web with hidden topics from large-scale data collections.'' Proceedings of the 17th international conference on World Wide Web. ACM, 2008.
Taddy. ``On estimation and selection for topic models.'' arXiv preprint arXiv:1109.4518 (2011).

Examples

Run this code

# Load skeletal myoblast RNA-Seq data from HSMMSingleCell package:
library(HSMMSingleCell)
data(HSMM_expr_matrix)

# Run LDA inference using 'maptpx' method for k = 4:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="maptpx")


# Run LDA inference using 'maptpx' method for number of topics k = 3 to 6:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=3:6, method="maptpx")

# Run LDA inference using 'Gibbs' [collapsed sampling] method for number of k = 4 topics:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="Gibbs")

Run the code above in your browser using DataLab