createJSON: Create the JSON object to read into the javascript visualization

Description

This function creates the JSON object that feeds the visualization template. For a more detailed overview, see vignette("details", package = "LDAvis")

Usage

createJSON(phi = matrix(), theta = matrix(), doc.length = integer(), vocab = character(), term.frequency = integer(), R = 30, lambda.step = 0.01, mds.method = jsPCA, cluster, plot.opts = list(xlab = "PC1", ylab = "PC2"), ...)

Arguments

phi

matrix, with each row containing the distribution over terms for a topic, with as many rows as there are topics in the model, and as many columns as there are terms in the vocabulary.

theta

matrix, with each row containing the probability distribution over topics for a document, with as many rows as there are documents in the corpus, and as many columns as there are topics in the model.

doc.length

integer vector containing the number of tokens in each document of the corpus.

vocab

character vector of the terms in the vocabulary (in the same order as the columns of phi). Each term must have at least one character.

term.frequency

integer vector containing the frequency of each term in the vocabulary.

integer, the number of terms to display in the barcharts of the interactive viz. Default is 30. Recommended to be roughly between 10 and 50.

lambda.step

a value between 0 and 1. Determines the interstep distance in the grid of lambda values over which to iterate when computing relevance. Default is 0.01. Recommended to be between 0.01 and 0.1.

mds.method

a function that takes phi as an input and outputs a K by 2 data.frame (or matrix). The output approximates the distance between topics. See jsPCA for details on the default method.

cluster

a cluster object created from the parallel package. If supplied, computations are performed using parLapply instead of lapply.

plot.opts

a named list used to customize various plot elements. By default, the x and y axes are labeled "PC1" and "PC2" (principal components 1 and 2), since jsPCA is the default scaling method.

...

not currently used.

Value

A string containing JSON content which can be written to a file or feed into serVis for easy viewing/sharing. One element of this string is the new ordering of the topics.

Details

The function first computes the topic frequencies (across the whole corpus), and then it reorders the topics in decreasing order of frequency. The main computation is to loop through the topics and through the grid of lambda values (determined by lambda.step) to compute the R most relevant terms for each topic and value of lambda.

References

Sievert, C. and Shirley, K. (2014) LDAvis: A Method for Visualizing and Interpreting Topics, ACL Workshop on Interactive Language Learning, Visualization, and Interfaces. http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

Examples

Run this code

## Not run: 
# data(TwentyNewsgroups, package="LDAvis")
# # create the json object, start a local file server, open in default browser
# json <- with(TwentyNewsgroups,
#              createJSON(phi, theta, doc.length, vocab, term.frequency))
# serVis(json) # press ESC or Ctrl-C to kill
# 
# # createJSON() reorders topics in decreasing order of term frequency
# RJSONIO::fromJSON(json)$topic.order
# 
# # You may want to just write the JSON and other dependency files
# # to a folder named TwentyNewsgroups under the working directory
# serVis(json, out.dir = 'TwentyNewsgroups', open.browser = FALSE)
# # then you could use a server of your choice; for example,
# # open your terminal, type `cd TwentyNewsgroups && python -m SimpleHTTPServer`
# # then open http://localhost:8000 in your web browser
# 
# # A different data set: the Jeopardy Questions+Answers data:
# # Install LDAvisData (the associated data package) if not already installed:
# # devtools::install_github("cpsievert/LDAvisData")
# library(LDAvisData)
# data(Jeopardy, package="LDAvisData")
# json <- with(Jeopardy,
#              createJSON(phi, theta, doc.length, vocab, term.frequency))
# serVis(json) # Check out Topic 22 (bodies of water!)
# 
# # If you have a GitHub account, you can even publish as a gist
# # which allows you to easily share with others!
# serVis(json, as.gist = TRUE)
# 
# # Run createJSON on a cluster of machines to speed it up
# system.time(
# json <- with(TwentyNewsgroups,
#              createJSON(phi, theta, doc.length, vocab, term.frequency))
# )
# #   user  system elapsed
# # 14.415   0.800  15.066
# library("parallel")
# cl <- makeCluster(detectCores() - 1)
# cl # socket cluster with 3 nodes on host 'localhost'
# system.time(
#  json <- with(TwentyNewsgroups,
#    createJSON(phi, theta, doc.length, vocab, term.frequency,
#      cluster = cl))
# )
# #   user  system elapsed
# #  2.006   0.361   8.822
# 
# # another scaling method (svd + tsne)
# library("tsne")
# svd_tsne <- function(x) tsne(svd(x)$u)
# json <- with(TwentyNewsgroups,
#              createJSON(phi, theta, doc.length, vocab, term.frequency,
#                         mds.method = svd_tsne,
#                         plot.opts = list(xlab="", ylab="")
#                         )
#              )
# serVis(json) # Results in a different topic layout in the left panel
# 
# ## End(Not run)

Run the code above in your browser using DataLab