Performs parameter estimation by means of Gibbs sampling and cluster allocation for the Deep Mixture of Unigrams.
deep_mou_gibbs(x, k, g, n_it = 500, seed_choice = 1, burn_in = 200)
Document-term matrix describing the frequency of terms that occur in a collection of documents. Rows correspond to documents in the collection and columns correspond to terms.
Number of clusters/groups at the top layer.
Number of clusters at the bottom layer.
Number of Gibbs steps.
Set seed for reproducible results.
Number of initial Gibbs samples to be discarded and not included in the computation of final estimates.
A list containing the following elements:
The data matrix.
the clustering labels.
the number of clusters at the top layer.
the number of clusters at the bottom layer.
the sample size.
the vocabulary size.
the allocation variables at the top layer.
the allocation variables at the bottom layer.
the estimates of Alpha parameters.
the estimates of the Beta parameters.
estimated probabilities of belonging to the k
clusters at the top layer conditional to the g
clusters at the bottom layer.
estimated probabilities of belonging to the g
clusters at the bottom layer.
Starting from the data matrix x
, the Deep Mixture of Unigrams is fitted
and k
clusters are obtained.
The algorithm for the estimation of the parameters is the Gibbs sampling.
In particular, the function assigns initial values to all the parameters to be estimated. Then n_it
samples for the parameters are obtained using
conditional distributions on all the other parameters. The final estimates are obtained by averaging the samples given that initial burn_in
samples are
discarded. Clustering is eventually performed by maximizing the posterior distribution of the latent variables.
For further details see the references.
Viroli C, Anderlucci L (2020). "Deep mixtures of Unigrams for uncovering topics in textual data." Statistics and Computing, pp. 1-18. 10.1007/s11222-020-09989-9.
# NOT RUN {
# Load the CNAE2 dataset
data("CNAE2")
# Perform parameter estimation and clustering, very few iterations used for this example
deep_CNAE2 = deep_mou_gibbs(x = CNAE2, k = 2, g = 2, n_it = 5, burn_in = 2)
# Shows cluster labels to documents
deep_CNAE2$clusters
# }
Run the code above in your browser using DataLab