deep_mou_gibbs: Deep Mixture of Unigrams

Description

Performs parameter estimation by means of Gibbs sampling and cluster allocation for the Deep Mixture of Unigrams.

Usage

deep_mou_gibbs(x, k, g, n_it = 500, seed_choice = 1, burn_in = 200)

Arguments

Document-term matrix describing the frequency of terms that occur in a collection of documents. Rows correspond to documents in the collection and columns correspond to terms.

Number of clusters/groups at the top layer.

Number of clusters at the bottom layer.

n_it

Number of Gibbs steps.

seed_choice

Set seed for reproducible results.

burn_in

Number of initial Gibbs samples to be discarded and not included in the computation of final estimates.

Value

A list containing the following elements:

The data matrix.

clusters

the clustering labels.

the number of clusters at the top layer.

the number of clusters at the bottom layer.

numobs

the sample size.

the vocabulary size.

the allocation variables at the top layer.

the allocation variables at the bottom layer.

Alpha

the estimates of Alpha parameters.

Beta

the estimates of the Beta parameters.

pi_hat

estimated probabilities of belonging to the k clusters at the top layer conditional to the g clusters at the bottom layer.

pi_hat_2

estimated probabilities of belonging to the g clusters at the bottom layer.

Details

Starting from the data matrix x, the Deep Mixture of Unigrams is fitted and k clusters are obtained. The algorithm for the estimation of the parameters is the Gibbs sampling. In particular, the function assigns initial values to all the parameters to be estimated. Then n_it samples for the parameters are obtained using conditional distributions on all the other parameters. The final estimates are obtained by averaging the samples given that initial burn_in samples are discarded. Clustering is eventually performed by maximizing the posterior distribution of the latent variables. For further details see the references.

References

Viroli C, Anderlucci L (2020). "Deep mixtures of Unigrams for uncovering topics in textual data." Statistics and Computing, pp. 1-18. 10.1007/s11222-020-09989-9.

Examples

Run this code

# NOT RUN {
# Load the CNAE2 dataset
data("CNAE2")

# Perform parameter estimation and clustering, very few iterations used for this example
deep_CNAE2 = deep_mou_gibbs(x = CNAE2, k = 2, g = 2, n_it = 5, burn_in = 2)

# Shows cluster labels to documents
deep_CNAE2$clusters
# }

Run the code above in your browser using DataLab