generateSampleDataBin

Generate sample clustered binary data with cluster labels. The probability of
a '1' in each cluster for each variable is randomly generated via a Beta(1,
5) distribution, encouraging sparse probabilities which vary across clusters.
For noisy variables, the probability of a '1' is also generated by a Beta(1,
5) distribution but this probability is the same regardless of the cluster
membership of the observation.

A variational Bayesian finite mixture model for the clustering of categorical data, and can implement variable selection and semi-supervised outcome guiding if desired. Incorporates an option to perform model averaging over multiple initialisations to reduce the effects of local optima and improve the automatic estimation of the true number of clusters. For further details, see the paper by Rao and Kirk (2024) <doi:10.48550/arXiv.2406.16227>.

Jackie Rao

VICatMix

Variational Mixture Models for Clustering Categorical Data

Sara Wade

Colin Starr

John Maddock

generateSampleDataBin function

<dl><dt>n</dt>
<dd>Number of observations in dataset.</dd>
<dt>K</dt>
<dd>Number of clusters desired.</dd>
<dt>w</dt>
<dd>A vector of mixture weights (proportion of population in each
cluster).</dd>
<dt>p</dt>
<dd>Number of clustering variables/covariates in dataset.</dd>
<dt>Irrp</dt>
<dd>Number of irrelevant/noisy variables/covariates in dataset. Note
that these variables will be the final Irrp columns in the simulated
dataset. Total data dimension is p + Irrp.</dd>
<dt>yout</dt>
<dd>Default FALSE. Indicate whether a binary outcome associated with
clustering is required.</dd></dl>

Arguments

generateSampleDataBin — generateSampleDataBin

<dl>

<dt>n</dt>
<dd>Number of observations in dataset.</dd>


<dt>K</dt>
<dd>Number of clusters desired.</dd>


<dt>w</dt>
<dd>A vector of mixture weights (proportion of population in each
cluster).</dd>


<dt>p</dt>
<dd>Number of clustering variables/covariates in dataset.</dd>


<dt>Irrp</dt>
<dd>Number of irrelevant/noisy variables/covariates in dataset. Note
that these variables will be the final Irrp columns in the simulated
dataset. Total data dimension is p + Irrp.</dd>


<dt>yout</dt>
<dd>Default FALSE. Indicate whether a binary outcome associated with
clustering is required.</dd>

</dl>

generateSampleDataBin: generateSampleDataBin

Description

Usage

Value

Arguments

Examples