clusterMix: Cluster Observations Based on Indicator MCMC Draws

Description

clusterMix uses MCMC draws of indicator variables from a normal component mixture model to cluster observations based on a similarity matrix.

Usage

clusterMix(zdraw, cutoff = 0.9, SILENT = FALSE)

Arguments

zdraw

R x nobs array of draws of indicators

cutoff

cutoff probability for similarity (def=.9)

SILENT

logical flag for silent operation (def= FALSE)

Value

clusteraindicator function for clustering based on method A above
clusterbindicator function for clustering based on method B above

concept

normal mixture
clustering

Warning

This routine is a utility routine that does not check the input arguments for proper dimensions and type.

Details

define a similarity matrix, Sim, Sim[i,j]=1 if observations i and j are in same component. Compute the posterior mean of Sim over indicator draws. clustering is achieved by two means: Method A: Find the indicator draw whose similarity matrix minimizes, loss(E[Sim]-Sim(z)), where loss is absolute deviation. Method B: Define a Similarity matrix by setting any element of E[Sim] = 1 if E[Sim] > cutoff. Compute the clustering scheme associated with this "windsorized" Similarity matrix.

References

For further discussion, see Bayesian Statistics and Marketing by Rossi, Allenby and McCulloch Chapter 3. http://faculty.chicagogsb.edu/peter.rossi/research/bsm.html

Examples

Run this code

##
if(nchar(Sys.getenv("LONG_TEST")) != 0) 
{
## simulate data from mixture of normals
n=500
pvec=c(.5,.5)
mu1=c(2,2)
mu2=c(-2,-2)
Sigma1=matrix(c(1,.5,.5,1),ncol=2)
Sigma2=matrix(c(1,.5,.5,1),ncol=2)
comps=NULL
comps[[1]]=list(mu1,backsolve(chol(Sigma1),diag(2)))
comps[[2]]=list(mu2,backsolve(chol(Sigma2),diag(2)))
dm=rmixture(n,pvec,comps)
## run MCMC on normal mixture
R=2000
Data=list(y=dm$x)
ncomp=2
Prior=list(ncomp=ncomp,a=c(rep(100,ncomp)))
Mcmc=list(R=R,keep=1)
out=rnmixGibbs(Data=Data,Prior=Prior,Mcmc=Mcmc)
begin=500
end=R
## find clusters
outclusterMix=clusterMix(out$zdraw[begin:end,])
##
## check on clustering versus "truth"
##  note: there could be switched labels
##
table(outclusterMix$clustera,dm$z)
table(outclusterMix$clusterb,dm$z)
}
##

Run the code above in your browser using DataLab