Learn R Programming

NPflow (version 0.9.0)

DPMpost: Posterior estimation for Dirichlet process mixture of multivariate (potentially skew) distibutions models

Description

Partially collapse slice Gibbs sampling for Dirichlet process mixture of multivariate normal, skew normal or skew t distributions.

Usage

DPMpost(data, hyperG0, a = 1e-04, b = 1e-04, N, doPlot = TRUE,
  nbclust_init = 30, plotevery = floor(N/10), diagVar = TRUE,
  verbose = TRUE, distrib = c("gaussian", "skewnorm", "skewt"),
  ncores = 1, type_connec = "SOCK", informPrior = NULL, ...)

Arguments

data
data matrix d x n with d dimensions in rows and n observations in columns.
hyperG0
prior mixing distribution.
a
shape hyperparameter of the Gamma prior on the parameter of the Dirichlet Process.
b
scale hyperparameter of the Gamma prior on the parameter of the Dirichlet Process.
N
number of MCMC iterations.
doPlot
logical flag indicating wether to plot MCMC iteration or not. Default to TRUE.
nbclust_init
number of clusters at initialisation. Default to 30 (or less if there are less than 30 observations).
plotevery
an integer indicating the interval between plotted iterations when doPlot is TRUE.
diagVar
logical flag indicating wether the variance of each cluster is estimated as a diagonal matrix, or as a full matrix. Default is TRUE (diagonal variance).
verbose
logical flag indicating wether partition info is written in the console at each MCMC iteration.
distrib
the distribution used for the clustering. Current possibilities are "gaussian", "skewnorm" and "skewt".
ncores
number of cores to use.
type_connec
The type of connection between the processors. Supported cluster types are "SOCK", "FORK", "MPI", and "NWS". See also makeCluster.
informPrior
an optional informative prior such as the approximation computed by summary.DPMMclust.
...
additional arguments to be passed to plot_DPM. Only used if doPlot is TRUE.

Value

  • a object of class DPMclust with the following attributes:
    • mcmc_partitions:
    { a list of length N. Each element mcmc_partitions[n] is a vector of length n giving the partition of the n observations.}
  • alpha:a vector of length N. cost[j] is the cost associated to partition c[[j]]
  • U_SS_list:a list of length N containing the lists of sufficient statistics for all the mixture components at each MCMC iteration
  • weights_list:a list of length N containing the weights of each mixture component for each MCMC iterations
  • logposterior_list:a list of length N containing the logposterior values at each MCMC iterations
  • data:the data matrix d x n with d dimensions in rows and n observations in columns
  • nb_mcmcit:the number of MCMC itertations
  • clust_distrib:the parametric distribution of the mixture component
  • hyperG0:the prior on the cluster location

Details

This function is a wrapper around DPMGibbsN, DPMGibbsN_parallel, DPMGibbsN_SeqPrior, DPMGibbsSkewN, DPMGibbsSkewN_parallel, DPMGibbsSkewT, DPMGibbsSkewT_parallel, DPMGibbsSkewT_SeqPrior, DPMGibbsSkewT_SeqPrior_parallel.

References

Hejblum BP, Alkhassim C, Gottardo R, Caron F, Thiebaut R, Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data, in preparation.

Examples

Run this code
rm(list=ls())
library(ggplot2)
library(truncnorm)

#Number of data
n <- 2000
set.seed(123)
set.seed(4321)


d <- 2
ncl <- 4

# Sample data

sdev <- array(dim=c(d,d,ncl))

xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3))
psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9))
nu <- c(100,25,8,5)
p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.2))
sdev[, ,4] <- .3*diag(2)


c <- rep(0,n)
w <- rep(1,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
 c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
 w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
 z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
               (sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
 #cat(k, "/", n, " observations simulated\\n", sep="")
}

# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rowMeans(z)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d+1
hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3

 # hyperprior on the Scale parameter of DPM
 a <- 0.0001
 b <- 0.0001



 ## Data
 ########
 library(ggplot2)
 p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
       + geom_point()
       #+ ggtitle("Simple example in 2d data")
       +xlab("D1")
       +ylab("D2")
       +theme_bw())
 p #pdf(height=8.5, width=8.5)

 c2plot <- factor(c)
 levels(c2plot) <- c("4", "1", "3", "2")
 pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot)))
       + geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster))
       #+ ggtitle("Slightly overlapping skew-normal simulation\\n")
       + xlab("D1")
       + ylab("D2")
       + theme_bw()
       + scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))))
 pp #pdf(height=7, width=7.5)

MCMCsample_st <- DPMpost(data=z, hyperG0=hyperG0, N=2000,
                          distrib="skewt",
                          gg.add=list(theme_bw(),
                          guides(shape=guide_legend(override.aes = list(fill="grey45"))))
 )
 s <- summary(MCMCsample_st, burnin = 1500, thin=5, lossFn = "Binder")
 s
 plot(s)
 #plot(s, hm=TRUE) #pdf(height=8.5, width=10.5) #png(height=700, width=720)

Run the code above in your browser using DataLab