Learn R Programming

BNPMIXcluster (version 1.1.0)

MIXclustering: Bayesian Nonparametric Model for Clustering with Mixed Scale Variables

Description

MIXclustering is used to perform cluster analisis of individuals using a Bayesian nonparametric mixture model that jointly models mixed scale data and accommodates for different sampling probabilities. The model is described in Carmona, C., Nieto-Barajas, L. E., Canale, A. (2016).

Usage

MIXclustering(x, var_type, n.iter_out = 1000, n.burn = 100, n.thin = 3,
  a_fix = NULL, alpha = 0.5, d_0_a = 1, d_1_a = 1, b_fix = NULL,
  d_0_b = 1, d_1_b = 5, eta = 2, d_0_z = 2.1, d_1_z = 30, kappa = 5,
  delta = 4, d_0_mu = 2.1, d_1_mu = 30, sampling_prob = NULL,
  expansion_f = NULL, max.time = Inf, USING_CPP = TRUE)

Arguments

x

Matrix or data frame containing the data to be clustered.

var_type

Character vector that indicates the type of variable in each column of x. Three possible types:

  • "c" for continuous variables. It is assumed to be Gaussian-shaped.

  • "o" for ordinal variables (binary and ordered categorical).

  • "m" for nominal variables (non-ordered categorical).

n.iter_out

Number of effective iterations in the MCMC procedure for clustering.

n.burn

Number of iterations discarded as part of the burn-in period at the beginning MCMC procedure.

n.thin

Number of iterations discarded between two effective iterations, with the purpose of reducing the autocorrelation in the chain.

a_fix

A numeric value to set the parameter \(a\) in the model. If NULL (default), the parameter \(a\) is assigned a prior distribution. See details.

alpha

Hyperparameter in the prior distribution of \(a\). See details.

d_0_a

Hyperparameter in the prior distribution of \(a\). See details.

d_1_a

Hyperparameter in the prior distribution of \(a\). See details.

b_fix

A numeric value to set the parameter \(b\) in the model. If NULL (default), the parameter \(b\) is assigned a prior distribution. See details.

d_0_b

Hyperparameter in the prior distribution of \(b\). See details.

d_1_b

Hyperparameter in the prior distribution of \(b\). See details.

eta

Tuning parameter controlling the proposal in the Metropolis-Hastings step for \(b\).

d_0_z

Hyperparameter in the prior distribution of the variance for the latent variables. See details.

d_1_z

Hyperparameter in the prior distribution of the variance for the latent variables. See details.

kappa

Tuning parameter controlling the proposal in the Metropolis-Hastings step for the variance of latent variables.

delta

Tuning parameter controlling the proposal in the Metropolis-Hastings step for the correlation of latent variables.

d_0_mu

Hyperparameter in the prior distribution of the variance within each cluster. See details.

d_1_mu

Hyperparameter in the prior distribution of the variance within each cluster. See details.

sampling_prob

vector with the sampling probabilities \(\pi_i\) for each individual in case that the data come from a complex survey sample. By default \(\pi_i=1\).

expansion_f

vector with the expansion factors, the reciprocal of the sampling probabilities, \(w_i = 1/\pi_i\). If both sampling_prob and expansion_f are specified, preference is given to sampling_prob.

max.time

Maximum time tolerated to be spend in the sampling procedure of the Metropolis-Hastings steps. If reached the routine is stopped with error.

USING_CPP

Indicates wheter to use optimized functions developed in C++ (TRUE default)

Value

MIXclustering returns a S3 object of class "MIXcluster".

The generic methods summary and plot are defined for this class.

An object of class "MIXcluster" is a list containing the following components:

cluster

vector with the cluster allocation for each each row in the data. It corresponds to the iteration which is Closest-To-Average (CTA) arrangement.

Y.cluster.summary

a summary of the data divided by the allocation in $cluster.

Y.var_type

vector with the variable types in the data.

Y.na

vector specifying the rows with missing values.

Y.n

number of rows in the data.

Y.p

number of variables in the data.

MC.clusters

matrix with the cluster allocation for each each row in the data. Each column corresponds to an effective iteration in the MCMC simulation of the model (after discarding burn-in and thinning iterations).

cluster.matrix.avg

average similarity matrix of size \(n\) by \(n\).

MC.values

a list with the simulated values of the chains for the parameters \(a\),\(b\),\(\Lambda\),\(\Omega\).

MC.accept.rate

a named vector with the acceptance rates for each parameter. It includes iterations that are discarded in the burn-in period and thinning.

call

the matched call.

Details

The model consists in a bayesian non-parametric approach for clustering that is capable to combine different types of variables through the usage of associated continuous latent variables. The clustering mechanism is based on a location mixture model with a Poisson-Dirichlet (\(PD\)) process prior on the location parameters \(\mu_i ; i=1,...,n\) of the associated latent variables.

Computational inference about the cluster allocation and the posterior distribution of the parameters are performed using MCMC simulations.

A full description of the model is in the article Carmona et al. (2016) (preprint: http://arxiv.org/abs/1612.00083). See Reference.

The model consider an individual \(y_i\) that is characterized by a multivariate response of dimension \(p\), i.e., \(y_i=(y_{i,1},...,y_{i,p})\). The total number of variables \(p\) is divided into \(c\) continuous variables, \(o\) ordinal variables, and \(m\) nominal variables such that \(p=c+o+m\).

For each response \(y_i=(y_{i,1},...,y_{i,p})\) (of dimension \(p\)) a corresponding latent vector \(z_i=(z_{i,1},...,z_{i,q})\) (of dimension \(q\)) is created, according to the following:

  • For each continuous variable \(y_{i,j} ; j=1,...,c\) the algorithm uses a latent with the same values \(z_{i,j}=y_{i,j}\).

  • For each ordinal variable \(y_{i,j} , j = c+1,...,c+o\), with \(K_j\) different ordered values, the algorithm creates one latent \(z_{i,j}\), that allows to map the categories into continuous values divided by thresholds. For example, for a binary \(y_j\), we have \(y_j=0\) iff \(z_j<0\) and \(y_j=1\) iff \(z_j>0\)

  • For each nominal variable \(y_{i,j} , j = c+o+1,...,c+o+m\), with \(L_j\) categories, the algorithm require \(L_j-1\) latent variables, whose relative order is consistent with the observed category.

The data may come from a complex survey sample where each individual \(y_i\) has known sampling probability \(\pi_i, i = 1,...,n\). The reciprocal of these sampling probabilities, \(w_i=1/\pi_i\), are called expansion factors or sampling design weights.

The joint model for the latent vector is therefore:

\((z_i | \mu_i,\Sigma)\) ~ \(N_q(\mu_i, \pi_i \Sigma )\)

(Note: the final model in Carmona et al. (2016) has variance \(\kappa \pi_i \Sigma\). This value of \(\kappa\) can be used in the package through a transformed sampling probability vector \(\pi_i^*=\kappa\pi_i\))

The clustering model will be based in an appropriate choice of the prior distribution on the \(\mu_i\)'s. A clustering of the \(\mu_i\)'s will induce a clustering of the \(y_i\)'s. Our prior on the \(\mu_i\)'s will be:

\(\mu_i | G\)~\(G\), iid for \(i=1,...,n\)

Where \(G\)~\(PD(a,b,G_0)\) is a Poisson-Dirichlet process with parameters \(a \in [0,1)\), \(b>-a\) and centring measure \(G_0\). The Dirichlet and the normalized stable processes arise when \(a=0\) and when \(b=0\), respectively.

In consequence, this choice of prior implies that the \(\mu_i\)'s are exchangeable with marginal distribution \(\mu_i\)~\(G_0\) for all \(i=1,...,n\).

In our case, \(G(\mu)=N(0,\Sigma_\mu)\), where \(\Sigma_\mu = diag( \sigma^2_{\mu 1} ,...,\sigma^2_{\mu q} )\).

The parameters \(a\) and \(b\) in the model define the \(PD\) process and therefore control the number of groups. These parameters can be fixed, resulting in a larger/smaller number of groups if assigned a larger/smaller value, respectively.

There are 9 hyperparameters in the function that also characterize the prior distributions in the model:

  • f(a) = alpha * I(a=0) + (1-alpha) * dbeta( a | d_0_a , d_0_a )

  • f(b | a) = dgamma( b + a | d_0_b , d_1_b )

  • sigma^2 ~ inverse-gamma( d_0_z , d_1_z)

  • sigma^2_mu ~ inverse-gamma( d_0_mu , d_1_mu )

The definition of these values also affect the number of resulting clusters since they affect the variance implied in the model.

For example, increasing the values of d_1_a and d_1_b reduce the number of groups.

Finally, the function parameters eta, kappa, delta are tuning parameters that control the acceptance rate in the random-walk MH steps of the new proposed values for the parameters \(b\), \(\Lambda_{j,j}\) (variance of latents) and \(\Omega_{i,j}\) (correlation of latents). These parameters are not recomended to be changed (used in the internal functions: sampling_b , sampling_Lambda_jj , sampling_Omega_ij).

References

Carmona, C., Nieto-Barajas, L. E. & Canale, A. (2016). Model based approach for household clustering with mixed scale variables. (preprint: http://arxiv.org/abs/1612.00083)

See Also

summary.MIXcluster for a summary of the clustering results, plot.MIXcluster for graphical representation of results.

Examples

Run this code
# NOT RUN {
##### Testing "MIXclustering" function with simulated clustered data #####
# for more information about the data...
# }
# NOT RUN {
help(sim.cluster.data)
# }
# NOT RUN {

### Exercise (Ic) from section 5.1 in Carmona et al. (2016)
### using 3 continuous variables
# }
# NOT RUN {
set.seed(0) # for reproducibility

Y_data <- sim.cluster.data[,2:4]
colnames(Y_data) <- paste("Y",1:3,sep=".")

cluster_estim_Ic <- MIXclustering( Y_data,
                          var_type=c("c","c","c"),
                          n.iter_out=1000,
                          n.burn=200,
                          n.thin=2,
                          alpha=0.5,
                          d_0_a = 1, d_1_a = 1,
                          d_0_b = 1, d_1_b = 1,
                          d_0_z = 2.1, d_1_z = 30,
                          d_0_mu = 2.1, d_1_mu = 30
                          )
cluster_estim_Ic

# Summary of clustering results
summary(cluster_estim_Ic)

# Representation of clustering results
plot(cluster_estim_Ic,type="heatmap")
plot(cluster_estim_Ic,type="chain")

# Comparison with the original clusters in the simulated data
plot(x=jitter(sim.cluster.data$cluster),
     y=jitter(cluster_estim_Ic$cluster),
     main="",
     xlab="Real cluster",
     ylab="Model cluster",
     pch=19, col="#FF000020")

# }
# NOT RUN {

### Exercise (IIb) from section 5.1 in Carmona et al. (2016)
### Two binary and one continuous variables
# }
# NOT RUN {
set.seed(0) # for reproducibility

Y_data <- matrix(NA,nrow=nrow(sim.cluster.data),ncol=3)
colnames(Y_data) <- paste("Y",1:3,sep=".")
Y_data[,1] <- findInterval( sim.cluster.data[,2], c(-Inf,5,Inf) )-1
Y_data[,2] <- sim.cluster.data[,3]
Y_data[,3] <- findInterval( sim.cluster.data[,4], c(-Inf,3,Inf) )-1

cluster_estim_IIb <- MIXclustering( Y_data,
                          var_type=c("o","c","o"),
                          n.iter_out=1000,
                          n.burn=200,
                          n.thin=2,
                          alpha=0.5,
                          d_0_a = 1, d_1_a = 1,
                          d_0_b = 1, d_1_b = 1,
                          d_0_z = 1, d_1_z = 1,
                          d_0_mu = 1, d_1_mu = 1
                          )

summary(cluster_estim_IIb)
plot(cluster_estim_IIb,type="heatmap")
plot(cluster_estim_IIb,type="chain")

# }
# NOT RUN {

### Exercise (IIIa) from section 5.1 in Carmona et al. (2016)
### Two binary, one ordinal, and one (non-informative) continuous variables
# }
# NOT RUN {
set.seed(0) # for reproducibility

Y_data <- matrix(NA,nrow=nrow(sim.cluster.data),ncol=4)
colnames(Y_data) <- paste("Y",1:4,sep=".")
Y_data[,1] <- findInterval( sim.cluster.data[,2], c(-Inf,5,Inf) )-1
Y_data[,2] <- findInterval( sim.cluster.data[,3], c(-Inf,4,5,Inf) )-1
Y_data[,3] <- findInterval( sim.cluster.data[,4], c(-Inf,3,Inf) )-1
Y_data[,4] <- rnorm(n=nrow(sim.cluster.data),mean=0,sd=1)

cluster_estim_IIIa <- MIXclustering( Y_data,
                          var_type=c("m","o","o","c"),
                          n.iter_out=1000,
                          n.burn=200,
                          n.thin=2,
                          alpha=0.5,
                          d_0_a = 1, d_1_a = 1,
                          d_0_b = 1, d_1_b = 1,
                          d_0_z = 0.1, d_1_z = 0.1,
                          d_0_mu = 0.1, d_1_mu = 0.1
                          )

summary(cluster_estim_IIIa)
plot(cluster_estim_IIIa,type="heatmap")
plot(cluster_estim_IIIa,type="chain")

# }
# NOT RUN {

##### Testing "MIXclustering" function with poverty.data #####
# Using entity 15 (Edomex) #

# }
# NOT RUN {
Y_names <- c("ict_norm",
             "ic_ali","ic_asalud","ic_cv",
             "ic_rezedu","ic_sbv","ic_segsoc",
             "niv_ed","tam_loc")
aux_subset <- rep(T,nrow(poverty.data))
aux_subset <- aux_subset & is.element(substr(poverty.data$folioviv,1,2),"15")

Y_data <- poverty.data[aux_subset,Y_names]

sampling_prob_pov_iii <- 4 * mean(poverty.data[aux_subset,"factor_hog"])
sampling_prob_pov_iii <- sampling_prob_pov_iii * 1/poverty.data[aux_subset,"factor_hog"]

cluster_estim_pov_iii <- MIXclustering(x=Y_data,
                             var_type=c("c","o","o","o","o","o","o","m","m"),
                             n.iter_out=1000,
                             n.burn=200,
                             n.thin=2,
                             d_0_a = 1, d_1_a = 1,
                             d_0_b = 1, d_1_b = 1,
                             d_0_z = 2.1, d_1_z = 30,
                             d_0_mu = 2.1, d_1_mu = 30,
                             sampling_prob = sampling_prob_pov_iii
                             )

summary(cluster_estim_pov_iii)
plot(cluster_estim_pov_iii,type="heatmap")
plot(cluster_estim_pov_iii,type="chain")
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab