output: github_document
R-package binomialMix
Copyright 2019 Faustine Bousquet (faustine.bousquet@tabmo.io or faustine.bousquet@umontpellier.fr) from TabMo and IMAG (Institut Montpelliérain Alexander Grothendieck, University of Montpellier). The binomialMix package is available under the Apache2 license.
Description
The binomialMix package provides a clustering method for longitudinal and non gaussian data. It uses an EM algorithm for GLM.
Instruction for users
Installation
You can install the binomialMix R package with the following R command:
# install.packages("devtools")
devtools::install_git("https://gitlab.com/tabmo/binomialmix")
devtools::install_gitlab("tabmo/binomialMix")You can also directly use the git repository :
git clone https://gitlab.com/tabmo/binomialMixOnce you cloned the git repository, you can run to install the binomialMix package:
devtools::install("/path/to/binomialMix/pkg") # edit the pathExample of use
- Import the library :
library(binomialMix)- Load the data :
data(adcampaign)Of course, you can use your own data. The format you need to have is the following :
- a dataframe is needed
- a column with factor id representing the objects you want to cluster
- a target value * a weighted value variable as we are in case of binomial data
- at least, one column as explicative variable
Run the clustering algorithm Here, we want to cluster advertising campaigns. Each campaigns (column "id") is composed of n_c observations from the whole dataset. We have repeated mesure for a same id level. The explicatives variables could be : day, timeSlot or app_or_site. We want to try with K=3 clusters.
model_formula<-"ctr~timeSlot+day"
weighted_variable<-"impressions"
nb_cluster<-3
df_tocluster<-adcampaign
col_id<-"id"
result_K3<-runEM(model_formula,
weighted_variable,
nb_cluster,
df_tocluster,
col_id)- Analysis of clustering obtained : The output of the runEM function provides the following values :
- loglikelihood for each EM iteration
- estimation of β, λ, π parameters
- BIC/ICL value
- Number of fisher iteration needed for each M-Step
Plotting evolution of Loglikelihood over iteration
# Plotting Loglikelihood :
install.packages("ggplot2")
library(ggplot2)
qplot(seq_along(result_K3[[1]]), result_K3[[1]])Matrix of beta estimated (values taken for last iteration) :
head(result_K3[[2]][[length(result_K3[[2]])]])## [,1] [,2] [,3]
## [1,] -3.8126661 -5.2914380 -3.2418550
## [2,] -0.4134079 0.3794783 0.4115441
## [3,] -0.2975236 0.2407683 0.4076950
## [4,] -0.1948168 0.2122175 0.3753815
## [5,] -0.1590104 0.4028323 0.1885215
## [6,] -0.2160946 0.3545593 0.1872363Vector of proportion in each cluster (values taken for last iteration) :
result_K3[[3]][[length(result_K3[[3]])]]## [1] 0.1871000 0.7246125 0.0883000Matrix of proability for each campaign to belong to the different cluster (values taken for last iteration) :
## Too large to print here
result_K3[[4]][[length(result_K3[[4]])]]BIC value as numeric :
paste0("BIC=",result_K3[[5]][[length(result_K3[[5]])]])## [1] "BIC=387914.537681485"ICL value as numeric :
paste0("ICL value=",result_K3[[6]][[length(result_K3[[6]])]])## [1] "ICL value=387919.96962191"Total number of EM iteration as numeric value :
paste0("Number of EM iteration :",length(result_K3[[7]]))## [1] "Number of EM iteration :10"Matrix of Fisher scoring number of iteration at each M step :
matrix(unlist(result_K3[[7]]),ncol=length(result_K3[[7]])-1)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 4 3 4 6 3 3 2 1 1
## [2,] 3 2 2 2 2 2 2 1 1
## [3,] 5 4 2 2 3 1 1 1 1#nrow is equal to the number of cluster
#ncol is equal to the number of iteration