ClusterR v1.2.1

0

Monthly downloads

0th

Percentile

Gaussian Mixture Models, K-Means, Mini-Batch-Kmeans, K-Medoids and Affinity Propagation Clustering

Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering with the option to plot, validate, predict (new data) and estimate the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. For more information, see (i) "Clustering in an Object-Oriented Environment" by Anja Struyf, Mia Hubert, Peter Rousseeuw (1997), Journal of Statistical Software, <doi:10.18637/jss.v001.i04>; (ii) "Web-scale k-means clustering" by D. Sculley (2010), ACM Digital Library, <doi:10.1145/1772690.1772862>; (iii) "Armadillo: a template-based C++ library for linear algebra" by Sanderson et al (2016), The Journal of Open Source Software, <doi:10.21105/joss.00026>; (iv) "Clustering by Passing Messages Between Data Points" by Brendan J. Frey and Delbert Dueck, Science 16 Feb 2007: Vol. 315, Issue 5814, pp. 972-976, <doi:10.1126/science.1136800>.

Readme

ClusterR


The ClusterR package consists of Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. More details on the functionality of ClusterR can be found in the blog-post, Vignette and in the package Documentation ( scroll down for information on how to use the docker image )

UPDATE 16-08-2018

As of version 1.1.4 the ClusterR package allows R package maintainers to perform linking between packages at a C++ code (Rcpp) level. This means that the Rcpp functions of the ClusterR package can be called in the C++ files of another package. In the next lines I'll give detailed explanations on how this can be done:


Assumming that an R package ('PackageA') calls one of the ClusterR Rcpp functions. Then the maintainer of 'PackageA' has to :


  • 1st. install the ClusterR package to take advantage of the new functionality either from CRAN using,



install.packages("ClusterR")


or download the latest version from Github using the remotes package,



remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')


  • 2nd. update the DESCRIPTION file of 'PackageA' and especially the LinkingTo field by adding the ClusterR package (besides any other packages),



LinkingTo: ClusterR


  • 3rd. open a new C++ file (for instance in Rstudio) and at the top of the file add the following 'headers', 'depends' and 'plugins',



# include <RcppArmadillo.h>
# include <ClusterRHeader.h>
# include <affinity_propagation.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends(ClusterR)]]
// [[Rcpp::plugins(cpp11)]]


The available functions can be found in the following files: inst/include/ClusterRHeader.h and inst/include/affinity_propagation.h


A complete minimal example would be :


# include <RcppArmadillo.h>
# include <ClusterRHeader.h>
# include <affinity_propagation.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends(ClusterR)]]
// [[Rcpp::plugins(cpp11)]]


using namespace clustR;


// [[Rcpp::export]]
Rcpp::List mini_batch_kmeans(arma::mat& data, int clusters, int batch_size, int max_iters, int num_init = 1, 

                            double init_fraction = 1.0, std::string initializer = "kmeans++",

                            int early_stop_iter = 10, bool verbose = false, 

                            Rcpp::Nullable<Rcpp::NumericMatrix> CENTROIDS = R_NilValue, 

                            double tol = 1e-4, double tol_optimal_init = 0.5, int seed = 1) {

  ClustHeader clust_header;

  return clust_header.mini_batch_kmeans(data, clusters, batch_size, max_iters, num_init, init_fraction, 

                                        initializer, early_stop_iter, verbose, CENTROIDS, tol, 

                                        tol_optimal_init, seed);
}


Then, by opening an R file a user can call the mini_batch_kmeans function using,



Rcpp::sourceCpp('example.cpp')              # assuming that the previous Rcpp code is included in 'example.cpp' 

set.seed(1)
dat = matrix(runif(100000), nrow = 1000, ncol = 100)

mbkm = mini_batch_kmeans(dat, clusters = 3, batch_size = 50, max_iters = 100, num_init = 2, 

                         init_fraction = 1.0, initializer = "kmeans++", early_stop_iter = 10, 

                         verbose = T, CENTROIDS = NULL, tol = 1e-4, tol_optimal_init = 0.5, seed = 1)

str(mbkm)


Use the following link to report bugs/issues,

https://github.com/mlampros/ClusterR/issues


UPDATE 28-11-2019


Docker images of the ClusterR package are available to download from my dockerhub account. The images come with Rstudio and the R-development version (latest) installed. The whole process was tested on Ubuntu 18.04. To pull & run the image do the following,



docker pull mlampros/clusterr:rstudiodev

docker run -d --name rstudio_dev -e USER=rstudio -e PASSWORD=give_here_your_password --rm -p 8787:8787 mlampros/clusterr:rstudiodev


The user can also bind a home directory / folder to the image to use its files by specifying the -v command,



docker run -d --name rstudio_dev -e USER=rstudio -e PASSWORD=give_here_your_password --rm -p 8787:8787 -v /home/YOUR_DIR:/home/rstudio/YOUR_DIR mlampros/clusterr:rstudiodev


In the latter case you might have first give permission privileges for write access to YOUR_DIR directory (not necessarily) using,



chmod -R 777 /home/YOUR_DIR


The USER defaults to rstudio but you have to give your PASSWORD of preference (see www.rocker-project.org for more information).


Open your web-browser and depending where the docker image was build / run give,


1st. Option on your personal computer,


http://0.0.0.0:8787


2nd. Option on a cloud instance,


http://Public DNS:8787


to access the Rstudio console in order to give your username and password.


Functions in ClusterR

Name Description
mushroom The mushroom data
distance_matrix Distance matrix calculation
Optimal_Clusters_Medoids Optimal number of Clusters for the partitioning around Medoids functions
Silhouette_Dissimilarity_Plot Plot of silhouette widths or dissimilarities
Optimal_Clusters_KMeans Optimal number of Clusters for Kmeans or Mini-Batch-Kmeans
predict_MBatchKMeans Prediction function for Mini-Batch-k-means
plot_2d 2-dimensional plots
predict_Medoids Predictions for the Medoid functions
dietary_survey_IBS Synthetic data using a dietary survey of patients with irritable bowel syndrome (IBS)
center_scale Function to scale and/or center the data
tryCatch_KMEANS_arma tryCatch function to prevent armadillo errors in KMEANS_arma
entropy_formula entropy formula (used in external_validation function)
soybean The soybean (large) data set from the UCI repository
external_validation external clustering validation
function_interactive Interactive function for consecutive plots ( using dissimilarities or the silhouette widths of the observations )
tryCatch_GMM tryCatch function to prevent armadillo errors
tryCatch_optimal_clust_GMM tryCatch function to prevent armadillo errors in GMM_arma_AIC_BIC
predict_GMM Prediction function for a Gaussian Mixture Model object
predict_KMeans Prediction function for the k-means
MiniBatchKmeans Mini-batch-k-means using RcppArmadillo
Cluster_Medoids Partitioning around medoids
AP_preferenceRange Affinity propagation preference range
GMM Gaussian Mixture Model clustering
KMeans_rcpp k-means using RcppArmadillo
Optimal_Clusters_GMM Optimal number of Clusters for the gaussian mixture models
Clara_Medoids Clustering large applications
AP_affinity_propagation Affinity propagation clustering
KMeans_arma k-means using the Armadillo library
No Results!

Vignettes of ClusterR

Name
Rplot.png
Rplot_2d.png
Rplot_clara.png
Rplot_cluster.png
dog.jpg
elephant.jpg
the_clusterR_package.Rmd
No Results!

Last month downloads

Details

Type Package
Date 2019-11-28
BugReports https://github.com/mlampros/ClusterR/issues
URL https://github.com/mlampros/ClusterR
License GPL-3
Encoding UTF-8
SystemRequirements libarmadillo: apt-get install -y libarmadillo-dev (deb), libblas: apt-get install -y libblas-dev (deb), liblapack: apt-get install -y liblapack-dev (deb), libarpack++2: apt-get install -y libarpack++2-dev (deb), gfortran: apt-get install -y gfortran (deb), libgmp3: apt-get install -y libgmp3-dev (deb), libfftw3: apt-get install -y libfftw3-dev (deb), libtiff5: apt-get install -y libtiff5-dev (deb)
LazyData TRUE
LinkingTo Rcpp, RcppArmadillo (>= 0.9.1)
VignetteBuilder knitr
RoxygenNote 6.1.0
NeedsCompilation yes
Packaged 2019-11-28 15:55:34 UTC; lampros
Repository CRAN
Date/Publication 2019-11-29 19:50:13 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/ClusterR)](http://www.rdocumentation.org/packages/ClusterR)