Learn R Programming

tidyclust

The goal of tidyclust is to provide a tidy, unified interface to clustering models. The packages is closely modeled after the parsnip package.

Installation

You can install the released version of tidyclust from CRAN with:

install.packages("tidyclust")

and the development version of tidyclust from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidyclust")

Example

The first thing you do is to create a cluster specification. For this example we are creating a K-means model, using the stats engine.

library(tidyclust)
set.seed(1234)

kmeans_spec <- k_means(num_clusters = 3) %>%
  set_engine("stats")

kmeans_spec
#> K Means Cluster Specification (partition)
#> 
#> Main Arguments:
#>   num_clusters = 3
#> 
#> Computational engine: stats

This specification can then be fit using data.

kmeans_spec_fit <- kmeans_spec %>%
  fit(~., data = mtcars)
kmeans_spec_fit
#> tidyclust cluster object
#> 
#> K-means clustering with 3 clusters of sizes 7, 11, 14
#> 
#> Cluster means:
#>        mpg cyl     disp        hp     drat       wt     qsec        vs
#> 1 19.74286   6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
#> 3 26.66364   4 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
#> 2 15.10000   8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
#>          am     gear     carb
#> 1 0.4285714 3.857143 3.428571
#> 3 0.7272727 4.090909 1.545455
#> 2 0.1428571 3.285714 3.500000
#> 
#> Clustering vector:
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>                   1                   1                   2                   1 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>                   3                   1                   3                   2 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>                   2                   1                   1                   3 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>                   3                   3                   3                   3 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>                   3                   2                   2                   2 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>                   2                   3                   3                   3 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>                   3                   2                   2                   2 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>                   3                   1                   3                   2 
#> 
#> Within cluster sum of squares by cluster:
#> [1] 13954.34 11848.37 93643.90
#>  (between_SS / total_SS =  80.8 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

Once you have a fitted tidyclust object, you can do a number of things. predict() returns the cluster a new observation belongs to

predict(kmeans_spec_fit, mtcars[1:4, ])
#> # A tibble: 4 × 1
#>   .pred_cluster
#>   <fct>        
#> 1 Cluster_1    
#> 2 Cluster_1    
#> 3 Cluster_2    
#> 4 Cluster_1

extract_cluster_assignment() returns the cluster assignments of the training observations

extract_cluster_assignment(kmeans_spec_fit)
#> # A tibble: 32 × 1
#>    .cluster 
#>    <fct>    
#>  1 Cluster_1
#>  2 Cluster_1
#>  3 Cluster_2
#>  4 Cluster_1
#>  5 Cluster_3
#>  6 Cluster_1
#>  7 Cluster_3
#>  8 Cluster_2
#>  9 Cluster_2
#> 10 Cluster_1
#> # ℹ 22 more rows

and extract_centroids() returns the locations of the clusters

extract_centroids(kmeans_spec_fit)
#> # A tibble: 3 × 12
#>   .cluster    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Cluster_1  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
#> 2 Cluster_2  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
#> 3 Cluster_3  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5

Visual comparison of clustering methods

Below is a visualization of the available models and how they compare using 2 dimensional toy data sets.

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('tidyclust')

Monthly Downloads

1,042

Version

0.2.4

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Emil Hvitfeldt

Last Published

January 27th, 2025

Functions in tidyclust (0.2.4)

.k_means_fit_ClusterR

Simple Wrapper around ClusterR kmeans
extract-tidyclust

Extract elements of a tidyclust model object
extract_centroids

Extract clusters from model
hier_clust

Hierarchical (Agglomerative) Clustering
details_k_means_stats

K-means via stats
details_k_means_klaR

K-means via klaR
.k_means_fit_clustMixType

Simple Wrapper around clustMixType kmeans
.k_means_fit_klaR

Simple Wrapper around klaR kmeans
k_means

K-Means
predict.cluster_fit

Model predictions
predict_cluster

Other predict methods.
knit_engine_docs

Knit engine-specific documentation
extract_cluster_assignment

Extract cluster assignments from model
extract_fit_summary

S3 method to get fitted model summary info depending on engine
sse_within_total

Compute the sum of within-cluster SSE
new_cluster_metric

Construct a new clustering metric function
sse_within

Calculates Sum of Squared Error in each cluster
glance.cluster_fit

Construct a single row summary "glance" of a model, fit, or other object
get_centroid_dists

Computes distance from observations to centroids
new_cluster_spec

Functions required for tidyclust-adjacent packages
set_args.cluster_spec

Change arguments of a cluster specification
min_grid.cluster_spec

Determine the minimum set of model fits
tidyclust-package

tidyclust: A Common API to Clustering
list_md_problems

Locate and show errors/warnings in engine-specific documentation
make_classes_tidyclust

Prepend a new class
reexports

Objects exported from other packages
linkage_method

The agglomeration Linkage method
finalize_model_tidyclust

Splice final parameters into objects
tidy.cluster_fit

Turn a tidyclust model object into a tidy tibble
sse_ratio

Compute the ratio of the WSS to the total SSE
sse_total

Compute the total sum of squares
tune_cluster

Model tuning via grid search
fit.cluster_spec

Fit a Model Specification to a Data Set
prep_data_dist

Prepares data and distance matrices for metric calculation
load_pkgs.cluster_spec

Quietly load package namespace
silhouette

Measures silhouette between clusters
silhouette_avg

Measures average silhouette across all observations
set_engine.cluster_spec

Change engine of a cluster specification
reconcile_clusterings_mapping

Relabels clusters to match another cluster assignment
set_mode.cluster_spec

Change mode of a cluster specification
update.hier_clust

Update a cluster specification
translate_tidyclust

Resolve a Model Specification for a Computational Engine
augment.cluster_fit

Augment data with predictions
cluster_metric_set

Combine metric functions
cluster_spec

Model Specification Information
cluster_fit

Model Fit Object Information
details_hier_clust_stats

Hierarchical (Agglomerative) Clustering via stats
.convert_form_to_x_fit

Helper functions to convert between formula and matrix interface
cut_height

Cut Height
control_cluster

Control the fit function
details_k_means_ClusterR

K-means via ClusterR
details_k_means_clustMixType

K-means via clustMixType
.k_means_fit_stats

Simple Wrapper around stats kmeans
get_tidyclust_colors

Get colors for tidyclust text.
.hier_clust_fit_stats

Simple Wrapper around hclust function