
Last chance! 50% off unlimited learning
Sale ends in
Compute hierarchical or kmeans cluster analysis and return the group assignment for each observation as vector.
cluster_analysis(
x,
n = NULL,
method = "kmeans",
include_factors = FALSE,
standardize = TRUE,
verbose = TRUE,
distance_method = "euclidean",
hclust_method = "complete",
kmeans_method = "Hartigan-Wong",
dbscan_eps = 15,
iterations = 100,
...
)
The group classification for each observation as vector. The
returned vector includes missing values, so it has the same length
as nrow(x)
.
A data frame (with at least two variables), or a matrix (with at least two columns).
Number of clusters used for supervised cluster methods. If NULL
,
the number of clusters to extract is determined by calling n_clusters()
.
Note that this argument does not apply for unsupervised clustering methods
like dbscan
, hdbscan
, mixture
, pvclust
, or pamk
.
Method for computing the cluster analysis. Can be "kmeans"
(default; k-means using kmeans()
), "hkmeans"
(hierarchical k-means
using factoextra::hkmeans()
), pam
(K-Medoids using cluster::pam()
),
pamk
(K-Medoids that finds out the number of clusters), "hclust"
(hierarchical clustering using hclust()
or pvclust::pvclust()
),
dbscan
(DBSCAN using dbscan::dbscan()
), hdbscan
(Hierarchical DBSCAN
using dbscan::hdbscan()
), or mixture
(Mixture modeling using
mclust::Mclust()
, which requires the user to run library(mclust)
before).
Logical, if TRUE
, factors are converted to numerical
values in order to be included in the data for determining the number of
clusters. By default, factors are removed, because most methods that
determine the number of clusters need numeric input only.
Standardize the dataframe before clustering (default).
Toggle warnings and messages.
Distance measure to be used for methods based on
distances (e.g., when method = "hclust"
for hierarchical clustering. For
other methods, such as "kmeans"
, this argument will be ignored). Must be
one of "euclidean"
, "maximum"
, "manhattan"
, "canberra"
, "binary"
or "minkowski"
. See dist()
and pvclust::pvclust()
for more
information.
Agglomeration method to be used when method = "hclust"
or method = "hkmeans"
(for hierarchical clustering). This should be one
of "ward"
, "ward.D2"
, "single"
, "complete"
, "average"
,
"mcquitty"
, "median"
or "centroid"
. Default is "complete"
(see
hclust()
).
Algorithm used for calculating kmeans cluster. Only applies,
if method = "kmeans"
. May be one of "Hartigan-Wong"
(default),
"Lloyd"
(used by SPSS), or "MacQueen"
. See kmeans()
for details on
this argument.
The eps
argument for DBSCAN method. See n_clusters_dbscan()
.
The number of replications.
Arguments passed to or from other methods.
The print()
and plot()
methods show the (standardized) mean value for
each variable within each cluster. Thus, a higher absolute value indicates
that a certain variable characteristic is more pronounced within that
specific cluster (as compared to other cluster groups with lower absolute
mean values).
Clusters classification can be obtained via print(x, newdata = NULL, ...)
.
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2014) cluster: Cluster Analysis Basics and Extensions. R package.
n_clusters()
to determine the number of clusters to extract.
cluster_discrimination()
to determine the accuracy of cluster group
classification via linear discriminant analysis (LDA).
performance::check_clusterstructure()
to check suitability of data
for clustering.
https://www.datanovia.com/en/lessons/
set.seed(33)
# K-Means ====================================================
rez <- cluster_analysis(iris[1:4], n = 3, method = "kmeans")
rez # Show results
predict(rez) # Get clusters
summary(rez) # Extract the centers values (can use 'plot()' on that)
if (requireNamespace("MASS", quietly = TRUE)) {
cluster_discrimination(rez) # Perform LDA
}
# Hierarchical k-means (more robust k-means)
if (require("factoextra", quietly = TRUE)) {
rez <- cluster_analysis(iris[1:4], n = 3, method = "hkmeans")
rez # Show results
predict(rez) # Get clusters
}
# Hierarchical Clustering (hclust) ===========================
rez <- cluster_analysis(iris[1:4], n = 3, method = "hclust")
rez # Show results
predict(rez) # Get clusters
# K-Medoids (pam) ============================================
if (require("cluster", quietly = TRUE)) {
rez <- cluster_analysis(iris[1:4], n = 3, method = "pam")
rez # Show results
predict(rez) # Get clusters
}
# PAM with automated number of clusters
if (require("fpc", quietly = TRUE)) {
rez <- cluster_analysis(iris[1:4], method = "pamk")
rez # Show results
predict(rez) # Get clusters
}
# DBSCAN ====================================================
if (require("dbscan", quietly = TRUE)) {
# Note that you can assimilate more outliers (cluster 0) to neighbouring
# clusters by setting borderPoints = TRUE.
rez <- cluster_analysis(iris[1:4], method = "dbscan", dbscan_eps = 1.45)
rez # Show results
predict(rez) # Get clusters
}
# Mixture ====================================================
if (require("mclust", quietly = TRUE)) {
library(mclust) # Needs the package to be loaded
rez <- cluster_analysis(iris[1:4], method = "mixture")
rez # Show results
predict(rez) # Get clusters
}
Run the code above in your browser using DataLab