Learn R Programming

lares (version 4.8.4)

clusterKmeans: Automated K-Means Clustering + PCA

Description

This function lets the user cluster a whole data.frame automatically. As you might know, the goal of kmeans is to group data points into distinct non-overlapping subgroups. If needed, one hot encoding will be applied to categorical values automatically with this function. For consideration: Scale/standardize the data when applying kmeans. Also, kmeans assumes spherical shapes of clusters and doesn<U+2019>t work well when clusters are in different shapes such as elliptical clusters.

Usage

clusterKmeans(
  df,
  k = NA,
  limit = 20,
  drop_na = TRUE,
  ignore = NA,
  ohse = TRUE,
  norm = TRUE,
  comb = c(1, 2),
  seed = 123
)

Arguments

df

Dataframe

k

Integer. Number of clusters

limit

Integer. How many clusters should be considered?

drop_na

Boolean. Should NA rows be removed?

ignore

Character vector. Which columns should be excluded when calculating kmeans?

ohse

Boolean. Do you wish to automatically run one hot encoding to non-numerical columns?

norm

Boolean. Should the data be normalized?

comb

Vector. Which columns do you wish to plot? Select which two variables by name or column position.

seed

Numeric. Seed for reproducibility

See Also

Other Machine Learning: ROC(), conf_mat(), export_results(), gain_lift(), h2o_automl(), h2o_predict_API(), h2o_predict_MOJO(), h2o_predict_binary(), h2o_predict_model(), h2o_results(), h2o_selectmodel(), impute(), iter_seeds(), lasso_vars(), model_metrics(), msplit()

Examples

Run this code
# NOT RUN {
options("lares.font" = NA) # Temporal
data(dft) # Titanic dataset
df <- subset(dft, select = -c(Ticket, PassengerId))

# Find optimal k
check_k <- clusterKmeans(df)
check_k$nclusters_plot

# Run with selected k
clusters <- clusterKmeans(df, k = 3)
names(clusters)

# Cross-Correlations for each cluster
plot(clusters$correlations)

# PCA Results
plot(clusters$PCA$plotVarExp)
plot(clusters$PCA$plot_1_2)

# }
# NOT RUN {
# 3D interactive plot
clusters$PCA$plot_1_2_3
# }

Run the code above in your browser using DataLab