KmeansClustering: Build clusters using kmeans()

Description

This step allows you to use kmeans clustering to explore and group your data.

Usage

KmeansClustering(object, df, grainCol, labelCol, numOfClusters, 
  usePCA, numOfPCA,impute, debug)

Arguments

object

of UnsupervisedModelParams class for $new() constructor

Dataframe whose columns are used for calc.

grainCol

Optional. The dataframe's column that has IDs pertaining to the grain. No ID columns are truly needed for this step. If left blank, row numbers are used for identification.

labelCol

Optional. Labels will not be used for clustering. Labels can be can be used for validation. The number of clusters should be the same as the number of labels. Functions getClusterLabels() and getConfusionMatrix() are only available if labelCol is provided. Generally, supervised models are a better choice if your goal is classification.

numOfClusters

Number of clusters you want to build. If left blank, will be determined automatically from the elbow plot.

usePCA

Optional. TRUE or FALSE. Default is FALSE. If TRUE, the method will use principle components as the new features to perform K-means clustering. This may accelerate convergence on high-dimension datasets.

numOfPCA

Optional. If using principle components, you may specify the number to use to perform K-means clustering. If left blank, it will be determined automatically from the scree (elbow) plot.

impute

Set all-column imputation to FALSE or TRUE. This uses mean replacement for numeric columns and most frequent for factorized columns. FALSE leads to removal of rows containing NULLs.

debug

Provides the user extended output to the console, in order to monitor the calculations throughout. Use TRUE or FALSE.

Format

An object of class R6ClassGenerator of length 24.

Methods

The above describes params for initializing a new KmeansClustering class with $new(). Individual methods are documented below.

<code>$new()</code>

Initializes a new Kmeans Clustering class using the parameters saved in p, documented above. This method loads, cleans, and prepares data for clustering. Usage: $new(p)

<code>$run()</code>

Calculates clusters, displays performance. Usage:$run()

<code>$get2DClustersPlot()</code>

Displays the data and assigned clusters. PCA is used to visualize the top two priciple components for plotting. This is unrelated to variable reduction for clustering. Passing TRUE to this function will display grain IDs on the plot. Usage: $get2DClustersPlot()

<code>$getOutDf()</code>

Returns the output dataframe for writing to SQL or CSV. Usage: $getOutDf()

<code>$getConfusionMatrix()</code>

Returns a confusion matrix of assigned cluster vs. provided labels. Clusters are named based on maximum overlap with label. Only available if labelCol is specified. Rows are true labels, columns are assigned clusters. Usage: $getConfusionMatrix()

<code>$getElbowPlot()</code>

Plots total within cluster error vs. number of clusters. Available if the number of clusters is unspecified. Usage: $getElbowPlot()

<code>$getScreePlot()</code>

Plots total variance explained vs. number of principle components. Available if the number of principle components is unspecified. Usage: $getScreePlot()

<code>$getKmeansFit()</code>

Returns all attributes of the kmeans fit object. Usage: $getKmeansFit()

Details

This is an unsupervised method for clustering data. That is, no response variable is needed or used. If you want to examine how the data clusters by some labeled grouping, you can specify the grouping in labelCol, but the labels are not used in the clustering process. If you want to use labels to train the model see LassoDevelopment or RandomForestDevelopment.

References

http://hctools.org/

https://github.com/bryanhanson/ChemoSpecMarkeR/blob/master/R/findElbow.R

Examples

Run this code

# NOT RUN {
#### Example using Diabetes dataset ####
ptm <- proc.time()
# Can delete this line in your work
csvfile <- system.file("extdata", 
                       "HCRDiabetesClinical.csv", 
                       package = "healthcareai")
# Replace csvfile with 'your/path'
df <- read.csv(file = csvfile, 
               header = TRUE, 
               na.strings = c("NULL", "NA", ""))
head(df)
df$PatientID <- NULL

set.seed(42)
p <- UnsupervisedModelParams$new()
p$df <- df
p$impute <- TRUE
p$grainCol <- "PatientEncounterID"
p$debug <- FALSE
p$cores <- 1
p$numOfClusters <- 3

# Run k means clustering
cl <- KmeansClustering$new(p)
cl$run()

# Get the 2D representation of the cluster solution
cl$get2DClustersPlot()

# Get the output data frame
dfOut <- cl$getOutDf()
head(dfOut) 

print(proc.time() - ptm)




#### Example using iris dataset with labels ####
ptm <- proc.time()
library(healthcareai)

data(iris)
head(iris)

set.seed(2017)

p <- UnsupervisedModelParams$new()
p$df <- iris
p$labelCol <- 'Species'
p$impute <- TRUE
p$debug <- FALSE
p$cores <- 1

# Run k means clustering
cl <- KmeansClustering$new(p)
cl$run()

# Get the 2D representation of the cluster solution
cl$get2DClustersPlot()

# Get the output data frame
dfOut <- cl$getOutDf()
head(dfOut) 

## Write to CSV (or JSON, MySQL, etc) using plain R syntax
## write.csv(dfOut,'path/clusteringresult.csv')

print(proc.time() - ptm)

# }

Run the code above in your browser using DataLab