idaDivCluster: Hierarchical (divisive) clustering

Description

This function generates a hierarchical (divisive) clustering model based on the contents of an IDA data frame (ida.data.frame) by applying recursively the K-means algorithm.

Usage

idaDivCluster(
    data,
    id,
    distance="euclidean",
    maxiter=5, 
    minsplit=5,
    maxdepth=3,
    randseed=12345,
    outtable=NULL,
    modelname=NULL
)
# S3 method for idaDivCluster
print(x,...)  
# S3 method for idaDivCluster
predict(object, newdata, id,...)

Value

The idaDivCluster function returns an object of class idaDivCluster.

Arguments

data: An IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row.
id: The name of the column that contains a unique ID for each row of the input data.
distance: The distance function that is to be used. This can be set to "euclidean", which causes the squared Euclidean distance to be used, or "norm_euclidean", which causes normalized euclidean distance to be used.
maxiter: The maximum number of iterations to perform in the base K-means Clustering algorithm
minsplit: The minimum number of instances per cluster that can be split.
maxdepth: The maximum number of cluster levels (including leaves).
randseed: The seed for the random number generator.
outtable: The name of the output table that is to contain the results of the operation. When NULL is specified, a table name is generated automatically.
modelname: The name under which the model is stored in the database. This is the name that is specified when using functions such as idaRetrieveModel or idaDropModel.
object: An object of the class idaDivCluster to used for prediction, i.e. for applying it to new data.
x: An object of the class idaDivCluster to be printed.
newdata: An IDA data frame that contains the data to which to apply the model.
...: Additional parameters to pass to the print or predict method.

Details

The idaDivCluster clustering function builds a hierarchical clustering model by applying the K-means algorithm recursively in a top-down fashion. The hierarchy of clusters is represented in a binary tree structure (each parent node has exactly 2 child nodes). The leafs of the cluster tree are identified by negative numbers.

Models are stored persistently in the database under the name modelname. Model names cannot have more than 64 characters and cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one model with a given name is allowed in the database at a time. If a model with modelname already exists, you need to drop it with idaDropModel first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel).

The output of the print function for a idaDivCluster object is:

A vector containing a list of centers
A vector containing a list of cluster sizes
A vector containing a list of the number of elements in each cluster
A data frame or the name of the table containing the calculated cluster assignments
The within-cluster sum of squares (which indicates cluster density)
The names of the slots that are available in the idaDivCluster object.

Examples

Run this code

if (FALSE) {

#Create ida data frame
idf <- ida.data.frame("IRIS")

#Create a DivCluster model stored in the database as DivClusterMODEL
dcm <- idaDivCluster(idf, id="ID",modelname="DivClusterMODEL") 
	
#Print the model
print(dcm)

#Predict the model
pred <- predict(dcm,idf,id="ID")

#Inspect the predictions
head(pred)
	
}