cluster: Cluster model

Description

Build a cluster model that predicts the algorithm to use based on the features of the problem.

Usage

cluster(clusterer = NULL, data = NULL,
    bestBy = "performance",
    pre = function(x, y=NULL) { list(features=x) })

Arguments

clusterer

the clustering function to use. Must accept a data frame with features. Return value should be a structure that can be given to predict along with new data. See examples.

The argument can also be a list of such clusters.

data

the data to use with training and test sets. The structure returned by trainTest or cvFolds.

bestBy

the criteria by which to determine the best algorithm in a cluster. Can be one of "performance", "count", "successes". Optional. Defaults to "performance".

pre

a function to preprocess the data. Currently only normalize. Optional. Does nothing by default.

Value

predictionsa list of lists of data frames with the predictions for each test set. Each data frame has columns algorithm and score and is sorted according to preference, with the most preferred algorithm first. The score corresponds to the cumulative performance value for the respective algorithm in the cluster the instance was assigned to. That is, if bestBy is "performance", it is the sum of the performance over all training instances. If bestBy is "count", the score corresponds to the number of training instances that the respective algorithm was the best on, and if it is "successes" it corresponds to the number of training instances solved. If more than one clustering algorithm is used, the score corresponds to the sum of all instances across all clusterers. If stacking is used, each data frame contains simply the best algorithm with a score of 1.
predictora function that encapsulates the model learned on the entire data set. Can be called with data for the same features with the same feature names as the training data to obtain predictions.
modelsthe list of models trained on the entire data set. This is meant for debugging/inspection purposes and does not include any models used to combine predictions of individual models.

Details

cluster takes data and processes it using pre (if supplied). clusterer is called to cluster the data. For each cluster, the best algorithm is identified according to the criteria given in bestBy. If bestBy is "performance", the best algorithm is the one with the best overall performance across all instances in the cluster. If it is "count", the best algorithm is the one that has the best performance most often. If it is "successes", the best algorithm is the one with the highest number of successes across all instances in the cluster. The learned model is used to cluster the test data and predict algorithms accordingly.

The evaluation across the training and test sets will be parallelized automatically if a suitable backend for parallel computation is loaded.

If a list of clusterers is supplied in clusterer, ensemble clustering is performed. That is, the models are trained and used to make predictions independently. For each instance, the final prediction is determined by majority vote of the predictions of the individual models -- the class that occurs most often is chosen. If the list given as clusterer contains a member .combine that is a function, it is assumed to be a classifier with the same properties as classifiers given to classify and will be used to combine the ensemble predictions instead of majority voting.

Examples

Run this code

library(RWeka)

data(satsolvers)
trainTest = cvFolds(satsolvers)

res = cluster(clusterer=XMeans, data=trainTest, pre=normalize)
# the total number of successes
sum(successes(trainTest, res))
# predictions on the entire data set
res$predictor(subset(satsolvers$data, TRUE, satsolvers$features))

# determine best by number of successes
res = cluster(clusterer=XMeans, data=trainTest, bestBy="successes", pre=normalize)
sum(successes(trainTest, res))

library(flexclust)
res = cluster(clusterer=function(x) { kcca(x, length(satsolvers$performance)) },
    data=trainTest, pre=normalize)

# ensemble clustering
rese = cluster(clusterer=list(XMeans, make_Weka_clusterer("weka/clusterers/EM"),
    function(x) { kcca(x, length(satsolvers$performance)) }),
    data=trainTest, pre=normalize)

# ensemble clustering with a classifier to combine predictions
rese = cluster(clusterer=list(XMeans, make_Weka_clusterer("weka/clusterers/EM"),
    function(x) { kcca(x, length(satsolvers$performance)) }, .combine=J48),
    data=trainTest, pre=normalize)

Run the code above in your browser using DataLab