KODAMA: Knowledge Discovery by Accuracy Maximization

Description

KODAMA (KnOwledge Discovery by Accuracy MAximization) is an unsupervised and semi-supervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, KODAMA is driven by an integrated procedure of cross validation of the results.

Usage

KODAMA(data, 
        M = 100, 
        Tcycle = 20, 
        FUN_VAR = function(x) { ceiling(ncol(x)) },
        FUN_SAM = function(x) { ceiling(nrow(x) * 0.75)},
        bagging = FALSE, 
        FUN = c("PLS-DA","KNN"), 
        f.par = 5, 
        W = NULL, 
        constrain = NULL, 
        fix=NULL, 
        epsilon = 0.05,
        dims=2,
        landmarks=5000)

Arguments

data

a matrix.

number of iterative processes (step I-III).

Tcycle

number of iterative cycles that leads to the maximization of cross-validated accuracy.

FUN_VAR

function to select the number of variables to select randomly. By default all variable are taken.

FUN_SAM

function to select the number of samples to select randomly. By default the 75 per cent of all samples are taken.

bagging

Should sampling be with replacement, bagging = TRUE. By default bagging = FALSE.

FUN

classifier to be considered. Choices are "PLS-DA" and "KNN".

f.par

parameters of the classifier.

a vector of nrow(data) elements. The KODAMA procedure can be started by different initializations of the vector W. Without any a priori information the vector W can be initialized with each element being different from the others (i.e., each sample categorized in a one-element class). Alternatively, the vector W can be initialized by a clustering procedure, such as kmeans.

constrain

a vector of nrow(data) elements. Supervised constraints can be imposed by linking some samples in such a way that if one of them is changed the remaining linked samples must change in the same way (i.e., they are forced to belong to the same class) during the maximization of the cross-validation accuracy procedure. Samples with the same identifying constrain will be forced to stay together.

fix

a vector of nrow(data) elements. The values of this vector must to be TRUE or FALSE. By default all elements are FALSE. Samples with the TRUE fix value will not change the class label defined in W during the maximization of the cross-validation accuracy procedure.

epsilon

cut-off value for low proximity. High proximity are typical of intracluster relationships, whereas low proximities are expected for intercluster relationships. Very low proximities between samples are ignored by (default) setting epsilon = 0.05.

dims

dimensions of the configurations of Sammon's non-linear mapping based on the KODAMA dissimilarity matrix.

landmarks

number of landmarks to use.

Value

The function returns a list with 4 items:

dissimilarity

a dissimilarity matrix.

acc

a vector with the M cross-validated accuracies.

proximity

a proximity matrix.

a matrix containing the all classification obtained maximizing the cross-validation accuracy.

a matrix containing the score of the Sammon's non-linear mapping.

res

a matrix containing all classification vectors obtained through maximizing the cross-validation accuracy.

f.par

parameters of the classifier..

entropy

Shannon's entropy of the KODAMA proximity matrix.

landpoints

indexes of the landmarks used.

Details

KODAMA consists of five steps. These can be in turn divided into two parts: (i) the maximization of cross-validated accuracy by an iterative process (step I and II), resulting in the construction of a proximity matrix (step III), and (ii) the definition of a dissimilarity matrix (step IV and V). The first part entails the core idea of KODAMA, that is, the partitioning of data guided by the maximization of the cross-validated accuracy. At the beginning of this part, a fraction of the total samples (defined by FUN_SAM) are randomly selected from the original data. The whole iterative process (step I-III) is repeated M times to average the effects owing to the randomness of the iterative procedure. Each time that this part is repeated, a different fraction of samples is selected. The second part aims at collecting and processing these results by constructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure (steps IV and V). Then, Sammon's non-linear mapping is used to visualise the results of KODAMA dissimilarity matrix.

References

Cacciatore S, Luchinat C, Tenori L Knowledge discovery by accuracy maximization. Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link

Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA KODAMA: an updated R package for knowledge discovery and data mining. Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link

Examples

Run this code

# NOT RUN {
 data(iris)
 data=iris[,-5]
 labels=iris[,5]
 kk=KODAMA(data,FUN="KNN")
 plot(kk$pp,col=as.numeric(labels), xlab="First component", ylab="Second component",cex=2)

# }

Run the code above in your browser using DataLab