Learn R Programming

KODAMA (version 3.0)

KODAMA.matrix: Knowledge Discovery by Accuracy Maximization

Description

KODAMA (KnOwledge Discovery by Accuracy MAximization) is an unsupervised and semi-supervised learning algorithm that performs feature extraction from noisy and high-dimensional data.

Usage

KODAMA.matrix (data,                 
               spatial = NULL,     
               samples = NULL,
               M = 100, Tcycle = 20, 
               FUN = c("fastpls","simpls"), 
               ncomp = min(c(50,ncol(data))),
               W = NULL, metrics="euclidean",
               constrain = NULL, fix = NULL,  landmarks = 10000,  
               splitting = ifelse(nrow(data) < 40000, 100, 300), 
               spatial.resolution = 0.3 , 
               simm_dissimilarity_matrix=FALSE,
               seed=1234)

Value

The function returns a list with 4 items:

dissimilarity

a dissimilarity matrix.

acc

a vector with the M cross-validated accuracies.

proximity

a proximity matrix.

v

a matrix containing all classifications obtained maximizing the cross-validation accuracy.

res

a matrix containing all classification vectors obtained through maximizing the cross-validation accuracy.

knn_Rnanoflann

dissimilarity matrix used as input for the KODAMA.visualization function.

data

original data.

res_constrain

the constrins used.

Arguments

data

A numeric matrix where rows are samples and columns are variables.

spatial

Optional matrix of spatial coordinates or NULL. Used to apply spatial constraints.

samples

An optional vector indicating the identity for each sample. Can be used to guide the integration of prior sample-level information.

M

Number of iterative processes.

Tcycle

Number of cycles to optimize cross-validated accuracy.

FUN

Classifier to be used. Options are "fastpls" or "simpls".

ncomp

Number of components for the PLS classifier. Default is min(50, ncol(data)).

W

A vector of initial class labels for each sample (length = nrow(data)). Defaults to unique labels for each sample if NULL.

metrics

Distance metric to be used (default is "euclidean").

constrain

An optional vector indicating group constraints. Samples sharing the same value in this vector will be forced to stay in the same cluster.

fix

A logical vector indicating whether each sample's label in W should be fixed during optimization. Defaults to all FALSE.

landmarks

Number of landmark points used to approximate the similarity structure. The default is 10000.

splitting

Number of random sample splits used during optimization. The default is 100 for small datasets (<40000 samples) and 300 otherwise.

spatial.resolution

A numeric value (default 0.3) controlling the resolution of spatial constraints.

simm_dissimilarity_matrix

Logical. If TRUE, the function returns a similarity/dissimilarity matrix. Default is FALSE.

seed

Random seed for reproducibility. The default is 1234.

Author

Stefano Cacciatore and Leonardo Tenori

Details

KODAMA consists of five steps. These can be in turn divided into two parts: (i) the maximization of cross-validated accuracy by an iterative process (step I and II), resulting in the construction of a proximity matrix (step III), and (ii) the definition of a dissimilarity matrix (step IV and V). The first part entails the core idea of KODAMA, that is, the partitioning of data guided by the maximization of the cross-validated accuracy. At the beginning of this part, a fraction of the total samples (defined by FUN_SAM) are randomly selected from the original data. The whole iterative process (step I-III) is repeated M times to average the effects owing to the randomness of the iterative procedure. Each time that this part is repeated, a different fraction of samples is selected. The second part aims at collecting and processing these results by constructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure (steps IV and V). Then, KODAMA.visualization function is used to visualise the results of KODAMA dissimilarity matrix.

References

Abdel-Shafy EA, Kassim M, Vignol A, et al.
KODAMA enables self-guided weakly supervised learning in spatial transcriptomics.
bioRxiv 2025. doi: 10.1101/2025.05.28.656544. tools:::Rd_expr_doi("10.1101/2025.05.28.656544")

Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-5122. doi: 10.1073/pnas.1220873111. tools:::Rd_expr_doi("10.1073/pnas.1220873111")

Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. tools:::Rd_expr_doi("10.1093/bioinformatics/btw705")

L.J.P. van der Maaten and G.E. Hinton.
Visualizing High-Dimensional Data Using t-SNE.
Journal of Machine Learning Research 9 (Nov): 2579-2605, 2008.

L.J.P. van der Maaten.
Learning a Parametric Embedding by Preserving Local Structure.
In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR W&CP 5:384-391, 2009.

McInnes L, Healy J, Melville J.
Umap: Uniform manifold approximation and projection for dimension reduction.
arXiv preprint:1802.03426. 2018 Feb 9.

See Also

KODAMA.visualization

Examples

Run this code
# \donttest{

 data(iris)
 data=iris[,-5]
 labels=iris[,5]
 kk=KODAMA.matrix(data,ncomp=2)
 cc=KODAMA.visualization(kk,"t-SNE")
 plot(cc,col=as.numeric(labels),cex=2)

# }

Run the code above in your browser using DataLab