KODAMA (KnOwledge Discovery by Accuracy MAximization) is an unsupervised and semi-supervised learning algorithm that performs feature extraction from noisy and high-dimensional data.
KODAMA.matrix (data,
spatial = NULL,
samples = NULL,
M = 100, Tcycle = 20,
FUN = c("fastpls","simpls"),
ncomp = min(c(50,ncol(data))),
W = NULL, metrics="euclidean",
constrain = NULL, fix = NULL, landmarks = 10000,
splitting = ifelse(nrow(data) < 40000, 100, 300),
spatial.resolution = 0.3 ,
simm_dissimilarity_matrix=FALSE,
seed=1234)
The function returns a list with 4 items:
a dissimilarity matrix.
a vector with the M
cross-validated accuracies.
a proximity matrix.
a matrix containing all classifications obtained maximizing the cross-validation accuracy.
a matrix containing all classification vectors obtained through maximizing the cross-validation accuracy.
dissimilarity matrix used as input for the KODAMA.visualization
function.
original data.
the constrins used.
A numeric matrix where rows are samples and columns are variables.
Optional matrix of spatial coordinates or NULL. Used to apply spatial constraints.
An optional vector indicating the identity for each sample. Can be used to guide the integration of prior sample-level information.
Number of iterative processes.
Number of cycles to optimize cross-validated accuracy.
Classifier to be used. Options are "fastpls"
or "simpls"
.
Number of components for the PLS classifier. Default is min(50, ncol(data))
.
A vector of initial class labels for each sample (length = nrow(data)
). Defaults to unique labels for each sample if NULL.
Distance metric to be used (default is "euclidean"
).
An optional vector indicating group constraints. Samples sharing the same value in this vector will be forced to stay in the same cluster.
A logical vector indicating whether each sample's label in W
should be fixed during optimization. Defaults to all FALSE
.
Number of landmark points used to approximate the similarity structure. The default is 10000.
Number of random sample splits used during optimization. The default is 100 for small datasets (<40000 samples) and 300 otherwise.
A numeric value (default 0.3) controlling the resolution of spatial constraints.
Logical. If TRUE
, the function returns a similarity/dissimilarity matrix. Default is FALSE
.
Random seed for reproducibility. The default is 1234.
Stefano Cacciatore and Leonardo Tenori
KODAMA consists of five steps. These can be in turn divided into two parts: (i) the maximization of cross-validated accuracy by an iterative process (step I and II), resulting in the construction of a proximity matrix (step III), and (ii) the definition of a dissimilarity matrix (step IV and V). The first part entails the core idea of KODAMA, that is, the partitioning of data guided by the maximization of the cross-validated accuracy. At the beginning of this part, a fraction of the total samples (defined by FUN_SAM
) are randomly selected from the original data. The whole iterative process (step I-III) is repeated M
times to average the effects owing to the randomness of the iterative procedure. Each time that this part is repeated, a different fraction of samples is selected. The second part aims at collecting and processing these results by constructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure (steps IV and V). Then, KODAMA.visualization
function is used to visualise the results of KODAMA dissimilarity matrix.
Abdel-Shafy EA, Kassim M, Vignol A, et al.
KODAMA enables self-guided weakly supervised learning in spatial transcriptomics.
bioRxiv 2025. doi: 10.1101/2025.05.28.656544. tools:::Rd_expr_doi("10.1101/2025.05.28.656544")
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-5122. doi: 10.1073/pnas.1220873111. tools:::Rd_expr_doi("10.1073/pnas.1220873111")
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. tools:::Rd_expr_doi("10.1093/bioinformatics/btw705")
L.J.P. van der Maaten and G.E. Hinton.
Visualizing High-Dimensional Data Using t-SNE.
Journal of Machine Learning Research 9 (Nov): 2579-2605, 2008.
L.J.P. van der Maaten.
Learning a Parametric Embedding by Preserving Local Structure.
In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR W&CP 5:384-391, 2009.
McInnes L, Healy J, Melville J.
Umap: Uniform manifold approximation and projection for dimension reduction.
arXiv preprint:1802.03426. 2018 Feb 9.
KODAMA.visualization
# \donttest{
data(iris)
data=iris[,-5]
labels=iris[,5]
kk=KODAMA.matrix(data,ncomp=2)
cc=KODAMA.visualization(kk,"t-SNE")
plot(cc,col=as.numeric(labels),cex=2)
# }
Run the code above in your browser using DataLab