Numero v1.2.0


Monthly downloads



Statistical Framework to Define Subgroups in Complex Datasets

High-dimensional datasets that do not exhibit a clear intrinsic clustered structure pose a challenge to conventional clustering algorithms. For this reason, we developed an unsupervised framework that helps scientists to better subgroup their datasets based on visual cues, please see Gao S, Mutter S, Casey A, Makinen V-P (2018) Numero: a statistical framework to define multivariable subgroups in complex population-based datasets, Int J Epidemiology, dyy113, <doi:10.1093/ije/dyy113>. The framework includes the necessary functions to construct a self-organizing map of the data, to evaluate the statistical significance of the observed data patterns, and to visualize the results.





In textbook examples, multivariable datasets are clustered into distinct subgroups that can be clearly identified by a set of optimal mathematical criteria. However, many real-world datasets arise from synergistic consequences of multiple effects, noisy and partly redundant measurements, and may represent a continuous spectrum of the different phases of a phenomenon. In medicine, complex diseases associated with ageing are typical examples. We postulate that population-based biomedical datasets (and many other real-world examples) do not contain an intrinsic clustered structure that would give rise to mathematically well-defined subgroups. From a modeling point of view, the lack of intrinsic structure means that the data points inhabit a contiguous cloud in high-dimensional space without abrupt changes in density to indicate subgroup boundaries, hence a mathematical criteria cannot segment the cloud reliably by its internal structure. Yet we need data-driven classification and subgrouping to aid decision-making and to facilitate the development of testable hypotheses. For this reason, we developed the Numero package, a more flexible and transparent process that allows human observers to create usable multivariable subgroups even when conventional clustering frameworks struggle.


# Install Numero from the CRAN repository:


The vignette of the package contains a practical real-life example of how to use the Numero R functions to define subgroups within a biomedical dataset.

browseVignettes(package = "Numero")

Functions in Numero

Name Description
nroPreprocess Data cleaning and standardization
nroTrain Train self-organizing map
numero.clean Clean datasets
nroPermute Permutation analysis of map layout
nroPlot Plot a self-organizing map
numero.summary Summarize subgroup statistics
nroPostprocess Standardization using existing parameters
numero.evaluate Self-organizing map statistics
numero.create Create a self-organizing map
numero.quality Self-organizing map statistics
numero.subgroup Interactive subgroup assignment
nroSummary Estimate subgroup statistics
numero.plot Plot results from SOM analysis
nroRcppMatrix Safety check for Rcpp calls
numero.prepare Prepare datasets for analysis
nroPrune Reduce collinearity within a dataset
nroLabel Label pruning
nroDestratify Mitigate data stratification
nroPair Match similar rows
nroKmeans K-means clustering
nroAggregate Regional averages on a self-organizing map
nroMatch Best-matching districts
nroKohonen Self-organizing map
nroImpute Impute missing values
nroColorize Assign colors based on value
No Results!

Vignettes of Numero

No Results!

Last month downloads


Type Package
Date 2019-06-12
License GPL (>= 2)
LinkingTo Rcpp
VignetteBuilder knitr
NeedsCompilation yes
Repository CRAN
SystemRequirements C++11
Encoding UTF-8
LazyData true
Packaged 2019-06-12 05:19:50 UTC; vipmak
Date/Publication 2019-06-12 13:30:08 UTC

Include our badge in your README