Learn R Programming

mikropml

meek-ROPE em el

User-Friendly R Package for Supervised Machine Learning Pipelines

An interface to build machine learning models for classification and regression problems. mikropml implements the ML pipeline described by Topçuoğlu et al. (2020) with reasonable default options for data preprocessing, hyperparameter tuning, cross-validation, testing, model evaluation, and interpretation steps. See the website for more information, documentation, and examples.

Installation

You can install the latest release from CRAN:

install.packages('mikropml')

or the development version from GitHub:

# install.packages("devtools")
devtools::install_github("SchlossLab/mikropml")

or install from a terminal using conda or mamba:

mamba install -c conda-forge r-mikropml

Dependencies

  • Imports: caret, dplyr, e1071, glmnet, kernlab, MLmetrics, randomForest, rlang, rpart, stats, utils, xgboost
  • Suggests: assertthat, doFuture, forcats, foreach, future, future.apply, furrr, ggplot2, knitr, progress, progressr, purrr, rmarkdown, rsample, testthat, tidyr

Usage

Check out the introductory vignette for a quick start tutorial. For a more in-depth discussion, read all the vignettes and/or take a look at the reference documentation.

You can watch the Riffomonas Project series of video tutorials covering mikropml and other skills related to machine learning.

We also provide a Snakemake workflow for running mikropml locally or on an HPC. We highly recommend running mikropml with Snakemake or another workflow management system for reproducibility and scalability of ML analyses.

Help & Contributing

If you come across a bug, open an issue and include a minimal reproducible example.

If you have questions, create a new post in Discussions.

If you’d like to contribute, see our guidelines here.

Code of Conduct

Please note that the mikropml project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

The mikropml package is licensed under the MIT license. Text and images included in this repository, including the mikropml logo, are licensed under the CC BY 4.0 license.

Citation

To cite mikropml in publications, use:

A BibTeX entry for LaTeX users is:

 @Article{,
  title = {{mikropml}: User-Friendly R Package for Supervised Machine Learning Pipelines},
  author = {Begüm D. Topçuoğlu and Zena Lapp and Kelly L. Sovacool and Evan Snitkin and Jenna Wiens and Patrick D. Schloss},
  journal = {Journal of Open Source Software},
  year = {2021},
  volume = {6},
  number = {61},
  pages = {3073},
  doi = {10.21105/joss.03073},
  url = {https://joss.theoj.org/papers/10.21105/joss.03073},
} 

Why the name?

The word “mikrop” (pronounced “meek-ROPE”) is Turkish for “microbe”. This package was originally implemented as a machine learning pipeline for microbiome-based classification problems (see Topçuoğlu et al. 2020). We realized that these methods are applicable in many other fields too, but stuck with the name because we like it!

Copy Link

Version

Install

install.packages('mikropml')

Monthly Downloads

306

Version

1.6.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Kelly Sovacool

Last Published

August 21st, 2023

Functions in mikropml (1.6.1)

calc_balanced_precision

Calculate balanced precision given actual and baseline precision
check_perf_metric_name

Check perf_metric_name is NULL or a function
check_training_frac

Check that the training fraction is between 0 and 1
check_training_indices

Check the validity of the training indices
create_grouped_data_partition

Split into train and test set while splitting by groups. When group_partitions is NULL, all samples from each group will go into either the training set or the testing set. Otherwise, the groups will be split according to group_partitions
check_remove_var

Check remove_var
check_seed

check that the seed is either NA or a number
create_grouped_k_multifolds

Splitting into folds for cross-validation when using groups
check_kfold

Check that kfold is an integer of reasonable size
check_groups

Check grouping vector
cluster_corr_mat

Cluster a matrix of correlated features
check_packages_installed

Check whether package(s) are installed
collapse_correlated_features

Collapse correlated features
check_corr_thresh

check that corr_thresh is either NULL or a number between 0 and 1
check_dataset

Check that the dataset is not empty and has more than 1 column.
check_perf_metric_function

Check perf_metric_function is NULL or a function
define_cv

Define cross-validation scheme and training parameters
find_permuted_perf_metric

Get permuted performance metric difference for a single feature (or group of features)
get_hyperparams_list

Set hyperparameters based on ML method and dataset characteristics
check_all

Check all params that don't return a value
check_permute

Check that permute is a logical
flatten_corr_mat

Flatten correlation matrix to pairs
check_cat_feats

Check if any features are categorical
check_outcome_column

Check that outcome column exists. Pick outcome column if not specified.
get_performance_tbl

Get model performance metrics as a one-row tibble
get_perf_metric_name

Get default performance metric name
get_hp_performance

Get hyperparameter performance metrics
get_outcome_type

Get outcome type.
get_hyperparams_from_df

Split hyperparameters dataframe into named lists for each parameter
get_tuning_grid

Generate the tuning grid for tuning hyperparameters
get_seeds_trainControl

Get seeds for caret::trainControl()
get_binary_corr_mat

Identify correlated features as a binary matrix
get_feature_importance

Get feature importance using the permutation method
get_caret_dummyvars_df

Get dummyvars dataframe (i.e. design matrix)
get_caret_processed_df

Get preprocessed dataframe for continuous variables
get_partition_indices

Select indices to partition the data into training & testing sets.
check_outcome_value

Check that the outcome variable is valid. Pick outcome value if necessary.
keep_groups_in_cv_partitions

Whether groups can be kept together in partitions during cross-validation
get_groups_from_clusters

Assign features to groups
mutate_all_types

Mutate all columns with utils::type.convert().`
mikropml-package

mikropml: User-Friendly R Package for Robust Machine Learning Pipelines
otu_data_preproc

Mini OTU abundance dataset - preprocessed
get_perf_metric_fn

Get default performance metric function
otu_mini_bin_results_xgbTree

Results from running the pipeline with xbgTree on otu_mini_bin
otu_mini_bin_results_svmRadial

Results from running the pipeline with svmRadial on otu_mini_bin
combine_hp_performance

Combine hyperparameter performance metrics for multiple train/test splits
otu_mini_bin

Mini OTU abundance dataset
otu_mini_bin_results_glmnet

Results from running the pipeline with L2 logistic regression on otu_mini_bin with feature importance and grouping
otu_mini_bin_results_rf

Results from running the pipeline with random forest on otu_mini_bin
otu_small

Small OTU abundance dataset
pbtick

Update progress if the progress bar is not NULL.
otu_mini_multi_group

Groups for otu_mini_multi
otu_mini_multi_results_glmnet

Results from running the pipeline with glmnet on otu_mini_multi for multiclass outcomes
reexports

caret contr.ltfr
get_difference

Calculate the difference in the mean of the metric for two groups
compare_models

Perform permutation tests to compare the performance metric across all pairs of a group variable.
get_corr_feats

Identify correlated features
permute_p_value

Calculated a permuted p-value comparing two models
plot_mean_roc

Plot ROC and PRC curves
process_cont_feats

Preprocess continuous features
process_novar_feats

Process features with no variation
shuffle_group

Shuffle the rows in a column
set_hparams_svmRadial

Set hyperparameters for SVM with radial kernel
shared_ggprotos

Get plot layers shared by plot_mean_roc and plot_mean_prc
remove_singleton_columns

Remove columns appearing in only threshold row(s) or fewer.
process_cat_feats

Process categorical features
preprocess_data

Preprocess data prior to running machine learning
otu_mini_cv

Cross validation on train_data_mini with grouped features.
replace_spaces

Replace spaces in all elements of a character vector with underscores
otu_mini_multi

Mini OTU abundance dataset with 3 categorical variables
group_correlated_features

Group correlated features
plot_model_performance

Plot performance metrics for multiple ML runs with different parameters
rm_missing_outcome

Remove missing outcome values
otu_mini_bin_results_rpart2

Results from running the pipeline with rpart2 on otu_mini_bin
randomize_feature_order

Randomize feature order to eliminate any position-dependent effects
plot_hp_performance

Plot hyperparameter performance metrics
is_whole_number

Check whether a numeric vector contains whole numbers.
otu_mini_cont_results_nocv

Results from running the pipeline with glmnet on otu_mini_bin with Otu00001 as the outcome column, using a custom train control scheme that does not perform cross-validation
otu_mini_cont_results_glmnet

Results from running the pipeline with glmnet on otu_mini_bin with Otu00001 as the outcome
set_hparams_xgbTree

Set hyperparameters for SVM with radial kernel
run_ml

Run the machine learning pipeline
tidy_perf_data

Tidy the performance dataframe
radix_sort

Call sort() with method = 'radix'
split_outcome_features

Split dataset into outcome and features
select_apply

Use future apply if available
calc_model_sensspec

Calculate and summarize performance for ROC and PRC plots
set_hparams_glmnet

Set hyperparameters for regression models for use with glmnet
set_hparams_rpart2

Set hyperparameters for decision tree models
set_hparams_rf

Set hyparameters for random forest models
train_model

calc_baseline_precision

Calculate the fraction of positives, i.e. baseline precision for a PRC curve
abort_packages_not_installed

Throw error if required packages are not installed.
calc_perf_metrics

Get performance metrics for test data
calc_perf_bootstrap_split

calc_mean_perf

Generic function to calculate mean performance curves for multiple models
bootstrap_performance

Calculate a bootstrap confidence interval for the performance on a single train/test split
lower_bound

Get the lower and upper bounds for an empirical confidence interval
change_to_num

Change columns to numeric if possible
check_method

Check if the method is supported. If not, throws error.
calc_pvalue

Calculate the p-value for a permutation test
check_ntree

Check ntree
check_features

Check features
check_group_partitions

Check the validity of the group_partitions list