Learn R Programming

MantaID (version 1.0.4)

A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

Description

The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.

Copy Link

Version

Install

install.packages('MantaID')

Monthly Downloads

164

Version

1.0.4

License

GPL (>= 3)

Maintainer

Zeng Zhengpeng

Last Published

September 9th, 2024

Functions in MantaID (1.0.4)

mi_predict_new

Predict new data with a trained learner.
mi_get_padlen

Get max length of ID data.
mi_unify_mod

Predict with four models and unify results by the sub-model's specificity score to the four possible classes.
mi_tune_rg

Tune the Random Forest model by hyperband.
mi_train_xgb

Xgboost model training
mi_run_bmr

Compare classification models with small samples.
mi_to_numer

Convert data to numeric, and for the ID column convert with fixed levels.
mi_train_rp

Classification tree model training.
mi_train_BP

Train a three layers neural network model.
mi_train_rg

Random Forest Model Training.
mi_data_rawID

ID dataset for testing.
mi_clean_data

Reshape data and delete meaningless rows.
mi

A wrapper function that executes MantaID workflow.
mi_get_ID_attr

Get ID attributes from the Biomart database.
mi_balance_data

Data balance. Most classes adopt random undersampling, while a few classes adopt smote method to oversample to obtain relatively balanced data.
mi_get_ID

Get ID data from the Biomart database using attributes.
mi_filter_feat

Performing feature selection in a automatic way based on correlation and feature importance.
mi_get_miss

Observe the distribution of the false response of the test set.
mi_split_str

Split the string into individual characters and complete the character vector to the maximum length.
mi_split_col

Cut the string of ID column character by character and divide it into multiple columns.
mi_plot_heatmap

Plot heatmap for result confusion matrix.
mi_plot_cor

Plot correlation heatmap.
mi_get_confusion

Compute the confusion matrix for the predicted result.
mi_get_importance

Plot the bar plot for feature importance.
mi_tune_rp

Tune the Decision Tree model by hyperband.
mi_tune_xgb

Tune the Xgboost model by hyperband.
mi_data_procID

Processed ID data.
Example

ID example dataset.
mi_data_attributes

ID-related datasets in biomart.