Learn R Programming

GeneSelectR

Overview

GeneSelectR is an R package designed to streamline the process of gene selection and evaluation in bulk RNAseq datasets. Built on top of the powerful scikit-learn Python library via the reticulate package, GeneSelectR offers a seamless integration of machine learning and bioinformatics capabilities in a single workflow.

Features

Comprehensive Workflow GeneSelectR provides an end-to-end solution for feature selection, combining the machine learning prowess of scikit-learn with the bioinformatics utilities of R packages like clusterprofiler and simplifyEnrichment.

Customizable Yet User-Friendly

While GeneSelectR offers a high degree of customization to cater to specific research needs, it also comes with preset configurations that are suitable for most use-cases, making it accessible for both novice and experienced users.

Diverse Feature Selection Methods

The package includes a variety of inbuilt feature selection methods, such as:

  • SelectFromModel with RandomForest
  • SelectFromModel with Logistic Regression (L1 penalty)
  • Boruta
  • Univariate Filtering

Main Functionality

The core function, GeneSelectR, performs gene selection using various methods and evaluates their performance through cross-validation. It also supports hyperparameter tuning, permutation feature importance calculation, and more.

Installation

GeneSelectR depends on reticulate that creates a conda working environment. Please, install Anaconda distribution before you proceed. You can install the development version of GeneSelectR from GitHub with:

# install.packages("devtools")
devtools::install_github("dzhakparov/GeneSelectR")

Usage and Example

A tutorial detailing how to use GeneSelectR can be accessed in this vignette.

Docker Image

GeneSelectR is available as a container image on Docker Hub. You can pull the image using the following command:

docker pull dzhakparov/geneselectr-image:latest
docker run -e PASSWORD=your_password -p 8787:8787 dzhakparov/geneselectr-image:latest

After running these commands, open your browser and go to localhost:8787 (http//local-ip-address:8787 in Windows). You will be prompted to enter username and password. The default username is rstudio and the password is the one you specified in the command above.

Citation

Please cite the following paper if you use GeneSelectR in your research:

Feedback and Contribution

Any feedback is welcome and appreciated! Feel free to create issues or pull requests. For any other questions please write to: damir.zhakparov@uzh.ch.

Copy Link

Version

Install

install.packages('GeneSelectR')

Monthly Downloads

186

Version

1.0.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Damir Zhakparov

Last Published

February 3rd, 2024

Functions in GeneSelectR (1.0.1)

configure_environment

Configure Python Environment for GeneSelectR
create_test_metrics_df

Create a Dataframe of Test Metrics
.onAttach

Package Attachment Function
create_pipelines

Create Pipelines
evaluate_test_metrics

Evaluate Test Metrics for a Grid Search Model
enable_multiprocess

Enable Multiprocessing in Python Environment
create_conda_env

Create a specific Conda environment
set_default_param_grids

Set Default Parameter Grids for Feature Selection
set_reticulate_python

Set RETICULATE_PYTHON for the Current Session
plot_feature_importance

Plot Feature Importance
pipeline_to_list

Convert Scikit-learn Pipeline to Named List
get_feature_importances

Get Feature Importances
set_default_fs_methods

Set Default Feature Selection Methods
run_simplify_enrichment

Run simplifyGOFromMultipleLists with specified measure and method
plot_overlap_heatmaps

Generate Heatmaps to Visualize Overlap and Similarity Coefficients between Feature Lists
steps_to_tuples

Convert Steps to Tuples
import_python_packages

Import Python Libraries
install_python_packages

Install necessary Python packages in a specific Conda environment
python-modules

Global references to Python modules
perform_grid_search

Perform Grid Search or Random Search for Hyperparameter Tuning
plot_metrics

Plot Performance Metrics
plot_upset

Plot Feature Overlaps Using UpSet Plots
skip_if_no_modules

Check if Python Modules are Available
split_data

Split Data into Training and Test Sets
load_python_packages

Load Python Modules
PipelineResults-class

PipelineResults class
TestMetrics-class

Class Union for Test metrics output that could contain either a dataframe or a lists
GO_enrichment_analysis

Perform gene set enrichment analysis using clusterProfiler
AnnotatedGeneLists-class

AnnotatedGeneLists class
GeneSelectR

Gene Selection and Evaluation with GeneSelectR
GeneList-class

GeneList class
calculate_mean_cv_scores

Calculate Mean Cross-Validation Scores for Various Feature Selection Methods
calculate_overlap_coefficients

Calculate Overlap and Similarity Coefficients between Feature Lists
compute_GO_child_term_metrics

Retrieve and Plot the Offspring Nodes of GO Terms
calculate_permutation_feature_importance

Calculate Permutation Feature Importance
check_python_modules_available

Check Python Module Availability for Examples
aggregate_feature_importances

Aggregate Feature Importances
annotate_gene_lists

Convert and Annotate Gene Lists
define_sklearn_modules

Define Python modules and scikit-learn submodules