Learn R Programming

Capsule

Comprehensive Reproducibility Framework for R and Bioinformatics Workflows

Author: Abu Saadat (saadatabu1996@gmail.com)


Overview

Capsule is a comprehensive reproducibility framework specifically designed for bioinformatics and computational biology workflows. It automatically captures your entire analysis environment and generates everything needed to reproduce your research—from Docker containers to pipeline configurations.

Why Capsule?

The Problem:

  • Bioinformatics analyses depend on dozens of external tools (samtools, STAR, BWA, etc.)
  • NGS pipelines process massive files (BAM/FASTQ) where checksumming is slow
  • HPC environments require Singularity, not Docker
  • Reference genomes must be precisely versioned
  • Conda environments are complex to reproduce
  • Integration with workflow managers (Nextflow, Snakemake) is manual and error-prone

The Solution: Capsule automatically tracks all of this:

  • External bioinformatics tools (samtools, STAR, BWA, etc.) with versions
  • Conda/Mamba environments with full package specs
  • Reference genomes, annotations, and aligner indices
  • Large files with fast checksumming (xxHash64, 10-100x faster)
  • R packages, parameters, random seeds, and session info
  • System libraries, hardware specs (CPU, RAM, GPU)
  • Generates Docker AND Singularity containers
  • Exports to Nextflow, Snakemake, WDL, and CWL formats

Installation

# Install from GitHub
devtools::install_github("SAADAT-Abu/Capsule")

# Or install locally
install.packages("path/to/Capsule", repos = NULL, type = "source")

System Requirements

  • R ≥ 4.0.0
  • Optional but recommended:
    • conda/mamba (for environment tracking)
    • Singularity/Apptainer (for HPC containers)
    • Workflow manager: Nextflow, Snakemake, WDL, or CWL

Bioinformatics Pipeline Example

Here is an example with simulated RNA-seq data:

library(Capsule)

# =====================================
# RNA-Seq Analysis with Capsule
# Author: Abu Saadat
# =====================================

# Initialize project
init_capsule()

# Track computational environment
capture_hardware("hardware_info.json")
capture_system_libraries("system_libs.json")

# Track external bioinformatics tools
track_external_tools(c(
  "samtools", "bwa", "STAR", "featureCounts",
  "fastqc", "multiqc", "cutadapt"
))

# Track conda environment (if using conda)
track_conda_env()

# Set random seed for reproducibility
set_seed(42, analysis_name = "rna_seq_de")

# Define and track analysis parameters
analysis_params <- list(
  # Alignment parameters
  star_threads = 8,
  star_genome_dir = "ref/STAR_index",
  star_out_filter_mismatch_n_max = 10,

  # Quantifiprintion parameters
  feature_counts_threads = 8,
  min_mapping_quality = 30,

  # Differential expression parameters
  min_count = 10,
  p_value_threshold = 0.05,
  log2fc_threshold = 1.0,

  # Sample groups
  treatment_samples = c("treated_1", "treated_2", "treated_3"),
  control_samples = c("control_1", "control_2", "control_3")
)

track_params(
  analysis_params,
  "rna_seq_de",
  "RNA-seq differential expression analysis parameters"
)

# Track reference genome
track_reference_genome(
  fasta_path = "ref/GRCh38.primary_assembly.genome.fa",
  gtf_path = "ref/gencode.v38.primary_assembly.annotation.gtf",
  genome_build = "GRCh38_gencode_v38",
  species = "Homo sapiens",
  source_url = "https://www.gencodegenes.org/human/release_38.html",
  indices = list(
    star = "ref/STAR_index_GRCh38",
    bwa = "ref/bwa_index_GRCh38"
  )
)

# Simulate RNA-seq data for demonstration
print("\n=== Simulating RNA-seq Data ===\n")

# Create directories
dir.create("data", showWarnings = FALSE)
dir.create("results", showWarnings = FALSE)

# Simulate count matrix
set.seed(42)
n_genes <- 1000
n_samples <- 6

# Gene names
genes <- paste0("GENE_", sprintf("%04d", 1:n_genes))

# Simulate counts with some differentially expressed genes
base_counts <- matrix(
  rnbinom(n_genes * n_samples, mu = 100, size = 10),
  nrow = n_genes,
  ncol = n_samples
)

# Add differential expression for first 100 genes
de_genes_idx <- 1:100
fold_change <- 4
base_counts[de_genes_idx, 4:6] <- base_counts[de_genes_idx, 4:6] * fold_change

# Create count matrix
count_matrix <- data.frame(
  gene_id = genes,
  base_counts
)
colnames(count_matrix) <- c(
  "gene_id",
  paste0("control_", 1:3),
  paste0("treated_", 1:3)
)

# Save count matrix
count_file <- "data/gene_counts.csv"
write.csv(count_matrix, count_file, row.names = FALSE)

# Track the count matrix
track_data(
  count_file,
  source = "generated",
  description = "Simulated RNA-seq gene count matrix",
  metadata = list(
    n_genes = n_genes,
    n_samples = n_samples,
    n_de_genes = length(de_genes_idx),
    simulation_seed = 42
  )
)

print("Simulated count matrix with", n_genes, "genes and", n_samples, "samples\n")
print("Including", length(de_genes_idx), "differentially expressed genes\n\n")

# Perform differential expression analysis (simplified)
print("=== Differential Expression Analysis ===\n")

# Calculate means
control_mean <- rowMeans(count_matrix[, 2:4])
treated_mean <- rowMeans(count_matrix[, 5:7])

# Calculate log2 fold change
log2fc <- log2((treated_mean + 1) / (control_mean + 1))

# Simple t-test (for demonstration)
p_values <- apply(count_matrix[, 2:7], 1, function(x) {
  t.test(x[1:3], x[4:6])$p.value
})

# Create results table
de_results <- data.frame(
  gene_id = count_matrix$gene_id,
  control_mean = control_mean,
  treated_mean = treated_mean,
  log2_fold_change = log2fc,
  p_value = p_values,
  significant = p_values < analysis_params$p_value_threshold &
                abs(log2fc) > analysis_params$log2fc_threshold
)

# Save results
results_file <- "results/differential_expression_results.csv"
write.csv(de_results, results_file, row.names = FALSE)

# Track results
track_data(
  results_file,
  source = "generated",
  description = "Differential expression analysis results"
)

# Summary statistics
n_significant <- sum(de_results$significant)
n_upregulated <- sum(de_results$significant & de_results$log2_fold_change > 0)
n_downregulated <- sum(de_results$significant & de_results$log2_fold_change < 0)

print("\nResults Summary:\n")
print("  Total genes analyzed:", n_genes, "\n")
print("  Significant genes:", n_significant, "\n")
print("  Upregulated:", n_upregulated, "\n")
print("  Downregulated:", n_downregulated, "\n\n")

# Create visualization (simple volcano plot data)
volcano_data <- data.frame(
  log2fc = de_results$log2_fold_change,
  neg_log10_p = -log10(de_results$p_value),
  significant = de_results$significant
)

plot_file <- "results/volcano_plot.pdf"
pdf(plot_file, width = 8, height = 6)
plot(
  volcano_data$log2fc,
  volcano_data$neg_log10_p,
  col = ifelse(volcano_data$significant, "red", "grey"),
  pch = 20,
  xlab = "Log2 Fold Change",
  ylab = "-Log10 P-value",
  main = "Volcano Plot: Differential Expression"
)
abline(h = -log10(analysis_params$p_value_threshold), lty = 2)
abline(v = c(-analysis_params$log2fc_threshold, analysis_params$log2fc_threshold), lty = 2)
dev.off()

print("Volcano plot saved to:", plot_file, "\n\n")

# Create comprehensive snapshot
print("=== Creating Workflow Snapshot ===\n")
snapshot_files <- snapshot_workflow(
  snapshot_name = "rna_seq_analysis_v1",
  analysis_name = "rna_seq_de",
  source_script = "rna_seq_pipeline.R",
  description = "RNA-seq differential expression analysis with simulated data",
  generate_docker = TRUE,
  generate_script = TRUE,
  generate_report = TRUE
)

print("\n=== Exporting for Workflow Managers ===\n")

# Export for Nextflow
export_for_nextflow("nextflow_manifest.json")

# Export for Snakemake
export_for_snakemake("snakemake_config.yaml")

# Export for WDL
export_for_wdl("wdl_inputs.json")

# Generate Singularity container (for HPC)
print("\n=== Generating HPC Container ===\n")
generate_singularity(
  output_dir = "singularity",
  project_name = "rna_seq_analysis",
  system_deps = c("samtools", "bwa")
)

print("\n" , rep("=", 60), "\n", sep = "")
print("RNA-seq Analysis Complete!\n")
print(rep("=", 60), "\n", sep = "")
print("\nAll reproducibility artifacts generated in:\n")
print("  .capsule/snapshots/rna_seq_analysis_v1/\n\n")

print("To reproduce this analysis:\n")
print("  1. Use Docker:\n")
print("     cd .capsule/snapshots/rna_seq_analysis_v1/docker\n")
print("     docker-compose up\n\n")
print("  2. Use Singularity (HPC):\n")
print("     cd singularity\n")
print("     sudo bash build_singularity.sh\n\n")
print("  3. Use Nextflow:\n")
print("     nextflow run pipeline.nf -params-file nextflow_manifest.json\n\n")

# Verify all tracked data
print("=== Data Integrity Check ===\n")
verify_data()

print("\n✓ Analysis complete with full reproducibility!\n\n")

Core Features

Copy Link

Version

Install

install.packages('Capsule')

Monthly Downloads

141

Version

0.2.0

License

MIT + file LICENSE

Maintainer

Abu SAADAT

Last Published

November 11th, 2025

Functions in Capsule (0.2.0)

.load_reference_registry

Load Reference Registry
.save_conda_registry

Save Conda Registry
.load_tools_registry

Load Tools Registry
.parse_deps

Parse Package Dependencies
generate_docker

Generate Docker Configuration
export_for_nextflow

Export Capsule Data for Nextflow
export_for_cwl

Generate CWL (Common Workflow Language) Input
generate_repro_script

Generate Reproducible Script
get_param_history

Get Parameter History
get_data_lineage

Get Data Lineage
.save_param_registry

Save Parameter Registry
.create_example_script

Create example workflow script
.compare_registries

Compare Registry Files
.get_fasta_stats

Get FASTA Statistics
generate_singularity

Generate Singularity Definition File
set_seed

Set and Track Random Seed
.create_config

Create Capsule config
.load_registry

Load Provenance Registry
.create_gitignore

Create .gitignore
.get_tool_path

Get Tool Path
.save_seed_registry

Save Seed Registry
.load_seed_registry

Load Seed Registry
snapshot_packages

Track Package Versions and Dependencies
.get_tool_version

Get Tool Version String
.save_registry

Save Provenance Registry
.save_reference_registry

Save Reference Registry
list_reference_sources

List Common Reference Genome Sources
.load_conda_registry

Load Conda Registry
snapshot_workflow

Create Complete Workflow Snapshot
get_conda_env_info

Get Conda Environment Info
.save_tools_registry

Save Tools Registry
list_snapshots

List Available Snapshots
get_reference_info

Get Reference Genome Information
get_seed_history

Get Seed History
get_tool_versions

Get External Tool Versions
init_capsule

Initialize Capsule in Project
restore_conda_env

Restore Conda Environment
.extract_package_info

Extract Package Information
export_for_snakemake

Export Capsule Data for Snakemake
.format_r_value

Format R Value for Script
track_conda_env

Track Conda Environment
track_reference_genome

Track Reference Genome
track_params

Track Analysis Parameters
export_for_wdl

Create WDL (Workflow Description Language) Config
restore_seed

Restore Random Seed
track_external_tools

Track External Bioinformatics Tools
track_data

Track Data Provenance
verify_data

Verify Data Integrity
.compare_packages

Compare Package Lists
create_renv_lockfile

Create renv Lockfile
capture_system_libraries

Capture System Libraries
create_repro_report

Create Reproducibility Report
capture_environment

Capture Environment State
compare_snapshots

Compare Two Workflow Snapshots
capture_session

Capture Complete Session Information
capture_hardware

Capture Hardware Information
.generate_docker_compose

Generate docker-compose.yml
.calculate_checksum

Calculate Checksum with Smart Algorithm Selection
.generate_dockerfile

Generate Dockerfile
.generate_dockerignore

Generate .dockerignore
.build_dependency_graph

Build Dependency Graph
.format_size

Format File Size
.generate_comparison_report

Generate Comparison Report
.generate_docker_readme

Generate Docker README
.load_param_registry

Load Parameter Registry