Learn R Programming

LBDiscover

Overview

LBDiscover is an R package for literature-based discovery (LBD) in biomedical research. It provides a comprehensive suite of tools for retrieving scientific articles, extracting biomedical entities, building co-occurrence networks, and applying various discovery models to uncover hidden connections in the scientific literature.

The package implements several literature-based discovery approaches including:

  • ABC model (Swanson’s discovery model)
  • AnC model (improved version with better biomedical term filtering)
  • Latent Semantic Indexing (LSI)
  • BITOLA-style approaches

LBDiscover also features powerful visualization tools for exploring discovered connections using networks, heatmaps, and interactive diagrams.

Installation

# Install from CRAN
install.packages("LBDiscover")

# Or install the development version from GitHub
# install.packages("devtools")
devtools::install_github("chaoliu-cl/LBDiscover")

Key Features

LBDiscover provides a complete workflow for literature-based discovery:

  1. Data Retrieval: Query and retrieve scientific articles from PubMed and other NCBI databases
  2. Text Preprocessing: Clean and prepare text for analysis
  3. Entity Extraction: Identify biomedical entities in text (diseases, drugs, genes, etc.)
  4. Co-occurrence Analysis: Build networks of entity co-occurrences
  5. Discovery Models: Apply various discovery algorithms to find hidden connections
  6. Validation: Validate discoveries through statistical tests
  7. Visualization: Explore results through network graphs, heatmaps, and more

Quick Start Example

library(LBDiscover)

# Retrieve articles from PubMed
articles <- pubmed_search("migraine treatment", max_results = 100)

# Preprocess article text
preprocessed <- vec_preprocess(
  articles,
  text_column = "abstract",
  remove_stopwords = TRUE
)

# Extract biomedical entities
entities <- extract_entities_workflow(
  preprocessed,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene")
)

# Create co-occurrence matrix
co_matrix <- create_comat(
  entities,
  doc_id_col = "doc_id",
  entity_col = "entity",
  type_col = "entity_type"
)

# Apply the ABC model to find new connections
abc_results <- abc_model(
  co_matrix,
  a_term = "migraine",
  n_results = 50,
  scoring_method = "combined"
)

# Visualize the results
vis_abc_network(abc_results, top_n = 20)

Discovery Models

ABC Model

The ABC model is based on Swanson’s discovery paradigm. If concept A is related to concept B, and concept B is related to concept C, but A and C are not directly connected in the literature, then A may have a hidden relationship with C.

# Apply the ABC model
abc_results <- abc_model(
  co_matrix,
  a_term = "migraine",
  min_score = 0.1,
  n_results = 50
)

# Visualize as a network
vis_abc_network(abc_results)

# Or as a heatmap
vis_heatmap(abc_results)

AnC Model

The AnC model is an extension of the ABC model that uses multiple B terms to establish stronger connections between A and C.

# Apply the AnC model
anc_results <- anc_model(
  co_matrix,
  a_term = "migraine",
  n_b_terms = 5,
  min_score = 0.1
)

LSI Model

The Latent Semantic Indexing model identifies semantically related terms using dimensionality reduction techniques.

# Create term-document matrix
tdm <- create_term_document_matrix(preprocessed)

# Apply LSI model
lsi_results <- lsi_model(
  tdm,
  a_term = "migraine",
  n_factors = 100
)

Visualization

The package offers multiple visualization options:

# Network visualization
vis_abc_network(abc_results, top_n = 25)

# Heatmap of connections
vis_heatmap(abc_results, top_n = 20)

# Export interactive HTML network
export_network(abc_results, output_file = "abc_network.html")

# Export interactive chord diagram
export_chord(abc_results, output_file = "abc_chord.html")

Comprehensive Analysis

For an end-to-end analysis:

# Run comprehensive discovery analysis
discovery_results <- run_lbd(
  search_query = "migraine pathophysiology",
  a_term = "migraine",
  discovery_approaches = c("abc", "anc", "lsi"),
  include_visualizations = TRUE,
  output_file = "discovery_report.html"
)

Documentation

For more detailed documentation and examples, please see the package vignettes:

# View package vignettes
browseVignettes("LBDiscover")

Citation

If you use LBDiscover in your research, please cite:

Liu, C. (2025). LBDiscover: Literature-Based Discovery Tools for Biomedical Research. 
R package version 0.1.0. https://github.com/chaoliu-cl/LBDiscover

License

This project is licensed under the GPL-3 License - see the LICENSE file for details.

Copy Link

Version

Install

install.packages('LBDiscover')

Monthly Downloads

165

Version

0.1.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Chao Liu Liu

Last Published

June 16th, 2025

Functions in LBDiscover (0.1.0)

bitola_model

Apply BITOLA-style discovery model
.dict_cache_env

Environment to store dictionary cache data
filter_by_type

Filter a co-occurrence matrix by entity type
find_abc_all

Find all potential ABC connections
calculate_score

Calculate ABC score based on specified method
.pubmed_cache_env

Environment to store PubMed cache data
fetch_and_parse_pubmed

Fetch and parse PubMed data
fetch_and_parse_protein

Fetch and parse Protein data
find_similar_docs

Find similar documents for a given document
find_term

Find primary term in co-occurrence matrix
get_type_dist

Get entity type distribution from co-occurrence matrix
get_service_ticket

Get a service ticket from a TGT URL
get_term_vars

Extract term variations from text corpus
load_from_umls

Load terms from UMLS API
get_umls_semantic_types

Get UMLS semantic types for a given dictionary type
merge_results

Merge multiple search results
min_results

Ensure minimum results for visualization
safe_diversify

Diversify ABC results with error handling
sanitize_dictionary

Enhanced sanitize dictionary function
segment_sentences

Perform sentence segmentation on text
save_results

Save search results to a file
query_external_api

Query external biomedical APIs to validate entity types
query_mesh

Query for MeSH terms using E-utilities
vis_network

Create an enhanced network visualization of ABC connections
vis_heatmap

Create an enhanced heatmap of ABC connections
gen_report

Generate comprehensive discovery report
merge_entities

Combine and deduplicate entity datasets
get_dict_cache

Get dictionary cache environment
map_ontology

Map terms to biomedical ontologies
preprocess_text

Preprocess article text
validate_biomedical_entity

Validate biomedical entities using BioBERT or other ML models
apply_correction

Apply correction to p-values
abc_model

Apply the ABC model for literature-based discovery with improved filtering
abc_model_sig

Apply the ABC model with statistical significance testing
abc_model_opt

Optimize ABC model calculations for large matrices
anc_model

ANC model for literature-based discovery with biomedical term filtering
abc_timeslice

Apply time-sliced ABC model for validation
calc_bibliometrics

Calculate basic bibliometric statistics
calc_doc_sim

Calculate document similarity using TF-IDF and cosine similarity
create_comat

Create co-occurrence matrix without explicit entity type constraints
compare_terms

Compare term frequencies between two corpora
create_citation_net

Create a citation network from article data
cluster_docs

Cluster documents using K-means
clear_pubmed_cache

Clear PubMed cache
apply_bitola_flexible

Apply a flexible BITOLA-style discovery model without strict type constraints
authenticate_umls

Authenticate with UMLS
add_statistical_significance

Add statistical significance testing based on hypergeometric tests
alternative_validation

Alternative validation for large matrices
detect_lang

Detect language of text
create_dummy_dictionary

Helper function to create dummy dictionaries
fetch_and_parse_gene

Fetch and parse Gene data
create_term_document_matrix

Create a term-document matrix from preprocessed text
create_tdm

Create a term-document matrix from preprocessed text
export_chord

Export interactive HTML chord diagram for ABC connections
is_valid_biomedical_entity

Determine if a term is likely a specific biomedical entity with improved accuracy
diversify_b_terms

Enforce diversity by selecting top connections from each B term
list_to_df

Convert a list of articles to a data frame
fetch_and_parse_pmc

Fetch and parse PMC data
export_network

Export ABC results to simple HTML network
extract_entities_workflow

Extract entities from text with improved efficiency using only base R
extract_mesh_from_text

Extract MeSH terms from text format instead of XML
extract_entities

Extract and classify entities from text with multi-domain types
enhance_abc_kb

Enhance ABC results with external knowledge
create_report

Generate a comprehensive discovery report
create_sparse_comat

Create a sparse co-occurrence matrix
ncbi_search

Search NCBI databases for articles or data
diversify_c_paths

Enforce diversity for C term paths
plot_network

Create network visualization from results
export_chord_diagram

Export interactive HTML chord diagram for ABC connections
eval_evidence

Evaluate literature support for discovery results
get_pmc_fulltext

Retrieve full text from PubMed Central
extract_topics

Apply topic modeling to a corpus
extract_terms

Extract common terms from a corpus
diversify_abc

Enforce diversity in ABC model results
get_pubmed_cache

Get the pubmed cache environment
prep_articles

Prepare articles for report generation
query_umls

Query UMLS for term information
remove_ac_terms

Remove A and C terms that appear as B terms
standard_validation

Standard validation method using hypergeometric tests
shadowtext

Helper function to draw text with a shadow/background
lsi_model

LSI model with enhanced biomedical term filtering and NLP verification
load_results

Load saved results from a file
null_coalesce

Null coalescing operator
retry_api_call

Retry an API call with exponential backoff
parallel_analysis

Apply parallel processing for document analysis
visualize_abc_network

Visualize ABC model results as a network
process_mesh_xml

Process MeSH XML data with improved error handling
pubmed_search

Search PubMed for articles with optimized performance
perm_test_abc

Perform randomization test for ABC model
plot_heatmap

Create heatmap visualization from results
run_lbd

Perform comprehensive literature-based discovery without type constraints
vec_preprocess

Vectorized preprocessing of text
vis_abc_heatmap

Create a heatmap of ABC connections
load_mesh_terms_from_pubmed

Load terms from MeSH using PubMed search
extract_ngrams

Extract n-grams from text
extract_ner

Perform named entity recognition on text
parse_pubmed_xml

Parse PubMed XML data with optimized memory usage
validate_entity_comprehensive

Comprehensive entity validation using multiple techniques
validate_entity_with_nlp

Validate entity types using NLP-based entity recognition with improved accuracy
process_mesh_chunks

Process MeSH data in chunks to avoid memory issues
validate_umls_key

Validate a UMLS API key
load_dictionary

Load biomedical dictionaries with improved error handling
load_from_mesh

Load terms from MeSH using rentrez with improved error handling
validate_abc

Apply statistical validation to ABC model results with support for large matrices
valid_entities

Filter entities to include only valid biomedical terms