This function provides a complete workflow for extracting entities from text using dictionaries from multiple sources, with improved performance and robust error handling.
extract_entities_workflow(
text_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene"),
dictionary_sources = c("local", "mesh", "umls"),
additional_mesh_queries = NULL,
sanitize = TRUE,
api_key = NULL,
custom_dictionary = NULL,
max_terms_per_type = 200,
verbose = TRUE,
batch_size = 500,
parallel = FALSE,
num_cores = 2,
cache_dictionaries = TRUE
)A data frame with extracted entities, their types, and positions.
A data frame containing article text data.
Name of the column containing text to process.
Character vector of entity types to include.
Character vector of sources for entity dictionaries.
Named list of additional MeSH queries.
Logical. If TRUE, sanitizes dictionaries before extraction.
API key for UMLS access (if "umls" is in dictionary_sources).
A data frame containing custom dictionary entries to incorporate into the entity extraction process.
Maximum number of terms to fetch per entity type. Default is 200.
Logical. If TRUE, prints detailed progress information.
Number of documents to process in a single batch. Default is 500.
Logical. If TRUE, uses parallel processing when available. Default is FALSE.
Number of cores to use for parallel processing. Default is 2.
Logical. If TRUE, caches dictionaries for faster reuse. Default is TRUE.