This function performs advanced normalization and fuzzy matching of bibliographic citations to identify and group citations that refer to the same work but are formatted differently. It uses a multi-phase approach combining string normalization, blocking strategies, hierarchical clustering, and post-processing to achieve both speed and accuracy on large citation datasets.
normalize_citations(
CR_vector,
threshold = 0.9,
method = "jw",
min_chars = 20,
max_block_size = 100,
use_iso4 = TRUE,
use_doi = TRUE,
use_exact = TRUE,
fuzzy = TRUE,
use_postproc = TRUE,
title_guard = FALSE
)A data frame with the following columns:
CR_original: Original citation string
CR_canonical: Canonical (representative) citation for the cluster
cluster_id: Unique identifier for each citation cluster
n_cluster: Number of citations in the cluster
first_author: First author surname
year: Publication year
journal_iso4: Journal name normalized to ISO4 abbreviated form
journal_original: Original journal name as extracted from citation
volume: Volume number
doi: Digital Object Identifier (when available)
blocking_key: Internal key used for blocking (author_year_journal)
Character vector containing bibliographic citations to be normalized and matched.
Numeric value between 0 and 1 indicating the similarity threshold for matching citations. Higher values (e.g., 0.90-0.95) produce more conservative matching, while lower values (e.g., 0.75-0.80) produce more aggressive matching. Default is 0.90, which provides a good balance between precision and recall.
String distance method to use for fuzzy matching. Options include:
"jw" (default): Jaro-Winkler distance, optimized for bibliographic strings
"lv": Levenshtein distance
"osa": Optimal String Alignment distance
"lcs": Longest Common Subsequence distance
Other methods supported by stringdistmatrix
Minimum characters for valid citations (default: 20).
Integer. Blocks with at least this many unique normalized strings skip within-block fuzzy matching and fall back to exact matching only, to bound the cost of the pairwise distance matrix (default: 100).
Logical. If TRUE (default), normalize journal names to their
ISO 4 abbreviated form via the LTWA database (Phase 1.5). Set to FALSE to
disable ISO 4 / LTWA journal normalization (used for ablation analyses).
Logical. If TRUE (default), perform exact matching on DOIs
(part of Phase 2). Set to FALSE to disable DOI-based matching.
Logical. If TRUE (default), perform exact normalized-string
and punctuation-invariant matching (Phase 2). Set to FALSE to disable them.
Logical. If TRUE (default), perform within-block matching
(Phase 4: WoS deterministic key matching and Scopus fuzzy clustering). Set to
FALSE to disable within-block matching, keeping only the exact phases.
Logical. If TRUE (default), perform Phase 4.5
metadata-based post-processing merge. Set to FALSE to disable it.
Logical. If TRUE, run an optional Phase 4.6 that
purifies clusters by detecting series part markers in titles: distinct
works that share author, year, journal and volume but differ only in a
series designator (e.g. "... Part I" / "... Part II", "... I." / "... II.")
are split into separate clusters. This step only splits clusters, never
merges; it relies solely on the part marker (not on full-title similarity)
so it does not interfere with robustness to title typos. Default
FALSE (legacy behaviour).
The function implements a five-phase matching algorithm:
Phase 1: Normalization and Feature Extraction
Converts text to uppercase
Removes issue numbers and page numbers (which often contain typos)
Removes punctuation and normalizes whitespace
Expands common journal abbreviations (e.g., "J. CLEAN. PROD." -> "JOURNAL OF CLEANER PRODUCTION")
Extracts key features: first author, year, journal, volume, pages
Phase 1.5: Journal Normalization The function uses the LTWA (List of Title Word Abbreviations) database from ISO 4 standards to normalize journal names. This ensures that abbreviated forms (e.g., "J. Clean. Prod.") and full forms (e.g., "Journal of Cleaner Production") are recognized as the same journal and matched together.
The LTWA database is included in the bibliometrix package. If not found, the function attempts to download it from ISSN.org. Journal normalization can be disabled by ensuring the LTWA database is not available.
Phase 2: Blocking Citations are grouped into blocks by first author and year. This dramatically reduces computational complexity from O(n^2) to approximately O(k*m^2), where k is the number of blocks and m is the average block size.
Phase 3: Within-Block Matching
Within each block, citations are compared using string distance metrics and
hierarchical clustering. For blocks with at least max_block_size unique
normalized strings (default 100), exact matching on normalized strings is used
instead to maintain performance.
Phase 4: Canonical Representative Selection For each cluster, the most complete citation (prioritizing those with volume and page information) is selected as the canonical representative.
Phase 5: Post-Processing Citations sharing the same first author, year, journal, and volume are merged into a single cluster, even if they weren't matched in Phase 3. This catches cases where minor title variations prevented matching.
Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.
applyReferenceMatching for direct application to bibliometrix data frames
if (FALSE) {
# Load bibliometrix data
data(scientometrics, package = "bibliometrixData")
# Extract and normalize citations
CR_vector <- unlist(strsplit(scientometrics$CR, ";"))
CR_vector <- trimws(CR_vector)
# Perform normalization with default threshold
matched <- normalize_citations(CR_vector)
# View matching statistics
table(matched$n_cluster)
# Find all variants of a specific citation
subset(matched, cluster_id == matched$cluster_id[1])
# Use more conservative matching
matched_conservative <- normalize_citations(CR_vector, threshold = 0.90)
}
Run the code above in your browser using DataLab