The function implements a five-phase matching algorithm:
Phase 1: Normalization and Feature Extraction
Converts text to uppercase
Removes issue numbers and page numbers (which often contain typos)
Removes punctuation and normalizes whitespace
Expands common journal abbreviations (e.g., "J. CLEAN. PROD." -> "JOURNAL OF CLEANER PRODUCTION")
Extracts key features: first author, year, journal, volume, pages
Phase 1.5: Journal Normalization
The function uses the LTWA (List of Title Word Abbreviations) database from
ISO 4 standards to normalize journal names. This ensures that abbreviated
forms (e.g., "J. Clean. Prod.") and full forms (e.g., "Journal of Cleaner
Production") are recognized as the same journal and matched together.
The LTWA database is included in the bibliometrix package. If not found,
the function attempts to download it from ISSN.org. Journal normalization
can be disabled by ensuring the LTWA database is not available.
Phase 2: Blocking
Citations are grouped into blocks by first author and year. This dramatically
reduces computational complexity from O(n^2) to approximately O(k*m^2), where k is
the number of blocks and m is the average block size.
Phase 3: Within-Block Matching
Within each block, citations are compared using string distance metrics and
hierarchical clustering. For blocks larger than 500 citations, exact matching
on normalized strings is used instead to maintain performance.
Phase 4: Canonical Representative Selection
For each cluster, the most complete citation (prioritizing those with volume
and page information) is selected as the canonical representative.
Phase 5: Post-Processing
Citations sharing the same first author, year, journal, and volume are merged
into a single cluster, even if they weren't matched in Phase 3. This catches
cases where minor title variations prevented matching.