normalize_citations: Normalize and match bibliographic citations

Description

This function performs advanced normalization and fuzzy matching of bibliographic citations to identify and group citations that refer to the same work but are formatted differently. It uses a multi-phase approach combining string normalization, blocking strategies, hierarchical clustering, and post-processing to achieve both speed and accuracy on large citation datasets.

Usage

normalize_citations(
  CR_vector,
  threshold = 0.9,
  method = "jw",
  min_chars = 20,
  max_block_size = 100,
  use_iso4 = TRUE,
  use_doi = TRUE,
  use_exact = TRUE,
  fuzzy = TRUE,
  use_postproc = TRUE,
  title_guard = FALSE
)

Value

A data frame with the following columns:

CR_original: Original citation string
CR_canonical: Canonical (representative) citation for the cluster
cluster_id: Unique identifier for each citation cluster
n_cluster: Number of citations in the cluster
first_author: First author surname
year: Publication year
journal_iso4: Journal name normalized to ISO4 abbreviated form
journal_original: Original journal name as extracted from citation
volume: Volume number
doi: Digital Object Identifier (when available)
blocking_key: Internal key used for blocking (author_year_journal)

Arguments

CR_vector

Character vector containing bibliographic citations to be normalized and matched.

threshold

Numeric value between 0 and 1 indicating the similarity threshold for matching citations. Higher values (e.g., 0.90-0.95) produce more conservative matching, while lower values (e.g., 0.75-0.80) produce more aggressive matching. Default is 0.90, which provides a good balance between precision and recall.

method

String distance method to use for fuzzy matching. Options include:

"jw" (default): Jaro-Winkler distance, optimized for bibliographic strings
"lv": Levenshtein distance
"osa": Optimal String Alignment distance
"lcs": Longest Common Subsequence distance
Other methods supported by stringdistmatrix

min_chars

Minimum characters for valid citations (default: 20).

max_block_size

Integer. Blocks with at least this many unique normalized strings skip within-block fuzzy matching and fall back to exact matching only, to bound the cost of the pairwise distance matrix (default: 100).

use_iso4

Logical. If TRUE (default), normalize journal names to their ISO 4 abbreviated form via the LTWA database (Phase 1.5). Set to FALSE to disable ISO 4 / LTWA journal normalization (used for ablation analyses).

use_doi

Logical. If TRUE (default), perform exact matching on DOIs (part of Phase 2). Set to FALSE to disable DOI-based matching.

use_exact

Logical. If TRUE (default), perform exact normalized-string and punctuation-invariant matching (Phase 2). Set to FALSE to disable them.

fuzzy

Logical. If TRUE (default), perform within-block matching (Phase 4: WoS deterministic key matching and Scopus fuzzy clustering). Set to FALSE to disable within-block matching, keeping only the exact phases.

use_postproc

Logical. If TRUE (default), perform Phase 4.5 metadata-based post-processing merge. Set to FALSE to disable it.

title_guard

Logical. If TRUE, run an optional Phase 4.6 that purifies clusters by detecting series part markers in titles: distinct works that share author, year, journal and volume but differ only in a series designator (e.g. "... Part I" / "... Part II", "... I." / "... II.") are split into separate clusters. This step only splits clusters, never merges; it relies solely on the part marker (not on full-title similarity) so it does not interfere with robustness to title typos. Default FALSE (legacy behaviour).

Details

The function implements a five-phase matching algorithm:

Phase 1: Normalization and Feature Extraction

Converts text to uppercase
Removes issue numbers and page numbers (which often contain typos)
Removes punctuation and normalizes whitespace
Expands common journal abbreviations (e.g., "J. CLEAN. PROD." -> "JOURNAL OF CLEANER PRODUCTION")
Extracts key features: first author, year, journal, volume, pages

Phase 1.5: Journal Normalization The function uses the LTWA (List of Title Word Abbreviations) database from ISO 4 standards to normalize journal names. This ensures that abbreviated forms (e.g., "J. Clean. Prod.") and full forms (e.g., "Journal of Cleaner Production") are recognized as the same journal and matched together.

The LTWA database is included in the bibliometrix package. If not found, the function attempts to download it from ISSN.org. Journal normalization can be disabled by ensuring the LTWA database is not available.

Phase 2: Blocking Citations are grouped into blocks by first author and year. This dramatically reduces computational complexity from O(n^2) to approximately O(k*m^2), where k is the number of blocks and m is the average block size.

Phase 3: Within-Block Matching Within each block, citations are compared using string distance metrics and hierarchical clustering. For blocks with at least max_block_size unique normalized strings (default 100), exact matching on normalized strings is used instead to maintain performance.

Phase 4: Canonical Representative Selection For each cluster, the most complete citation (prioritizing those with volume and page information) is selected as the canonical representative.

Phase 5: Post-Processing Citations sharing the same first author, year, journal, and volume are merged into a single cluster, even if they weren't matched in Phase 3. This catches cases where minor title variations prevented matching.

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.

Examples

Run this code

if (FALSE) {
# Load bibliometrix data
data(scientometrics, package = "bibliometrixData")

# Extract and normalize citations
CR_vector <- unlist(strsplit(scientometrics$CR, ";"))
CR_vector <- trimws(CR_vector)

# Perform normalization with default threshold
matched <- normalize_citations(CR_vector)

# View matching statistics
table(matched$n_cluster)

# Find all variants of a specific citation
subset(matched, cluster_id == matched$cluster_id[1])

# Use more conservative matching
matched_conservative <- normalize_citations(CR_vector, threshold = 0.90)
}

Run the code above in your browser using DataLab