normalize_citations: Normalize and match bibliographic citations

Description

This function performs advanced normalization and fuzzy matching of bibliographic citations to identify and group citations that refer to the same work but are formatted differently. It uses a multi-phase approach combining string normalization, blocking strategies, hierarchical clustering, and post-processing to achieve both speed and accuracy on large citation datasets.

Usage

normalize_citations(CR_vector, threshold = 0.9, method = "jw", min_chars = 20)

Value

A data frame with the following columns:

CR_original: Original citation string
CR_canonical: Canonical (representative) citation for the cluster
cluster_id: Unique identifier for each citation cluster
n_cluster: Number of citations in the cluster
first_author: First author surname
year: Publication year
journal_iso4: Journal name normalized to ISO4 abbreviated form
journal_original: Original journal name as extracted from citation
volume: Volume number
doi: Digital Object Identifier (when available)
blocking_key: Internal key used for blocking (author_year_journal)

Arguments

CR_vector

Character vector containing bibliographic citations to be normalized and matched.

threshold

Numeric value between 0 and 1 indicating the similarity threshold for matching citations. Higher values (e.g., 0.90-0.95) produce more conservative matching, while lower values (e.g., 0.75-0.80) produce more aggressive matching. Default is 0.85, which provides a good balance between precision and recall.

method

String distance method to use for fuzzy matching. Options include:

"jw" (default): Jaro-Winkler distance, optimized for bibliographic strings
"lv": Levenshtein distance
Other methods supported by stringdistmatrix

min_chars

Minimum characters for valid citations (default: 20)

Details

The function implements a five-phase matching algorithm:

Phase 1: Normalization and Feature Extraction

Converts text to uppercase
Removes issue numbers and page numbers (which often contain typos)
Removes punctuation and normalizes whitespace
Expands common journal abbreviations (e.g., "J. CLEAN. PROD." -> "JOURNAL OF CLEANER PRODUCTION")
Extracts key features: first author, year, journal, volume, pages

Phase 1.5: Journal Normalization The function uses the LTWA (List of Title Word Abbreviations) database from ISO 4 standards to normalize journal names. This ensures that abbreviated forms (e.g., "J. Clean. Prod.") and full forms (e.g., "Journal of Cleaner Production") are recognized as the same journal and matched together.

The LTWA database is included in the bibliometrix package. If not found, the function attempts to download it from ISSN.org. Journal normalization can be disabled by ensuring the LTWA database is not available.

Phase 2: Blocking Citations are grouped into blocks by first author and year. This dramatically reduces computational complexity from O(n^2) to approximately O(k*m^2), where k is the number of blocks and m is the average block size.

Phase 3: Within-Block Matching Within each block, citations are compared using string distance metrics and hierarchical clustering. For blocks larger than 500 citations, exact matching on normalized strings is used instead to maintain performance.

Phase 4: Canonical Representative Selection For each cluster, the most complete citation (prioritizing those with volume and page information) is selected as the canonical representative.

Phase 5: Post-Processing Citations sharing the same first author, year, journal, and volume are merged into a single cluster, even if they weren't matched in Phase 3. This catches cases where minor title variations prevented matching.

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.

Examples

Run this code

if (FALSE) {
# Load bibliometrix data
data(scientometrics, package = "bibliometrixData")

# Extract and normalize citations
CR_vector <- unlist(strsplit(scientometrics$CR, ";"))
CR_vector <- trimws(CR_vector)

# Perform normalization with default threshold
matched <- normalize_citations(CR_vector)

# View matching statistics
table(matched$n_cluster)

# Find all variants of a specific citation
subset(matched, cluster_id == matched$cluster_id[1])

# Use more conservative matching
matched_conservative <- normalize_citations(CR_vector, threshold = 0.90)
}

Run the code above in your browser using DataLab