applyCitationMatching: Apply citation normalization to bibliometrix data frame

Description

This is a convenience wrapper function that applies normalize_citations to a bibliometrix data frame (typically loaded with convert2df). It extracts citations from the CR field, performs normalization and matching, and returns comprehensive results including per-paper citation lists and summary statistics.

Usage

applyCitationMatching(M, threshold = 0.9, method = "jw", min_chars = 20)

Value

A list with four elements:

full_data

A data frame with columns:

SR: Source document identifier
CR: Original citation string
CR_canonical: Canonical (normalized) citation
cluster_id: Unique cluster identifier
n_cluster: Size of the citation cluster
first_author, year, journal, volume: Extracted metadata

summary

A data frame summarizing citation frequencies with columns:

CR_canonical: The canonical citation for each cluster
n: Total number of times this work was cited
n_variants: Number of different formatting variants found
variants_example: Sample of variant formats (up to 3 examples)

Sorted by citation frequency (n) in descending order.

matched_citations

Complete output from normalize_citations, useful for detailed analysis of the matching process.

CR_normalized

A data frame with columns:

SR: Source document identifier
CR: Reconstructed CR field with normalized citations (semicolon-separated)
n_references: Number of unique references after normalization

This can be merged back with M to replace the original CR field.

Arguments

M

A bibliometrix data frame, typically created by convert2df. Must contain the columns:

SR: Short reference identifier for each document
CR: Cited references field (citations separated by semicolons)
DB: (Optional) Database source identifier for format detection

threshold

Numeric value between 0 and 1 indicating the similarity threshold for matching citations. Default is 0.85. See normalize_citations for details on selecting appropriate thresholds.

method

String distance method to use for fuzzy matching. Options include:

"jw" (default): Jaro-Winkler distance, optimized for bibliographic strings
"lv": Levenshtein distance
Other methods supported by stringdistmatrix

min_chars

Minimum characters for valid citations (default: 20)

Details

The function automatically handles the new Scopus citation format (where the year appears at the end in parentheses) by converting it to the classic format before processing.

The function performs the following steps:

Splits the CR field by semicolons to extract individual citations
Detects and converts new Scopus format citations to classic format
Trims whitespace from each citation
Applies normalize_citations to identify duplicate citations
Links normalized citations back to source documents (SR)
Generates summary statistics and reconstructs normalized CR fields

The normalized CR field can be used to replace the original CR field in subsequent bibliometric analyses, ensuring that citation counts and network analyses are not inflated by duplicate citations with minor formatting differences.

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.

Examples

Run this code

if (FALSE) {
# Load bibliometric data
file <- "https://www.bibliometrix.org/datasets/savedrecs.txt"
M <- convert2df(file, dbsource = "wos", format = "plaintext")

# Apply citation normalization
results <- applyCitationMatching(M, threshold = 0.85)

# View top cited works (after normalization)
head(results$summary, 20)

# See how many variants were found for the top citation
top_citation <- results$summary$CR_canonical[1]
variants <- subset(results$full_data, CR_canonical == top_citation)
unique(variants$CR)

# Replace original CR with normalized CR in the data frame
M_normalized <- M %>%
  rename(CR_orig = CR) %>%
  left_join(results$CR_normalized, by = "SR")

# Compare citation counts before and after normalization
original_citations <- strsplit(M$CR, ";") %>%
  unlist() %>%
  trimws() %>%
  table() %>%
  length()

normalized_citations <- nrow(results$summary)

cat("Original unique citations:", original_citations, "\n")
cat("After normalization:", normalized_citations, "\n")
cat("Duplicates found:", original_citations - normalized_citations, "\n")

# Use normalized data for further analysis
CR_analysis <- citations(M_normalized, field = "article", sep = ";")
}

Run the code above in your browser using DataLab