Learn R Programming

searchAnalyzeR (version 0.1.0)

detect_dupes: Detect and Remove Duplicate Records

Description

Detect and Remove Duplicate Records

Usage

detect_dupes(results, method = "exact", similarity_threshold = 0.85)

Value

Data frame with duplicates marked and removed

Arguments

results

Standardized search results data frame

method

Method for duplicate detection ("exact", "fuzzy", "doi")

similarity_threshold

Threshold for fuzzy matching (0-1)

Details

This function provides three methods for duplicate detection:

  • exact: Matches on title and first 100 characters of abstract

  • fuzzy: Uses Jaro-Winkler string distance for similarity matching

  • doi: Matches based on cleaned DOI strings

For fuzzy matching, similarity_threshold should be between 0 and 1, where 1 means identical strings. A threshold of 0.85 typically works well for academic titles.