Learn R Programming

bibliometrix (version 5.4.1)

completeMetadata: Complete missing metadata via DOI lookup against Crossref and OpenAlex

Description

Given a bibliometrix collection produced by convert2df, this function takes the subset of records that have a DOI but are missing one or more of the analysis-relevant fields, queries the Crossref REST API (https://api.crossref.org/works) and/or OpenAlex (via openalexR) using the DOI as the lookup key, and fills the gaps with the values returned by those sources. Existing non-empty values are never overwritten.

Usage

completeMetadata(
  M,
  sources = c("openalex", "crossref"),
  fields = c("AB", "AU", "C1", "CR", "DT", "LA", "PY", "RP", "SO", "TC", "TI"),
  email = NULL,
  oa_apikey = NULL,
  batch_size = 20,
  max_records = Inf,
  progress = NULL,
  verbose = TRUE
)

Value

A list with components:

M

The enriched collection (same class as the input).

report

Long-format data.frame with one row per (field, source) summarising attempts, fills, and failures.

before

The mandatoryTags table from missingData(M) before enrichment.

after

The mandatoryTags table from missingData(M) after enrichment.

Provenance is attached to the returned collection as attr(M, "enrichment"), a long-format data.frame with columns SR, field, source, timestamp.

Arguments

M

Bibliometrix data frame produced by convert2df.

sources

Character vector of enrichment sources. Default c("openalex", "crossref"). Order is irrelevant; OpenAlex always runs before Crossref. "openalex" is skipped if M$DB[1] == "OPENALEX".

fields

Character vector of WoS-codified fields to attempt to fill. Default c("AB","AU","C1","CR","DT","LA","PY","RP","SO","TC","TI"). TC is filled only by OpenAlex.

email

Optional contact email used as the Crossref polite-pool identifier and OpenAlex mailto. If NULL, the function falls back to the env var BIBLIOMETRIX_EMAIL or the persisted file ~/.biblio_openalex_email.txt.

oa_apikey

Optional OpenAlex API key. If NULL, the function reads Sys.getenv("openalexR.apikey") and falls back to ~/.biblio_openalex_apikey.txt. The OpenAlex pass works without a key (lower rate limit).

batch_size

Number of DOIs per Crossref batch request (default 20). OpenAlex uses a fixed batch size of 50 (the maximum that keeps URLs under length limits).

max_records

Optional cap on the number of records to enrich (useful for previewing). Default Inf.

progress

Optional callback function(done, total, label) invoked after each batch. Used by biblioshiny to drive a progress bar.

verbose

Logical. Print progress messages to the console.

Details

When both sources are enabled, OpenAlex runs first (broader coverage of AB/CR/C1/TC) and Crossref then fills the residual gaps. If the input collection was originally imported from OpenAlex (M$DB[1] == "OPENALEX"), the OpenAlex pass is automatically skipped because re-querying it would not add information.

The vacancy predicate matches the one used by missingData: a cell is considered missing when it is NA or one of c("", "NA", "none", "NA,0000,NA").

Crossref cannot supply author keywords (DE), Keywords Plus (ID), Web of Science categories (WC), or citation counts (TC). OpenAlex covers TC, AB, AU, C1, CR, DT, LA, PY, RP, SO, TI well; OpenAlex keywords are AI-derived topic labels and not author keywords, so DE is off by default. ID and WC are always skipped.

Examples

Run this code
if (FALSE) {
data(scientometrics, package = "bibliometrixData")
res <- completeMetadata(scientometrics, email = "you@example.com")
res$report
res$after
}

Run the code above in your browser using DataLab