Learn R Programming

contentanalysis (version 0.2.1)

extract_pdf_metadata: Extract DOI and Metadata from PDF

Description

This function extracts the Digital Object Identifier (DOI) and other metadata from a PDF file using pdftools::pdf_info(). It searches through all metadata fields including the XMP metadata XML.

Usage

extract_pdf_metadata(pdf_path, fields = "doi", return_all_dois = FALSE)

Value

If fields = "doi" (default), returns a character string with the DOI or NA_character_ if not found. If multiple fields are requested, returns a named list with the requested metadata. If return_all_dois = TRUE, the DOI element will be a character vector.

Arguments

pdf_path

Character. Path to the PDF file.

fields

Character vector. Metadata fields to extract. Options are: "doi", "title", "authors", "journal", "year", "all". Default is "doi".

return_all_dois

Logical. If TRUE, returns all DOIs found; if FALSE (default), returns only the first article DOI found (excluding journal ISSNs).

Details

The function searches for DOIs in:

  • All fields in the keys list (prioritizing article DOI fields)

  • The XMP metadata XML field

Journal DOIs/ISSNs (containing "(ISSN)" or from journal-specific fields) are automatically filtered out to return article DOIs.

For other metadata:

  • Title: extracted from Title field or dc:title in XMP metadata

  • Authors: extracted from dc:creator in XMP metadata or Author/Creator fields

  • Journal: extracted from Subject, prism:publicationName in XMP metadata

  • Year: extracted from prism:coverDate, created/modified dates (avoiding DOI patterns)

Common DOI prefixes are automatically removed. The function uses regex pattern matching to validate DOI format and extract structured data from XMP XML.

See Also

pdf_info

Examples

Run this code
if (FALSE) {
# Extract only DOI
doi <- extract_pdf_metadata("path/to/paper.pdf")

# Extract multiple metadata fields
meta <- extract_pdf_metadata("path/to/paper.pdf",
                             fields = c("doi", "title", "journal"))

# Extract all available metadata
meta <- extract_pdf_metadata("path/to/paper.pdf", fields = "all")
}

Run the code above in your browser using DataLab