Learn R Programming

contentanalysis (version 0.2.0)

extract_pdf_metadata: Extract DOI and Metadata from PDF

Description

This function extracts the Digital Object Identifier (DOI) and other metadata from a PDF file using pdftools::pdf_info(). It searches through all metadata fields including the XMP metadata XML.

Usage

extract_pdf_metadata(pdf_path, fields = "doi", return_all_dois = FALSE)

Value

If fields = "doi" (default), returns a character string with the DOI or NA_character_ if not found. If multiple fields are requested, returns a named list with the requested metadata. If return_all_dois = TRUE, the DOI element will be a character vector.

Arguments

pdf_path

Character. Path to the PDF file.

fields

Character vector. Metadata fields to extract. Options are: "doi", "title", "authors", "journal", "year", "all". Default is "doi".

return_all_dois

Logical. If TRUE, returns all DOIs found; if FALSE (default), returns only the first article DOI found (excluding journal ISSNs).

Details

The function searches for DOIs in:

  • All fields in the keys list (prioritizing article DOI fields)

  • The XMP metadata XML field

Journal DOIs/ISSNs (containing "(ISSN)" or from journal-specific fields) are automatically filtered out to return article DOIs.

For other metadata:

  • Title: extracted from Title field or dc:title in XMP metadata

  • Authors: extracted from Author/Creator fields or dc:creator in XMP metadata

  • Journal: extracted from Subject, prism:publicationName in XMP metadata

  • Year: extracted from created/modified dates, prism:coverDate, or title

Common DOI prefixes are automatically removed. The function uses regex pattern matching to validate DOI format and extract structured data from XMP XML.

See Also

pdf_info

Examples

Run this code
if (FALSE) {
# Extract only DOI
doi <- extract_pdf_metadata("path/to/paper.pdf")

# Extract multiple metadata fields
meta <- extract_pdf_metadata("path/to/paper.pdf",
                             fields = c("doi", "title", "journal"))

# Extract all available metadata
meta <- extract_pdf_metadata("path/to/paper.pdf", fields = "all")
}

Run the code above in your browser using DataLab