Learn R Programming

contentanalysis (version 0.2.1)

pdf2txt_auto: Import PDF with Automatic Section Detection

Description

High-level function that imports PDF files, extracts text while handling multi-column layouts, and optionally splits content into sections. Supports AI-enhanced text extraction using Google Gemini API and includes control over citation format conversion.

Usage

pdf2txt_auto(
  file,
  n_columns = NULL,
  preserve_structure = TRUE,
  sections = TRUE,
  normalize_refs = TRUE,
  citation_type = c("none", "numeric_superscript", "numeric_bracketed", "author_year"),
  enable_ai_support = FALSE,
  ai_model = "2.5-flash",
  api_key = NULL
)

Value

If sections = TRUE, returns a named list where:

  • The first element Full_text contains the complete document text

  • Subsequent elements contain individual sections (Introduction, Methods, etc.)

If sections = FALSE, returns a character string with the full document text. Returns NA if extraction fails.

Arguments

file

Character. Path to the PDF file to be processed.

n_columns

Integer or NULL. Number of columns in the PDF layout. Default is NULL (automatic detection).

preserve_structure

Logical. If TRUE, preserves paragraph structure and formatting. Default is TRUE.

sections

Logical. If TRUE, splits the document into sections based on headers. Default is TRUE.

normalize_refs

Logical. If TRUE, normalizes reference formatting in the document. Default is TRUE.

citation_type

Character. Type of citations used in the document. Options are:

  • "none": No citation conversion (default)

  • "numeric_superscript": Numeric citations in superscript format, will be converted to bracket notation

  • "numeric_bracketed": Numeric citations already in brackets

  • "author_year": Author-year citations (e.g., Smith, 2020)

This parameter helps avoid false positives in citation detection. Only specify "numeric_superscript" if your document uses superscript numbers for citations.

enable_ai_support

Logical. If TRUE, enables AI-enhanced text extraction using Google Gemini API. Default is FALSE.

ai_model

Character. The Gemini model version to use for AI processing. Default is "2.5-flash". See process_large_pdf for available models.

api_key

Character or NULL. Google Gemini API key. If NULL, the function attempts to read from the GEMINI_API_KEY environment variable.

Details

The function attempts multiple extraction methods:

  1. First tries multi-column extraction with pdf2txt_multicolumn_safe

  2. Falls back to standard pdftools::pdf_text if the first method fails

  3. Optionally applies AI-enhanced extraction if enable_ai_support = TRUE

When AI support is enabled and successful, the function:

  • Processes the PDF using process_large_pdf

  • Merges text chunks and converts to appropriate format

  • Preserves References/Bibliography section from standard extraction

  • Returns AI-processed content with improved formatting

Citation conversion is applied based on the citation_type parameter to standardize reference markers throughout the document.

See Also

pdf2txt_multicolumn_safe for multi-column extraction, process_large_pdf for AI-enhanced processing, split_into_sections for section detection

Examples

Run this code
if (FALSE) {
# Basic import with automatic section detection
doc <- pdf2txt_auto("paper.pdf")

# Import with superscript citation conversion
doc <- pdf2txt_auto(
  "paper.pdf",
  citation_type = "numeric_superscript"
)

# Import with AI-enhanced extraction
doc <- pdf2txt_auto(
  "paper.pdf",
  enable_ai_support = TRUE,
  ai_model = "2.5-flash",
  api_key = Sys.getenv("GEMINI_API_KEY")
)

# Import paper with author-year citations (no conversion)
doc <- pdf2txt_auto(
  "paper.pdf",
  citation_type = "author_year"
)

# Simple text extraction without sections or citation processing
text <- pdf2txt_auto(
  "paper.pdf",
  sections = FALSE,
  citation_type = "none"
)

# Access specific sections
introduction <- doc$Introduction
methods <- doc$Methods
}

Run the code above in your browser using DataLab