pdf2txt_auto: Import PDF with Automatic Section Detection

Description

High-level function that imports PDF files, extracts text while handling multi-column layouts, and optionally splits content into sections. Supports AI-enhanced text extraction using Google Gemini API and includes control over citation format conversion.

Usage

pdf2txt_auto(
  file,
  n_columns = NULL,
  preserve_structure = TRUE,
  sections = TRUE,
  normalize_refs = TRUE,
  citation_type = c("none", "numeric_superscript", "numeric_bracketed", "author_year"),
  enable_ai_support = FALSE,
  ai_model = "2.5-flash",
  api_key = NULL
)

Value

If sections = TRUE, returns a named list where:

The first element Full_text contains the complete document text
Subsequent elements contain individual sections (Introduction, Methods, etc.)

If sections = FALSE, returns a character string with the full document text. Returns NA if extraction fails.

Arguments

file

Character. Path to the PDF file to be processed.

n_columns

Integer or NULL. Number of columns in the PDF layout. Default is NULL (automatic detection).

preserve_structure

Logical. If TRUE, preserves paragraph structure and formatting. Default is TRUE.

sections

Logical. If TRUE, splits the document into sections based on headers. Default is TRUE.

normalize_refs

Logical. If TRUE, normalizes reference formatting in the document. Default is TRUE.

citation_type

Character. Type of citations used in the document. Options are:

"none": No citation conversion (default)
"numeric_superscript": Numeric citations in superscript format, will be converted to bracket notation
"numeric_bracketed": Numeric citations already in brackets
"author_year": Author-year citations (e.g., Smith, 2020)

This parameter helps avoid false positives in citation detection. Only specify "numeric_superscript" if your document uses superscript numbers for citations.

enable_ai_support

Logical. If TRUE, enables AI-enhanced text extraction using Google Gemini API. Default is FALSE.

ai_model

Character. The Gemini model version to use for AI processing. Default is "2.5-flash". See process_large_pdf for available models.

api_key

Character or NULL. Google Gemini API key. If NULL, the function attempts to read from the GEMINI_API_KEY environment variable.

Details

The function attempts multiple extraction methods:

First tries multi-column extraction with pdf2txt_multicolumn_safe
Falls back to standard pdftools::pdf_text if the first method fails
Optionally applies AI-enhanced extraction if enable_ai_support = TRUE

When AI support is enabled and successful, the function:

Processes the PDF using process_large_pdf
Merges text chunks and converts to appropriate format
Preserves References/Bibliography section from standard extraction
Returns AI-processed content with improved formatting

Citation conversion is applied based on the citation_type parameter to standardize reference markers throughout the document.

Examples

Run this code

if (FALSE) {
# Basic import with automatic section detection
doc <- pdf2txt_auto("paper.pdf")

# Import with superscript citation conversion
doc <- pdf2txt_auto(
  "paper.pdf",
  citation_type = "numeric_superscript"
)

# Import with AI-enhanced extraction
doc <- pdf2txt_auto(
  "paper.pdf",
  enable_ai_support = TRUE,
  ai_model = "2.5-flash",
  api_key = Sys.getenv("GEMINI_API_KEY")
)

# Import paper with author-year citations (no conversion)
doc <- pdf2txt_auto(
  "paper.pdf",
  citation_type = "author_year"
)

# Simple text extraction without sections or citation processing
text <- pdf2txt_auto(
  "paper.pdf",
  sections = FALSE,
  citation_type = "none"
)

# Access specific sections
introduction <- doc$Introduction
methods <- doc$Methods
}