process_large_pdf: Process Large PDF Documents with Google Gemini AI

Description

Split a large PDF into chunks and process each chunk with Google Gemini AI to extract and format text content. Particularly useful for PDFs that exceed the token limit of a single API request.

Usage

process_large_pdf(
  pdf_path,
  api_key,
  pages_per_chunk = 4,
  model = c("2.5-flash", "2.5-flash-lite")
)

Value

List of character vectors, one element per chunk, containing the extracted and formatted text in markdown format. Returns NULL if processing fails.

Arguments

pdf_path: Character. Path to the PDF file to be processed.
api_key: Character. Google Gemini API key.
pages_per_chunk: Integer. Number of pages to include in each chunk. Default is 4. Lower values may help with very dense documents or API rate limits.
model: Character. The Gemini model version to use. Options are: "2.5-flash" and "2.5-flash-lite. Default is "2.5-flash".

Details

The function performs the following steps:

Validates input parameters and PDF file
Splits the PDF into chunks based on pages_per_chunk
Processes each chunk sequentially with Gemini AI
Extracts text while:
- Removing repeated headers
- Maintaining hierarchical structure
- Preserving reference numbers in bracket notation
- Formatting output as markdown
- Handling sections that span multiple chunks
Returns a list of extracted text, one element per chunk

The function includes comprehensive error handling for:

Invalid or missing PDF files
Missing or invalid API keys
PDF processing errors
Gemini AI service errors
File system operations

Rate limiting: The function includes a 1-second delay between chunks to respect API rate limits.

Examples

Run this code

if (FALSE) {
# Process a large PDF with default settings
result <- process_large_pdf(
  pdf_path = "large_document.pdf",
  api_key = Sys.getenv("GEMINI_API_KEY")
)

# Process with smaller chunks and specific model
result <- process_large_pdf(
  pdf_path = "very_large_document.pdf",
  api_key = Sys.getenv("GEMINI_API_KEY"),
  pages_per_chunk = 3,
  model = "2.5-flash"
)

# Combine all chunks into single text
if (!is.null(result)) {
  full_text <- paste(unlist(result), collapse = "\n\n")
}
}