Learn R Programming

contentanalysis (version 0.2.1)

process_large_pdf: Process Large PDF Documents with Google Gemini AI

Description

Split a large PDF into chunks and process each chunk with Google Gemini AI to extract and format text content. Particularly useful for PDFs that exceed the token limit of a single API request.

Usage

process_large_pdf(
  pdf_path,
  api_key,
  pages_per_chunk = 4,
  model = c("2.5-flash", "2.5-flash-lite")
)

Value

List of character vectors, one element per chunk, containing the extracted and formatted text in markdown format. Returns NULL if processing fails.

Arguments

pdf_path

Character. Path to the PDF file to be processed.

api_key

Character. Google Gemini API key.

pages_per_chunk

Integer. Number of pages to include in each chunk. Default is 4. Lower values may help with very dense documents or API rate limits.

model

Character. The Gemini model version to use. Options are: "2.5-flash" and "2.5-flash-lite. Default is "2.5-flash".

Details

The function performs the following steps:

  1. Validates input parameters and PDF file

  2. Splits the PDF into chunks based on pages_per_chunk

  3. Processes each chunk sequentially with Gemini AI

  4. Extracts text while:

    • Removing repeated headers

    • Maintaining hierarchical structure

    • Preserving reference numbers in bracket notation

    • Formatting output as markdown

    • Handling sections that span multiple chunks

  5. Returns a list of extracted text, one element per chunk

The function includes comprehensive error handling for:

  • Invalid or missing PDF files

  • Missing or invalid API keys

  • PDF processing errors

  • Gemini AI service errors

  • File system operations

Rate limiting: The function includes a 1-second delay between chunks to respect API rate limits.

See Also

gemini_content_ai for the underlying AI processing function

Examples

Run this code
if (FALSE) {
# Process a large PDF with default settings
result <- process_large_pdf(
  pdf_path = "large_document.pdf",
  api_key = Sys.getenv("GEMINI_API_KEY")
)

# Process with smaller chunks and specific model
result <- process_large_pdf(
  pdf_path = "very_large_document.pdf",
  api_key = Sys.getenv("GEMINI_API_KEY"),
  pages_per_chunk = 3,
  model = "2.5-flash"
)

# Combine all chunks into single text
if (!is.null(result)) {
  full_text <- paste(unlist(result), collapse = "\n\n")
}
}

Run the code above in your browser using DataLab