pdf2txt_multicolumn_safe: Extract text from multi-column PDF with structure preservation

Description

Extracts text from PDF files handling multi-column layouts, with options for structure preservation and automatic column detection. This version includes post-processing to convert superscript citation numbers based on the specified citation type.

Usage

pdf2txt_multicolumn_safe(
  file,
  n_columns = NULL,
  column_threshold = NULL,
  preserve_structure = TRUE,
  citation_type = c("none", "numeric_superscript", "numeric_bracketed", "author_year")
)

Value

Character string with extracted text.

Arguments

file

Character string. Path to the PDF file.

n_columns

Integer or NULL. Number of columns to detect. If NULL, attempts automatic detection. Default is NULL.

column_threshold

Numeric or NULL. X-coordinate threshold for column separation. If NULL and n_columns is NULL, calculated automatically.

preserve_structure

Logical. If TRUE, preserves paragraph breaks and section structure. If FALSE, returns continuous text. Default is TRUE.

citation_type

Character string. Type of citations in the document:

"numeric_superscript": Numeric citations in superscript (converted to dplyr::n)
"numeric_bracketed": Numeric citations already in brackets dplyr::n (no conversion)
"author_year": Author-year citations like (Smith, 2020) (no conversion)
"none": No citation conversion

Default is "none".

Details

This function uses pdftools::pdf_data() for precise text extraction with spatial coordinates. It handles:

Multi-column layouts (2+ columns)
Section detection and paragraph preservation
Hyphenation removal
Title and heading identification
Superscript citation number conversion (only if citation_type = "numeric_superscript")

If pdf_data() fails, falls back to pdftools::pdf_text().

Examples

Run this code

if (FALSE) {
# Extract from 2-column paper with superscript citations
text <- pdf2txt_multicolumn_safe("paper.pdf", n_columns = 2,
                                  citation_type = "numeric_superscript")

# Extract paper with author-year citations (no conversion)
text <- pdf2txt_multicolumn_safe("paper.pdf", citation_type = "author_year")
}

Run the code above in your browser using DataLab