Learn R Programming

contentanalysis (version 0.2.1)

split_into_sections: Split document text into sections

Description

Splits extracted text into logical sections (Introduction, Methods, Results, etc.) using either the PDF's table of contents or common academic section patterns.

Usage

split_into_sections(text, file_path = NULL)

Value

Named list where each element is a section's text. Always includes "Full_text" element with complete document.

Arguments

text

Character string. Full text of the document.

file_path

Character string or NULL. Path to PDF file for TOC extraction. If NULL, uses common section names. Default is NULL.

Details

The function attempts to:

  1. Extract section names from PDF table of contents

  2. Fall back to common academic section names if TOC unavailable

  3. Match section headers in text using regex patterns

  4. Handle duplicate section names

Common sections searched: Abstract, Introduction, Methods, Results, Discussion, Conclusion, References, etc.

Examples

Run this code
if (FALSE) {
text <- pdf2txt_auto("paper.pdf", sections = FALSE)
sections <- split_into_sections(text, file_path = "paper.pdf")
names(sections)
}

Run the code above in your browser using DataLab