Learn R Programming

arete (version 0.1)

process_document: Extract and process text from a document

Description

This function extracts text embedded in a .pdf or .txt file and processes it so it can be safely used by LLM API's.

Usage

process_document(path, extra_measures = NULL)

Value

character. Fully processed text.

Arguments

path

character. Path leading to the desired PDF file.

extra_measures

character. To be implemented. Some documents are especially difficult for LLM to process due to a variety of issues such as size and formatting. extra_measures tries to improve future performance by cropping the document given to only the central passage mentioning a specific species. "header" and, by extension, "both" require an mmd file that is the output of nougatOCR.

Examples

Run this code
path = arete_data("holzapfelae")
process_document(path)

extra_measures = list("mention", "Tricholathys spiralis")

Run the code above in your browser using DataLab