process_document: Extract and process text from a document
Description
This function extracts text embedded in a .pdf or .txt file
and processes it so it can be safely used by LLM API's.
Usage
process_document(path, extra_measures = NULL)
Value
character. Fully processed text.
Arguments
path
character. Path leading to the desired PDF file.
extra_measures
character. To be implemented. Some documents are
especially difficult for LLM to process due to a variety of
issues such as size and formatting. extra_measures tries to improve
future performance by cropping the document given to only the central passage
mentioning a specific species. "header" and, by extension, "both" require an mmd file
that is the output of nougatOCR.