process_document

This function extracts text embedded in a <code>.pdf</code> or <code>.txt</code> file
and processes it so it can be safely used by LLM API's.

A Python based pipeline for extraction of species occurrence data through the usage of large language models. Includes validation tools designed to handle model hallucinations for a scientific, rigorous use of LLM. Currently supports usage of GPT with more planned, including local and non-proprietary models. For more details on the methodology used please consult the references listed under each function, such as Kent, A. et al. (1995) <doi:10.1002/asi.5090060209>, van Rijsbergen, C.J. (1979, ISBN:978-0408709293, Levenshtein, V.I. (1966) <https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf> and Klaus Krippendorff (2011) <https://repository.upenn.edu/handle/20.500.14332/2089>.

Vasco V. Branco

arete

Automated REtrieval from TExt

Vaughn Shirey

Thomas Merrien

Pedro Cardoso

process_document function

<dl><dt>path</dt>
<dd>character. Path leading to the desired PDF file.</dd>
<dt>extra_measures</dt>
<dd>character. To be implemented. Some documents are 
especially difficult for LLM to process due to a variety of 
issues such as size and formatting. <code>extra_measures</code> tries to improve 
future performance by cropping the document given to only the central passage
mentioning a specific species. <code>"header"</code> and, by extension, <code>"both"</code> require an mmd file
that is the output of nougatOCR.</dd></dl>

Arguments

Extract and process text from a document — process_document

<dl>

<dt>path</dt>
<dd>character. Path leading to the desired PDF file.</dd>


<dt>extra_measures</dt>
<dd>character. To be implemented. Some documents are 
especially difficult for LLM to process due to a variety of 
issues such as size and formatting. <code>extra_measures</code> tries to improve 
future performance by cropping the document given to only the central passage
mentioning a specific species. <code>"header"</code> and, by extension, <code>"both"</code> require an mmd file
that is the output of nougatOCR.</dd>

</dl>

process_document: Extract and process text from a document

Description

Usage

Value

Arguments

Examples