OCR_document

Extract text contained under image form in a PDF through the use
of optical character recognition software (OCR). Currently two options are
available, <code>method = "nougat"</code> and <code>method = "tesseract"</code>.

A Python based pipeline for extraction of species occurrence data through the usage of large language models. Includes validation tools designed to handle model hallucinations for a scientific, rigorous use of LLM. Currently supports usage of GPT with more planned, including local and non-proprietary models. For more details on the methodology used please consult the references listed under each function, such as Kent, A. et al. (1995) <doi:10.1002/asi.5090060209>, van Rijsbergen, C.J. (1979, ISBN:978-0408709293, Levenshtein, V.I. (1966) <https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf> and Klaus Krippendorff (2011) <https://repository.upenn.edu/handle/20.500.14332/2089>.

Vasco V. Branco

arete

Automated REtrieval from TExt

Vaughn Shirey

Thomas Merrien

Pedro Cardoso

OCR_document function

<dl><dt>in_path</dt>
<dd>character. string of a file with species data in either pdf or txt format, e.g: ./folder/file.pdf</dd>
<dt>out_path</dt>
<dd>character. Binomial name of the species used with applicable <code>type</code>.</dd>
<dt>method</dt>
<dd>character. Method used for the OCR. Currently it defaults to the only available method, nougatOCR.</dd>
<dt>verbose</dt>
<dd>logical. Print output after finish.</dd></dl>

Arguments

Scan PDF with optical character recognition (OCR) — OCR_document

<dl>

<dt>in_path</dt>
<dd>character. string of a file with species data in either pdf or txt format, e.g: ./folder/file.pdf</dd>


<dt>out_path</dt>
<dd>character. Binomial name of the species used with applicable <code>type</code>.</dd>


<dt>method</dt>
<dd>character. Method used for the OCR. Currently it defaults to the only available method, nougatOCR.</dd>


<dt>verbose</dt>
<dd>logical. Print output after finish.</dd>

</dl>

OCR_document: Scan PDF with optical character recognition (OCR)

Description

Usage

Value

Arguments

Details

See Also

Examples