Learn R Programming

arete (version 0.1)

Automated REtrieval from TExt

Description

A Python based pipeline for extraction of species occurrence data through the usage of large language models. Includes validation tools designed to handle model hallucinations for a scientific, rigorous use of LLM. Currently supports usage of GPT with more planned, including local and non-proprietary models. For more details on the methodology used please consult the references listed under each function, such as Kent, A. et al. (1995) , van Rijsbergen, C.J. (1979, ISBN:978-0408709293, Levenshtein, V.I. (1966) and Klaus Krippendorff (2011) .

Copy Link

Version

Install

install.packages('arete')

Monthly Downloads

202

Version

0.1

License

GPL-3

Maintainer

Vasco V. Branco

Last Published

October 20th, 2025

Functions in arete (0.1)

string_to_coords

Convert strings to numerical coordinates
check_lang

Check if text is language-appropriate
arete_setup

Setup arete
aux_string_to_coords

Mechanical coordinate conversion
webanno_open

Open a WebAnno TSV v3.3 file.
webanno_summary

Summarize the contents of a group of WebAnno tsv files
process_document

Extract and process text from a document
performance_report

Evaluate the performance of a LLM
install_python_packages

Update python dependencies
install_OCR_packages

Update OCR dependencies
compare_IUCN

Check EOO differences between two sets of coordinates
gazetteer

Get geographic coordinates from localities
get_geodata

Call a Large Language Model (LLM) to extract species geographic data
arete_data

Example data packaged with gecko
WebAnnoTSV-class

WebAnno TSV v3.3 class creator.
labels

Labels for model training
labels_unique

Get the unique labels of a WebAnno document
OCR_document

Scan PDF with optical character recognition (OCR)
arete_package

Summary of methods in the arete package
create_training_data

Create training data for GPT
file_comparison

Compare the contents of two WebAnno tsv files.
process_species_names

Process and fix species names