arete_package: Summary of methods in the arete package

Description

Package arete seeks to provide easy Automated REtrieval of species data from TExt. To do this it processes user-supplied text, breaks it into API requests to LLM services and processes the output through a variety of machine learning and rule-based methods to deliever species data in a machine-readable format, ready for analysis. For a short and sweet use case of arete, try our vignette(package = "arete"). In broad terms, functions in arete can be placed in 6 different categories:

---------------------------------------------------------------------------------------------------------------------

Arguments

1. Setting up arete

`arete_setup`	Define a default virtual environment and install external dependencies
`install_python_packages`	Install or update python dependencies after setup
`install_OCR_packages`	Install or update the dependencies for our optional OCR utilities
---------------------------	------------------------------------------------------------------------------------------

2. Prepare annotation data

`labels`	Extract the labels and relations in a Webanno TSV 3.3 file to an easy, machine readable format ready for machine learning projects
`labels_unique`	Extract all unique labels and relations in a Webanno TSV 3.3 file
`webanno_open`	Read the contents of a WebAnno TSV 3.3 file and create a `webanno` object, a format for annotated text containing named entities and relations
`webanno_summary`	Summarize the contents of a group of WebAnno tsv files by counting labels and relations present
`create_training_data`	Open files with text and annotated data and build training data for large language models in a variety of formats
`file_comparison`	Detect differences between two WebAnno files of the same text for annotation monitoring
---------------------------	------------------------------------------------------------------------------------------

3. Prepare text data

`process_document`	Extract text embedded in a `.pdf` or `.txt` file and process it so it can be safely used APIs of LLM
`OCR_document` Optional utilities based on `tesseract` and `nougatOCR`
`check_lang`	Check if a given string is mostly (75% of the document) in English
---------------------------	------------------------------------------------------------------------------------------

4. Clean data

`string_to_coords`	Rule-based conversion of character strings containing geographic coordinates to sets of numeric values
`process_species_names`	This function standardizes species names and fixes a number of common typos and mistakes that commonly occur due to OCR
---------------------------	------------------------------------------------------------------------------------------

5. Finetune and extract data

`get_geodata`	Call a Large Language Model (LLM) to extract species geographic data
`gazetteer`	Extract geographic coordinates from strings containing location names, using an online index
---------------------------	------------------------------------------------------------------------------------------

6. Evaluate model performance

`performance_report`	Produce a detailed report on the discrepancies between data extracted by a LLM and human annotated data.
`compare_IUCN`	Calculate EOO for two sets of coordinates for a practical assessment of data proximity
---------------------------	------------------------------------------------------------------------------------------

Contributors

The methods and functions in this package were written by Vasco Branco, with code contribuitions by Vaughn Shirey, Thomas Merrien. Code revision by Pedro Cardoso.