Learn R Programming

arete (version 0.1)

arete_package: Summary of methods in the arete package

Description

Package arete seeks to provide easy Automated REtrieval of species data from TExt. To do this it processes user-supplied text, breaks it into API requests to LLM services and processes the output through a variety of machine learning and rule-based methods to deliever species data in a machine-readable format, ready for analysis. For a short and sweet use case of arete, try our vignette(package = "arete"). In broad terms, functions in arete can be placed in 6 different categories:

---------------------------------------------------------------------------------------------------------------------

Arguments

1. Setting up arete

arete_setupDefine a default virtual environment and install external dependencies
install_python_packagesInstall or update python dependencies after setup
install_OCR_packagesInstall or update the dependencies for our optional OCR utilities
---------------------------------------------------------------------------------------------------------------------

2. Prepare annotation data

labelsExtract the labels and relations in a Webanno TSV 3.3 file to an easy, machine readable format ready for machine learning projects
labels_uniqueExtract all unique labels and relations in a Webanno TSV 3.3 file
webanno_openRead the contents of a WebAnno TSV 3.3 file and create a webanno object, a format for annotated text containing named entities and relations
webanno_summarySummarize the contents of a group of WebAnno tsv files by counting labels and relations present
create_training_dataOpen files with text and annotated data and build training data for large language models in a variety of formats
file_comparisonDetect differences between two WebAnno files of the same text for annotation monitoring
---------------------------------------------------------------------------------------------------------------------

3. Prepare text data

process_documentExtract text embedded in a .pdf or .txt file and process it so it can be safely used APIs of LLM
OCR_document Optional utilities based on tesseract and nougatOCR
check_langCheck if a given string is mostly (75% of the document) in English
---------------------------------------------------------------------------------------------------------------------

4. Clean data

string_to_coordsRule-based conversion of character strings containing geographic coordinates to sets of numeric values
process_species_namesThis function standardizes species names and fixes a number of common typos and mistakes that commonly occur due to OCR
---------------------------------------------------------------------------------------------------------------------

5. Finetune and extract data

get_geodataCall a Large Language Model (LLM) to extract species geographic data
gazetteerExtract geographic coordinates from strings containing location names, using an online index
---------------------------------------------------------------------------------------------------------------------

6. Evaluate model performance

performance_reportProduce a detailed report on the discrepancies between data extracted by a LLM and human annotated data.
compare_IUCNCalculate EOO for two sets of coordinates for a practical assessment of data proximity
---------------------------------------------------------------------------------------------------------------------

Contributors

The methods and functions in this package were written by Vasco Branco, with code contribuitions by Vaughn Shirey, Thomas Merrien. Code revision by Pedro Cardoso.