Package arete seeks to provide easy Automated REtrieval of species data from TExt. To do this it processes user-supplied text, breaks it into API requests to LLM services and processes the output
through a variety of machine learning and rule-based methods to deliever species data in a machine-readable format, ready for analysis. For a short and sweet use case of arete, try our vignette(package = "arete").
In broad terms, functions in arete can be placed in 6 different categories:
---------------------------------------------------------------------------------------------------------------------
arete_setup | Define a default virtual environment and install external dependencies |
install_python_packages | Install or update python dependencies after setup |
install_OCR_packages | Install or update the dependencies for our optional OCR utilities |
| --------------------------- | ------------------------------------------------------------------------------------------ |
labels | Extract the labels and relations in a Webanno TSV 3.3 file to an easy, machine readable format ready for machine learning projects |
labels_unique | Extract all unique labels and relations in a Webanno TSV 3.3 file |
webanno_open | Read the contents of a WebAnno TSV 3.3 file and create a webanno object, a format for annotated text containing named entities and relations |
webanno_summary | Summarize the contents of a group of WebAnno tsv files by counting labels and relations present |
create_training_data | Open files with text and annotated data and build training data for large language models in a variety of formats |
file_comparison | Detect differences between two WebAnno files of the same text for annotation monitoring |
| --------------------------- | ------------------------------------------------------------------------------------------ |
process_document | Extract text embedded in a .pdf or .txt file and process it so it can be safely used APIs of LLM |
OCR_document Optional utilities based on tesseract and nougatOCR | |
check_lang | Check if a given string is mostly (75% of the document) in English |
| --------------------------- | ------------------------------------------------------------------------------------------ |
string_to_coords | Rule-based conversion of character strings containing geographic coordinates to sets of numeric values |
process_species_names | This function standardizes species names and fixes a number of common typos and mistakes that commonly occur due to OCR |
| --------------------------- | ------------------------------------------------------------------------------------------ |
get_geodata | Call a Large Language Model (LLM) to extract species geographic data |
gazetteer | Extract geographic coordinates from strings containing location names, using an online index |
| --------------------------- | ------------------------------------------------------------------------------------------ |
performance_report | Produce a detailed report on the discrepancies between data extracted by a LLM and human annotated data. |
compare_IUCN | Calculate EOO for two sets of coordinates for a practical assessment of data proximity |
| --------------------------- | ------------------------------------------------------------------------------------------ |
The methods and functions in this package were written by Vasco Branco, with code contribuitions by Vaughn Shirey, Thomas Merrien. Code revision by Pedro Cardoso.