Unique title: From Paper to Place: How ARETE Turns Unstructured Text into Rich Biodiversity Data
Introduction: Why ARETE matters in a data-scarce world
If you care about biodiversity, you’ve probably run into a familiar obstacle: the most useful data—where species occur, where they’ve been spotted, how their ranges stretch across continents—sometimes lives in text you can’t easily machine-read. Think of decades’ worth of scientific papers, gray literature, and reports tucked away in PDFs and PDFs of PDFs. This “Wallacean shortfall” (the gap between what we know and what we could know about where species actually live) has slowed conservation planning, extinction risk assessments, and even our understanding of biodiversity change over time.
Enter ARETE, an open-source R package designed to automate the extraction of species occurrence data from unstructured text using large language models (LLMs). The idea is simple and powerful: give ARETE a document and a species name (or let it pull everything), and it uses AI to pull out coordinates, localities, and other relevant notes, then cleans and validates the output. It’s not a replacement for careful human review, but it can dramatically speed up the initial data-gathering phase while leaving room for targeted manual checks where it matters most.
What ARETE is (in plain terms)
- A pipeline that blends R and Python to run large language models on text data to extract occurrence information.
- It handles the full workflow: from document handling (including OCR when text isn’t embedded) to pulling out locations and coordinates, to flagging possible data issues, and finally presenting results in clean tables ready for analysis.
- It’s designed around conservation biology needs: fast, scalable data extraction with built-in validation and outlier detection, plus tools to compare model outputs against human-annotated ground truth.
- It emphasizes transparency about model limitations, costs, and the potential biases that come with AI-driven data extraction (including what researchers should watch for when applying the tool to new taxa or non-English texts).
How ARETE works: a practical, end-to-end workflow
1) User input: a simple yet flexible starting point
ARETE aims to be user-friendly for ecologists and conservation biologists who may not be coding experts. The core function, getgeodata(), requires only a few inputs:
- path: a file path to a document in PDF or TXT format.
- tax: the species name to search for (or NULL to retrieve data for all species found in the document).
- userkey: the OpenAI API key or other service keys, plus a flag for free vs premium access.
- service and model: which LLM service and model to use (as of the study, OpenAI’s chatGPT variants were typical choices; GPT-3.5 or a GPT-4o-family model were recommended in different contexts).
- outpath: optional path to save results.
- verbose: whether to print progress updates.
2) Document processing: read, extract, and clean text
If your file is already text, ARETE can use it directly. If it’s a PDF with embedded text, ARETE uses that. If the PDF is just scanned images, ARETE switches to an OCR step (Nougat OCR) to pull the text. After text extraction, the pipeline cleans up odd characters (for example, symbol quirks that can appear in older papers) so the LLM sees clean, consistent input.
3) LLM requests: chunking and prompting
LLMs have token limits, so ARETE splits long documents into manageable chunks and appends each chunk to a prompt tailored for the selected service. The prompts are designed to nudge the model toward extracting species occurrences (locations and coordinates) while avoiding irrelevant text. The approach also accounts for differences between free and premium API access (which can affect context windows and throughput).
4) Outlier processing: sanity checks for geographic and environmental sense
After extraction, ARETE can run an optional outlier-detection step using an existing R package (gecko) to flag questionable points. There are three main outlier signals:
- Geographical distances (geo) to detect data points far from the rest.
- Environmental distances (env) calculated from climate data like WorldClim, to see if a point sits in an unusual environmental space.
- An SVM-based approach using pseudo-absence data to define an environmental envelope.
The user can set the threshold (default is 95%), and decisions about what to do with flagged points are left to the researcher. This balance helps avoid blindly discarding data while still catching likely errors.
5) Model validation: how well did the AI do?
ARETE includes a validation option (performance_report()) that compares AI-extracted data to ground-truth data supplied by the user. Metrics like accuracy, recall, precision, and F1 are calculated for coordinates and locality names, with a distance-based weighting to reflect that large coordinate errors matter more than minor ones. The output is meant to be transparent and reviewable—R Markdown reports are produced to summarize performance for each species and globally.
6) Fine-tuning and performance improvements: what difference does training make?
A key part of the ARETE study was to test whether fine-tuning the model on a curated corpus could improve results. They experimented with two GPT variants:
- Baseline performance with GPT-3.5-turbo-1106 (no fine-tuning).
- Fine-tuned models (including GPT-3.5-turbo and a GPT-4o family model) trained against a corpus of manually annotated ecological papers (RECODE).
7) A real-world test: a spider data use case
To demonstrate ARETE in a real research setting, the authors sampled 100 random spider species from the World Spider Catalog (WSC) daily export and downloaded all PDFs. About 155 of the PDFs were in languages other than English and were excluded for the spider-use case, leaving 321 PDFs for extraction. They used their best-performing GPT-4o fine-tuned model for this task and conducted both automated extractions and manual checks on a subset of 50 species to gauge accuracy.
What the results actually looked like (high-level summary)
- Without fine-tuning (GPT-3.5 baseline): the researchers reported a reasonable level of performance but also a notable rate of false positives (hallucinations) and some errors tied to OCR misreads. The metrics for a set of 50 papers showed:
- Accuracy around 0.71
- Recall around 0.76
- Precision around 0.92
- F1 around 0.83
Some false positives were due to OCR misinterpreting characters or misattributing non-target species as targets; these are typical pitfalls when machine-reading older literature.
With fine-tuning (best performing model within standard deviation across folds): performance jumped substantially:
- Accuracy around 0.92
- Recall around 0.97
- Precision around 0.95
- F1 around 0.96
This demonstrates that fine-tuning, when done carefully, can dramatically improve extraction quality by teaching the model to better distinguish target information from distractors in real biodiversity texts.
The spider-use case for automated data against GBIF baseline showed:
- Data extraction completeness varied by paper quality and language, with some papers presenting substantial OCR-related challenges.
- In a subset of 50 papers, the accuracy was around 0.54, recall about 0.61, precision around 0.84, and F1 about 0.70. This drop compared to the larger validation set underscores how paper quality and OCR issues can influence performance, especially in older or lower-quality documents.
Real-world conservation signal: how ARETE changes our view of species ranges
In the spider use case, the authors compared ARETE-derived data with GBIF data for 22 species that had geospatial data in both sources. The mean extent of occurrence (EOO) increased dramatically when incorporating ARETE data—roughly 1,949.84 times larger on average. In 9 of the 22 species, these increases were large enough to alter IUCN threat classifications under certain criteria, illustrating how automated, text-derived data can reshape conservation assessments by revealing previously hidden occurrences.Complementarity with GBIF
The study highlighted that ARETE and GBIF data overlap was substantial (about 60%), but there was also notable complementary information: roughly 17% of the species occurrences found by ARETE were not present in GBIF or were only discovered via ARETE’s automated extraction. This suggests that AI-assisted text mining can uncover distributional data that traditional biodiversity databases may miss, particularly in understudied taxa or literature that hasn’t yet been digitized into standard databases.Efficiency and cost considerations
One of ARETE’s big selling points is speed and cost. The authors argue that, at scale, the time and money saved by automated extraction can be substantial compared to expensive human curation. They present rough figures suggesting a heavy efficiency gain: for a representative 16-page paper, a human annotator might take several minutes, while ARETE could process it faster in practice, with a fraction of the per-paper cost given current AI pricing. They also emphasize that researchers can start with a free-trial account for the AI service to gauge fit before committing to paid usage.
What this all means in practical terms
- Speed without sacrificing rigor: ARETE accelerates the early, data-hungry stages of biodiversity work—pulling out potential occurrences from large swaths of text so researchers can focus their time on targeted validation and expert review.
- Aiding conservation planning: By expanding extents of occurrence and revealing previously hidden localities, ARETE can influence priority-setting for conservation actions and help refine risk assessments like IUCN Red List evaluations.
- Managing risk: The tool is designed with explicit notes about where AI can slip, such as OCR errors, misinterpreting non-target species, or misattributing localities. The built-in outlier detection and validation reporting help researchers identify and correct these issues.
- Taxonomic and language considerations: The current work tested English-language texts and highlighted potential biases and limitations when non-English sources are involved. There’s also an eye toward future enhancements like taxon-aware synonym checks and broader taxonomic references.
Limitations and what’s on the horizon
- Resource constraints and validation: While the performance looks promising, broader validation with more annotators and across different text types would strengthen confidence. The authors discuss plans for inter-rater reliability checks and broader data types, but note these require additional funding and resources.
- Black-box concerns: As with many AI systems, the “why” behind certain extractions is opaque. The authors stress transparency about model outputs and provide metrics to help users judge reliability, but the reasoning inside the model remains hidden.
- Access and reproducibility: Fine-tuned models used in the study aren’t publicly released. The core idea—training data and prompts—are described, and the underlying approach should be reproducible, but model weights themselves aren’t openly shared in the paper’s open-access materials.
- Language bias: The study emphasizes English-language texts, with limited testing on non-English sources. Expect performance to vary with non-English papers, and anticipate future work to broaden language coverage.
Getting started with ARETE: what you need to know
- Availability: ARETE is available on CRAN and GitHub, with links provided by the authors. It’s designed to integrate with Python-based LLM services while delivering results through an R-centric workflow, which is a common setup in ecological research.
- How to use it in practice:
- Install ARETE in R (and ensure Python support if you want to access the Python-based LLM tools).
- Prepare PDFs or TXT documents containing species information.
- Use get_geodata() to extract occurrences, specifying a species (tax) if you want targeted results, and provide your API key for the LLM service.
- Optionally run outlier detection (via gecko) and generate a model-validation report to see how well the AI matched ground-truth data you provide.
- What to expect in your workflow:
- A fast first pass that yields tables of potential occurrences with coordinates.
- A validation step where you compare AI output to trusted references and identify potential errors.
- A path to refine models by fine-tuning if you have a corpus of annotated papers and you want to push performance even higher for your target taxa.
A few practical tips if you’re trying ARETE for the first time
- Start with English-language texts if that’s what you have, but plan to test on non-English sources gradually. Expect the need for more careful validation when non-English data are involved.
- Use OCR-enabled processing for older literature or older PDFs. OCR is a common source of false signals, so keep an eye on extraction quality in your validation report.
- Leverage outlier detection as a guardrail. It’s easy to get a few suspicious points from OCR quirks or misread coordinates, and the outlier checks can help you decide whether to drop or scrutinize those points.
- Run a validation report against a trusted reference set. That will help you quantify how well ARETE’s extractions line up with known data, and it’s invaluable when communicating results to collaborators or funders.
- Consider fine-tuning if you’re working with a well-defined set of papers or taxa. As shown in the researchers’ tests, fine-tuning can meaningfully improve precision and recall, especially when you have a labeled corpus to train on.
Key takeaways: what to remember about ARETE
- ARETE is a practical bridge between unstructured biodiversity literature and machine-readable occurrence data, designed to speed up data collection for conservation work.
- The pipeline handles the entire flow—from document ingestion and OCR to LLM-based data extraction, outlier detection, and validation—so researchers can focus on interpretation and decision-making.
- Model performance is context-dependent. Baseline GPT-3.5 results were solid but benefited substantially from fine-tuning, with metrics improving from around 0.71 accuracy to around 0.92 accuracy in controlled validation settings.
- Real-world use with spiders showed that while AI can dramatically increase the extent of occurrence data, quality varies with paper quality and language. The approach still produced meaningful insights about range expansion and potential IUCN status changes for several species.
- ARETE emphasizes transparency about limitations, costs, and biases. It provides validation tools and documentation to help researchers track how AI-driven extractions compare to human annotations, and it highlights the types of errors (false positives, false negatives) that matter most in conservation decisions.
- The tool is designed to be scalable and adaptable. It’s open-source, integrates with existing biodiversity databases, and has a roadmap for broader taxonomic coverage, additional LLMs (including open-source options), and improved taxonomic referencing to handle synonyms and taxonomy more robustly.
- For researchers, ARETE offers a compelling workflow to discover and leverage undocumented or previously inaccessible occurrence data, potentially reshaping how we map species distributions and assess extinction risk—especially for less-studied taxa where traditional data sources are sparse.
If you’re curious about applying ARETE to your own projects
- Visit the ARETE project on CRAN and GitHub to get started and view tutorials, README guidance, and example workflows.
- Start with a small batch of PDFs in English, run the default get_geodata() workflow, and generate a validation report to gauge baseline performance.
- As you build trust, consider fine-tuning on a targeted annotated corpus to push accuracy and recall higher for your focal taxa.
- Use the outlier detection and mean Levenshtein-location validation to maintain data quality and keep human oversight where it matters most.
Bottom line
ARETE represents a thoughtful, transparent attempt to harness the power of large language models to fill a stubborn data gap in biodiversity science. It doesn’t pretend AI will replace ecologists or taxonomists; instead, it offers a scalable way to accelerate the initial, labor-intensive step of turning scattered textual clues into usable, machine-readable data. For researchers grappling with the urgency of conservation planning and the imperative to understand species distributions more comprehensively, ARETE is a tool worth exploring—and, with community feedback, it could become even more capable and widely applicable.
Key Takeaways
- ARETE automates the extraction of species occurrence data from unstructured text (PDFs, papers, reports) using large language models, streamlining the early data-collection phase in biodiversity research.
- The workflow covers document processing (including OCR), LLM-driven data extraction, outlier detection, and rigorous model validation, all within an approachable R-centric pipeline.
- Fine-tuning LLMs on a domain-specific corpus can markedly improve performance, with metrics in the study rising from about 0.71 to 0.92 accuracy in controlled tests.
- Real-world use in spiders showed that AI-driven extraction can dramatically expand the known extents of occurrence (EOO), potentially affecting conservation assessments, while also underscoring the need for careful validation due to data quality and language issues.
- ARETE’s strengths lie in speed, cost-effectiveness, and transparency about errors, with built-in validation tools to balance automation with expert review.
- The tool is actively developed, open-source, and ready for researchers to experiment with, while the authors acknowledge current limitations (language bias, OCR-related errors, proprietary model considerations) and chart a path for future enhancements.