AI Prompt Engineering for Bibliographic Scraping (PHP Code in 1 Go)
Table of Contents
- Introduction
- Why This Matters
- The Big Idea: Treat Prompts Like a Spec
- From “Ask a Question” to a Working Scraper: Control vs Advanced Prompts
- How Their Method Works in Practice (ChatGPT → PHP → Claude)
- Results You Can Trust: Speed, Errors, and Stress Tests
- Limitations and Ethical Guardrails
Introduction
If you’ve ever tried to pull thousands of bibliographic records from a library website, you already know the pain: pages have weird formatting, fields appear in inconsistent places, and your scraper breaks the moment the HTML changes slightly. The new research behind prompt engineering for bibliographic web-scraping attacks that exact problem by making AI generate a real PHP scraper—not just an explanation—using well-structured prompts.
This blog post is based on new research from Prompt engineering for bibliographic web-scraping. The authors (Manuel Blázquez-Ochando, Juan José Prieto-Gutiérrez, María Antonia Ovalle-Perandones) focus on one practical question: How do we write prompts so the model produces fully functional scraping code in a single interaction (or with minimal back-and-forth), specifically for bibliographic catalogues?
Their approach is tested on the Spanish National Library catalogue (datos.bne.es) using 55,473 records (from a target of 50,000 reached during testing). Along the way, they compare a “simple prompt” against a structured “advanced prompt,” then validate the resulting code at scale with stress testing.
Why This Matters
Right now, there’s a growing mismatch in research workflows: you need bibliographic datasets for bibliometrics, altmetrics, collection analysis, and migration tasks—but the data you want often isn’t available via clean APIs (or the API is locked behind institutional constraints). In that real world, researchers and librarians frequently face a choice between:
- waiting weeks for custom development, or
- scraping manually (slow) / hacking together scripts (fragile).
What makes this paper timely is that it treats prompt engineering like engineering, not like “magic words.” In other words, the prompt becomes a specification document for the scraper: the execution environment, the extraction method, which fields matter, how the code should structure outputs, and even which PHP functions to use.
A scenario you can apply today: suppose your university library wants to build a “new acquisitions” alert using data from a national catalogue page—but the standard protocol (like OAI-PMH) isn’t convenient. Instead of commissioning a scraper from scratch, you can use the method from this research to generate a working scraper script quickly, adapt the fields, and loop over record IDs—while still being respectful to server load.
And compared to earlier AI-for-coding work (like AI script generation in general), this builds on the idea that LLMs are highly sensitive to prompt structure. Earlier work showed that prompt wording and formatting matter; this research pushes it one step further by proposing a reusable prompt template designed for bibliographic scraping—validated with tens of thousands of records and tested with two different model families (ChatGPT-4o and Claude Sonnet 3.5). That cross-model angle is crucial if you’re building workflows that can’t depend on a single model forever.
The Big Idea: Treat Prompts Like a Spec
The core takeaway is simple: the prompt isn’t just instructions—it’s the contract that tells the LLM what context exists and what “done” looks like.
In the paper’s framing, “prompt engineering” is essentially the skill of creating optimized, structured inputs so the AI doesn’t guess. That matters because when AI has too little context, it fills gaps with plausible assumptions—which might be correct in a demo, but wrong in production scraping.
Think of it like this: a web scraper has to know three things with almost programmer-level clarity:
1. Where to fetch content (the request method, environment details)
2. How to locate fields (HTML structure, XPath rules, table rows, labels)
3. How to output results (data structure names, variable conventions, database insertion logic)
If you give the model only: “Create a PHP scraper for this page,” it will improvise. If you give it a structured spec with constraints, examples, extraction rules, and expected outputs, it’s much more likely to produce code that matches reality.
From “Ask a Question” to a Working Scraper: Control vs Advanced Prompts
The authors run a deliberate comparison between two prompt styles.
The control prompt: too open = risky guessing
Their control prompt asks for a PHP scraper to extract bibliographic data from a specific record page on datos.bne.es. While it includes the language (PHP) and objective (extract bibliographic data), it leaves many crucial details implicit.
As a result, the AI chooses an approach that “sounds reasonable” in general scraping. For example, the control-prompt output used a common library (simple_html_dom.php) and assumed metadata is stored in typical Dublin Core-like tags, such as:
- h1[property="dc:title"]
- span[property="dc:creator"]
- span[property="dc:publisher"]
That’s a classic AI behavior: when the prompt doesn’t specify the actual HTML patterns, the model substitutes its best generic guess.
The paper notes that this approximation fails when executed against the real page—because the assumptions about the HTML structure don’t match the catalogue’s actual layout. In other words: the code may look correct, but it’s not grounded in the page’s real structure.
The advanced prompt: role + constraints + examples + steps
The advanced prompt is built from multiple sections (the paper explicitly recommends this style), and it’s much more like a mini technical document. It includes:
- Role: “researcher in software development and documentation sciences” with expertise in PHP web scraping
- Context and purpose:
- the target URL
- the execution environment: PHP on Apache, MySQL support
- which fields to extract (title, place, publisher, date, physical description, dimensions, call number/signature, location, institution/headquarters)
- Inputs and constraints:
- explicitly require use of cURL functions like
curl_init,curl_exec,curl_setopt - instruct the use of XPath for extraction
- optionally allow
preg_matchif needed—but also present an alternative logic approach
- explicitly require use of cURL functions like
- Input/output examples:
- show a snippet of an HTML row containing something like
<strong>Título</strong>and its corresponding value cell - show how that should map into PHP variables and arrays
- show a snippet of an HTML row containing something like
- Detailed steps:
- fetch HTML with cURL
- parse DOM with DOMDocument + DOMXPath
- apply extraction logic
- build
$arrayPHP/$datastyle structures - print results for verification
This is where things get interesting: because the prompt includes the pattern seen in the page (label in a <strong> tag inside a table row, value in a nearby cell), the model doesn’t need to hallucinate structure. It gets a usable “anchor” it can follow.
The paper reports that the advanced prompt’s generated code successfully ran on the record page and produced the expected extracted fields.
How Their Method Works in Practice (ChatGPT → PHP → Claude)
The paper doesn’t just stop at “the model wrote code once.” They push the method into a full workflow, including scale testing and interoperability between different AI systems.
Phase 1: Pick a real scraping target
They chose the Spanish National Library catalogue at datos.bne.es because it has:
- millions of records (17.4 million in the BNE report mentioned in the paper)
- structured and semi-structured representations on the page
- a realistic test environment for stress and performance
Phase 2: Generate a scraper from one page
Using ChatGPT-4o with the advanced prompt template, they generated PHP code that:
- retrieves HTML with cURL
- parses content with DOM + XPath
- extracts field values from table rows
- maps labels (like “Título”, “Lugar de publicación”, “Editorial”, etc.) to PHP variables
- stores extracted data into an array
- prints values for quick validation
Phase 3: Stress test by looping over record IDs
This is where most scrapers die: single-page extraction is easy. Iteration is hard.
To loop over thousands of records, they needed additional logic:
- set up MySQL connection (PDO)
- insert scraped fields into a database table
- create a main loop to visit successive record URLs
- format record identifiers as 10-digit numbers (using str_pad)
- handle missing fields (null/empty values)
- skip records when key fields (like title) are missing
- add a 3-second pause after each insertion to avoid overwhelming the server
- run a test until the first 50,000 records (stress testing included up to 50k reach)
Crucially, they didn’t assume another round of ChatGPT would be available or ideal. Instead, they tested interoperability by having Claude Sonnet 3.5 reuse and extend the ChatGPT-generated scraping code.
The Claude prompt preserved the structure (“role,” “context and purpose,” “inputs and constraints,” “detailed steps”), but added requirements for:
- MySQL table structure (columns like url, author, title, placeOfPublication, publisher, publicationDate, physicalDescription, ... headquarters)
- insertion SQL
- looping behavior
- skipping null titles
- rate limiting (3 seconds)
The paper reports that Claude correctly integrated the looping and database insertion logic while respecting the original scraping structure. That’s a big deal in practice: you can build a pipeline where one model handles extraction logic, another handles orchestration and database concerns—without rewriting from scratch.
If you want to see the full prompt-to-code chain, the original paper is the place to start: Prompt engineering for bibliographic web-scraping.
Results You Can Trust: Speed, Errors, and Stress Tests
Let’s talk numbers, because this is where “AI-generated code” either earns trust or gets discarded.
Error rate and data volume
During the stress test:
- The program examined 62,786 links
- 7,313 of them returned 404 errors (about 11.65%)
- It collected 55,473 records over 63 hours, 3 minutes, 12 seconds total duration (including scheduled pauses)
- The test included server anomalies unrelated to the scraper itself: 1 anomaly lasting 4:25:48
Scraping throughput
Because they inserted a 3-second scheduled pause between each insertion, the total time includes deliberate throttling to protect the server.
They compute:
- total scheduled break time: 46:13:39
- effective scraping time (excluding pauses): about 60,573 seconds
- effective extraction rate: 66.76 records per minute (≈ 4,005.60 records per hour)
They also compare “ideal theoretical time” vs actual time:
- ideal: 55,473 records * 3 seconds = 166,419 seconds
- effective efficiency using ideal vs effective scraping time: 274.74%
- real efficiency using ideal vs total time discounting anomaly: 73.12%
What does that mean in plain language? The scraper was fast when you ignore deliberate waits, but real-world “politeness delays” reduce end-to-end speed—which is exactly the trade-off you’d expect when you don’t want to hurt the host server.
Record numbering jumps
They detected 5,407 jumps in record numbering:
- average jump frequency: roughly one jump every 2,352 records
- jump sizes ranged from 2 to 81
- standard deviation about 2.5433, indicating variable structure
They also mention that only 14 cases took more than 20 seconds, likely due to those jumps.
Finally, they report an average completion rate per record of 85.4301%—interpretable as: most records contained enough information for most expected fields, but not everything was always present.
Limitations and Ethical Guardrails
This method isn’t “one prompt fixes all.” The authors list limitations that matter if you want to use the approach responsibly.
Limitations
- Domain specificity: validated primarily on BNE’s catalogue, though principles should generalize
- Technology dependency: the code output is targeted to PHP in an Apache + MySQL stack
- LLM version dependence: results depend on model versions (ChatGPT-4o and Claude Sonnet 3.5 were used)
- Performance constraints: rate limiting (3-second pauses) increases total runtime
Ethical and legal issues
They stress ethical scraping:
- implement rate control (the 3-second pause is positioned as a commitment to not overload servers)
- follow institution policies and fair use expectations
- use extracted data for research/library development purposes
- document scraping methodology for transparency
That’s important because “AI generated scraper” can tempt people to run wild. This paper’s best practical contribution is that it bakes server protection into the workflow itself.
Key Takeaways
- Well-structured prompts work like specs. In the study, a “control” prompt led to generic guesses (and failing assumptions about HTML structure), while an “advanced” prompt produced working PHP scraping code.
- Role + constraints + examples + step-by-step workflow were the winning prompt structure, enabling code that could extract many bibliographic fields.
- The method supports scale. In stress testing, they scraped 55,473 records while handling 7,313 (11.65%) 404 errors, using throttling to protect the server.
- Interoperability matters. The authors successfully reused ChatGPT-generated scraping logic and extended it using Claude Sonnet 3.5 to add looping + database insertion—without rewriting from scratch.
- Practical performance is feasible. They report 66.76 records/minute effective extraction rate (excluding scheduled pauses) and a 73.12% real efficiency when accounting for practical runtime constraints.
- Future of AI-assisted data mining: This points toward a workflow where librarians and researchers can build reliable bibliographic data tools with minimal coding by leveraging prompt-engineered code generation.
Sources & Further Reading
- Original Research Paper: Prompt engineering for bibliographic web-scraping
- Authors:
- Manuel Blázquez-Ochando
- Juan José Prieto-Gutiérrez
- María Antonia Ovalle-Perandones