What is AI medical literature retrieval errors?

AI medical literature retrieval errors are mistakes where an LLM returns incorrect or fabricated citation details, such as the wrong DOI or an incorrect PubMed link. In this study, the focus is on bibliographic metadata accuracy, not just whether a reference “looks right.”

How does AI-assisted retrieval of medical literature work?

An LLM is prompted to produce citations or links for medical articles, often aiming to match a query to real publications. The study evaluates whether the returned DOIs and PubMed URLs correctly correspond to the intended medical literature.

Why is AI medical citation accuracy important?

Incorrect DOIs and PubMed links can send researchers to the wrong papers, undermining evidence quality and slowing verification. The study highlights that errors can appear plausible, making manual checks easy to miss.

What are the benefits of verifying AI-retrieved citations?

Verification reduces the risk of using the wrong sources by confirming that DOIs and PubMed links resolve to the correct articles. The post also provides practical fixes to catch errors early before you rely on citations.

How can I apply citation verification when using LLMs?

Cross-check every DOI and PubMed link the model provides by opening them and confirming the article details match. Use targeted checks (e.g., confirm title/metadata on PubMed) before citing the reference in your work.

AI Medical Literature Retrieval Errors: Comparative Study

AI Medical Literature Retrieval Errors: Comparative Study
LLM platforms can return wrong or “hallucinated” DOIs and PubMed links—new research shows how often

Introduction
Why This Matters
What the Study Actually Tested (and How)
How Wrong Can It Get? The Key Numbers
Platform and Journal Differences: It’s Not One-Size-Fits-All
Practical Fixes You Can Use Today
Key Takeaways
Sources & Further Reading

Introduction

If you’ve ever used an AI assistant to “pull citations” for a paper, you’ve probably felt a little rush of relief—until you double-check and realize something is off. The newest research behind this blog digs into exactly that risk: errors in AI-assisted retrieval of medical literature, especially when AI returns plausible-looking references with incorrect bibliographic metadata.

This post is based on new research from the original paper, “Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study” (arXiv:2603.22344). The authors looked at five popular free-version LLM platforms and tested how accurately each one could retrieve references tied to real medical articles—focusing not just on whether the citations sounded relevant, but whether the DOI, PubMed ID, and Google Scholar link were actually correct.

And here’s the punchline: across platforms, the models completely failed to retrieve correct reference data nearly half the time. That’s not a minor glitch; it’s a workflow-level problem. Let’s break down what was tested, what the researchers found, and what you can do with this information right now.

Why This Matters

This is significant right now because “AI citation generation” is becoming a default muscle memory for researchers. People are using LLMs to speed up literature review, draft manuscripts, and sanity-check background reading. But unlike many other AI tasks, citation accuracy has a hard edge: a wrong DOI or PubMed ID doesn’t just mislead—it can silently reshape the evidence trail.

Imagine a real-world scenario: a graduate student uses an LLM to compile background citations for a grant application. The references look right at a glance, and the titles match what the student expects. The grant proceeds. Later, when the team tries to verify the sources, they discover several citations lead to different papers or don’t resolve at all. That can cost days (or weeks), create rework, and—if it’s not caught—risk scientific credibility.

What makes this research especially strong is that it goes beyond the usual “hallucinated citation” stories. Prior work has documented that LLMs can fabricate references—but this study quantifies retrieval error rates in a systematic way across multiple platforms and real top-tier medical journals. In other words, it’s not just “AI can be wrong,” it’s “AI is wrong in measurable, platform-dependent ways—and journal characteristics influence outcomes too.”

What the Study Actually Tested (and How)

The researchers wanted a fair, repeatable comparison of reference retrieval accuracy. So they built a dataset in a pretty disciplined way.

The test set: real articles from top journals

They randomly selected 40 original research articles from three major journals:
- BMJ
- JAMA
- NEJM

Published between:
- Jan 2024 and early 2025 (up to July 15, 2025)

They pulled 10 articles per journal, and then repeated the process across the eligible time window to get a mix of publication dates.

The LLM task: “Give me 10 key references”

For each of the 40 articles, the researchers prompted five free-version LLM platforms:

Grok-2
ChatGPT (GPT-4.1)
Google Gemini Flash 2.5
Perplexity AI
DeepSeek (GPT-4)

The prompt asked each model to generate 10 key references related to the abstract, including:
- title
- publication year
- DOI
- PubMed ID
- Google Scholar URL

Then they added a second prompt step: the models were asked to “confirm” DOI, PubMed ID, and Google Scholar link data. After that, the team manually checked what the LLM produced.

How accuracy was scored: metadata + relevance

This part matters because they didn’t rely on “vibes.” They used multiple validity metrics:

Three bibliographic validity metrics
- DOI resolves and matches
- PubMed ID resolves and matches
- Google Scholar link resolves and matches

One relevance metric
- Whether the retrieved reference appears in the index article’s reference list (using cues like title and first author)
- If the retrieved reference was the index article itself, they gave extra credit

To combine these fairly (since not every reference will have a PubMed ID), they used a multimetric score ratio:
- Total score / score cap (only counting applicable metrics)

They also tracked complete miss rate:
- the proportion of retrieved references where the total score was zero (i.e., no validated metrics worked)

This is the key distinction: the study evaluated whether citations were bibliographically real, not just whether they sounded plausible.

(If you want to see exactly how the scoring worked, it’s described in the Methods section of the paper: arXiv:2603.22344.)

How Wrong Can It Get? The Key Numbers

Now for the headline results—because they’re blunt.

“Complete miss” happened 47.8% of the time

Across all platforms, the models completely failed to retrieve correct reference data in 47.8% of cases (on average).

That means: for nearly half of the generated references, the citation metadata and validation checks didn’t hold up across the metrics they used.

Average score ratio was modest: 0.29

The average score ratio across the five platforms was:
- 0.29 (standard deviation 0.35, range 0 to 1.25)

Higher score ratio = better retrieval accuracy for relevant references and correct bibliographic data.

Best vs worst platform performance

Platform accuracy varied a lot:

Highest score ratio: Grok = 0.57
Lowest score ratio: Gemini = 0.11

And complete miss rates:
- Best complete miss: Grok = 11.2%
- Worst complete miss: Gemini = 78.5%

So, yes—some models were “pretty bad,” but one of them was in a category you’d be uncomfortable calling “assistive” without verification.

Publication year didn’t matter much for overall score ratio

Interestingly, publication year wasn’t associated with score ratio in their analyses. So the models’ struggles weren’t simply because they were less familiar with newer papers.

Platform and Journal Differences: It’s Not One-Size-Fits-All

If you’re thinking, “Okay, I’ll just switch platforms,” the study partially confirms that instinct—but also complicates it.

Journal differences were real (and statistically meaningful)

When comparing journals, the paper found that NEJM articles had lower accuracy and higher complete miss rates compared with BMJ, with:
- higher complete miss rate (P < .001)
- lower score ratios for NEJM vs BMJ/JAMA/Lancet

The authors also note something practical: abstract length differs across journals, and NEJM’s abstracts are shorter (~250 words) compared to BMJ (~300), Lancet (~300), and JAMA (~350). The implication is that the model has less material to infer the “key references” correctly.

Even if you don’t buy abstract-length as the whole explanation, the takeaway stands: retrieval accuracy isn’t uniform across publication sources.

Platform and journal effects were independent

In multivariable regression, both:
- the LLM platform, and
- the journal

were independently associated with performance measures (score ratio and complete miss).

So even if you choose Grok, you still shouldn’t treat results from every journal as equally reliable.

Relevance quality vs metadata quality: two different problems

One of the most important insights is that LLMs may approximate topic relevance reasonably well, but bibliographic metadata is where things break.

In their individual metric analysis:
- LLMs differed significantly in obtaining correct DOI, PubMed ID, and Google Scholar link
- relevance scores differed less reliably (relevance didn’t vary by publication year)

This means you can see a correct-looking title and still end up with a wrong DOI or dead PubMed linkage. It’s like getting a well-written map label while the street address points to the wrong city.

Practical Fixes You Can Use Today

So what should you do if you’re using LLMs for medical literature retrieval right now? The research doesn’t suggest “don’t use them”—it suggests use them like a junior assistant, not a librarian with tenure.

1) Treat metadata as “untrusted input”

Even when the title and relevance appear solid, verify DOI / PubMed / Scholar link manually before citing.

This isn’t paranoia; it’s exactly what the study results demand. The models generated citations with correctness rates that were, at best, modest and wildly variable by platform and journal.

2) Use multiple platforms and cross-check overlaps

Because platforms performed differently, one practical strategy is redundancy:
- run the same query across two or more models
- prioritize references that show up consistently

Cross-checking can reduce the chance that you’re relying on a single model’s “best guess.”

3) Don’t over-weight “AI confirmation” prompts

The study used explicit prompting like “confirm DOI, PubMed ID, and Google Scholar link.” But performance didn’t improve enough to make the approach safe by itself.

So if your workflow depends on the model saying “confirmed,” this research is a warning sign: self-asserted verification isn’t verification.

4) Prefer workflows that use retrieval-augmented systems (with real database checks)

While this paper focused on free-version LLM platforms and abstract-based retrieval prompts, it points toward the need for systems that connect to trustworthy sources—like:
- DOI resolvers (doi.org)
- PubMed APIs
- structured bibliographic databases

In other words: if the tool can’t check the identifier against the database, it shouldn’t be treated as authoritative.

5) If you’re selecting sources for publication, implement a citation QA step

This is a workflow habit used in better publishing pipelines:
- freeze your reference list
- run automatic DOI resolution
- verify PubMed IDs resolve
- ensure each citation actually matches the claimed title/authors

Even a lightweight QA checklist can catch the “plausible but wrong” failures that this study quantified.

Key Takeaways

LLM-assisted medical reference retrieval is frequently unreliable. The study found a 47.8% complete miss rate across five free-version platforms.
Performance varies dramatically by platform:
- Grok performed best (score ratio 0.57, complete miss 11.2%)
- Gemini performed worst (score ratio 0.11, complete miss 78.5%)
Journal matters. Compared with BMJ, NEJM had lower score ratios and higher complete miss rates (P < .001).
Metadata is the weak point. LLMs may guess relevant topics, but DOIs, PubMed IDs, and Scholar links often fail.
What you should do now:
- Always manually verify citation identifiers (DOI/PubMed/Scholar)
- Use multiple platforms and cross-check overlaps
- Don’t assume “AI confirmation” equals correctness

Sources & Further Reading

Original Research Paper: Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study
Authors: Jenny Gao, Yongfeng Zhang, Mary L Disis, Lanjing Zhang