AI Medical Literature Retrieval Errors: Comparative Study

LLM assistants can speed up medical literature search—but they may also return hallucinated DOIs and incorrect PubMed links. This comparative study tests five free platforms, shows error rates, and offers fixes you can use today.
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

AI Medical Literature Retrieval Errors: Comparative Study
LLM platforms can return wrong or “hallucinated” DOIs and PubMed links—new research shows how often

Table of Contents

Introduction

If you’ve ever used an AI assistant to “pull citations” for a paper, you’ve probably felt a little rush of relief—until you double-check and realize something is off. The newest research behind this blog digs into exactly that risk: errors in AI-assisted retrieval of medical literature, especially when AI returns plausible-looking references with incorrect bibliographic metadata.

This post is based on new research from the original paper, “Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study” (arXiv:2603.22344). The authors looked at five popular free-version LLM platforms and tested how accurately each one could retrieve references tied to real medical articles—focusing not just on whether the citations sounded relevant, but whether the DOI, PubMed ID, and Google Scholar link were actually correct.

And here’s the punchline: across platforms, the models completely failed to retrieve correct reference data nearly half the time. That’s not a minor glitch; it’s a workflow-level problem. Let’s break down what was tested, what the researchers found, and what you can do with this information right now.

Why This Matters

This is significant right now because “AI citation generation” is becoming a default muscle memory for researchers. People are using LLMs to speed up literature review, draft manuscripts, and sanity-check background reading. But unlike many other AI tasks, citation accuracy has a hard edge: a wrong DOI or PubMed ID doesn’t just mislead—it can silently reshape the evidence trail.

Imagine a real-world scenario: a graduate student uses an LLM to compile background citations for a grant application. The references look right at a glance, and the titles match what the student expects. The grant proceeds. Later, when the team tries to verify the sources, they discover several citations lead to different papers or don’t resolve at all. That can cost days (or weeks), create rework, and—if it’s not caught—risk scientific credibility.

What makes this research especially strong is that it goes beyond the usual “hallucinated citation” stories. Prior work has documented that LLMs can fabricate references—but this study quantifies retrieval error rates in a systematic way across multiple platforms and real top-tier medical journals. In other words, it’s not just “AI can be wrong,” it’s “AI is wrong in measurable, platform-dependent ways—and journal characteristics influence outcomes too.”

What the Study Actually Tested (and How)

The researchers wanted a fair, repeatable comparison of reference retrieval accuracy. So they built a dataset in a pretty disciplined way.

The test set: real articles from top journals

They randomly selected 40 original research articles from three major journals:
- BMJ
- JAMA
- NEJM

Published between:
- Jan 2024 and early 2025 (up to July 15, 2025)

They pulled 10 articles per journal, and then repeated the process across the eligible time window to get a mix of publication dates.

The LLM task: “Give me 10 key references”

For each of the 40 articles, the researchers prompted five free-version LLM platforms:

  1. Grok-2
  2. ChatGPT (GPT-4.1)
  3. Google Gemini Flash 2.5
  4. Perplexity AI
  5. DeepSeek (GPT-4)

The prompt asked each model to generate 10 key references related to the abstract, including:
- title
- publication year
- DOI
- PubMed ID
- Google Scholar URL

Then they added a second prompt step: the models were asked to “confirm” DOI, PubMed ID, and Google Scholar link data. After that, the team manually checked what the LLM produced.

How accuracy was scored: metadata + relevance

This part matters because they didn’t rely on “vibes.” They used multiple validity metrics:

Three bibliographic validity metrics
- DOI resolves and matches
- PubMed ID resolves and matches
- Google Scholar link resolves and matches

One relevance metric
- Whether the retrieved reference appears in the index article’s reference list (using cues like title and first author)
- If the retrieved reference was the index article itself, they gave extra credit

To combine these fairly (since not every reference will have a PubMed ID), they used a multimetric score ratio:
- Total score / score cap (only counting applicable metrics)

They also tracked complete miss rate:
- the proportion of retrieved references where the total score was zero (i.e., no validated metrics worked)

This is the key distinction: the study evaluated whether citations were bibliographically real, not just whether they sounded plausible.

(If you want to see exactly how the scoring worked, it’s described in the Methods section of the paper: arXiv:2603.22344.)

How Wrong Can It Get? The Key Numbers

Now for the headline results—because they’re blunt.

“Complete miss” happened 47.8% of the time

Across all platforms, the models completely failed to retrieve correct reference data in 47.8% of cases (on average).

That means: for nearly half of the generated references, the citation metadata and validation checks didn’t hold up across the metrics they used.

Average score ratio was modest: 0.29

The average score ratio across the five platforms was:
- 0.29 (standard deviation 0.35, range 0 to 1.25)

Higher score ratio = better retrieval accuracy for relevant references and correct bibliographic data.

Best vs worst platform performance

Platform accuracy varied a lot:

  • Highest score ratio: Grok = 0.57
  • Lowest score ratio: Gemini = 0.11

And complete miss rates:
- Best complete miss: Grok = 11.2%
- Worst complete miss: Gemini = 78.5%

So, yes—some models were “pretty bad,” but one of them was in a category you’d be uncomfortable calling “assistive” without verification.

Publication year didn’t matter much for overall score ratio

Interestingly, publication year wasn’t associated with score ratio in their analyses. So the models’ struggles weren’t simply because they were less familiar with newer papers.

Platform and Journal Differences: It’s Not One-Size-Fits-All

If you’re thinking, “Okay, I’ll just switch platforms,” the study partially confirms that instinct—but also complicates it.

Journal differences were real (and statistically meaningful)

When comparing journals, the paper found that NEJM articles had lower accuracy and higher complete miss rates compared with BMJ, with:
- higher complete miss rate (P < .001)
- lower score ratios for NEJM vs BMJ/JAMA/Lancet

The authors also note something practical: abstract length differs across journals, and NEJM’s abstracts are shorter (~250 words) compared to BMJ (~300), Lancet (~300), and JAMA (~350). The implication is that the model has less material to infer the “key references” correctly.

Even if you don’t buy abstract-length as the whole explanation, the takeaway stands: retrieval accuracy isn’t uniform across publication sources.

Platform and journal effects were independent

In multivariable regression, both:
- the LLM platform, and
- the journal

were independently associated with performance measures (score ratio and complete miss).

So even if you choose Grok, you still shouldn’t treat results from every journal as equally reliable.

Relevance quality vs metadata quality: two different problems

One of the most important insights is that LLMs may approximate topic relevance reasonably well, but bibliographic metadata is where things break.

In their individual metric analysis:
- LLMs differed significantly in obtaining correct DOI, PubMed ID, and Google Scholar link
- relevance scores differed less reliably (relevance didn’t vary by publication year)

This means you can see a correct-looking title and still end up with a wrong DOI or dead PubMed linkage. It’s like getting a well-written map label while the street address points to the wrong city.

Practical Fixes You Can Use Today

So what should you do if you’re using LLMs for medical literature retrieval right now? The research doesn’t suggest “don’t use them”—it suggests use them like a junior assistant, not a librarian with tenure.

1) Treat metadata as “untrusted input”

Even when the title and relevance appear solid, verify DOI / PubMed / Scholar link manually before citing.

This isn’t paranoia; it’s exactly what the study results demand. The models generated citations with correctness rates that were, at best, modest and wildly variable by platform and journal.

2) Use multiple platforms and cross-check overlaps

Because platforms performed differently, one practical strategy is redundancy:
- run the same query across two or more models
- prioritize references that show up consistently

Cross-checking can reduce the chance that you’re relying on a single model’s “best guess.”

3) Don’t over-weight “AI confirmation” prompts

The study used explicit prompting like “confirm DOI, PubMed ID, and Google Scholar link.” But performance didn’t improve enough to make the approach safe by itself.

So if your workflow depends on the model saying “confirmed,” this research is a warning sign: self-asserted verification isn’t verification.

4) Prefer workflows that use retrieval-augmented systems (with real database checks)

While this paper focused on free-version LLM platforms and abstract-based retrieval prompts, it points toward the need for systems that connect to trustworthy sources—like:
- DOI resolvers (doi.org)
- PubMed APIs
- structured bibliographic databases

In other words: if the tool can’t check the identifier against the database, it shouldn’t be treated as authoritative.

5) If you’re selecting sources for publication, implement a citation QA step

This is a workflow habit used in better publishing pipelines:
- freeze your reference list
- run automatic DOI resolution
- verify PubMed IDs resolve
- ensure each citation actually matches the claimed title/authors

Even a lightweight QA checklist can catch the “plausible but wrong” failures that this study quantified.

Key Takeaways

  • LLM-assisted medical reference retrieval is frequently unreliable. The study found a 47.8% complete miss rate across five free-version platforms.
  • Performance varies dramatically by platform:
    • Grok performed best (score ratio 0.57, complete miss 11.2%)
    • Gemini performed worst (score ratio 0.11, complete miss 78.5%)
  • Journal matters. Compared with BMJ, NEJM had lower score ratios and higher complete miss rates (P < .001).
  • Metadata is the weak point. LLMs may guess relevant topics, but DOIs, PubMed IDs, and Scholar links often fail.
  • What you should do now:
    • Always manually verify citation identifiers (DOI/PubMed/Scholar)
    • Use multiple platforms and cross-check overlaps
    • Don’t assume “AI confirmation” equals correctness

Sources & Further Reading

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.