ChatGPT for Medical Information Extraction: Performance, Explainability, and Beyond

This post reviews a study evaluating MedIE with ChatGPT on medical information extraction across multiple datasets. It analyzes accuracy, explainability, faithfulness, confidence, and uncertainty, and offers practical tips for applying AI safely in real clinical workflows. It notes limitations soon.
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

ChatGPT for Medical Information Extraction: Performance, Explainability, and Beyond

Table of Contents
- Introduction: What this study tests
- Why this matters right now
- Main content: what the research actually found
- The MedIE tasks and how ChatGPT was prompted
- Performance: where ChatGPT shines and where it struggles
- Explainability, faithfulness, confidence, and uncertainty
- Practical takeaways for today
- Key takeaways
- Sources & Further Reading

Introduction: What this study tests
If you’ve been following the buzz around large language models (LLMs) like ChatGPT, you’ve probably heard claims that these systems can handle a wide range of natural language tasks with little to no fine-tuning. But how do they actually perform on specialized, high-stakes tasks—like extracting concrete medical information from clinical text? A new study dives into this exact question: evaluating ChatGPT on medical information extraction (MedIE) tasks across multiple datasets, and probing not just accuracy but also explainability, faithfulness, confidence, and uncertainty. The work is based on new research from Wei Zhu and colleagues, and you can check the original paper here: https://arxiv.org/abs/2601.21767.

Short version: ChatGPT can explain its decisions well, but it doesn’t match the performance of fine-tuned baselines on most MedIE tasks. It’s often confident even when wrong, and its outputs can be affected by the randomness of generation. The researchers assess five dimensions—performance, explainability, faithfulness, confidence, and uncertainty—across six datasets that cover four fine-grained MedIE tasks. The takeaway? ChatGPT is a strong storyteller and explainer, but for rigorous medical information extraction, it’s not a replacement for task-specific, fine-tuned models—yet it has real potential as an assistive tool with careful prompting and hybrid approaches.

Why this matters right now
- Why now: MedIE is a cornerstone capability for turning free-text clinical notes into structured data that can power quality improvement, decision support, and research. As hospitals and health systems increasingly rely on rapid information synthesis from electronic health records, the “prompt-before-finetune” approach offered by ChatGPT and other LLMs becomes an attractive option—especially when there isn’t time to train dedicated models for every new dataset or domain.
- Real-world scenario: Imagine a hospital trying to extract standardized data about adverse drug events (a MedIE task) from thousands of clinician notes. Relying on a fully fine-tuned model for every local variant can be slow and expensive. A well-prompted ChatGPT instance could provide quick, explainable extractions to support clinicians, with a follow-up pass by a domain-specific system to ensure strict accuracy before any clinical decision is tied to the output.
- Building on prior AI work: This study adds nuance to the broader conversation about LLMs in medicine. Earlier work showed impressive capabilities of large models on many NLP benchmarks, but medical domains demand strict factual fidelity and domain knowledge. The paper’s five-dimensional evaluation (including calibration and faithfulness) helps separate flashy reasoning from reliable, usable extraction in clinical contexts.

Main content: what the research actually found
The study design centers on six MedIE datasets that span four tasks, evaluated in a closed MedIE setting (where the output labels are predefined). ChatGPT’s responses were generated via the OpenAI API (gpt-3.5-turbo) with a handful of demonstrations, task descriptions, label explanations, and explicit output formats. The key claim is that even with carefully crafted prompts, ChatGPT lags behind fully fine-tuned baselines and SOTAs on most MedIE tasks, though it can deliver high-quality explanations and demonstrates substantial faithfulness to source text. The authors also highlight the instability introduced by top-p sampling, which contributes to generation uncertainty.

The MedIE tasks and how ChatGPT was prompted
- Tasks and datasets (six in total, four MedIE tasks):
- Named Entity Recognition (NER): ShARe13, CADEC, CMeEE-v2
- Triple Extraction (TE): CMeIE-v2
- Clinical Event Extraction (CEE): CHIP-CDEE
- ICD Encoding (ICD coding): CHIP-CDN
- What “closed MedIE” means: The study uses task-specific label sets and expects outputs in a fixed schema. This mirrors standard supervised settings, but ChatGPT cannot be fine-tuned, so the researchers rely on task prompts and demonstrations to steer the model.
- How prompts were composed: Task description, label set (with label explanations), output format, demonstrations (input-output pairs from unseen tasks), and the target input text. The prompt language matched the MedIE task language (e.g., Chinese for CMeEE-v2 in the example).
- Evaluation protocol: Five dimensions, each with both automatic measures and domain-expert annotations:
- Performance: Strict F1 on each “instance” (an extracted unit, such as an entity mention or a relation instance).
- Explainability: Sample-level and instance-level explanations, assessed for quality by domain experts (R-scores).
- Faithfulness: Whether explanations and outputs follow instructions and accurately reflect the input (including faithful reasoning).
- Confidence: How confident ChatGPT appears about its predictions, including potential overconfidence.
- Uncertainty: Measured by repeating the same sample across five ChatGPT API calls to gauge variation.

Performance: where ChatGPT shines and where it struggles
- Overall takeaway: ChatGPT’s MedIE performance trails fine-tuned baselines and SOTA methods on all six datasets. With prompt-based, few-shot setups, ChatGPT can achieve reasonable results on simpler tasks but struggles on more complex extraction that requires multi-hop reasoning, cross-entity relations, or nested/discontinuous entities.
- Concrete numbers (average results across tasks, with standard deviations):
- NER tasks (ShARe13, CADEC, CMeEE-v2): ChatGPT’s strict F1 scores are notably lower than BERT-based baselines, UIE, and SOTA methods. For example, on ShARe13 and CADEC, ChatGPT scored in the 19–24 range, compared with much higher baselines (e.g., BERT/UIE around the 70s in some cases). On CMeEE-v2, ChatGPT reached around 39.8, still below several strong baselines.
- TE task (CMeIE-v2): ChatGPT performed around 9.9, a clear gap versus baselines in the 50–60 range and SOTA figures.
- CEE task (CHIP-CDEE): ChatGPT achieved about 26.1, again behind the strongest baselines (which were in the 60s range for this dataset).
- ICD coding (CHIP-CDN): ChatGPT scored around 30.9, compared to baselines in the high 70s to mid-80s for this task.
- Why so far behind? The paper suggests a few reasons:
- Fine-tuning vs prompting: Fine-tuned models learn the task parameters directly; prompting only adapts behavior without real parameter updates.
- Task label ambiguity: Some labels are hard to interpret even with explanations, which can hurt performance.
- Task complexity: As tasks require identifying entities, then determining meaningful relations, or handling nested/discontinuous structures, ChatGPT’s generation-based approach adds complexity.
- Task scale and domain knowledge: Medical information extraction benefits from domain-specific pretraining or knowledge grounding that a generic ChatGPT instance may lack.
- Takeaway on complexity: ChatGPT tends to do better on simpler MedIE tasks and notably struggles on complex relational extraction (e.g., CMeIE-v2’s entity pairs with meaningful relations).

Explainability, faithfulness, confidence, and uncertainty
- Explainability (R-scores): The study reports that, among correctly predicted instances, human annotators rated ChatGPT’s explanations as high quality:
- Sample-level R-scores: generally around 75–81% across datasets.
- Instance-level R-scores: higher, typically around 88–95%.
- What this means: ChatGPT tends to provide reasonable, well-structured explanations for its predictions when it gets them right.
- Caution: Explanations can be overconfident even when predictions are wrong, a known risk with LLMs that practitioners must manage.
- Faithfulness: Two aspects were measured:
- Instruction following: Most tasks showed that ChatGPT followed the prompts and used the given label sets roughly correctly (often in the 80%+ range).
- Faithful reasoning: Explanations were largely faithful to the input samples in the majority of cases.
- Practical implication: The model’s explanations can be trusted more often than not, but not perfectly. There is still a non-trivial risk of explanations that sound convincing but don’t align with the input text.
- Confidence and calibration:
- Average confidence scores (CC-score for correct predictions; IC-score for incorrect predictions) sit in the high 70s to low 80s across datasets.
- No large gap between confidence on correct vs incorrect predictions, indicating poor calibration: high confidence does not guarantee correctness.
- Standard deviations of confidence scores are relatively small, suggesting many samples share similar confidence, which can mislead users if relied upon for decision-making.
- Uncertainty due to generation:
- Because the model uses top-p sampling, the same prompt can yield slightly different outputs across runs.
- The study explicitly measured this by running five times on the same input and observing variation, underscoring potential reliability concerns in high-stakes settings.
- Practical takeaway: ChatGPT’s ability to explain its decisions is valuable, but the model’s confidence and occasional misalignment with truth call for caution. In real-world MedIE deployments, you’d want calibration steps, a secondary validation pass, or a hybrid approach that uses a specialized extractor for the final labels, while reserving the LLM for explainable commentary or candidate suggestions.

Practical takeaways for today
- Prompt engineering matters, but is not a cure-all. The study used a robust, multi-part prompt with clear task descriptions, label explanations, output formats, and demonstrations. Even with this care, performance gaps remain—the gap is especially wide for tasks requiring complex extraction.
- Use LLMs as assistants, not sole extractors. The high quality of explanations and good faithfulness on many correctly predicted instances suggest a role as a “review broker” or an explainability layer. A downstream, domain-tuned extractor could verify and finalize the structured outputs.
- Calibrate and monitor confidence. Given the overconfidence and limited calibration, systems should not treat ChatGPT’s confidence scores as a proxy for accuracy. Consider confidence thresholds, ensemble strategies, or uncertainty-aware decision making before acting on the extracted data.
- Hybrid workflows show promise. The research aligns with a broader trend: combine general-purpose LLMs with domain-specific models or knowledge bases. For ICD coding, for example, a retrieval step to narrow candidate terms (as described in related methods) paired with ChatGPT’s reasoning could improve both accuracy and interpretability.
- Real-world applicability today. For fast-turnaround tasks or exploratory data gathering in clinical research, prompt-driven ChatGPT can surface potential entities, relations, or events and offer human-interpretable explanations, while formal compliance or patient-safety workflows rely on validated, fine-tuned models for production-grade extraction.

Key takeaways
- ChatGPT’s MedIE performance lags behind fine-tuned models and SOTA on six benchmark datasets across four tasks, even with carefully designed prompts.
- The model provides high-quality explanations for its predictions and shows faithful reasoning to inputs in many cases, but it is often overconfident, and its confidence does not reliably track accuracy.
- Uncertainty inherent in generation (top-p sampling) introduces variability across runs, which can hinder consistent MedIE applications without additional controls.
- The study highlights a practical path forward: use ChatGPT as an explainability and suggestion layer, complement it with dedicated, fine-tuned medical information extraction models, and pursue hybrid pipelines that harness the strengths of both approaches.
- This work contributes a structured framework for evaluating LLMs in medical information extraction, emphasizing not just raw performance but also interpretability, trustworthiness, and reliability—critical factors for clinical and research use.

Sources & Further Reading
- Original Research Paper: Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond
- Authors: Wei Zhu

Note: For readers who want to dive deeper, the paper also situates the MedIE evaluation within the broader field of information extraction and medical NLP, comparing standard fine-tuned baselines (like BERT-based methods) and prompt-driven approaches across multiple datasets (ShARe13, CADEC, CMeEE-v2, CMeIE-v2, CHIP-CDEE, CHIP-CDN). The discussion on prompt design, evaluation metrics, and faithfulness/uncertainty offers a practical blueprint for researchers and practitioners exploring LLMs in medical text processing.

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.