Can General-Purpose AI Agents Autonomously Generate Pathology Reports?
Table of Contents
- Introduction
- Why This Matters
- The Experiment at a Glance
- How the Agents Worked: Scenarios A, B, and C
- Performance, Pitfalls, and What It Tells Us
- Real-World Implications and Future Directions
- Key Takeaways
- Sources & Further Reading
Introduction
Digital pathology is already steeped in AI-powered tools that excel at highly specific tasks—think spotting mitotic figures or segmenting tumor regions. But what about the broader, messier job of generating a pathology report—navigating a slide, recognizing a constellation of tissue features, and narrating a diagnostic story in plain English? That’s where the idea of agentic, multimodal AI comes in. A recent pilot study explored whether general-purpose AI systems—specifically OpenAI’s ChatGPT 5.0 in agentic mode and H Company’s Surfer—can autonomously read whole-slide images (WSIs), describe what they see, and propose provisional diagnoses within a digital slide viewer. The study used 35 veterinary pathology cases, with outcomes validated by board-certified pathologists. And yes, they even tested how different kinds of context (like signalment or a detailed morphological description) influenced the AI’s performance. If you want to dive into the original methodology and results, check out the paper here: Exploring General-Purpose Autonomous Multimodal Agents for Pathology Report Generation.
This post translates that research into plain-language takeaways, with the aim of explaining what’s possible now, what isn’t, and why it matters for clinicians, educators, and AI developers alike. The core question remains provocative: can broad, off-the-shelf AI tools emulate aspects of the diagnostic workflow—autonomous slide navigation, feature description, and initial reasoning—well enough to be useful in pathology? The answer, as you’ll see, is nuanced. The study found meaningful gains when extra context was provided, but accuracy still lagged far behind human experts, and “hallucinations” or mis-descriptions were a recurring challenge. It’s a telling snapshot of where we stand with general-purpose vision-language agents in a domain that demands precision and deep domain understanding.
Why This Matters
Significance right now
We’re living in a moment when agentic AI—systems that can perceive, reason, and interact with tools on their own—has moved from sci-fi-ish headlines to real-world experimentation. The study in question is one of the early attempts to stress-test these broad, multimodal models in a highly specialized, safety-conscious field. The takeaway isn’t “replace doctors” but “push the envelope of what automation can handle and where humans must intervene.” For digital pathology, this matters because:
- It sets a benchmark for how far general-purpose AI can go in a complex diagnostic workflow that relies on careful visual inspection, pattern recognition, and narrative synthesis.
- It surfaces concrete failure modes (e.g., model navigational blind spots, mis-descriptions, and the tendency to propose diagnoses based on limited context) that researchers can target for improvement.
- It underlines the need to calibrate expectations about AI-assisted pathology—especially in educational contexts or in triage roles where speed and breadth of analysis are valuable, but not the final verdict.
A real-world scenario today
Imagine a busy veterinary diagnostic lab or a teaching hospital using a digital slide viewer and a chat-based AI assistant to draft preliminary reports while a pathologist focuses on reviewing flagged regions. The AI could autonomously explore a case, describe visual features, and propose a provisional diagnosis that the pathologist then verifies or corrects. In educational settings, students could interact with AI to practice structured reporting, compare AI-generated morphologic descriptions with expert notes, and learn how to phrase differential diagnoses. The key is to use AI as a complement—an intelligent navigator and writing aid—while leaving final decisions to trained specialists.
How this builds on prior AI work
Prior AI work in pathology has mostly focused on narrow tasks under supervised learning, such as segmentation or specific object detection. The novelty here is testing “general-purpose” vision–language models with agentic capabilities in a domain that requires deliberate navigation and tailored storytelling about tissue morphology. The authors explicitly note that domain-specific training can improve performance, but generalized models offer broader accessibility and potential for cross-domain transfer. That said, the study also reinforces a crucial lesson: broad models can display impressive language prowess and surface-level medical terminology while still making fundamental diagnostic errors. It’s a reminder that in medicine (and pathology), sophistication in language does not guarantee diagnostic fidelity.
Main Content Sections
The Experiment at a Glance
- What was tested: Two agentic, multimodal AI frameworks—OpenAI’s ChatGPT 5.0 in agentic mode and H Company’s Surfer—applied to open-source slide viewing platforms. The goal was autonomous slide navigation, tissue feature description, and provisional diagnosis generation, all within a structured workflow.
- The dataset: 35 veterinary pathology cases, curated for educational value rather than population-representative sampling. Each case had:
- A single hematoxylin and eosin (H&E) stained WSI
- The organ of origin
- Signalment (animal species, breed, age, sex, neuter status)
- A complete morphologic description in plain text
- A board-certified pathologist’s final diagnosis
- The three scenarios (contexts given to the AI)
- Scenario A: WSI only—the model tried to infer organ origin and diagnosis from image content alone.
- Scenario B: WSI + brief case description, including signalment and organ origin.
- Scenario C: Morphologic description only—no WSI needed because morphology is already described in text.
- The prompts and outputs: The experiments required the AI to output a structured JSON per case, including organ, diagnosis, and a set of morphologic descriptors (e.g., shape, invasion, cellularity, growth pattern, stroma, tumor cells, nuclei, nucleoli, malignancy criteria, necrosis, inflammation, etc.).
- The human benchmark: A second board-certified pathologist evaluated the same cases with the same limited information to provide a baseline comparison.
- The big takeaway in numbers: The AI agents achieved up to 28.6% diagnostic accuracy with scenario B (WSI + signalment + organ) and up to 68.6% when a provided morphological description was used (scenario C). With only the WSI (scenario A), accuracy dropped to 5.7%. In contrast, the human expert reached 85.7% accuracy with a single WSI and 88.6% when signalment and organ were also provided. The comparison highlights the jump in AI performance when morphology is richly described, and the still-large gap to human performance.
- What went wrong (and why): The study found that the AI often failed to visually inspect the most diagnostically relevant regions of slides. This led to many incorrect or vague descriptions and an overreliance on plausible-sounding but not case-specific content. They quantify this as a high rate of erroneous feature descriptions (336 out of 455 features, or 73.85%) in one model’s outputs. However, the same models showed improved diagnostic chances when they were given explicit morphological details or case context.
- Notable observations: Among the two frameworks, ChatGPT showed relatively better performance at organ identification and benefited more from contextual brief case descriptions (scenario B). The authors also emphasize that the ability to generate a diagnosis or descriptive narrative did not stop the model from giving a plausible—but not necessarily correct—output. This underscores a critical risk: “hallucination” in medical content, which they connect to broader patterns observed in large vision–language models.
- Source and scope caveat: The study used two general-purpose, web-browsing-enabled VLMs, not domain-specific pathology models. The authors are clear that this is a feasibility study, not a claim that these agents are ready for clinical deployment. They acknowledge other specialized models or pipelines might perform differently, but their aim was to probe the capabilities of broadly accessible AI tools in a domain-specific diagnostic task (and to surface qualitative behavior for future improvement). For more details on methodology and results, see the original paper: Exploring General-Purpose Autonomous Multimodal Agents for Pathology Report Generation.
How the Agents Worked: Scenarios A, B, and C
- Scenario A (WSI only): The AI received only the whole-slide image and was asked to determine both the organ of origin and the diagnosis from image content alone. The results show that, without extra cues, the models struggled mightily—accuracy hovered at a low single-digit level (5.7%). This highlights a fundamental barrier: mapping raw histology visuals to precise clinical categories is nontrivial for broad, non-domain-tuned models.
- Scenario B (WSI + brief case description): Here the AI got the WSI plus a short, structured case description that included signalment (species, breed, age, sex, neuter status) and the organ of origin. In this context, the models’ correctness improved notably. The study reports up to 28.6% diagnostic accuracy in this scenario, indicating that contextual cues do help the model disambiguate plausible but not case-specific possibilities.
- Scenario C (Morphologic description only): The morphological description—the lab’s detailed, structured text about the tissue features—was provided as input, with no WSI. Surprisingly, when the model received rich morphological data, accuracy rose to a much higher level, up to 68.6%. This underscores a key insight: when the model is fed with concrete descriptive cues about the tissue features, it can leverage that information to narrow down the differential and align its synthesis more closely with expert reasoning.
Performance, Pitfalls, and What It Tells Us
- The ceiling is far from human-level in this setup, but the directionality is telling. The AI’s capacity to correctly identify the organ and diagnosis improves dramatically when morphology or context is explicitly supplied. That implies the bottleneck is less about “vision” per se and more about “targeted navigation and feature extraction” within a complex slide, plus the ability to fuse those features with a clinical narrative.
- Navigation matters. The study notes that the agents rarely avoided non-diagnostic regions, which contributed to errors. In other words, a core weakness lies in the autonomous search strategy: if the AI doesn’t actively seek the right regions, its descriptive outputs will be generic or misleading. This is a solvable system design problem—complementing the AI with better guidance, saliency cues, or reinforcement learning objectives that reward diagnostic-correct navigation could help.
- Hallucination risk is real. The researchers connect the observed errors to broader patterns of model hallucination, especially when data are rare or tail-end cases. In pathology, where the correct diagnosis can hinge on subtle morphologic cues, the fear is that a model will “fold in” typical tumor types for a given species/organ and propose a plausible diagnosis rather than the exact one depicted. This is exactly why the authors urge caution about educational or clinical use without human oversight.
- Contextually informed models shine. When the morphology description was provided (Scenario C), the performance gains were substantial. This suggests a practical path forward: use general-purpose AI as a helper that consumes high-quality, structured pathology descriptions and anchors its reasoning on explicit features rather than trying to infer everything from slide visuals alone.
- Benchmarking reality vs. promise. The authors stress that they used only two general-purpose tools with web-browsing capabilities. There’s a spectrum of models—some trained specifically for pathologic reasoning or multimodal retrieval—that might perform differently. The study’s value is in exposing qualitative behavior and establishing a baseline for how far current off-the-shelf agentic AI can go in this niche.
Real-World Implications and Future Directions
- Practical implications for education and triage: In teaching environments, AI-assisted practice with agentic VLMs could accelerate exposure to varied case types and promote structured reporting. Students might compare AI-produced morphologic descriptors with expert notes, learning to recognize which features matter most for differential diagnoses.
- A cautious path to clinical integration: The study’s findings argue for a measured approach to any clinical deployment. The combination of limited accuracy without morphology, and the clear probability of hallucinations, means these tools should augment rather than replace human judgement. In practice, AI could pre-screen or draft initial reports, while a pathologist reviews and corrects—essentially a decision-support system rather than a decision-maker.
- The role of context and data design: A clear signal from the research is that providing structured morphological data and case context dramatically improves AI outputs. This points to a broader design principle: for AI tools in pathology, curating high-quality descriptive metadata and guiding inputs could be as important as the visual data itself.
- Roadmap for improvement: Researchers can build on these results by exploring domain-adapted variants of agentic models, incorporating explicit salience cues to encourage scanning of diagnostically informative slide regions, and blending retrieval-augmented generation to ground outputs in pathology literature or curated image databases. Some recent work in pathology-specific agentic models shows promise, but this study emphasizes that general-purpose tools still require substantial safeguards and domain-specific refinements before clinical use.
Key Takeaways
- General-purpose, agentic AI models can autonomously navigate some digital pathology tasks and generate narrative outputs, but their accuracy is currently well below expert human performance for pathology report generation.
- Context matters a lot. Providing signalment and organ information helps, but the biggest gains come when detailed morphological descriptions accompany the input, allowing the AI to anchor its reasoning to explicit features.
- Navigation and feature extraction are critical bottlenecks. Models tended to miss diagnostically relevant regions, contributing to erroneous feature descriptions and incorrect diagnoses.
- Hallucination is a real, measurable risk. Even when the model produces plausible medical language, it doesn’t guarantee the actual histopathology aligns with reality.
- This work is a stepping stone, not a clinical protocol. It demonstrates feasibility and highlights where improvements are needed. The broader takeaway is a call to combine the strengths of human expertise with disciplined AI tooling—especially when accuracy matters as much as the words used to convey a diagnosis.
- For educators, clinicians, and AI researchers: the next generation of pathology AI should emphasize domain-specific data curation, robust evaluation against medical standards, and transparent communication about what the AI can and cannot reliably do.
Sources & Further Reading
- Original Research Paper: Exploring General-Purpose Autonomous Multimodal Agents for Pathology Report Generation
- Authors: Marc Aubreville, Taryn A. Donovan, Christof A. Bertram
If you’re curious to see how a broad, vision–language model behaves when given different kinds of context, this study is a compelling, cautionary, and instructional read. It’s a reminder that we’re still teaching AI to “see” like a pathologist, and that the human-in-the-loop approach remains essential as we experiment with autonomous AI helpers in the medical arena. The future of AI-assisted pathology will likely hinge on smarter navigation, richer, structured pathology descriptors, and well-defined guardrails that keep patient safety front and center.