Knowledge-Based Learning in Text-RAG vs Image-RAG for Chest X-Rays: A Practical Analysis
Table of Contents
- Introduction
- Why This Matters
- What Text-RAG and Image-RAG Are
- The Data and Experimental Setup
- The Models: ChatGPT mini and LLaMA with EVA-ViT
- Findings: Accuracy, Hallucination, and Calibration
- Practical Implications and Real-World Scenarios
- Key Takeaways
- Sources & Further Reading
Introduction
Radiology sits at a pivotal intersection of speed and accuracy. Hospitals today generate more chest X-rays than radiologists can review promptly, creating bottlenecks that can delay critical care. This is precisely the kind of setting where knowledge-grounded AI can shine: systems that not only label or describe an image but also reason about it using external information and contextual grounding. The study “Knowledge-based learning in Text-RAG and Image-RAG” takes a close look at how two retrieval-augmented strategies—text-based RAG and image-based RAG—stack up in chest X-ray interpretation, comparing them against a baseline multi-modal setup. They push the idea that grounding machine reasoning in external knowledge (text) or anchored visual references (images) can reduce hallucinations and improve calibration, all while working within the practical constraints of available hardware and data.
If you’re curious about what’s new in this space, this is based on fresh research from the original paper Knowledge-based learning in Text-RAG and Image-RAG. The authors (Alexander Shim, Khalil Saieh, and Samuel Clarke from Florida International University) explore how modern vision encoders and large language models interact when diagnosing chest X-rays from the NIH Chest X-ray dataset. The takeaway: grounding helps, but the path to robust, clinically reliable AI is nuanced and highly dependent on which grounding strategy you lean on.
The core question is simple, but the implications are profound: if you want AI that can classify findings, pinpoint relevant details, and justify its reasoning, should you fetch external textual knowledge to guide it, or should you anchor its decisions to similar images in a curated visual space? The authors’ head-to-head comparison of text-RAG and image-RAG against a baseline sheds light on where each approach excels—and where it falters—in the messy, imbalanced world of medical imaging data.
Why This Matters
This research arrives at a moment when AI in healthcare is moving from novelty to necessity. The promise of multi-modal models—systems that see and read and reason—could dramatically speed up workflows, reduce turnaround times for chest radiographs, and, crucially, increase trust through interpretable reasoning. But the practical reality isn’t so straightforward: data imbalance, model complexity, and the risk of hallucinations (the model making up plausible-but-wrong conclusions) loom large.
A distinctive strength of this study is its explicit attention to grounding. Text-RAG grounds predictions by pulling in external medical knowledge (via a retrieval mechanism that pulls Wikipedia summaries), while Image-RAG grounds reasoning by comparing with visually similar X-ray images using a KNN/FAISS setup. The result isn’t just about accuracy numbers; it’s about whether the model’s reasoning can be traced to real-world references the clinician can verify. In a clinical setting, that interpretability is almost as important as the raw accuracy.
The work also updates the broader AI landscape by comparing two contemporary large-language-model ecosystems—GPT-based variants and LLaMA-based models—under resource constraints. The takeaway is not just which model wins on a single metric, but how the model family interacts with grounding signals while facing real-world data challenges like class imbalance. In short: this is a meaningful step toward reliable, explainable AI in radiology, building on prior efforts like ChestGPT and related roots in multimodal medical AI.
What Text-RAG and Image-RAG Are
Text-RAG (Retrieval-Augmented Generation) supplements a vision-language pipeline with external text sources. After the model processes the X-ray image, it retrieves relevant textual context (in this study, summaries accessed via the Wikipedia API) that could support diagnosis. The retrieved passages are fed back into the reasoning process, guiding the final predictions and providing a textual rationale that humans can audit.
Image-RAG flips the grounding mechanism to the visual domain. Instead of text, it retrieves visually similar X-ray images from a dedicated in-domain store (built with a FAISS index and a KNN search, using k=3 for speed). The retrieved images help constrain the model’s interpretation of the target X-ray, anchoring its predictions in known visual patterns and reducing mislabeling or overconfident errors.
Analogy time: Text-RAG is like calling a medical encyclopedia and a clinician’s notes while you interpret a scan; Image-RAG is like pulling up similar past scans to compare shapes, densities, and distributions. Both aim to reduce what AI often does best poorly—hallucinating findings that aren’t actually supported by data.
A key takeaway in the paper is that these two grounding strategies offer different strengths. Text-RAG tends to enrich reasoning with domain knowledge, but its effectiveness hinges on the relevance and quality of retrieved passages. Image-RAG can stabilize predictions through visual grounding, but it may still miss subtle, textually grounded nuances. The study’s comparison helps us see which approach to lean on in different clinical scenarios and hardware environments.
To ground the discussion, the paper uses the NIH Chest X-ray dataset with a deliberate focus on single-label classification (six disease classes plus “No Finding”). The researchers also emphasize an enduring challenge in radiology AI: severe class imbalance. In their preprocessing, they actively address this imbalance (more on that in the next section) to avoid biased learning.
For readers who want to dive deeper, you can consult the original paper here: Knowledge-based learning in Text-RAG and Image-RAG.
The Data and Experimental Setup
Data quality and composition shape what’s possible with any AI system, and this study is candid about the limits and trade-offs of the NIH Chest X-ray dataset.
Dataset scale and split: The NIH Chest X-ray dataset started with about 112,000 images. After single-label filtering to focus on six target diseases (Atelectasis, Effusion, Emphysema, Pneumothorax, Mass, No Finding), they worked with 84,053 labeled images. They used an 80/10/10 split for training, test, and validation sets.
Class imbalance: The authors highlighted a persistent issue—the “No Finding” class dominates, creating an imbalanced dataset where minority disease classes can be underrepresented in learning signal. This skew makes models prone to predicting the majority class and missing rare conditions.
Pneumonia removal: Pneumonia, a rare category in this setup, was removed from training to avoid instability and misleading gradients. This underscores a practical point: ignoring the rarest cases can stabilize training but also risks losing clinically relevant signals.
Mitigation strategies: To counteract imbalance, the study employs several strategies:
- Class-weighted loss functions to emphasize minority labels.
- A WeightedRandomSampler that oversamples underrepresented classes without duplicating data, avoiding overfitting risks tied to simple oversampling.
- The acknowledgment that while class weights improve macro-F1, they can reduce overall accuracy, illustrating the classic trade-off between balanced performance and total correctness.
Grounding mechanics and resources:
- Image-RAG uses a KNN approach with FAISS to locate visually similar X-rays (k=3), embedding visual context into the reasoning pipeline.
- Text-RAG taps into external knowledge via Wikipedia summaries to augment reasoning with medically relevant context.
Model baselines and resources: The study compares a baseline multimodal approach against Text-RAG and Image-RAG. Due to hardware and funding constraints, the researchers used GPT-2 mini (a smaller GPT variant) for the text-RAG/Large-Language-Model (LLM) experiments and experimented with LLaMA-3 for the other setup. The paper notes that resource limits prevented running larger LLMs on the full NIH dataset, which is a useful reminder of the practical bottlenecks in deploying cutting-edge AI in clinical research.
If you want to revisit the data and setup in more detail, reference the original paper’s figures and descriptions, especially the discussion around class distributions (including the “No Finding” dominance), the justification for removing Pneumonia, and the specific sampling choices that shaped learning dynamics.
The Models: ChatGPT mini and LLaMA with EVA-ViT
Two model families anchor the study's comparisons: a ChatGPT-based approach (via a GPT-2 mini proxy) and a LLaMA-based pipeline that leverages an EVA-ViT backbone for image understanding.
ChatGPT-based Model
- Setup: The project tested three conditions for each model: a baseline without retrieval, an image-RAG variant, and a text-RAG variant. Each configuration was run around 20 times to gauge stability and variability.
- What they found:
- Accuracy over epochs: The baseline and text-RAG models tended to be the most consistently accurate across epochs, with the text-RAG model maintaining tight performance around a narrow band.
- Hallucination trends: The image-RAG setup produced fewer hallucinations per epoch than the baseline or text-RAG configurations, which tended to spike in certain phases, especially when relying on retrieved text.
- Takeaway: Text grounding helps reasoning, but the quality of retrieved passages matters; image grounding can curb spurious outputs by tethering predictions to visual references.
LLaMA-based Model (with EVA-ViT)
- Pipeline: The LLaMA-based system follows a four-to-five-step pipeline: preprocessing to filter labels, EVA-ViT as the visual encoder backbone that tokenizes X-ray images into image tokens, LLaMA-based prediction layers for disease classification, and loss functions with class weights to contend with imbalance. Adam optimizer is used for stabilizing training, aided by weighted cross-entropy to balance gradient contributions across classes.
- Classes: The model targets six diseases (Atelectasis, Effusion, Emphysema, Pneumothorax, Mass) plus No Finding. The prevention of overfitting and gradient instability is a recurring theme given the multi-modal and multi-stage nature of the pipeline.
- Rationale: The Adam optimizer helps regulate unstable gradients across the three-part structure (EVA-ViT, LLaMA, classifier). The combination of an attention-based vision encoder (EVA-ViT) with a language-model-driven classifier aims to capture both visual patterns and textual reasoning cues.
A common thread across both model families: grounding—whether textual or visual—can steer complex multi-modal models away from brittle, overconfident predictions. The study’s comparative lens shows that how you ground the model matters as much as what you ground it with.
Throughout the narrative, the authors emphasize the practical constraint: you don’t need the biggest models to see measurable gains in grounded reasoning; sometimes a well-tuned, resource-conscious setup yields the most stable benefits. For readers, this is a valuable reminder that “bigger” isn’t always better when your goal is reliable, interpretable clinical AI.
If you’d like to skim the full architecture visuals, you can check the original paper for figures illustrating the LLaMA-based pipeline and the Image RAG vs Text RAG comparison. And for a broader context on this line of work, you’ll also see references to ChestGPT and related efforts in the paper’s discussion.
Knowledge-based learning in Text-RAG and Image-RAG
Findings: Accuracy, Hallucination, and Calibration
The study’s results underscore a few concrete patterns that are worth internalizing if you’re planning to deploy or extend this line of work:
Image-RAG tends to be the most stable performer in accuracy. Across epochs, the image-based RAG maintains a tight accuracy band roughly between 0.55 and 0.70, with a peak around 0.70. This stability is attributed to the “visual grounding” effect: by anchoring predictions to similar past X-rays, the model reduces random fluctuations and noise in its decisions.
Text-RAG offers the benefit of external knowledge, but its reliability is more variable. The text-based approach can improve reasoning by injecting domain knowledge, but it can also introduce hallucinations when retrieved passages are irrelevant or contradictory to the image data. In the LLaMA-based experiments, the text-RAG variant showed higher variability in hallucination rates across epochs, including bursts of 40–60 hallucinations in some phases. The takeaway is nuanced: text grounding is powerful when the retrieved content is high-quality and well-contextualized to the image, but it can backfire if the retrieval process returns misleading or off-topic information.
Baseline (non-grounded) models are consistently less stable. Without external grounding signals, the models show more erratic learning dynamics and a higher susceptibility to drifting into incorrect predictions as training progresses. This aligns with a broad intuition in AI: grounding helps calibrate uncertain predictions and keeps decision boundaries more robust.
LLM choice matters. The GPT-based (GPT mini) experiments tended to outperform the LLaMA-based setup on key measures like hallucination rate and calibration in this particular study, though hardware constraints limited scaling. The authors note that the GPT-2 mini configuration achieved better balance across accuracy and hallucination, suggesting that model size, training regime, and grounding interplay in non-trivial ways. They report that GPT-based grounding demonstrated lower hallucination rates and better Expected Calibration Error (ECE) than the LLaMA-based approach.
Data imbalance and calibration remain central. Even with sophisticated grounding, the NIH dataset’s imbalance—particularly the dominance of “No Finding”—continues to shape learning dynamics. The WeightedRandomSampler and class-weighted losses helped mitigate macro-F1 gaps, but the results highlight the trade-off between fair performance across rare diseases and overall accuracy. Expect more nuanced calibration work in future iterations as datasets grow and balance improves.
Practical implication: for clinicians and AI engineers, the image-grounded approach offers the most predictable reliability in this setting, which is valuable for high-stakes medical interpretation. Text-grounded AI can be a powerful supplement when you have reliable, context-rich external sources and a robust retrieval strategy, but you must guard against misleading retrieved content.
If you want a quick mental model: imagine image-grounding as a seasoned radiology mentor who has seen thousands of visually similar X-rays and can guide you toward plausible patterns, while text-grounding is a medical encyclopedia you consult for explanatory background. Both are useful, but you don’t want the encyclopedia to hand you irrelevant pages during a time-critical diagnostic decision.
Practical Implications and Real-World Scenarios
So what does this mean for real-world use today?
In fast-paced clinical settings, image-RAG can serve as a stable triage assistant. Its consistent accuracy and low hallucination spikes make it a trustworthy partner for preliminary reads, especially when radiologists are juggling high volumes of cases. A clinician could rely on image-grounded AI to flag potential findings and provide a grounded rationale anchored in similar-looking prior images.
Text-RAG is a strong companion for augmenting radiology workflows when solid, reliable external knowledge is available. If the retrieval layer is carefully curated (e.g., high-quality medical texts, up-to-date guidelines), text grounding can improve interpretability and explainability—vital for clinician trust. However, this strategy should be deployed with checks to guard against retrieval of misleading or conflicting passages, especially in edge cases where images do not align neatly with generic textual knowledge.
Hardware and scale matter. The paper’s constrained use of GPT mini and LLaMA-3 reflects practical realities: bigger models demand more compute, which can slow experimentation and complicate deployment in resource-constrained hospitals. The takeaway is not that bigger is always better, but that smart grounding combined with appropriately scaled models can outperform ungrounded baselines even on modest hardware.
Data quality and curation matter more than ever. The researchers’ experience with imbalance and the decision to remove the Pneumonia class illustrate a broader point: data curation decisions can materially affect how a model learns. In real deployments, teams should invest in robust data pipelines, clear labeling strategies, and careful handling of underrepresented conditions to ensure the model generalizes well to clinically relevant cases.
For developers and researchers: consider an ensemble approach. The study’s results suggest that a hybrid strategy, using image-grounded reasoning as the backbone and selectively incorporating text-grounding for specific questions (e.g., “Is there an effusion in this context with a history of edema?”), could offer both stability and interpretability. As hardware and data availability improve, larger LLMs may shift these trade-offs, but the core insight—grounded reasoning improves reliability—will likely persist.
If you want a bridge to the broader literature, the paper situates its work in conversation with ChestGPT and related efforts, underscoring a shared trajectory toward more interpretable, knowledge-grounded, multi-modal radiology AI.
Key Takeaways
Grounded reasoning matters. Text-RAG and Image-RAG both aim to reduce hallucinations compared to ungrounded baselines, but they do so via different channels—external textual knowledge vs. visual similarity.
Image-grounding yields more stable accuracy. In this study, Image-RAG stayed consistently between about 0.55 and 0.70 accuracy across most epochs and peaked at around 0.70, indicating strong generalization with minimal volatility.
Text-grounding can boost reasoning but is sensitive to retrieved content. Spikes in hallucinations in the text-RAG configurations illustrate the risk when retrieved text conflicts with the visual signal or is not well-integrated into the decision process.
Model choice and calibration matter. GPT-based grounding showed advantages in hallucination rates and calibration (ECE) over the LLaMA-based setup in this particular investigation, though computational constraints limited scale. Expect evolving results as hardware and model families mature.
Data balance is a practical bottleneck. The NIH Chest X-ray dataset’s imbalance shaped learning dynamics; addressing class imbalance with weighted losses and sampling helped, but it’s a reminder that robust clinical AI requires not just smarter models but smarter data curation.
Real-world deployment requires pragmatism. Grounded multi-modal AI can accelerate radiology workflows, but clinicians should be prepared for a mix of models: a stable image-grounded core, supplemented by text-grounded reasoning where high-quality external evidence exists.
The future is iterative. The authors are candid about hardware limitations and data imbalance as barriers to exploring larger LLMs on this dataset. As datasets grow and hardware becomes more accessible, we can expect more incisive comparisons and potentially even better-performing hybrids.
For those who want to revisit the technical details or replicate the study’s spirit, the original paper offers a thorough description of preprocessing steps, weighting schemes, the RAG architecture trade-offs, and the empirical plots that trace accuracy and hallucination across epochs. And as AI in radiology evolves, this kind of disciplined, grounded comparison will be essential for building tools clinicians can trust.
Sources & Further Reading
Original Research Paper: Knowledge-based learning in Text-RAG and Image-RAG
Authors: Alexander Shim, Khalil Saieh, Samuel Clarke
For readers who want to explore related lines of work, the paper’s references include ChestGPT and contemporary examinations of LLMs in radiology, such as talks on GPT-4 and medical image interpretation. These works collectively illuminate a growing ecosystem where grounding, interpretation, and human-AI collaboration become the norm rather than the exception.