What question did the study aim to answer?

The study asked whether large cloud-based language models or smaller, local discriminative models yield better diagnostic performance for Covid-19 in chest X-rays, and how each option impacts energy use and safety in a real-world Mendix app.

Which models were analyzed, and how many configurations were tested?

The analysis compared large language models (LLMs) with small discriminative models across 14 configurations, some pairing LLMs with knowledge bases to support accuracy, all within a Mendix-based diagnostic workflow.

What were the key findings regarding energy use and accuracy?

Small discriminative models significantly reduced carbon footprint compared with larger LLMs, while sometimes biasing outputs toward positive diagnoses. Limiting LLMs to probabilistic outputs could degrade performance, with Covid-Net often offering the best balance.

Which approach achieved the best balance for clinical deployment?

The Covid-Net family of small models provided strong accuracy with relatively low energy costs, offering the most favorable balance between performance and environmental impact in the studied configurations.

What are the implications for safety and deployment?

For greener, safer diagnostics, clinicians should favor localized discriminative models or hybrid setups with controlled LLM outputs, rather than relying on universal LLM classifiers, to manage energy use and bias.

Tiny AI, Big Footprint: What Covid-19 X-Rays Reveal About Energy, Accuracy, and Safer Diagnostics

In a world where AI is increasingly woven into healthcare, the question isn’t just “Can it be accurate?” but also “At what cost?” This study dives into that intersection by asking a practical, urgent question: when you’re trying to detect Covid-19 from chest X-rays, should you lean on huge, cloud-hosted language models (LLMs) or smaller, local discriminative models? And how does that choice affect both diagnostic performance and the planet we’re trying to protect?

Think of it this way: you’ve got two kinds of AI teammates. One is a grand, do-it-all consultant (the LLM) who can talk, reason, and explain. The other is a tight-knit team of specialists (the small, local models) who are fast, focused, and carry far less baggage. This paper puts them head-to-head in a real-world medical app built on Mendix, comparing not only accuracy but also how much energy each option guzzles in the process.

Below is a plain-English breakdown of what the researchers did, what they found, and what it means for developers, clinicians, and anyone who cares about responsible AI in medicine.

Why this topic matters (the quick why)

Medical AI is everywhere, but the big, fancy LLMs we hear about (think GPT-4, Claude-style models) are incredibly capable—and incredibly hungry. They’re typically hosted in the cloud, often across multiple data centers, and they generate a lot of text or probabilities that require real-time, energy-intensive computation. In healthcare, where decisions can be life-or-death, you also have to weigh safety, privacy, and bias.

The flip side is that small, well-tuned models—running locally or on modest cloud resources—tend to be faster to deploy, easier to secure, and more transparent. They might not have the same broad conversation abilities as LLMs, but for a specific task like detecting Covid-19 from an X-ray, can they beat the big models while also saving energy?

This study tackles exactly that question by running a fair, apples-to-apples comparison across 14 model configurations and a handful of knowledge-augmented approaches.

What was tested (in plain terms)

Task: binary classification of chest X-rays as Covid-19 positive or negative.
Models on the test bench:
- Generative, large models (LLMs) hosted externally: GPT-4.5-Preview, GPT-4.1-Nano, o4-Mini, GenAI (Claude-3.5 Sonnet). These were sometimes augmented with a knowledge base (a retrieval-enhanced setup) to improve accuracy.
- Local discriminative models (smaller, more traditional AI classifiers): Covid-Net (specifically designed for Covid-19 detection in X-rays), DenseNet, ResNet, and VGG.
How the models were integrated: A Mendix app served as the host platform to compare how each model performs in a real-world app environment, including data handling, latency, and energy use.
Knowledge bases: To help LLMs without feeding them extra patient images, the researchers used a cosine-similarity-based retrieval system that pulls in three most similar image vectors from a knowledge base built from two local models (Covid-Net and DenseNet). This is a kind of “consultant with a library card” approach—no extra patient data is sent to the LLM, but the model gets relevant context.

In short: you’ve got five LLM configurations (with and without knowledge bases) and four small, local models. They all faced the same X-ray dataset (Covidx CXR-3) and were measured on two axes: diagnostic accuracy and environmental impact (carbon footprint).

The tools you need to know (in plain terms)

LLMs: Big AI models that can generate text, reason over prompts, and produce probabilistic outputs. They’re powerful but energy-hungry and can be tricky to control for medical classification tasks.
Local discriminative models: Smaller neural networks designed to label data directly (e.g., Covid-Net), often faster and less demanding on compute resources.
Knowledge bases / Retrieval Augmented Generation (RAG): A way to give an AI model extra, trusted context without giving it raw data to memorize. In this study, it means pulling in related image vectors to inform the LLM’s decision, rather than feeding it more X-ray images.
Carbon footprint measurement: The researchers estimate energy use based on cloud infrastructure and server hardware, balancing apples-to-apples comparisons by controlling some variables (same server environment for local models against comparable cloud usage for LLMs).

How energy and accuracy were measured (the simple version)

Accuracy: How often the model correctly identifies Covid-19 vs. non-Covid-19 X-rays on the Covidx CXR-3 dataset.
Confidence/PPV: How confident the model is about its positive predictions, which matters for reducing false positives in a medical setting.
Speed and energy: How long each inference takes and the associated energy footprint, using AWS Kubernetes-based hosting as a baseline for the cloud portion. The study notes that cloud-based LLM inferences can rack up carbon emissions quickly, especially for models that generate long text or need heavy GPU usage.
Knowledge-base impact: The researchers tested whether adding a knowledge base (via image-vector similarity) improves LLM performance and how that changes energy use.

The punchline is simple but important: the energy cost of running large LLMs for a binary medical classification task can dwarf the energy used by small, task-focused models, even when you factor in some efficiency tricks.

The big findings (the short, practical version)

Smaller, local models often outperformed LLMs on accuracy for this specific task. Covid-Net, in particular, came out on top in terms of accuracy (about 95.5%)—and even when you account for its larger perceived carbon footprint relative to other small models, it still used far less energy than the cloud-hosted LLMs.
LLMs struggle with probabilistic outputs in this setting. When the researchers forced LLMs to give a pure probability, performance dropped and energy use didn’t necessarily improve. In other words, asking a giant model to spit out a straightforward Covid-19 probability can be both less accurate and more energy-hungry than a small model doing a simple yes/no with a probability.
If you constrain LLMs to only give probabilities, performance tails off. The study shows this approach is not a universal answer for AI in medicine—there are clear scenarios where LLMs underperform simple discriminative models in both accuracy and carbon footprint.
Knowledge bases can help, but the effect is mixed. Adding a knowledge-base (with Covid-Net or DenseNet embeddings) sometimes boosted accuracy for LLMs by noticeable margins (for example, GPT-4.1-Nano saw up to a 27% accuracy increase with a Covid-Net embedding knowledge base). Yet the carbon footprint changes varied across models; some models saw small increases, others more pronounced ones.
Covid-Net is the standout in accuracy among all models tested, with robust performance, and a carbon footprint that’s still dramatically lower than running a large LLM. The most efficient path, in terms of energy, wasn’t a general-purpose AI but a purpose-built model tuned to the task.
There’s a clear safety/interpretability angle. LLMs can provide rich, descriptive analyses that are easy to understand but can also hallucinate or overstate confidence. In clinical settings, this makes it risky to rely on LLMs as the sole classifier. The local models’ outputs—especially when they provide direct probabilities—tend to be more transparent and interpretable for the specific task.
Speed matters, but energy matters even more. GenAI or other fast-looking LLMs can appear quick, but their energy footprint tends to be orders of magnitude higher than small models when you scale up to continuous use. Even a relatively fast LLM can end up with a far larger carbon bill than a local model that processes the same image in about a second or so.

In other words: for the Covid-19 X-ray classification task, the most accurate and environmentally friendly solution was a well-tuned, local model (Covid-Net) despite the allure of larger, cloud-based LLMs. The energy cost of using large generative models for this narrow, binary task outweighed the benefits in many cases.

Real-world implications: what this means for healthcare AI

For hospitals and clinics with limited compute budgets or sustainability goals, investing in a strong, local discriminative model can deliver robust accuracy with a fraction of the energy cost of large LLMs. This is especially true when the goal is a binary decision (Covid-19 present or not) rather than a broad, explainable narrative about a case.
If you’re considering LLMs in a medical workflow, treat them as supplements—not replacements. The study’s results show LLMs can add value in contextual explanation, risk discussion, or patient-facing summaries, but not as the primary classifier for chest X-ray Covid-19 detection.
Knowledge bases can uplift LLM performance, but they aren’t a magic wand. They can help, but the benefit varies by model and by how the knowledge is integrated. If you do use LLMs with a knowledge base, monitor not just accuracy but energy use and latency.
Safety and reliability should guide deployment choices. The tendency of LLMs to be overconfident—even when wrong—makes it essential to pair them with safeguards or to rely on smaller, more interpretable models for primary triage steps. Human radiologists remain critical for verification.
The carbon footprint conversation matters in policy and procurement. This study provides a framework for comparing models beyond raw accuracy: consider the total energy cost, latency, and the infrastructure required to support inference at scale.

Practical takeaways for builders and clinicians

Start with a strong local model for binary X-ray classification tasks. Covid-Net, in this study, demonstrates high accuracy with manageable energy use.
Use LLMs judiciously. If you’re aiming for explanations or nuanced analysis to accompany the diagnosis, LLMs can be useful as a secondary tool, not the primary classifier.
If you experiment with knowledge bases, expect model-by-model variability. Some models benefit significantly from embedding-based knowledge bases; others show modest or mixed gains.
Be mindful of output format. For medical applications, forcing LLMs to give plain probabilities can be unreliable and energy-inefficient. Prefer a clear, constrained output aligned with clinical decision-making.
Consider deployment setup. The energy cost of hosting LLMs is heavily influenced by where and how the models run (cloud APIs vs. local inference). In many cases, model architecture and the task shape energy usage more than raw model size.
Communicate uncertainty and safety clearly. Regardless of the model, provide human-in-the-loop checks and clear communication about confidence and limitations to patients and clinicians.

Limitations and future directions (the honest parts)

The study focuses on inference energy, not training energy. Training large LLMs is energy-intensive; the authors acknowledge that a full life-cycle assessment would be needed to capture the entire environmental picture.
Fine-tuning LLMs for medical image tasks is constrained by current tooling. The authors note that future work could explore fine-tuning or image-based prompting improvements for LLMs to see if accuracy and energy costs shift.
The scope is binary Covid-19 detection in X-rays. Multi-class or broader disease detection could yield different trade-offs between accuracy and carbon footprint. Generalizing the findings to other diseases should be tested.
Real-world deployment would require more extensive security, privacy, and regulatory assessments. The paper points out risks around data sharing with third-party providers and the need for safeguards.

Conclusion: a sensible path forward for AI in medical imaging

As AI tools become more embedded in clinical workflows, this study provides a pragmatic lens on when to reach for a gigantic, cloud-based helper and when to lean on focused, local models. For the specific task of Covid-19 detection in chest X-rays, the evidence favors small, discriminative models like Covid-Net in terms of both accuracy and carbon footprint. Knowledge bases can boost the performance of LLMs, but they don’t automatically solve the energy and safety challenges that come with generative models in sensitive medical contexts.

In short: don’t assume bigger is always better in medical AI. For classification tasks where speed, reliability, and energy efficiency matter—as they do in hospitals and clinics—well-designed, smaller models often deliver the best combination of performance and sustainability. And if you do use LLMs, keep them in a supporting role, add safeguards, and measure results not just in accuracy but in footprint, latency, and interpretability.

Key Takeaways

For Covid-19 detection from chest X-rays, small, local discriminative models (like Covid-Net) can outperform large, cloud-hosted LLMs in accuracy and, importantly, dramatically reduce energy use.
LLMs are powerful for broad conversation and explanations, but they can be biased, overconfident, and energy-hungry when used for straightforward medical classification tasks.
Adding knowledge bases to LLMs can improve accuracy, but the environmental impact varies by model and embedding approach; benefits aren’t guaranteed and must be weighed against energy costs.
When energy efficiency is a priority, a focused, well-tuned local model is often the most sustainable choice, with LLMs serving a supplementary, safety-focused role rather than a primary classifier.
Measuring AI impact in medicine should include both diagnostic performance and carbon footprint, plus factors like latency, interpretability, and safety. This helps ensure you’re building tools that are not only effective but also responsibly designed for real-world healthcare settings.

If you’re a developer or clinician looking to prompt or deploy AI in healthcare, consider starting with a strong local model, reserve LLMs for tasks where their strengths truly add value, and always couple automation with human oversight. That combo gives you the best shot at accurate diagnoses, patient safety, and a lighter environmental footprint.