Are AI Chatbots More Empathetic Than Doctors? A Practical Look at the New Empathy Research
Table of Contents
- Introduction
- Why This Matters
- What the Studies Looked Like
- Empathy Measurement: The Method Landscape
- The Big Finding: GPT-3.5 and GPT-4 Outperform Humans in Text
- Dermatology and Other Exceptions
- Limitations and Real-World Implications
- Key Takeaways
- Sources & Further Reading
Introduction
Imagine emailing a clinician with a tricky health question and getting an answer that feels almost as warm and understanding as talking to a human. A new wave of research is testing that exact idea: can AI chatbots, powered by large language models like GPT-3.5 and GPT-4, be more empathetic than human healthcare professionals in text-based interactions? The study syntheses behind this question are striking. In a set of 15 empirical studies from 2023–2024, AI chatbots were often rated as more empathetic than human clinicians when the interactions were text-only. The meta-analysis focusing on ChatGPT variants found an overall standardized mean difference (SMD) of 0.87 (95% CI 0.54–1.20), indicating a meaningful empathy advantage for AI in these scenarios (P < .00001). That roughly translates to about a two-point bump on a 10-point empathy scale.
This blog post distills the core findings and translates them into practical implications for patients, clinicians, and healthcare systems. The results come from the paper “AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care” by Howcroft, Bennett-Weston, Khan, Griffiths, Gay, and Howick. You can read the original research paper here: AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care.
Why This Matters
Short version: empathy is a predictor of patient comfort, adherence, satisfaction, and even pain experience. If AI chatbots can reliably convey empathy in text-based exchanges, they could become a practical ally—handling routine inquiries, triage, and administrative tasks with a warmth that supports patients while freeing clinicians to handle more complex needs.
There are two big reasons this matters right now:
- Real-world relevance and workload relief: There’s growing integration of AI into primary care workflows. For instance, a sizable share of GPs in some systems already use generative AI to help draft patient letters and emails. The current findings suggest AI could shoulder some of the emotionally-laden but repetitive communication tasks, potentially preserving time and emotional bandwidth for clinicians to focus on more demanding care.
- A pivot in how we view AI empathy: The Topol Report (Public UK health tech roadmap) and earlier skepticism suggested empathy is a uniquely human domain AI can’t mimic. This new synthesis shows that, at least in text-based settings, AI can outperform humans on perceived empathy across a broad range of specialties. That doesn’t mean AI replaces clinicians, but it does push the boundary on what AI can contribute to patient care in the moment.
To be clear, the authors also caution that these results come with important caveats (more on that later). The study’s scope was text-only, often relied on proxy raters, and used mostly non-validated empathy measures. Still, the trend is provocative and timely, especially as voice-enabled AI and digital health tools become more common in patient interactions.
What the Studies Looked Like
Scope and design
- 15 empirical studies published in 2023–2024 were included, and all compared AI chatbots using large language models (LLMs) with human healthcare professionals on empathy-related outcomes.
- The AI systems focused primarily on ChatGPT-3.5 and ChatGPT-4, with a few studies testing other models (e.g., Claude, Gemini, ERNIE Bot, Med-PaLM2). In the meta-analysis, researchers avoided double counting by selecting a single AI arm per study when multiple models were tested.
- Settings spanned dermatology, neurology, oncology, rheumatology, thyroid care, mental health, breast reconstruction, autoimmune diseases, inpatient/outpatient contexts, and online forums. The sources of patient queries included Reddit, private medical records, lab result interpretations, and live outpatient inquiries.
Study quality and biases
- Risk of bias was a concern: about nine studies were judged to have a moderate risk of bias and six a serious risk. Many used curated or publicly available question sets, which can introduce selection bias. Some used Reddit or public forums, which might not reflect typical care-seeking populations.
- The majority of empathy measures were unvalidated (e.g., single-item Likert scales) rather than using established instruments like CARE. This matters for how consistently empathy is defined and compared across studies.
- Raters varied widely: patient proxies, clinicians, laypeople, students, or a mix. Importantly, raters were blinded to whether the reply came from AI or a human, which helps but does not fully eliminate perspective differences.
The Empathy Measurement Landscape
Empathy is a fuzzy, multi-dimensional construct. In these studies, researchers used a mix of tools and approaches:
- Most studies relied on text-based responses and assessed empathy through rater judgments, not patients’ own reports of empathy in a real-care setting.
- Tools ranged from 1–5 Likert scales (with “1” or “2” indicating low empathy and “5” or higher indicating high empathy) to 0–10 scales or even qualitative coding (e.g., counting empathy-related language such as appreciation, acknowledgment, or compassion).
- Only one study used CARE, a validated empathy instrument. Others used bespoke or ad-hoc measures, sometimes focusing on a single dimension (e.g., “empathy level,” “compassion,” or “empathic language markers”). The reliance on proxy raters and non-validated scales means the numbers should be interpreted with caution.
The Big Finding: GPT-3.5 and GPT-4 Outperform Humans in Text
- Overall result (GPT-3.5 and GPT-4 combined): 13 of 15 comparisons favored AI empathy over human clinicians, with an SMD of 0.87 (95% CI 0.54–1.20; P < .00001). That translates to about a two-point difference on a 10-point empathy scale.
- GPT-3.5 subgroup: Pooled SMD 0.51 (95% CI not shown here; but reported as a modest to moderate advantage). This suggests GPT-3.5 generally outperformed humans on empathy measures, though with more variability across studies.
- GPT-4 subgroup: Pooled SMD 1.03 (95% CI 0.71–1.35), indicating a stronger and more consistent empathy edge for GPT-4 across the included studies.
- Heterogeneity was substantial, especially overall (I2 around 97%). Even within subgroups, the I2 was moderate-to-high (around 49% for GPT-3.5 and higher for GPT-4 subgroups). This means the exact magnitude of the advantage varied a lot across different study designs, settings, and evaluators.
- Dermatology exceptions: In two dermatology-focused studies, human dermatologists outperformed AI (Med-PaLM 2 and ChatGPT-3.5). So, the empathy edge for AI isn’t universal; domain-specific dynamics matter.
A closer look at the numbers helps: what the results mean in practice
- Across text-based interactions, AI chatbots generated empathy signals that raters perceived as warmer, more compassionate, or better at “emotional connection” than typical human responses. The practical upshot is that AI can be a strong empathic cue in written exchanges, which are common in patient portals, follow-up emails, and asynchronous communications.
- The observed empathy boost could have practical benefits: higher patient satisfaction, reduced perceived pain or distress in some contexts, and potentially improved engagement with care plans. But the link to actual health outcomes remains less clearly established in this literature, because most studies measured perceived empathy rather than real-world clinical outcomes.
Dermatology and Other Exceptions
- The dermatology findings deserve attention. In two studies, AI did not beat human clinicians on empathy. One plausible explanation is the nuanced rapport and visible cues involved in dermatology consultations, where subtleties in tone, warmth, and nonverbal communication may play a larger role than in text-based exchanges. It also highlights that AI empathy is context-dependent and that quality of care still hinges on accuracy, safety, and human connection in sensitive domains.
Limitations and Real-World Implications
Limitations to keep in mind
- Text-only scope: All analyses stem from text-based interactions. Real-world clinical encounters involve voice, body language, and the broader context of care. Voice-enabled AI promises to change how empathy feels in practice, but the current evidence base for voice-based empathy remains limited.
- Proxy raters: Most empathy assessments used external evaluators (patients’ proxies, laypeople, or clinicians) rather than patients themselves. Patient-reported experiences could diverge from observer judgments.
- Non-validated measures: The heavy reliance on single-item scales and non-validated tools limits comparability and precision. The CARE scale, used in only one study, provides a standardized benchmark but was not widely adopted in the corpus.
- Potential test-set effects: Several studies used publicly available questions from Reddit or medical records. While this reflects real queries, some concerns exist about AI models being exposed to similar prompts during training, which could influence performance (test-set familiarity). The authors note this but also argue that the scale of training data likely mitigates a single-source bias.
- Heterogeneity and double counting: The authors took care to avoid double counting across overlapping model comparisons, but the breadth of models, settings, and evaluators contributes to substantial heterogeneity. This means you should be cautious about applying a single pooled number to a specific clinical role.
Real-world implications and future directions
- Collaborative human–AI models: Rather than AI replacing clinicians, a collaborative model—where clinicians draft the core content and AI refines tone and empathy—could combine accuracy with warmth. The authors highlight a practical path: “an empathic enhancer” that preserves clinician oversight for safety and correctness.
- Voice and nonverbal cues: Emerging voice-enabled AI (and future embodied agents) claim to pick up nonverbal cues. Trials in voice-based contexts—such as telephone GP appointments, which still account for a sizable share of care—will reveal whether the empathy edge seen in text translates to spoken interactions.
- Disclosure and perception: Blinded raters often judged AI as more empathic, but in real-world settings, people know when they’re interacting with AI. Some studies suggest the initial empathy advantage can fade once users know the reply is AI. This stresses the importance of transparency and how disclosure might influence perceived empathy.
- Prompt design matters: The amount of detail and the length of AI responses influence empathy ratings. Longer, better-contextualized responses may feel more empathetic than brief, terse ones. Finetuning prompts to balance empathy with accuracy will be a core design challenge for deployment.
To dive deeper into the specifics and to see how varied contexts shape the empathy signal, you can refer to the original synthesis here: AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care.
Key Takeaways
- The headline finding: In text-based interactions, AI chatbots—especially GPT-4—are frequently perceived as more empathic than human clinicians across a broad set of medical questions and settings. The overall effect size (SMD 0.87) suggests a meaningful empathy edge for AI, roughly equivalent to a two-point rise on a 10-point empathy scale.
- Subtlety matters: GPT-4 showed a stronger empathy signal than GPT-3.5 in most studies, though both outperformed humans overall. Dermatology was a notable exception where human clinicians edged AI in empathy.
- Practical routes forward: A collaborative approach—clinician-crafted core messages with AI help to refine tone and empathy—could harness AI empathy without compromising safety or diagnostic accuracy.
- Caution and context: The evidence comes with substantial limitations. Text-only data, proxy raters, and non-validated measures mean we should translate these findings into practice gradually and thoughtfully. Voice-enabled AI and patient-centered trials with direct patient feedback are essential next steps.
- The big takeaway for now: AI empathy in text can be a powerful tool to improve patient experience, but it is not a universal substitute for human empathy across all specialties or modalities. Ongoing research, transparency, and careful design will determine how these tools fit into everyday care.
Sources & Further Reading
- Original Research Paper: AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care
- Authors: Alastair Howcroft, Amber Bennett-Weston, Ahmad Khan, Joseff Griffiths, Simon Gay, Jeremy Howick
If you’re a clinician, patient, or health-tech reader, this line of work signals a shift in how we think about empathy in digital health. It’s not about AI replacing warmth; it’s about AI augmenting the human touch when used thoughtfully, responsibly, and with clear patient-centric goals. As voice-enabled AI and more nuanced evaluation methods mature, we’ll get a clearer view of where AI empathy helps most, where it needs guardrails, and how to design care that truly feels as caring as it is capable.