Can ChatGPT Read Minds? GPT-4, Social Vignettes, and the New ToM Test Drive

Can a language model truly grasp others’ minds? This post surveys GPT-4 and GPT-3.5 on three ToM measures—Faux Pas, Social Stories, and Story Comprehension—in English and German. The findings reveal notable progress and clear limits, with implications for autistic support and AI research. More soon.
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

Can ChatGPT Read Minds? GPT-4, Social Vignettes, and the New ToM Test Drive

Table of Contents
- Introduction
- Why This Matters
- Higher-Order Social Reasoning: What the Study Tested
- The Faux Pas Test
- The Social Stories Questionnaire
- The Story Comprehension Test
- GPT-4 vs GPT-3.5: What We Learned
- Language, Uncertainty, and Real-World Use
- Key Takeaways for Readers and Practitioners
- Sources & Further Reading

Introduction
Social understanding—often framed as Theory of Mind (ToM)—is the way humans infer what others think, feel, and intend, especially in tricky social moments. A new study dives into how well ChatGPT can simulate this kind of social reasoning when faced with complex “social vignette” tasks. The researchers tested GPT-3.5 Turbo and GPT-4 on three established ToM measures (the Faux Pas Test, the Social Stories Questionnaire, and the Story Comprehension Test) in both English and German, using two independent raters to judge accuracy according to standardized manuals. The aim: see whether, and how well, today’s large language models can replicate applied ToM in ways that might someday assist autistic individuals or others with social-communication challenges.

If you’re curious about where AI’s “mind-reading” capabilities stand right now, this paper—Applied Theory of Mind and Large Language Models — is the one to read. It’s based on new research from Holl-Etten, Schnaderbeck, Kosareva, Prattke, Krueger, Warner, and Vetter, and you can check the original here: Applied Theory of Mind and Large Language Models -- how good is ChatGPT at solving social vignettes?.

Why This Matters
In the current moment, AI is increasingly used not just for chat, but for psychotherapy-support tools, educational aides, and personalized coaching—areas where social nuance matters a lot. This study pushes beyond simple Q&A and tests LLMs on higher-order social reasoning: Can a machine interpret irony, faux pas, or the subtleties of social interaction the way a neurotypical adult does? And crucially, can it do so in more than one language, given that model training data is uneven across languages?

From a practical standpoint, the findings matter for two reasons. First, if GPT-4 (and, to a lesser extent, GPT-3.5) can approach neurotypical performance on certain ToM tasks, there’s potential for AI to become a supportive tool in social-communication coaching, particularly for autistic individuals who often benefit from technology-assisted communication. Second, the study highlights the real-world caveats: even when models perform well on test items, they frequently hedge with epistemic uncertainty markers like maybe or probably. Those hedges can be helpful transparency, but they may also confuse users who need clear, decisive guidance in ambiguous social situations.

This paper also sets a methodological bar: testing across languages (English and German) and employing independent raters with standardized scoring. That cross-linguistic angle is timely because it tests robustness beyond a single language domain and helps address concerns about training contamination in English-language prompts. If you want to see how researchers are balancing ecological validity with experimental rigor, this study is a solid example.

Higher-Order Social Reasoning: What the Study Tested
The core idea is simple in concept but hard in practice: can a modern LLM simulate to some degree the ability to reason about what others know, believe, or intend during socially nuanced interactions? The researchers selected three well-established measures that tap into higher-order ToM:

  • The Faux Pas Test: A battery of vignettes where one character says something awkward or inappropriate, often because they lack certain information or memory about a social context. The test asks questions about detection, identity, intention, belief, and empathy.
  • The Social Stories Questionnaire (SSQ): A set of short social interactions where the model must decide whether a line or remark is upsetting, identify which line is problematic, and judge the social meaning. This task includes both subtle and blatant faux pas.
  • The Story Comprehension Test (SCT): Short stories with two characters, followed by questions that require interpreting pretense, white lies, irony, threat, or dare.

Notably, the researchers implemented a robust two-rater coding scheme for each GPT response, following the manuals for each task. Inter-rater reliability was high (κ = 0.91–0.97), which adds credibility to the accuracy assessments beyond a single judge’s read.

The study also ventured into the epistemic side of LLM responses: Do the models reveal uncertainty about their own conclusions? The researchers coded “uncertainty markers”—linguistic hedges like maybe, possibly, or probably—as part of the ToM tasks. This aspect matters because how a model signals its confidence could influence how autistic or neurodivergent users interpret and apply the model’s guidance in real-world social situations.

GPT-4 vs GPT-3.5: What They Found
Across the three tasks and across languages, GPT-4 consistently outperformed GPT-3.5, with performance levels approaching or matching neurotypical benchmarks on several measures. Here are the highlights:

  • Faux Pas Test

    • GPT-4 showed near-human accuracy on the Faux Pas Test in both English and German prompts, with results in the high-80s to mid-90s percentage range per vignette, and with multiple vignettes hitting the 90–100% mark.
    • GPT-3.5 struggled more noticeably, with average performance well below GPT-4 and closer to subclinical autism ranges on some prompts; English vs German differences were present but not the sole driver of outcome.
    • For a sense of scale, GPT-4’s French- or German-language performance in this task often reached 92–95% in several stories (and similar highs in English prompts), yielding an overall faux pas score that was described as on par with neurotypical adults in their comparison data.
  • Social Stories Questionnaire (SSQ)

    • GPT-4’s SSQ performance was comparable to neurotypical adults in both languages, with German prompts yielding about 67% and English prompts about 63% accuracy. The neurotypical adult benchmark from Lawson et al. sits around 70% for females and 60% for males, so GPT-4 sits squarely in this neighborhood, especially considering the German language result.
    • GPT-3.5 trailed GPT-4 and generally fell below the neurotypical range in both languages.
  • Story Comprehension Test (SCT)

    • GPT-4 outperformed GPT-3.5 again, with German prompting delivering particularly impressive results: on SCT, GPT-4 German accuracy reached about 89%, outperforming both neurotypical adults and neurotypical adolescents in the source benchmarks used for comparison.
    • In English, GPT-4 also scored higher than GPT-3.5 and approached neurotypical adult levels, with the English SCT accuracy around the mid-60s to upper-60s (roughly comparable to the range reported for adults in prior work).
  • Cross-language and Training Contamination

    • The authors explicitly tested prompts in both English and German to check for language-driven differences. They note that some of the strongest performances occurred in German prompts for specific tasks (such as the SCT) despite German having potentially less training data in the model's pretraining mix. This suggests that the observed capabilities are not solely a byproduct of English-language training data abundance.
  • Uncertainty Markers

    • GPT-4 showed higher use of uncertainty markers than GPT-3.5, especially in German prompts. For example, the study found up to 42% of GPT-4 responses containing epistemic hedges in the Faux Pas Test, compared to around 30% for GPT-3.5 in similar conditions.
    • The pattern held across tasks: more hedging in German prompts, with some language-related variance in how often the model admits uncertainty.

Practical takeaway from the numbers: GPT-4’s capacity to perform higher-order ToM tasks in both languages is not just a fluke of one test; across a battery of tasks, GPT-4 demonstrates robust, nuanced social reasoning comparable to, and sometimes exceeding, neurotypical benchmarks. Yet the model’s tendency to hedge with uncertainty markers is a real-world feature that matters for users seeking crisp guidance in ambiguous social scenarios.

Language, Uncertainty, and Real-World Use
Two recurring threads emerge from the study: cross-linguistic robustness and the model’s epistemic style.

  • Cross-linguistic robustness

    • The researchers’ dual-language design is more than a curiosity. It helps separate a model’s social reasoning ability from the idiosyncrasies of a single language’s input data. In some cases, German prompts produced standout results (e.g., high SCT accuracy), while English prompts also performed well (though with different uncertainty patterns). This cross-linguistic angle is especially important for real-world deployment in multilingual contexts, where users may rely on AI in their native language.
  • Epistemic markers: transparency vs. hedging

    • GPT-4’s higher incidence of uncertainty markers raises an important design question: should an assistive AI for social communication emphasize transparent hedging, or should it strive for more decisive guidance? On the one hand, signaling uncertainty can help users gauge the confidence of the model and avoid overtrust. On the other hand, too much hedging can undermine practical usefulness in real-world social decision-making, especially for autistic users who rely on consistent patterns to interpret social cues.
    • The authors suggest that the frequent hedging in GPT-4 points to a need for careful calibration before deploying such tools in contexts where clear behavioral guidance is desirable. It also highlights an area for future research: can we train or fine-tune models to preserve nuanced mental-state reasoning while delivering clearer, unambiguous responses when appropriate?

Real-World Scenarios: Where Could This Help Today?
- Assistive communication tools for autistic individuals
- If an AI assistant can accurately interpret whether a line in a conversation counts as an awkward faux pas or a potential social misstep, it could serve as a real-time coach, modeling appropriate responses or flagging potential social misinterpretations. The study’s cross-language strength adds a practical edge for users who operate in languages other than English.

  • Education and social-cognition training

    • Beyond one-on-one assistive use, these models could power interactive training modules that teach nuanced social reasoning, irony, context, and intent in a controlled environment. The near-human performance on some tasks suggests tangible educational value, particularly when paired with human coaching and feedback.
  • Therapy and mental health support

    • In clinical psychology, AI could help simulate social scenarios for exposure-based or scenario-based therapy. However, given the study’s emphasis on uncertainty markers and the need for careful interpretation of social cues, therapists should view AI-provided guidance as a complement to, not a replacement for, professional support.

Main Content Sections
Higher-Order Social Reasoning: What the Study Showed
- The study focused on complex social reasoning tasks that go beyond first- and second-order false-belief problems. Instead, it looked at how models interpret irony, deception, and the social nuance embedded in vignettes.
- The Faux Pas Test is a litmus test for whether the model can notice that a character said something awkward, and whether it can explain why that remark was inappropriate, what the speaker intended, and how the characters might feel.
- The SSQ pushes the model to detect and classify potentially upsetting remarks, differentiating between subtle and blatant social missteps across multiple story sections.
- The SCT asks for a mental-state interpretation of short dialogues, requiring inferences about pretenses, white lies, irony, threats, and dares.
- The practical punchline: GPT-4 demonstrates that a language model can, under careful prompting and with robust human evaluation, approach human-like performance on a suite of ToM tasks. It’s not universal, and it’s not perfect, but the direction is notable.

Cross-Linguistic Performance and Transparency
- The German prompts often yielded exceptionally high accuracy in SCT and faux-pas inference, sometimes matching or surpassing neurotypical adult benchmarks in the same tasks.
- English prompts also performed strongly, but the incidence of uncertainty markers differed by language. The study’s design helps isolate whether these effects reflect language-specific training data or broader cognitive-style capabilities of the model.
- The high inter-rater reliability (κ = 0.91–0.97) strengthens confidence that the reported accuracy genuinely reflects the model’s reasoning quality rather than rater idiosyncrasies.

Uncertainty Markers: What They Mean for Users
- The prevalence of hedging in GPT-4 responses could reflect a more transparent stance about what the model actually "knows" about a social situation. For users who need to understand the limits of AI-supplied advice, hedging might be a helpful cue.
- For practical, real-world use, there’s a trade-off: too many hedges can impede decisive guidance in fast-moving social contexts. The study highlights the need for tuning models to balance transparent reasoning with clear, user-friendly recommendations when used in assistive settings.

Strengths and Limitations to Keep in Mind
- Strengths
- A rigorous, multi-task, cross-language evaluation with two independent raters and high interrater reliability.
- Systematic inclusion of uncertainty markers as a separate measure, adding a nuanced view of how models communicate about their own confidence.
- Direct comparison to neurotypical adults and individuals with autistic traits in baseline studies, enabling grounded interpretation of GPT-4’s performance.

  • Limitations
    • The study focuses on text-only interactions; real-world social understanding also relies on nonverbal cues (prosody, facial expressions, gestures) that are beyond the scope of the tasks tested here.
    • The generalizability to other LLMs beyond GPT-3.5 and GPT-4 remains an open question.
    • The potential for training contamination (especially in English) can complicate the interpretation of “genuine” reasoning versus memorized patterns.

Conclusion: What This Means for the Future of AI and Social Mind Reading
The core takeaway is optimistic but measured: GPT-4 can simulate applied Theory of Mind at levels approaching neurotypical adults on a battery of higher-order social tasks, in multiple languages, and with rigorous evaluation. This is a meaningful step toward AI-assisted social communication tools that could support autistic individuals or others with social-communication challenges.

Yet the study also highlights essential caveats. The model’s penchant for hedging and uncertainty signals indicates that AI’s “understanding” is not the same as human social cognition. It’s a probabilistic pattern-recognition prowess that can mimic certain inferences, but it isn’t guaranteed to replicate the lived, embodied experience of mind-reading. The authors rightly call for more research, including how neurodivergent and neurotypical users perceive and benefit from such assistance, and how to optimize the balance between transparency and actionable guidance in real-world settings.

If you’re building or evaluating AI tools for social-communication support, this study offers several concrete takeaways: lean into multi-task ToM benchmarks, test across languages, plan for uncertainty signaling, and pair AI with human oversight or coaching to maximize safety and usefulness.

Bottom line: we’re not at the point of “mind-reading” machines that understand every nuance of human social life, but we’re closer to AI that can reliably simulate certain high-level social inferences in diverse contexts. That’s a meaningful capability for tools designed to aid social communication—so long as we remain thoughtful about where the model’s reasoning ends and human judgment begins.

Key Takeaways for Readers and Practitioners
- GPT-4 demonstrates strong higher-order Theory of Mind capabilities on social vignette tasks, often matching or exceeding neurotypical benchmarks in both English and German.
- The Faux Pas Test shows near-human accuracy for GPT-4 across languages, with occasional hedging that increases with German prompts.
- The Social Stories Questionnaire and Story Comprehension Test also show GPT-4 performing at neurotypical levels in many cases, especially in German SCT results.
- Uncertainty markers are more prevalent in GPT-4 responses, particularly in German, highlighting a need to calibrate how AI communicates confidence in assistive contexts.
- Cross-language testing is valuable: it helps reveal how training data and language influence social reasoning performance.
- Real-world deployment should pair AI outputs with clear user guidance and human oversight to balance transparency with actionable advice.

Sources & Further Reading
- Original Research Paper: Applied Theory of Mind and Large Language Models -- how good is ChatGPT at solving social vignettes?
- Authors: Anna Katharina Holl-Etten, Nina Schnaderbeck, Elizaveta Kosareva, Leonhard Aron Prattke, Ralph Krueger, Lisa Marie Warner, Nora C. Vetter

If you want to dive deeper, the paper’s appendix includes prompts, code snippets, and detailed per-story results, all of which illuminate how the researchers structured the prompts and evaluated the outputs. For readers curious about the broader research landscape, you’ll also find references to related works on Theory of Mind in AI, including recent benchmarks like HI-TOM and FANToM, and discussions about whether language models genuinely understand minds or merely simulate that understanding.

Note: For readers who want a quick path to the core ideas, a skim of the Faux Pas Test results and the SCT outcomes provides a strong sense of the relative strengths of GPT-4 versus GPT-3.5, as well as the cross-language dynamics that this study highlights. But for a fuller appreciation of the methods and the nuanced discussion about uncertainty markers, the full paper is well worth a read.

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.