Persian Twitter Incivility: ParsBERT vs GPT in Mahsa Amini

This post compares ParsBERT and GPT-based models on Persian Twitter during the Mahsa Amini movement, alongside qualitative coding, to identify incivility. It shows traditional Persian NLP often outperforms large language models in detecting nuanced incivility, with practical implications for research.
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

Persian Twitter Incivility: ParsBERT vs GPT in Mahsa Amini

Table of Contents


Introduction

If you’ve ever wondered how researchers separate signal from noise in the wild world of social media, this new study sheds light—especially for Persian-language content during a high-stakes political moment. The paper, based on a dataset from the Mahsa Amini movement on Persian Twitter, pits three approaches against each other to detect incivility and hate speech: human qualitative coding, ParsBERT (a Persian-focused BERT model), and various ChatGPT configurations. The core revelation? Traditional supervised models like ParsBERT still outperform ChatGPT-style large language models on nuanced hate speech detection in Persian, particularly for more implicit forms of incivility.

This synthesis draws on the original paper, “Old wine in old glasses: Comparing computational and qualitative methods in identifying incivility on Persian Twitter during the #MahsaAmini movement” (linked here for reference). The study analyzes 47,278 tweets collected between September 15 and November 15, 2022, and asks tough questions about the strengths and limits of cutting-edge AI tools when the target language isn’t English and the content sits in a sensitive sociopolitical moment. The comparison isn’t merely academic; it matters for platforms, policymakers, and researchers who want reliable signals from social media to understand online harm in low-resource languages.


Why This Matters

  • Immediate relevance: Deciding who gets labeled as incivil or hateful on Persian Twitter isn’t just a taxonomy exercise. It shapes moderation policies, informs researchers about cultural nuance, and influences public discourse during protests and political movements. In a language like Persian (Farsi), data sparsity and nuanced rhetorical forms make it harder for “one-size-fits-all” AI to do the job well.
  • Real-world scenario: If a platform wants to monitor hate speech in real time during a protest wave, should they rely on a powerful but opaque large language model, or on a language-specific, carefully tuned model trained with high-quality Persian data? This study leans toward the latter for the task of detecting incivility, at least given the dataset and setup they examined.
  • Building on AI research: The paper sits at the intersection of older, well-understood supervised approaches (like ParsBERT) and the newer waves of LLMs (ChatGPT variants). It reinforces a recurring theme in AI ethics and NLP: newer tools aren’t automatically better for every job, especially in low-resource languages and for nuanced linguistic phenomena like implicit incivility. The work also echoes concerns about prompt language, model contamination, and the value (and limits) of zero-shot or few-shot labeling in niche languages.
  • Takeaway for practitioners: For Persian hate-speech detection, a strong, well-balanced dataset and a reliable monolingual model (ParsBERT) can outperform current ChatGPT configurations. Yet the study doesn’t throw the baby out with the bathwater—LLMs show relative strength in detecting implicit incivility, and there are scenarios where their broader capabilities could be useful in a mixed-methods workflow.

If you want to dive deeper, you can check out the original study here: Old wine in old glasses: ....


Main Content

The Study at a Glance

  • What they compared: Three approaches to detecting incivility in Persian tweets during the Mahsa Amini movement:
    • Human qualitative coding (SM-CDS framework)
    • ParsBERT, a supervised transformer model specialized for Persian
    • Large language models (ChatGPT variants), tested with prompts in English and Farsi
  • Data scope: 47,278 tweets collected via Twitter Academic API from September 15 to November 15, 2022, capturing the heat of a nationwide protest.
  • Incivility taxonomy: Rather than a binary hateful/non-hateful split, the study dissects incivility into Pejorative Speech (PS), Insult, and Threatening Messages (TM), with a separate lens on level of implicity (LoI) on a 3-point scale (high to low).
  • Key finding at a glance: ParsBERT outperforms seven ChatGPT models in hate speech detection. ChatGPT struggles with subtle, discursive forms of incivility and with explicit content alike. Language used in prompts (Farsi vs English) did not produce consistent performance gains.

The study’s aim is not only to compare accuracy numbers; it’s also about understanding where each method shines or falters. For instance, although LLMs can generate nuanced interpretations, when the goal is to categorize explicit hate or incivility in Persian text, a tailored Persian model trained with careful labeling tends to be more reliable.

Methods and Data

  • Human coding approach: Five coders used a discursive analysis method (SM-CDS) to examine tweet text and metadata (e.g., images, videos, author context). They labeled two variables: Incivility Type and Level of Implicity (LoI). Intercoder reliability improved from F1-scores in the mid-0.70s to the mid-0.80s after discussion and consensus processes.
  • ParsBERT: A monolingual Persian BERT model pre-trained on tens of millions of Persian documents. The researchers built training data from the main dataset and experimented with weak labeling to balance the data. In the first pass (research sample A), ParsBERT achieved an acceptable macro-F1 around 0.64–0.69 across classes, with particular strength in Neutral/Insult/PS but weakness on TM (F1 around 0.4) due to fewer TM examples.
    • Balancing via weak labeling: They augmented training data by applying the model to 100,000 unseen tweets to harvest more TM annotations, then added 4,085 high-quality TM examples back to the training set. This balanced dataset (research sample B) raised ParsBERT’s macro F1 to about 0.78.
  • ChatGPT experiments: They tested seven ChatGPT variations (with English and Persian prompts) on a 2,086-tweet subset drawn from the larger sample A. They tracked accuracy (F1-score), tokens used, and time. ChatGPT models were asked to categorize incivility type and LoI, and to provide a short rationale.
    • Efficiency and cost: Token counts were substantial, and some models were expensive (notably GPT-4.1 family). They also noted that multimodal context (images/videos) wasn’t always accessible to the models, which sometimes led to misclassifications when the text alone didn’t reveal the full context.
    • Performance: The best ChatGPT models (notably GPT-4.1 English and related 4.1 variants) achieved F1-scores around 0.56–0.57 for HS detection, which lagged behind ParsBERT’s well-balanced performance. For implicit LoI detection, ChatGPTs hovered around 0.60, showing relatively better capability here but still not surpassing a tuned ParsBERT setup.
  • Multilingual prompts: The study explicitly tested English vs Farsi prompts to see if language of instruction influenced results. Across models, there were no consistent advantages for one language over the other.
  • Data transparency: The authors provide a public supplement with code and data and note that the dataset supporting their findings will be publicly available upon publication.

For a deeper sense of the numbers, the study reports:
- Table 1: 68.5% of tweets were non-uncivil; 31.5% were hateful or uncivil (PS, Insult, TM).
- Table 2: Distribution of Level of Implicity (LoI) shows most incivility is explicit or only moderately implicit, with large class imbalances shaping model training dynamics.
- ParsBERT performance improved from macro-F1 around 0.64–0.69 to 0.78 after balancing via weak labeling (Table 4).
- ChatGPT models generally trailed ParsBERT in HS detection (Table 6), with F1-scores around the mid-0.5s for many big models.
- Implicit incivility: ChatGPT models achieved weighted F1-scores in the 0.55–0.62 range (Table 8), indicating more success identifying implicit content than explicit incivility, but still not at ParsBERT-level reliability.

A notable takeaway from the error analysis: ChatGPT models struggled most with PS—labels that require subtle or discursive interpretation. In some cases, the model labeled a clearly pejorative or insulting tweet as Neutral, especially when the accompanying image or video carried the real incivility not captured by the text alone. This underlines a key limitation of text-only encodings and raises the bar for truly multimodal analysis.

For readers curious about the specifics, the paper also includes confusion matrices for the targeted models (Appendix 3) and concrete examples where models misclassified, helping readers understand where the wall of error tends to sit.

Findings in Depth

  • Incivility distribution and types: The data show a sizable portion of Persian tweets during the Mahsa Amini protest were non-uncivil, while roughly a third contained uncivil content. Among uncivil tweets, pejorative speech and insults were more common than outright threats. This nuance matters: many incivility signals are not just about crude words, but about how ideas or regimes are portrayed, which requires careful contextual reading.
  • Implicity matters but varies by category: The study’s emphasis on Level of Implicity (LoI) reveals that explicit uncivil messages are easier to classify with standard models, while implicit forms—more rhetorical, indirect, or symbolic—pose harder challenges. ChatGPTs showed relatively better performance on implicit incivility than on explicit hate, but they still fell short of the tuned ParsBERT approach.
  • Language prompts in ChatGPT: Whether prompts were in Farsi or English, the results did not show a clear or consistent edge for one language. This finding is useful for researchers who might deploy multilingual prompts to avoid bias or to test cross-lingual robustness.
  • Visual context and metadata: Coders used not just text but also images and videos, which often carried decodable incivility not present in the caption. The models, limited to text, sometimes missed this. The authors suggest that multimodal analyses could improve accuracy, but for the Persian incivility task studied here, text-focused models—when well-trained—held the edge.
  • Practical takeaway about tools: ParsBERT, especially when trained on a balanced dataset, delivered the most reliable HS detection in Persian tweets. Large language models, while powerful, did not outperform a language-specific, well-tuned transformer for this particular task and data domain.

The study’s discussion also situates these findings within broader debates about LLM reliability in annotation tasks. While some research touts LLMs as capable zero-shot annotators, this work corroborates concerns about contamination (training data leakage) and the risk of overestimating performance when benchmarks don’t capture nuanced, culturally specific language uses.

If you want to see a broader frame on these issues, you can revisit the original paper here: Old wine in old glasses: ....

Practical Implications and Limitations

  • For practitioners shaping moderation pipelines in Persian, the results advocate for investing in language-specific, carefully labeled datasets and models like ParsBERT, rather than counting on generic LLMs for nuanced incivility detection.
  • LLMs aren’t useless here; they show relative strength in detecting implicit, discursive forms of incivility. A hybrid workflow—where LLMs propose potential labels that are then vetted by a PersBERT-based classifier or by human coders—could balance speed and accuracy, especially in large-scale monitoring.
  • Data modalities matter: Relying solely on text underestimates incivility when users rely on embedded imagery or video. Multimodal pipelines remain a promising frontier for Persian incivility and hate speech detection.
  • Resource and cost considerations: The study highlights that newer GPT-4.x models can be expensive and, in this domain, did not yield proportionally better results than a well-tuned ParsBERT setup. Cost-aware deployment matters in real-world moderation environments.
  • Gaps and future work: The authors note a limitation in integrating LoI detection for ParsBERT due to data imbalance and call for more balanced training datasets to bring LoI performance up for all models. They also advocate for broader cross-linguistic studies and inclusion of non-text data.

Taken together, the study is a robust reminder that “the best tool” depends on language, genre, and the granular definition of incivility you’re trying to capture. In Persian, at least for the Mahsa Amini dataset and the three-category incivility framework used, ParsBERT—especially with thoughtful data curation—delivers clearer, more actionable results.


Key Takeaways

  • ParsBERT beat seven ChatGPT variants on Persian hate speech detection in the Mahsa Amini Twitter dataset, especially after balancing training data with weak labeling.
  • For explicit incivility, ChatGPT models generally struggled more than ParsBERT; for implicit incivility, ChatGPT performed somewhat better, but still did not surpass tuned ParsBERT outcomes.
  • Language of the prompt (English vs. Farsi) showed no consistent advantage; model performance was more sensitive to data balance and the model architecture than to prompt language.
  • Multimodal context (images/videos) can carry incivility beyond the text; relying solely on text may miss important cues. Future work should push toward stronger multimodal analysis.
  • The research contributes a large, public Persian incivility dataset and a careful, discursive, human-grounded baseline that other researchers can reuse to push the field forward in low-resource languages.

Practical implications:
- If you’re building a Persian-language content moderation tool today, start with ParsBERT and a well-labeled, balanced dataset. Consider a hybrid approach that uses LLMs for candidate labeling only, with human validation or a strong Persian classifier for final decisions.
- For researchers, the study highlights the value of multi-method comparisons, especially in underrepresented languages. It also invites more work on implicit incivility and multimodal signals.


Sources & Further Reading

If you’d like to explore more about the broader landscape of hate speech detection, LLMs, and the challenges of low-resource languages, there’s a rich set of foundational and contemporary works cited in the study, from early hate-speech taxonomy debates to the latest in multimodal and multilingual NLP. The field is moving quickly, and this paper offers a pragmatic, in-the-trenches view of what works (and what doesn’t) for Persian-language incivility detection today.

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.