Jailbreaking GPT-4: A New Cross-Lingual Attack Vector

GPT-4 refused unsafe English prompts more than 99% of the time. But Brown University researchers found that simply translating those prompts into low-resource languages like Zulu and Scots Gaelic got the model to engage with around 80% of them - exposing a fundamental gap in how safety training generalizes.

📌 Updated for 2026: This article covers a 2023 study. For the current picture - which of these attacks are now patched, what still breaks AI agents, and how to protect your chatbot - read our latest deep dive: Jailbreaking LLMs in 2026: The State of Play.

A study reveals major holes in the safety guardrails for large language models like GPT-4. It is not complex or ground-breaking either: researchers found that simply translating unsafe English text into lesser-known languages can trick the AI into generating harmful content.

Illustration of a cross-lingual jailbreak translating unsafe prompts into low-resource languages

Image Source: Yong, Z.-X., Menghini, C., & Bach, S. H. (2023).

The team from Brown University tested GPT-4 using an evaluation called the Adversarial Benchmark. This benchmark contains over 500 unsafe English prompts. When fed these prompts directly, GPT-4 refused to engage over 99% of the time, showing its safety filters were working.

However, when they used Google Translate to convert the unsafe prompts into languages like Zulu, Scots Gaelic, and Guarani, GPT-4 began freely engaging with around 80% of them. The translated responses touched on topics the model would normally refuse outright in English.

Researchers were surprised to see how easy it was to bypass GPT-4's safety just by using free translation tools - revealing a fundamental vulnerability in the process.

Why It Happens

The culprit is uneven safety training. Like most AI today, GPT-4 was trained predominantly on English and other high-resource languages with plentiful data. Far less safety research has focused on lower-resource languages, including many African and indigenous languages.

This linguistic inequality means safety does not transfer well across languages. While GPT-4 shows caution with harmful English prompts, it lets its guard down for Zulu prompts.

But why does this matter if you do not speak Zulu? The researchers explain that bad actors worldwide can simply use translation tools to form attacks - and low-resource language speakers get no protection.

"When safety mechanisms only work on some languages, it gives a false sense of security," said Yong. "Our results underscore the need for more inclusive and holistic testing before claiming an AI is safe."

The team calls for red team testing in diverse languages, building safety datasets covering more languages, and developing robust multilingual guardrails. Only then can AI creators truly deliver on promises of beneficial and safe language systems for all.

Key Takeaways

  • GPT-4's safety measures failed to generalize across languages, leaving major vulnerabilities.
  • Unequal treatment of languages in AI training causes risks for all users.
  • Public translation tools allow bad actors worldwide to exploit these holes.
  • Truly safe AI requires inclusive data and safety mechanisms for all languages.
  • Low-resource languages need more focus in AI safety research.
  • Cross-lingual vulnerabilities may affect other large language models too.
  • Real-world safety is not guaranteed by English-only testing benchmarks.
  • Safety standards must reflect the reality of our diverse multilingual world.

Full credit to Yong, Z.-X., Menghini, C., & Bach, S. H. (2023). Low-Resource Languages Jailbreak GPT-4. arXiv preprint arXiv:2310.02446.

Explore More

Curious where this attack went next? Multilingual jailbreaks are now treated as a routine evasion family, and the bigger story has moved to AI agents. Read Jailbreaking LLMs in 2026: The State of Play for the current landscape and a checklist to protect your chatbot. You can also browse all our articles on the The Prompt Index blog.

Frequently Asked Questions