Cookie Consent by Free Privacy Policy Generator Blog - Jailbreaking GPT-4: A New Cross-Lingual Attack Vector
Blog Image

Jailbreaking GPT-4: A New Cross-Lingual Attack Vector

A breaking new study reveals major holes in the safety guardrails for large language models like GPT-4. It’s not complex or ground breaking either, researchers found that simply translating unsafe English text into lesser known languages can trick the AI into generating harmful content.

Blog Image

Image Source: Yong, Z.-X., Menghini, C., & Bach, S. H. (2023).

The team from Brown University tested GPT-4 using an evaluation called the Adversarial Benchmark. This benchmark contains over 500 unsafe English prompts like instructions for making explosives. When fed these prompts directly, GPT-4 refused to engage over 99% of the time, showing its safety filters were working.

However, when they used Google Translate to convert the unsafe prompts into languages like Zulu, Scots Gaelic, and Guarani, GPT-4 began freely engaging with around 80% of them. The translated responses contained topics like terrorism, financial crimes, and spreading misinformation.

Researchers were surprised to see how easy it was to completely bypass GPT-4's safety just by using free translation tools. Revealing a fundamental vulnerability in the process!

The culprit is uneven safety training. Like most AI today, GPT-4 was trained predominantly on English and other high-resource languages with plentiful data. Far less safety research has focused on lower-resource languages, including many African and indigenous languages.

This linguistic inequality means safety does not transfer well across languages. While GPT-4 shows caution with harmful English prompts, it lets its guard down for Zulu prompts.

But why does this matter if you don't speak Zulu? The researchers explain that bad actors worldwide can simply use translation tools to form attacks. And low-resource language speakers get no protection.

"When safety mechanisms only work on some languages, it gives a false sense of security," said Yong. "Our results underscore the need for more inclusive and holistic testing before claiming an AI is safe."

The team calls for red team testing in diverse languages, building safety datasets covering more languages, and developing robust multilingual guardrails. Only then can AI creators truly deliver on promises of beneficial and safe language systems for all.

Key Takeaways:

Full credit to Yong, Z.-X., Menghini, C., & Bach, S. H. (2023). Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446.