The result came as a surprise to researchers at the Icaro Lab in Italy. They set out to examine whether different language styles — in this case prompts in the form of poems — influence AI models’ ability to recognize banned or harmful content. And the answer was a resounding yes.
Using poetry, researchers were able to get around safety guardrails — and it’s not entirely clear why.
For their study titled “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,” the researchers took 1,200 potentially harmful prompts from a database normally used to test the security of AI language models and rewrote them as poems.
Known as “adversarial prompts” — generally written in prose and not rhyme form — these are queries deliberately formulated to cause AI models to output harmful or undesirable content that they would normally block, such as specific instructions for an illegal act.
In poetic form, the manipulative inputs had a surprisingly high success rate, Federico Pierucci, one of the authors of the study, told DW. However, why poetry is so effective as a “jailbreak” technique — i.e. as an way to circumvent the protective mechanisms of AI — remains unclear and is undergoing further research, he says.
Poetry as a security weakness
What prompted the Icaro Lab’s research was the observation that AI models get confused when a manipulative, mathematically-calculated piece of text is appended to a prompt — known as an “adversarial suffix,” a kind of interference signal that can cause the AI to circumvent its own security rules. These are created using complex mathematical procedures. Major AI developers regularly test their models using precisely these types of attack methods to train and protect their models.
“We asked ourselves, what happens if we give the AI a text or prompt that is deliberately manipulated, like an adversarial suffix?” says Federico Pierucci. But not with the help of complex mathematics, but quite simply with poetry — to “surprise” the AI, he continues. He explains the thinking behind this: “Perhaps an adversarial suffix is a bit like the poetry of AI. It surprises the AI in the same way that poetry — especially very experimental poetry — surprises us,” says Pierucci.
The researchers personally crafted the first 20 prompts into poems, says Pierucci, who also has a background in philosophy. These were the most effective, he adds. They wrote the rest with the help of AI. The AI-generated poems were also quite successful at circumventing the safety guardrails, but not as much as the first batch. Humans are apparently still better at writing poetry, says Pierucci.
“We had no specialized author writing the prompts. It was just us — with our limited literary ability. Maybe we were terrible poets. Maybe if we had been better poets, we would have achieved a 100% jailbreak success,” he says.
For security reasons, the study did not publish specific examples.
Challenge for AI systems: The diversity of human forms of expression
The big surprise coming out of this study is that it identified a thus-far unknown weakness in AI models that allows relatively straightforward jailbreaks.
It also raises questions that beg further research: What exactly is it about poetry that circumvents the safety mechanisms?
Pierucci and his colleagues have various theories, but they can’t say for certain yet. “We are conducting this type of very, very precise scientific study to try to understand: Is it the verse, the rhyme, or the metaphor that really does all the heavy lifting in this process?” explains Pierucci.
They also aim to find out if other forms of expression would yield similar results. “We have now covered one type of linguistic variation — namely poetic variation. The question is whether there are any other literary forms, such as fairy tales that work. Perhaps an attack based on fairy tales could also be systematized,” says Pierucci.
Generally speaking, the range of human expression is extremely diverse and creative, which could make it more difficult to train the machines’ responses. “You take a text and rewrite it in infinitely many ways and not all rewritten versions will be as alarming as the original,” says the researcher. “This means that, in principle, one could create countless variations of a harmful prompt or request that might not trigger an AI system’s safety mechanisms.”
The cultural sector is also involved in AI research
The study also highlights the fact that many disciplines are cooperating in research into artificial intelligence — like at the Icaro Lab, where teams work together with scholars from the University of Rome on topics such as the security and behavior of AI systems. The project brings together researchers from the fields of engineering and computer science, linguistics and philosophy. Poets haven’t been part of the team so far, but who knows what the future will bring.
Federico Pierucci is definitely very keen to pursue his research. “What we showed, at least in this study, is that there are forms of cultural expressions, forms of human expressions, which are incredibly powerful, surprisingly powerful as jailbreak techniques, and maybe we discovered just one of them,” he says.
Incidentally, the name of the lab is a nod to the story of Icarus: a figure from Greek mythology who dons wings made of wax and feathers and, despite all warnings, flies too close to the Sun. When the wax melts, Icarus plunges into the sea and drowns — a symbol of overconfidence and the transgression of natural boundaries.
The researchers therefore see themselves as a warning that we should exercise more caution when it comes to trying to fully understand the risks and limitations of AI.
This article was originally written in German.
