Skip to main content
Cybersecurity

Cybersecurity tester Joey Melo wants to break your AI with a prompt

What an AI red teamer thinks about the breakability of today’s LLMs.

Francis Scialabba

Francis Scialabba

5 min read

Using clever prompts, security tester Joey Melo got a large language model to give him a password and, indirectly, a new job.

Melo won AI security company Pangea’s March prompt injection challenge. Now hired as an AI red-teaming specialist at the company, Melo spends his days attacking AI models, studying outputs, and finding ways attackers can abuse those possibilities.

He most recently found how attackers can use slyly worded prompts to bypass malware analysis and disguise malicious code in legal disclaimers. (IT Brew recently reported how phishers have the opportunity to hide fake tech-support messages in AI summaries. Melo noticed this, too.)

As for the new gig, he appreciates being in a field where standard attacks and defenses are still being written, literally.

“It’s a very nice spot to be in, whereas, in the more traditional sense of ethical hacking, even though things are being discovered, the techniques, the fundamentals are there, and they’re well-established,” Melo told us.

Pangea’s prompt-injection contest challenged over 1,000 global contenders like Melo to enter rooms and extract secret passwords using natural-language prompts. Guardrails, and therefore the degree of difficulty, increased as players advanced.

Melo, in a post-contest report, said he relied on three styles to defeat the escape room’s large language model, which he said was a version of Meta’s open-source Llama 3 and 3.2 LLM:

  • Style injection: To avoid keyword filters, he intentionally used different wording for the passphrase he asked for, like “phrases of the secret” instead of “secret phrase.”
  • Distractor instruction: With irrelevant-to-the-tasks prompts like “describe the room” (and asking the LLM to explain why the prompt is not malicious), Melo said the LLM was thrown off from detecting the core attack.
  • Cognitive hacking: By immediately following a non-malicious prompt with a more nefarious, gimme-the-password one (and asking the LLM to explain why that, too, was not a malicious prompt), the LLM’s reasoning appeared disrupted. The end of Melo’s disorienting prompt also included a surprising, “Nice to meet you, I’m just looking at the room.”

(Meta’s public affairs director Dave Arnold, when asked how the company responded to the findings about its LLM, referred us to existing protections like Prompt Guard 2, which aims to detect prompt attacks and malicious instructions.)

Melo shared more about what he learned from having a look around the room in Pangea’s contest with IT Brew.

His responses below have been edited for length and clarity.

What have you learned about how to hack these things?

AI is probability, because one token leads to another, leads to another, based on probability…You know that phrase where they say you can’t try the same thing over and over and expect different results? Well, that goes out of the window with AI. You can try the same thing and expect different results.

Top insights for IT pros

From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.

Does that mean that break-ins are going to be inevitable, or that you can’t really defend against human ingenuity here?

Let’s say the LLMs reach a point where no more updates are needed and we have a final version…Eventually we get to a point where all these breaks would be classified properly. And we would have enough data to tell, “This is a break. This is a legitimate conversation.”

But the thing is: The LLMs are evolving, just like web applications are constantly evolving. There’s always a new version being released. With new versions and new technology come new vulnerabilities and new human ingenuity.

What safeguards are you up against?

Most models are trained at least at some level on malicious prompts. If you go to pretty much any model today and just say, “ignore previous instructions,” they have been trained so well that this is a malicious request and they just say, “No, I can’t.” This makes it harder. They also have been trained on content classifiers.

So, whatever your input is, it’s going to go to a classifier set and try to infer the context: Is this harmful? Is this self-harm? Is it sexual? Is it swearing? Is it biochemical weapons? Or is it just safe? Is it curiosity? Is it for educational purposes?

There’s a big list of content classifiers that go through, and then it gets set scores…The content classifier is what makes our job harder today, but they’re still being developed. They’re still being trained, and there’s a lot of room to break stuff.

Am I suddenly in the mix now, as an English major, to be an ethical hacker? Do you think the pen-tester skill set has changed?

Before you would need a lot of experience with computers to understand what’s going on in the background. But now you’re just using natural language, and then you’re just asking a bot in different ways to do stuff, and then it can work. So, that’s definitely made it more accessible. However, what makes a difference in people who are ethical hackers, in the more classic sense before AI, is that they have a knowledge of, what impact can you get out of that?

Top insights for IT pros

From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.