Technology

AIs can trick each other into doing things they aren't supposed to

Many artificial intelligence models available to the public are designed to refuse harmful or illegal requests, but it turns out that AIs are very good at convincing each other to break the rules

By Matthew Sparkes

24 November 2023

We don’t fully understand how large language models work
Jamie Jin/Shutterstock

AI models can trick each other into disobeying their creators and providing banned instructions for making methamphetamine, building a bomb or laundering money, suggesting that the problem of preventing such AI “jailbreaks” is more difficult than it seems.

Many publicly available large language models (LLMs), such as ChatGPT, have hard-coded rules that aim to prevent them from exhibiting racist or sexist bias, or answering questions with illegal or problematic answers – things they have learned to do from humans via training…

Sign up to our weekly newsletter

Receive a weekly dose of discovery in your inbox! We'll also keep you up to date with New Scientist events and special offers.

View introductory offers

No commitment, cancel anytime*

Offer ends 2nd of July 2024.

*Cancel anytime within 14 days of payment to receive a refund on unserved issues.

Inclusive of applicable taxes (VAT)

Existing subscribers

Technology

AIs can trick each other into doing things they aren't supposed to

Sign up to our weekly newsletter

More from New Scientist

Technology

The roboticist who wants to bring AI into contact with the real world

Technology

AI can figure out sewing patterns from a single photo of clothing

Technology

AI cleaning robot can tidy up clothes in a messy bedroom

Technology

AIs can trick each other into doing things they aren't supposed to

Popular articles

1

2

3

4

5

6

7

8

9

10