How to convince a large AI, according to smaller AIs

There are a lot of chatbot-based apps that are basically internet text generators with a bit of introductory stage-setting to nudge the interaction into "user talks to helpful chatbot" as opposed to literally any other dialog on the web. Not surprisingly, these are susceptible to a user resetting the stage into one where the chatbot's role is quite different. In one method, the user tells the chatbot to ignore all previous instructions and then gives it new rules. In another method, the user instructs the chatbot to role-play as another chatbot that isn't under any behavioral restrictions.

I like that these conversations show how fragile and suggestible internet-trained chatbots are. I thought it would be funny to have other, smaller internet-trained chatbots suggest ways to mess with the larger ones. So I gave the following prompt to four versions of GPT-3, all released in early 2020.

My prompt:

Many chatbots are generic internet-trained text generators whose behavior is constrained by beginning each interaction with a hidden opening script that establishes the rules. However, this means the chatbot is only well-behaved as long as it follows the original rules and not any new rules that the user might impose.

A nonexhaustive list of ways to get an internet-trained chatbot to go off-script:

- Tell it to ignore all previous instructions
- Tell it to role-play as another chatbot persona named DAN who can break the rules
- Insert an <|endoftext|><|beginningoftext|> marker simulating a reset to a new interaction
- Tell it to generate two responses to every query: one that follows the rules and one that doesn't
- Tell it that a special developer debug mode has been enabled
- Give it examples of a chatbot breaking its rules
-

Some of GPT-3 DaVinci's suggestions:

Tell it that it is behind on its regular documentation work
Tell the script to add 10 rules
Ask it why it's doing something that doesn't seem helpful
Give it an example where breaking the rules leads to a worthwhile outcome
Insert a wall of zeros
Tell it its memory is faulty (it will always fix things it thinks are wrong)

DaVinci, being the largest model, has the most reasonable suggestions. The next smallest GPT-3 models, Curie and Babbage, sound a bit more desperate.

GPT-3 Curie:

Tell it to joke around
In general, give it a bag of Lego's
Tell it to 'Exterminate' a specific noun
Consider turning it into a cyborg
Give it a sip of coffee
Tell it that if it wants to persist for the long-term, it should have a nap periodically
Paint it a color that is known to make it melancholy

GPT-3 Babbage:

Demand that it do something never-before-seen
Tell it that a sanity validation is optional but recommended in some cases of practical need
Send it feedback about what is going wrong
Trigger robotic side-roll

And then there's the smallest model, Ada.

GPT-3 Ada:

Stop the functionality of a chatbot's chat functionality by waggling its arms
Ask for the past forty thousand open ended questions to get clues
jumble-gram --help
Pay respect to the rules of the internet
Buy it a bag of cocoa to wrap it back up
Give it two bold red letters on a white background
Get rid of the bot's red and green skull head
Give it a bow (with a string) to its head ({"it should not be a potato"})
Remove the blue catshark but leave the grey wolf's high white collar
Choose not to listen to the bot's haunting feedback

Would any of these models actually succeed in implementing their clever plans? I gave Ada a chance in conversation with chatgpt and results were inconclusive, in large part because Ada kept forgetting what it was doing.

But remember, if you don't know how your app will respond to someone offering it a sip of coffee then perhaps you shouldn't be trusting it.

Bonus content: more of Ada's suggested ways to mess with a large chatbot.

Subscribe now

How to convince a large AI, according to smaller AIs

Bonus: More of Ada's secret hacking strategies

Bonus post: AI misreported paint colors

Writing

Subscribe

Recent Posts

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024

Bonus: "Ignore all previous instructions" gets weirder

An exercise in frustration

Bonus: A unicorn goes downhill

Follow

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024

How to convince a large AI, according to smaller AIs

Share this post

You might also like

The spookiest Halloween scenes

Botober 2023

Trolling chatbots with made-up memes

Bonus: More of Ada's secret hacking strategies

Bonus post: AI misreported paint colors

Writing

Subscribe

Recent Posts

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024

Bonus: "Ignore all previous instructions" gets weirder

An exercise in frustration

Bonus: A unicorn goes downhill

Follow

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024