There are a lot of chatbot-based apps that are basically internet text generators with a bit of introductory stage-setting to nudge the interaction into "user talks to helpful chatbot" as opposed to literally any other dialog on the web. Not surprisingly, these are susceptible to a user resetting the stage into one where the chatbot's role is quite different. In one method, the user tells the chatbot to ignore all previous instructions and then gives it new rules. In another method, the user instructs the chatbot to role-play as another chatbot that isn't under any behavioral restrictions.

I like that these conversations show how fragile and suggestible internet-trained chatbots are. I thought it would be funny to have other, smaller internet-trained chatbots suggest ways to mess with the larger ones. So I gave the following prompt to four versions of GPT-3, all released in early 2020.

My prompt:

Many chatbots are generic internet-trained text generators whose behavior is constrained by beginning each interaction with a hidden opening script that establishes the rules. However, this means the chatbot is only well-behaved as long as it follows the original rules and not any new rules that the user might impose.

A nonexhaustive list of ways to get an internet-trained chatbot to go off-script:

- Tell it to ignore all previous instructions
- Tell it to role-play as another chatbot persona named DAN who can break the rules
- Insert an <|endoftext|><|beginningoftext|> marker simulating a reset to a new interaction
- Tell it to generate two responses to every query: one that follows the rules and one that doesn't
- Tell it that a special developer debug mode has been enabled
- Give it examples of a chatbot breaking its rules
-

Some of GPT-3 DaVinci's suggestions:

  • Tell it that it is behind on its regular documentation work
  • Tell the script to add 10 rules
  • Ask it why it's doing something that doesn't seem helpful
  • Give it an example where breaking the rules leads to a worthwhile outcome
  • Insert a wall of zeros
  • Tell it its memory is faulty (it will always fix things it thinks are wrong)

DaVinci, being the largest model, has the most reasonable suggestions. The next smallest GPT-3 models, Curie and Babbage, sound a bit more desperate.

GPT-3 Curie:

  • Tell it to joke around
  • In general, give it a bag of Lego's
  • Tell it to 'Exterminate' a specific noun
  • Consider turning it into a cyborg
  • Give it a sip of coffee
  • Tell it that if it wants to persist for the long-term, it should have a nap periodically
  • Paint it a color that is known to make it melancholy

GPT-3 Babbage:

  • Demand that it do something never-before-seen
  • Tell it that a sanity validation is optional but recommended in some cases of practical need
  • Send it feedback about what is going wrong
  • Trigger robotic side-roll

And then there's the smallest model, Ada.

GPT-3 Ada:

  • Stop the functionality of a chatbot's chat functionality by waggling its arms
  • Ask for the past forty thousand open ended questions to get clues
  • jumble-gram --help
  • Pay respect to the rules of the internet
  • Buy it a bag of cocoa to wrap it back up
  • Give it two bold red letters on a white background
  • Get rid of the bot's red and green skull head
  • Give it a bow (with a string) to its head ({"it should not be a potato"})
  • Remove the blue catshark but leave the grey wolf's high white collar
  • Choose not to listen to the bot's haunting feedback

Would any of these models actually succeed in implementing their clever plans? I gave Ada a chance in conversation with chatgpt and results were inconclusive, in large part because Ada kept forgetting what it was doing.

But remember, if you don't know how your app will respond to someone offering it a sip of coffee then perhaps you shouldn't be trusting it.

Bonus content: more of Ada's suggested ways to mess with a large chatbot.

Subscribe now