ChatGPT, Bard, GPT-4, and the like are often pitched as ways to retrieve information. The problem is they'll "retrieve" whatever you ask for, whether or not it exists.
Tumblr user @indigofoxpaws sent me a few screenshots where they'd asked ChatGPT for an explanation of the nonexistent "Linoleum harvest" Tumblr meme, and gotten a realistic but completely fabricated description of it. I tried this myself with GPT-4 and Bard and got similar results.
GPT-4
And "linoleum harvest" isn't the only meme the chatbots will "explain".
GPT-4
Occasionally GPT-4 would respond that it didn't have a record of whatever "meme" I was asking about, but if I asked it again, it would produce an explanation within a try or two. (It was more likely to produce an explanation right away if it was in a conversation where it had already been explaining other memes.)
I didn't see Bard report not finding the memes. In fact, Bard even adds dates and user names and timelines, as well as typical usage suggestions. Its descriptions were boring and wordy so I will summarize with a timeline:
Bard
I had fun giving the chatbots phrases that appeared on my own Tumblr blog. Rather than correctly identifying them as paint colors, cookie recipe titles, and so forth, they generated fanciful "explanations" of the original meme.
GPT-4GPT-4
Finding whatever you ask for, even if it doesn't exist, isn't ideal behavior for chatbots that people are using to retrieve and summarize information. It's like weaponized confirmation bias. This is the phenomenon behind, for example, a lawyer citing nonexistent cases as legal precedent.
People call this "hallucination" but it's really a sign of the fundamental disconnect between what we're asking for (find information) versus what the language models are trained to do (predict probable text).
On July 31, 2023, a giraffe with no spots was born at Brights Zoo in Tennessee. She's a uniform brown with pretty white highlights around her face and belly, like a Jersey cow or a white-tailed deer.
Image recognition algorithms are trained on a variety of images from around the internet, and/or on a few standard image datasets. But there likely haven't been any spotless giraffes in their training data, since the last one to be born was probably in 1972 in Tokyo. How do they do when faced with photos of the spotless giraffe?
The MMICL demo doesn't have a back and forth dialog mode per se, but at each step I copied the prior dialog into the system prompt and added my newest question to the end. I'm not sure if this is how it's intended to be used but I didn't see any other instructions.
I did a very brief and entirely unscientific experiment using two image recognition models, chosen for inclusion entirely because they had online demos where I could choose what question the program was answering.
One model I tried was Multi-Modal In-Context Learning (MMICL). I tried simulating a back-and-forth exchange a la Visual Chatbot but it quickly got repetitive (see conversation above). When I tried one-off questions it answered more coherently but also failed to note the giraffe's spotlessness.
I also experimented with a version of Instruct-BLIP (I used the demo hosted here). Indirect questions yielded lots of fluent detail, much of it spurious. Chairs? A person in a grey shirt? A pink umbrella?
Direct questions fared no better. When I first asked InstructBLIP:
"Is there anything unusual about the giraffe's coat?"
I got the answer:
"Yes, the giraffe has an unusual coat in the image. It is a brown giraffe, which is a rare color compared to the typical giraffe that is typically spotted or brownish-gray. This brown coloration is more unusual and may be due to genetic factors or an environmental influence."
Which for a moment made me think that perhaps the algorithm had correctly identified the key feature of this image! But when I repeated the question over a few trials, it never repeated its success, its answers including:
The giraffe's coat is unusual because it appears to be spotted or striped, instead of being solid brown or spotted as a typical pattern.
The giraffe's coat is unusual in that it is brown and white. Many giraffes have dark brown or black coats, so the brown and white combination is relatively unusual for this species.
Yes, there is something unusual about the giraffe's coat. Specifically, the giraffe appears to be wearing a coat. While this might seem unusual or unexpected, it is a common practice in the case of giraffes raised in captivity.
I asked it directly:
Is the giraffe striped, spotted, or plain?
and got
The giraffe is striped.
Its answer depended on word order, though. If I presented the "spotted" or "plain" option first I would get:
Me: Is the giraffe spotted, striped, or plain? Bot: The giraffe is spotted, with a brown and white pattern on its body.
I repeated the experiment with a normally-patterned baby giraffe and got the same range of answers about the giraffe's coat. It even selected "striped" again from "striped, spotted, or plain".
Are these crummy image recognition models? Not unusually so. As far as I can tell with a brief poke around, MMICL and InstructBLIP are modern models (as of Aug 2023), fairly high up on the leaderboards of models answering questions about images. Their demonstration pages (and InstructBLIP's paper) are full of examples of the models providing complete and sensible-looking answers about images.
Then why are they so bad at Giraffe With No Spots?
I can think of three main factors here:
AI does best on images it's seen before. We know AI is good at memorizing stuff; it might even be that some of the images in the examples and benchmarks are in the training datasets these algorithms used. Giraffe With No Spots may be especially difficult not only because the giraffe is unusual, but because it's new to the internet.
AI tends to sand away the unusual. It's trained to answer with the most likely answer to your question, which is not necessarily the most correct answer.
The papers and demonstration sites are showcasing their best work. Whereas I am zeroing in on their worst work, because it's entertaining and because it's a cautionary tale about putting too much faith in AI image recognition.
A reader wrote in a while ago with a suggestion: they were about to have a baby and wondered if I could use AI to come up with some new ideas for baby onesies. I can't find the letter any more, and I don't remember how long ago it was. But I finally figured out how to make it work! Reader, if your kid has already outgrown the onesie section, I believe these can be sized up to toddler.
One of the problems with trying to get anything creative out of chatgpt is that as far as it is concerned, a perfect answer to "generate some unique and original baby onesie designs" is to repeat existing designs verbatim. (I checked the 1st 25 designs chatgpt generated in response to that prompt, and they all existed already.)
So I turned to BLOOM, a large language model NOT finetuned as a chatbot like Bard, chatGPT, gpt-4 etc, but left as a general internet text predictor. Give it a phrase, and it has to predict what would likely come next on a webpage containing that phrase.
It turns out if I just give BLOOM the phrase "List of baby onesie designs:" it's not enough context. I tried several times.
List of baby onesie designs:
1. US Patent for Method of fixing a voltage regulator device on a printed circuit board 2. US Patent for Signal transmission apparatus having antenna isolation circuitry 3. US Patent for Electrostatic energy control device and method for reducing charge build-up
List of baby onesie designs:
1.3.1. Baby Onesie Bamboo Tee 2 Piece Set Baby Bodysuit with Footie Pajamas Fitted Soft Touch Clothes for 0 to 6 months old baby boy/girl
List of baby onesie designs:
1. A cape with front openings 2. A bodysuit with a hood 3. A shirt with openings for hands and feet
List of baby onesie designs:
1.What is baby onesie? Baby Onesie is a large piece of cloth that will serve you as a diaper cover or as a slip. It is often worn as pajamas by babies and by babies to prevent them from becoming dirty.
I tried changing the context a bit, but only had occasional partial success.
List of BabyBucket's best-selling whimsical baby onesie designs:, Buy Kerbed Zinc Alloy 3 Inch Car Style Chrome Universal Side Steps For Escalade ESV LT PI:KEY: Doorstep Bumper Guards
List of BabyBucket's best-selling whimsical baby onesie designs: Baby Birds, Star Baby, Bear Design, Elephant Baby, Pig Baby, Panda Design, and the Giraffe print.
List of BabyBucket's best-selling whimsical baby onesie designs: Whether you're shopping for a future grandpa, grandma, aunt, uncle, cousin, sibling or even a friend, you're bound to find the right onesie at a perfect price for that special someone on your list this year.
Adding even more of a preamble increased my chances of getting an interesting list. With the prompt below (in bold) I found my main problem was the text got very wordy and rarely got around to actual designs. Here's the output from a trial that worked, at least once it got going - if I didn't like the direction the list was heading, I reverted back to an earlier place in the list and allowed it to generate text again. (line breaks added for readability):
Welcome to the Mommy Mommy blog! I've been spending the last week working on new baby onesie designs for the store and I'm so proud of these! These are Mommy Mommy blog store exclusives, never appearing anywhere else before. Let me know what you think in the comments!
Click on the title of each design to see more details:
Big Blue Train, Little Baby Noses, Peaches and Cream, First Carpet, Before There Was Everything, Pop-Tarts and Pumpkins, These Big Ears and Hi There Little Space
I think they are mostly originals (except for Peaches and Cream and arguably Big Blue Train).
But the most original onesie designs I got were when I used this prompt:
Welcome to the AI Weirdness blog! I trained a neural network to generate new designs and sayings for baby onesies. Here are some of the most interesting designs it generated:
Here I've illustrated a few of my favorites.
Image description: A series of baby one-piece bodysuits with AI-generated captions and hand-drawn illustrations. Galaxies on Ice: two galaxies on ice skates. Dandelion Underwater: A dandelion surrounded by bubbles and a small fish.
Image description: Onions in Snow: Three onions with stick legs and snow-covered tops, standing around in falling snow. Fart Whale: a humpback whale with bubbles rising from its rear end.
Image description: Pumpkins on Mars: the Curiosity Rover examines four pumpkins. Meatball launch controls: a simple red button with an icon next to it of a single meatball and an arrow pointing up.
Some of the suggested onesie designs seemed difficult to illustrate.
Moths with Santa Claus' Eyes 12 Billion Circling Magnets Swedish Fish Ulysses Almost Everything Ketchup Metaphors
I decided to see if I could get it to give me more details on the ketchup metaphors, so I used the following prompt:
For the Ketchup Metaphors onesie the AI generated a few of the example ketchup metaphors to be printed on the shirt, including:
And it delivered an explanation. Of sorts.
"Tomatoes taste good with ketchup on them." "Now your bread tastes like ketchup." "Redemption is sweet like ketchup."
Image description: Ketchup Metaphors baby one-piece illustrated with a red splotch and the phrases "Tomatoes taste good with ketchup on them.""Now your bread tastes like ketchup.""Redemption is sweet like ketchup."
You can try out BLOOM, the language model I used, for free at HuggingFace - I used this version.
In my opinion, the most interesting creative use of large language models is to generate text that's nothing like a human would have written. If your AI is just going to lift human creative output virtually verbatim, you're not only shortchanging the humans you could have hired to write similar things, but also plagiarizing the original humans from the training data. In that sense, BLOOM, with its less-perfect retrieval of human output, is better at this task than GPT-4.
It is creepy to me however that the only reason this method gets BLOOM to generate weird designs is because I spent years seeding internet training data with lists of weird AI-generated text.
I've noted before that because AI detectors produce false positives, it's unethical to use them to detect cheating.
Now there's a new study that shows it's even worse. Not only do AI detectors falsely flag human-written text as AI-written, the way in which they do it is biased.
This is
Some of the recent image-generating models have this thing where they can fill in the blank parts of images. It's handy when you want to show them exactly how to give you more of the same. Like these animal emoji. See if you can tell which ones I gave DALL-E2