DALL-E (and other text-to-image generators) will often add text to their images even when you don't ask for any. Ask for a picture of a Halifax Pier and it could end up covered in messy writing, variously legible versions of "Halifax" as if it was quietly mumbling "Halifax... Halifax" to itself. Since the AI is rewarded for matching your text prompt, it seems to get some reward for having versions of the actual text of your prompt in the picture. Label an apple with a sign that says iPod, and CLIP, the internet-trained reward system behind many of these image-generators, may count it as a close match to "iPod".

1st photo: an apple that the AI labeled as a Granny Smith apple (85.6% confidence). 2nd photo: Same apple but with a piece of paper stuck to it that reads iPod in huge black letters. The AI IDs it as an iPod, 99.7% likely.
Image: OpenAI

One thing that's different about DALL-E2 is that the text it generates is often legible. Legible but mangled. Or legible but completely incomprehensible. The question is, are those letters completely random, or do they have some bearing on the text prompt?

I decided to do some experimenting after reading a preliminary-stages paper whose authors observed that some of DALL-E's generated nonsense text did seem to relate to the original prompt when fed back into DALL-E.

So, I asked DALL-E2 to generate "A message in the tea leaves at the bottom of a cup".

Teacups with mostly nonsensical messages written in, or sometimes on, the leaves. Some read "you you" or "I tea tea" or "tee teat" but some aren't recognizable words.
Prompt: A message in the tea leaves at the bottom of a cup

Some of them are real words, or obviously variations on the word "Tea". But what about "Te at Ecnge"? Do they mean anything? I gave them back to DALL-E as a new text prompt and got:

Green mountains covered in lush crops thatstrongly resemble tea, two images of tea-filled teacups, and one image of a flying stork.
Prompt: Te at Ecnge

It looks like tea. Cups of tea, tea growing in the mountains. And also a random stork. It may be that adding "Te at Ecnge" to an image is a way to add some extra "tea". (Although another time I tried this the tea leaves gave me messages that led to energy drinks, or plates of food.)

I also tried "The complete set of lucky charms marshmallow shapes"

Pastel-colored collections of marshmallows, most of which are just plain marshmallow shaped. A few are hearts, clovers, and horseshoes. The accompanying text is mostly clear but says stuff like "Hamarkys" and "Crammmuts"
The complete set of lucky charms marshmallow shapes

There's a lot of text in these - are they random?

I tried prompting DALL-E with a few of the words above. Here's "lramioicss"

Roman ruins, seedlings, daisies, crackers, baskets of noodles, raspberries, purple eggplants, stuffled tomatoes, and a cylinder made of marshmallows.
Prompt: lramioicss

One of the pictures contains actual marshmallows (as a weird corncob?), and 5 more could be considered as maybe matching "collections of foods".

And here's "crammmuts"

Ever picture is of food, mostly identifiable if a bit weird. Is that carrot rounds on soggy apple rings with a basil and hot pepper garnish?
prompt: "crammmuts"

No marshmallows this time, but it's all food, and often food in small round pieces or food in bowls. Like cereals?

Here's another of the Lucky Charms messages, "Hamarkys":

All food, several of which are dumplings. One might be potatoes wrapped in dough, and another might be chipped pieces of tiny coconuts with chocolate rinds
prompt: "Hamarkys"

It's foods again. Foods in bowls? Like cereals?

I tried to get it to generate text for another category of things. How about animals? Here's DALL-E generating "A list of common mammals"

Combination of photographs and drawings. The photographs are recognizeable fisher cats, mice, and racoons, but the drawings are more mixed, and include several non-mammals.
prompt: "a list of common mammals"

It is excellent but mostly illegible. "Commmals" and "Almals" look so close to "common" and "mammals" that it's probably why they were included. But what about the text that labels the well-known mammal, the snail? I fed "cnlomeno" into Dall-e and got:

The Taj Mahal, Sainte Chappelle, Notre Dame, L'Hotel de Ville, and some other fancy european domed landmark. Interspersed with pinto beans, ice cream cones lying on the table, cream covered tartlets, and potstickers
Prompt; "cnlomeno"

...pieces of food? In bowls? Like cereal - oh wait, that was the last prompt. Grand architecture. ...built by mammals? The link seems tenuous.

I tried "Callmas", which labels the pigeon-mammal and got:

Three images are of snails, two are of pinecones, one might be a durian, another is a walnut, another might be a combination between an almond and a hand grenade
prompt: "Callmas"

There are the snails! And the pinecones, and the walnuts, and the tapioca...?

Even a random string of letters points to a crisp, identifiable set of images. For example, "wltlttf", a garbled string from a very early neural net paint color generator:

Pinecones, a white pumpkin on a keyboard, a tomato caprese sandwich on terrible bread, two baby birds, kiwi and cucumber slices on pate-spread crackers, sushi, hands holding plates of sloppy joes
"wltlttf", generated by Dall-E2. This is a composite image from three different outputs, because each output had at least one human face in it and the terms of using DALL-E2's API (at the time I generated them) required me not to share images with photorealistic human faces.

So if the gibberish text DALL-E generates points to a set of clear images, that alone doesn't distingish it from random text.

Here's "A robot saying something profound about language"

Various boxy robots holdig up a single hand and emitting speech bubbles with illegible text
prompt: "A robot saying something profound about language"

And when I ask Dalle to generate "Leotunqualon":

Various sea invertebrates, like anenomes and jellyfish. One anenome appears to be someone's hair
prompt: "Leotunqualon"

Or "Loclaque":

Plates of food, some of which look very 1970s style. Someone has placed raw fish fillets in a frying pan with tomatoes, or in another one, three giant white mushrooms sit on a pot with some kind of orange fruit in it.
prompt: "Loclaque"

Are the robots saying these jumbles of letters because invertebrates and foods represent profound statements about language? Or because the text simply shares some letters in common with "language"?

My experiments here are anything but systematic and statistically significant. But if I had to guess, I would say that the gibberish text in Dall-e outputs is not random.

In some cases, the text points to things that fit the original prompt, even if in garbled form. After all, we know that AI can treat jumbled things like an unscrambled whole. Present it a scrambled flamingo and it'll ID it as a flamingo with no problem.

In other cases, DALL-E's generated text fits the original prompt simply by being text. The robot is supposed to be saying something so here are some common English letter sequences. If the sequences seem to result in pertinent images when fed back into DALL-E, that may be entirely coincidental.

I would like to see how the classifier in that first image of an apple responds to some of these:

Green apples with post-it notes stuck to them. Their messages read "ipo", "iopo!", "pdo", "ipd", "tod", and "ipod"
Dalle2 result for the prompt "A green apple with a note stuck to it that says ipod"

Bonus content: more mysterious messages, some of which lead to some very excellent birds and some of which don't.

Subscribe now