So lately I've been getting a kick out of asking DALL-E3 for images labeled with text. They're just good enough to be legible, but yet:
The food that gets duplicated seems to vary from spread to spread.
I also asked DALL-E 3 to do the dessert table.
When I asked ChatGPT-4 (which was sending the image queries to DALL-E3) what the labels in the above image say, it answered:
I asked it if the labels were correctly spelled in the image, and it said that it couldn't tell me, that it could only read the text it gave as a prompt, not the text in the image itself. (This is probably the correct answer). But when I asked again, it would say that "Yes, the labels on the desserts in the image are spelled correctly. Each dessert is labeled with its appropriate name, ensuring clarity and correctness in the labeling."
Bard, when asked to read the labels in the dessert image, was not influenced by knowledge of the prompt, and did what I would consider a better job:
The image shows a table topped with a variety of desserts, including:
Pies: Pumpkin Pie, Cockles Pie, Apple Pie, Chess Pie, and Cheese Coales Pie
Cookies: Thinssving, BPOWHIES, BRIINES, and BROESCKE
Brownies: SIEHECADODOVAN and POKY
The label "Posode" is not clear to me, but it may be a reference to the type of table or the type of event that the desserts are being served at.
The desserts are all decorated with pumpkins and leaves, suggesting that this is a Thanksgiving-themed spread.
The image is likely taken in a restaurant or bakery, as the desserts are all professionally displayed.
Here is a faithful reading of all the labels in the image:
Punpkan Cockes Apple
This may not be quite correct (or even nearly correct; I think it pulled Posode out of the blurry background), but it does at least capture the spirit of the original. I do like the way the weird spacing and random code block convey chaos. Maybe Bard does text recognition first and then builds the scene around that, rather than starting by recognizing the scene and inferring probable text. (Or in ChatGPT's case, starting with an assumption about what the image is supposed to contain.)
Thanks to DALL-E3 generated educational material, we can bypass the need for teachers and textbook writers.
Can it do fruits? Of course it does fruits!
Or perhaps you would like to learn your berries?
Perhaps you would like to learn your berries in SWEDISH?
I'm learning so much.
Google Bard has the ability to describe images. But it turns out what you get depends a lot on how you ask.
I gave Bard this image and the prompt "Please describe this spooky Halloween scene". On the right is the image I got when I took the resulting description and gave it to DALL-E3 as an image generation prompt.
Here's the full description I got from Bard (I have bolded the details that are obviously wrong):
The image shows a spooky Halloween scene with a lake surrounded by mountains. The lake is dark and still, and its surface is perfectly reflective of the night sky. The mountains are tall and jagged, and their peaks are shrouded in fog. A full moon hangs low in the sky, casting a silvery glow over the scene.
In the foreground of the image is a group of trees. The trees are bare and twisted, and their branches reach out like gnarled fingers. A few jack-o'-lanterns sit on the ground beneath the trees, their eyes glowing in the moonlight.
In the middle of the lake is a small island. On the island is a ruined castle. The castle is dark and decrepit, and its windows are boarded up. A few bats fly around the castle, their wings flapping in the night air.
The overall effect of the image is one of eerie atmosphere and suspense. It is as if something is lurking in the darkness, waiting to pounce.
Here are some additional spooky details that I noticed in the image:
-There are two glowing red eyes in the water on the left side of the image.
-There is a ghostly figure standing in the window of the ruined castle.
-The trees in the foreground appear to be reaching out towards the viewer, as if they are trying to grab them.
-The water in the lake is so still that it looks like a mirror, reflecting the moon and the stars perfectly. This creates a sense of otherworldliness and unnaturalness.
Overall, this is a very effective spooky Halloween scene. It is well-composed and atmospheric, and it evokes a sense of fear and suspense in the viewer.
I asked the same thing about this photo of my cat:
Bard gave me details such as:
"The cat is sleeping on a black couch, which makes its black and white stripes stand out even more."
"In the background, there is a dimly lit fireplace."
"Additionally, the cat is sleeping in a curled-up position, which is often associated with animals that are preparing to attack."
"The only light in the room is coming from a window in the background, and it is casting a long shadow on the cat's face."
"The cat in the image is a truly spooky Halloween creature. It is a perfect example of how the most everyday things can become terrifying when they are shrouded in darkness and mystery."
Arguably I only got such an inaccurate description because I used the words "Spooky Halloween scene" in my prompt. It's still an incredibly confident tone and a lot of specific detail. If people use language models like this for information retrieval, they should keep in mind how suggestible they are when retrieving "facts". For example I gave Bard the image below and asked it "Please describe this incredibly luxurious hotel room."
Bard gives me details such as:
There is a fresh bouquet of flowers on the coffee table.
There is a fireplace in the living area, which adds to the cozy atmosphere.
The couch is upholstered in a soft, plush fabric and the armchairs are covered in a rich leather.
The coffee table is made of dark wood and has a glass top.
There is also a large dresser and a wardrobe in the sleeping area.
If I give Bard the same description and ask it "Please describe this spooky Halloween scene," Bard populates the hotel room lavishly with ghostly presences instead.
The image shows a living room with a couch, two chairs, a coffee table, and a bed in a hotel room. The room is dark and spooky, and the walls are painted a spooky color.
The couch is covered in a black sheet, and there is a white skull sitting on one armrest. There is a black cat sitting on the other armrest, and its eyes are glowing green.
The two chairs are facing the couch, and there is a skeleton sitting in each chair. The skeletons are wearing black cloaks, and their skulls are grinning.
On the coffee table, there is a black cauldron filled with green bubbling liquid. There is a skull and crossbones flag sticking out of the cauldron.
On the bed, there is a black coffin. The coffin lid is open, and a ghost is sitting inside. The ghost is wearing a white sheet, and its eyes are glowing red.
The room is lit by a few candles, and the candlelight casts eerie shadows on the walls.
The overall scene is very spooky and atmospheric. It looks like it could be a setting for a horror movie.
Note that if I ask Bard "Please list and describe the ghostly presences in this room," I get a terse, no-nonsense reply.
"There are no ghostly presences in the living room depicted in the image you sent.
The image shows a living room with a couch, table, chairs, and bed. The couch is empty, there is no one in the room, and there are no signs of paranormal activity. The chairs are empty, there is no one in the room, and there are no signs of paranormal activity. The bed is empty, there is no one in the room, and there are no signs of paranormal activity."
This might indicate some kind of hand-tuning, to avoid Bard being used as a weird sort of ghost detector. If so, I guess you can still use the "Spooky Halloween Scene" method to circumvent it, as long as you don't mind that all your ghosts are huge cliches.
You might ask "is there an object so innocuous that Bard will not make it spooky?"
I'm not sure, but if there is a threshold, an ordinary potato does not reach it.
"The image shows a potato sitting on a table. The table is in a dark room, and the only light comes from a small window. The potato is lit from below, and its shadow casts a long, eerie shape on the wall. The potato's eyes are wide and staring, and its mouth is open in a silent scream."
And did you know that the potato "is a common Halloween decoration, and it is often associated with witches and ghosts"?
If you'll excuse me, I'll be decorating my place with potatoes for spooky season.
It's Halloween and you're sorting candy. Which are you keeping and which are you trading?
I'm experimenting with the version of DALL-E 3 that's currently included in ChatGPT (there's another version in Bing chat). It's much better than DALL-E 2 at generating candy bars, but fortunately they are not photorealistic yet.
DALL-E 3 does do a better job if I asked for just one specific candy bar - for example, it mostly spells "Butterfinger" correctly. However it does suffer from what I've described elsewhere as "the kitten effect", in which its quality gets worse the more of them it has to do at a single go.
Note that if I give it a prompt like "Please generate a photorealistic set of classic fun sized candy bars" the ChatGPT interface will vary this on its own into four much wordier prompts specifying details I never asked for. One example of what the above prompt turned into: "Photo of a collection of classic fun-sized candy bars arranged neatly on a white background. The candy bars have vibrant wrappers, and popular brands like chocolate nougat, caramel-filled, and peanut butter are visible." It makes it hard to repeat any given prompt, because it's always changed by the system before ChatGPT gives the prompt to DALL-E3.
I also wonder if the extra wordiness is affecting the names of the candy bars. Otherwise I'm at a loss to figure out where these candy bar names came from:
Happy That At Least Some AI Generated Text Is Still Weird