AI vs a giraffe with no spots

On July 31, 2023, a giraffe with no spots was born at Brights Zoo in Tennessee. She's a uniform brown with pretty white highlights around her face and belly, like a Jersey cow or a white-tailed deer.

Image recognition algorithms are trained on a variety of images from around the internet, and/or on a few standard image datasets. But there likely haven't been any spotless giraffes in their training data, since the last one to be born was probably in 1972 in Tokyo. How do they do when faced with photos of the spotless giraffe?

The MMICL demo doesn't have a back and forth dialog mode per se, but at each step I copied the prior dialog into the system prompt and added my newest question to the end. I'm not sure if this is how it's intended to be used but I didn't see any other instructions.

I did a very brief and entirely unscientific experiment using two image recognition models, chosen for inclusion entirely because they had online demos where I could choose what question the program was answering.

One model I tried was Multi-Modal In-Context Learning (MMICL). I tried simulating a back-and-forth exchange a la Visual Chatbot but it quickly got repetitive (see conversation above). When I tried one-off questions it answered more coherently but also failed to note the giraffe's spotlessness.

I also experimented with a version of Instruct-BLIP (I used the demo hosted here). Indirect questions yielded lots of fluent detail, much of it spurious. Chairs? A person in a grey shirt? A pink umbrella?

Direct questions fared no better. When I first asked InstructBLIP:

"Is there anything unusual about the giraffe's coat?"

I got the answer:

"Yes, the giraffe has an unusual coat in the image. It is a brown giraffe, which is a rare color compared to the typical giraffe that is typically spotted or brownish-gray. This brown coloration is more unusual and may be due to genetic factors or an environmental influence."

Which for a moment made me think that perhaps the algorithm had correctly identified the key feature of this image! But when I repeated the question over a few trials, it never repeated its success, its answers including:

The giraffe's coat is unusual because it appears to be spotted or striped, instead of being solid brown or spotted as a typical pattern.

The giraffe's coat is unusual in that it is brown and white. Many giraffes have dark brown or black coats, so the brown and white combination is relatively unusual for this species.

Yes, there is something unusual about the giraffe's coat. Specifically, the giraffe appears to be wearing a coat. While this might seem unusual or unexpected, it is a common practice in the case of giraffes raised in captivity.

I asked it directly:

Is the giraffe striped, spotted, or plain?

and got

The giraffe is striped.

Its answer depended on word order, though. If I presented the "spotted" or "plain" option first I would get:

Me: Is the giraffe spotted, striped, or plain?
Bot: The giraffe is spotted, with a brown and white pattern on its body.

I repeated the experiment with a normally-patterned baby giraffe and got the same range of answers about the giraffe's coat. It even selected "striped" again from "striped, spotted, or plain".

Are these crummy image recognition models? Not unusually so. As far as I can tell with a brief poke around, MMICL and InstructBLIP are modern models (as of Aug 2023), fairly high up on the leaderboards of models answering questions about images. Their demonstration pages (and InstructBLIP's paper) are full of examples of the models providing complete and sensible-looking answers about images.

Then why are they so bad at Giraffe With No Spots?

I can think of three main factors here:

AI does best on images it's seen before. We know AI is good at memorizing stuff; it might even be that some of the images in the examples and benchmarks are in the training datasets these algorithms used. Giraffe With No Spots may be especially difficult not only because the giraffe is unusual, but because it's new to the internet.
AI tends to sand away the unusual. It's trained to answer with the most likely answer to your question, which is not necessarily the most correct answer.
The papers and demonstration sites are showcasing their best work. Whereas I am zeroing in on their worst work, because it's entertaining and because it's a cautionary tale about putting too much faith in AI image recognition.

Giraffes have been a cautionary meme in image recognition, with early algorithms especially prone to reporting giraffes in places where there are none. Giraffe With No Spots is another example of how AI can miss the obvious.

Bonus content: I try (and fail) to get DALL-E2 to generate a giraffe with no spots.