A sloppy pizza-textured pile that's approximately 8 pizza thicknesses tall
“Leaning Tower of Pizza”, BigGAN steered by CLIP using Big Sleep

I wrote earlier about DALL-E, an image generating algorithm recently developed by OpenAI. One part of DALL-E’s success is another algorithm called CLIP, which is essentially an art critic. Show CLIP a picture and a phrase, and it’ll return a score telling you how well it thinks the picture matches the phrase. You can see how that might be useful if you wanted to tell the difference between, say, a pizza and a calzone - you’d show it a picture of something and compare the scores for “this is a pizza” and “this is a calzone”.

How you come up with the pictures and captions is up to you - CLIP is merely the judge. But if you have a way to generate images, you can use CLIP to tell whether you’re getting closer to or farther from whatever you’re trying to generate. One example of this is Ryan Murdock’s Big Sleep program, which uses CLIP to steer BigGAN’s image generation. If I give Big Sleep the phrase “a cat singing a sea shanty”, it’ll start with some random guesses and then tweak the image based on CLIP’s feedback, searching for an image that better fits the prompt.

So its first try at “a cat singing a sea shanty” might be this:

A blurry golden background, a figure with a black and white head in the foreground

After maybe 20 minutes of crunching on a big processor, it find a better image, looking something like this:

The black and white figure is now distinctly furry and wearing a tricornered hat and rugged boatnecked shirt. The blurs in the background are starting to look like old tall ships.

It isn’t nearly as good as the DALL-E results I’ve seen (although to be fair, they posted the best of tens to thousands of tries, and I haven’t seen DALL-E try “a cat singing a sea shanty”). But it’s an interesting way of exploring BigGAN’s latent space. Here’s “a giraffe in the Great British Bakeoff”.

There's a brick building and trees in the background and it looks like heaps of giraffe-spotted dough about two stories high. In the foreground is a muffin shaped object with a very tall giraffe-spotted neck.

I’ve mentioned before that BigGAN was trained on about 1000 categories of images from ImageNet, all hand-labeled by hired humans. “Giraffe” and “Great British Bakeoff” are not among the ImageNet categories, but CLIP can still give feedback about what looks like a giraffe, and what looks like an outdoor baking show, because CLIP was trained on a huge set of images and nearby text scraped from the internet. The upside of this is that CLIP knows how to nudge BigGAN into finding something vaguely like “My Little Pony Friendship Is Magic in outer space”, even if nothing like this was in the BigGAN training images.

The background is vividly magenta and two figures in the foreground might be ponies, and one of them might even be in a space helmet with a unicorn horn. Definitely requires some imagination.

The downside of this is that CLIP has seen a skewed picture of the world through its internet training data. In their paper, the makers of CLIP discuss a bunch of biases that CLIP learned, including classifying people as criminals based on their appearance or making assumptions about what a doctor looks like. If I had given it terrible prompts it would have helpfully done its best to fulfill them.

Many times, the CLIP-BigGAN combination steered out of control rather than arriving anywhere. “Spiderman delivering pizza” looked at first like it was refining a drawing of a technicolor emu:

Lush green in the background, and a red and blue bipedal figure in the foreground. Body shape and legs more closely resemble an emu than anything else. Maybe that's a rectangle of pizza on its back?

then dissolved into sudden chaos:

Red and blue staticky lumpy chaos

heroically refined that chaos into something resembling spiderman and a single slice:

Emerging from the previous chaos is a very rudimentary three-holed head and long appendage with a pizza-textured thing on the end

and then dissolved suddenly into white nothingness from which it never recovered. This chaotic disaster happens when the search wanders off into extreme values and is usually accompanied by intense strobing colors and huge areas of blank, or spots, or stripes. And it happened all the time. When I gave it the AI-generated drawing prompt “Coots of magic”, things were looking promising for a few minutes:

Very pretty and glowy, a few bird-shaped objects with glowing beaks and black pointy hats, standing among maybe jewels in a forest glade.

But in the very next step it collapsed into this:

It looks like if you gave a toddler six tubes of colored cake frosting and left for an hour. Indistinct pastel smears everywhere.

A “professional high quality illustration of a cat singing a sea shanty” got this far:

It might be a white boat in the background and possibly an angry white cat with its mouth open in the foreground.

before turning into this neon laserfest:

It looks like a cross section of cauliflower, but made out of vivid green lasers

In other cases, progress would just stop somewhere weird.

“Tyrannosaurus Rex with lasers” became ever more finely textured but just looked like a dog.

It might be some kind of boxer dog - its head is indistinct. In the background are red and green laser beams. There's green laser light on the dog's belly fur.

This is why AI programmers tend to laugh at the idea that an AI left running long enough would eventually become superintelligent. It’s hard to get an AI to keep progressing.

I had particular difficulty trying to get it to find a photo of Bernie Sanders. In three different trials I would get neon blobs from the get-go. Finally CLIP-BigGAN went with a strategy that has occasionally paid off for it: if you can’t make it well, make it very tiny. I give you “a photo of Bernie Sanders sitting on a chair and wearing mittens”.

Off in the distance among snowlike white is a bald whitehaired man sitting in a chair with his hands on his lap. It might plausibly be the famous Bernie with Mittens meme but sideways, but it's a bit hard to tell because it's so small and far away.

For more examples, including when I tried to get it NOT to generate a picture of a pink elephant, become an AI Weirdness supporter! Or become a free subscriber to get AI Weirdness in your inbox.

Subscribe now

“A golden retriever in the Great British Bakeoff”

Way off in the distance among featureless white is a golden tuft of hair with possibly a single dog nose.
Subscribe now