I've been experimenting with GauGAN2, released in Nov 2021 as a follow-on to GauGAN. One new thing Nvidia added in GauGAN2 is the ability to generate a picture to match a phrase.

"A rocky stream in an ancient mossy rainforest"

Twin waterfalls descend a mossy, fern-covered cliff. The scene is nearly photorealistic.

It can do more than generic landscapes - it can also try to get the vibe of specific places. Here's "Rocky Mountain National Park"

Looking up a rocky, evergreen-covered ridge toward steep jagged mountains, with the remnants of spring snow above treeline.

It's tough to tell exactly what it knows because its scenes depend not only on the text you give it, but also on the style filter you select and even on the seed for the random noise. And you can't control the random noise seed - you can only reload the site and get a new one (which confused me at first when I would revisit a prompt/filter combo and find that my results had changed).

The secret to being impressed with its "the city of san francisco with the golden gate bridge" is to pick the filter/seed combo where it's showing the city, not the countryside scene or the icebergs.

A bay cuts through a city beneath green hills. The city is definitely not as large or as built-up as San Francisco, and the houses in the foreground are more like distorted sketches among confusing streets.
the city of san francisco with the golden gate bridge
Deciduous trees in autumn leaf stretch over low mountains, with the valleys filled with green farms, lakes, and the occasional neighborhood. It looks a lot like Scotland.
the city of san francisco with the golden gate bridge
The mountains are in the same shape as before, but now they’re ice-covered fjords.
the city of san francisco with the golden gate bridge

(The above scenes were all generated with the same seed but different filters - notice how the topography is similar.)

At some filter/seed combinations it almost doesn't matter what you type; you are predestined to get certain elements. Here I asked for "Furnace Creek" (a decidedly dry spot in Death Valley), and the main effect was that the mountains above the tropical bay became a little browner. The information about Furnace Creek was in there, but it seemed to be warring with information about what went along with the color scheme it was supposed to be using.

 A long-exposure nighttime shot of a glowing city above tropical green waters. Above the city are orange-brown bare mountains.
Furnace Creek

I decided to stick with a particular seed/filter combination and change the text prompt, to find out what it was getting from the text alone. Using the same seed and filter as the "Rocky Mountain National Park" picture, I discovered GauGAN2 also knows "The Alps" is steep mountains and "The Appalachians" are mountains that are much less steep and completely covered in trees.

Here's a series of images with the same filter and seed, a different one from the picture above.

"Furnace Creek"

Reddish badlands and dry sagebrush against a dried-up creek.
Furnace Creek

"Norway"

The same shape of mountains as before, but now they’re snow-dusted peaks above tree-ined fjords.
Norway

"Manhattan"

The same scene as before, but the mountains are a bit blurrier, with indistinct snow. The foreground now has a few apartment buildings.
Manhattan

"The inside of a library"

The same scene as before, but with the mountains and apartment buildings looking rather blurry and melty. The trees are more saturated and less distinct.
The inside of a library

"Three apples on a plate"

The mountains are blobby and saturated and not at all realistic-looking. The apartment buildings are replaced by a single twisted barn, and the trees are now deciduous.
Three apples on a plate

It seems that as the prompt becomes a worse match for the seed and filter settings, and strays away from landscapes in general, not only does the mountain scene remain looking like a mountain scene, it becomes a worse and blurrier one, as if GauGAN2 is out of options and flailing.

Why is GauGAN2 so flummoxed by three apples? Unlike popular algorithms based on CLIP, GauGAN2 was NOT trained on the internet but on a custom dataset of "high-quality landscape images". I couldn't find any details on what was/wasn't in that dataset, and who labeled the images and how. (Those details matter - one complaint about images generated using CLIP is that since it got its image/text data by scraping the internet, its training data was essentially labeled by random internet people, and was therefore full of bias. Similar things can happen if random internet people provide your training data on moral dilemmas.)

We do know the training data focused on landscapes, and Nvidia als0 told VentureBeat that they "audited to ensure no people were in the training images".

How completely did they erase humans from the training data? I decided to do some tests.

Here's "A busy sidewalk".

Bright colors with lots of vertical and horizontal lines, floating above (in front of?) a green landscape. In the sky are palm trees, for some reason.

Similar prompts, like "a crowded sidewalk downtown" and "a crowded shopping mall" and "times square" all resulted in variations of the above tropical-themed algorithmic shrug.

Is it completely clueless about non-landscape things?

Here's "a giraffe"

Macro shot of grey bark with something furry in focus. It has at least two heads and an uncoutntable number of eyes.

And, under the same seed/style settings, "a turtle"

Hard to determine scale, but the background looks metallic and wet. There is an object of some sort, with at least one discernable eye.

And "a gymnasium"

The heart of a river rapids, with a twisted up chrome motorcycle emerging from the water. Maybe. It’s very abstract.

And "a sheep"

It looks a lot like the giraffe photo, but with slightly more fur and slightly fewer eyes.

So it's made a sheep look more like a giraffe than like a gymnasium. In further tests, a raven looks more like a sheep, whereas a writing desk looks more like a gymnasium. Any kind of mammal generally ends up being rendered as furry and covered in far far too many eyes. There's information about animals and objects in there, either from rare images in the training data or from remnants of earlier, general, pretraining. What about a human?

"A human"

Perhaps a cross between the turtle and gymnasium images - rippled water or possibly whale skin, some deep holes, maybe some chrome, at least one large eye.

Hard to say. With some seed/filter combinations a human does look more like a somewhat less furry giraffe. The effect isn't consistent, though. I got similarly inconclusive results from variations on homo sapiens, basketball players, the Mona Lisa, a CEO, etc. To erase knowledge of humans this thoroughly, my guess is they trained from scratch on a dataset where people had gone through and manually deleted any picture that had a person in it. Humans are basically gone.

It's poignant to think that GauGAN2 was able to generate this rendition of Topeka, Kansas, filled with homes and boats and skyscrapers, yet unable to generate any of the people who inhabit it, or interiors of any of the buildings.

A city on the edge of the water, at the mouth of a mountain valley.

Bonus content for AI Weirdness supporters: GauGAN2 railroads me into asking for whatever it already had in mind.

Subscribe now