Be not afraid of these perfectly normal golf courses
What would an AI who's never seen or heard of golf courses do when shown a list of real golf course names and challenged to generate more?
When Jeff Kissel sent me 15,626 existing golf course names from the National Course Rating Database, I thought I might have an opportunity to find out. I used Max Woolf's aitextgen, a text-generating model based on GPT-2. Although it has the option to start training from a version of GPT-2 that has already seen a lot of internet text, I wanted to see what would happen if I started from scratch. I told aitextgen that my dataset consisted of individual items, each on their own line, and that they were 30 characters or less in length.
Here's what aitextgen started producing just before I started training:
lands Ranch Ranch
Resort OFAKESage Resortsh
SandSand IDGEersersAollEADOW Farmield
Therairieont GolfBearockous P
They are Extremely cursed-looking, yes, but though I haven't started training yet, they already have fragments of real golf course names in them. Why? As Max Woolf explained to me, this is because in preparation for training, aitextgen looks through the dataset and finds recurring chunks of text to use as building blocks. Algorithms I've used in the past use the set of individual characters, or of common words, as their building blocks. But, like a lot of modern text-generating algorithms, GPT-2 uses the fact that chunks of text like "ing" or "the" (or in this case, "golf", "resort", "ranch", etc) happen a lot and it saves time to use the entire chunk of text at once. (This process is called tokenization).
Once I actually start training, aitextgen seems to grab ahold of which tokens are the most common and use them preferentially.
Here's some course names from iteration 20.
Course - -
Course C Club Golf Club Country
Club Club Club SO
e Course Course CC Club GC Course -M - Club - Course Course &
By iteration 70, it's starting to branch out into a bit more variety.
Rmps Golf Club
HBEeecy FNCner Hills Country Course
MHpmkoon Cere Country Club
In iteration 80, it becomes clear that it's generating two kinds of golf courses - sort of reasonable-sounding courses in lowercase, and frightening word salad in all-caps.
BCTINEOOERILTMAYLBB Me GC
Shcpases Golf Club
MARDORWM PINGANal Golf Club
CYWSIAREDANEORINKane Golf Club
Pine Country Club
BAARAWass Golf Course
BOVDVTRGHASSA CAINWGS CE/CTGS The GOLF Golf Course
What seems to have happened is since most golf course names contain either all mixed-case or all-caps, and there are lots more mixed-case, aitextgen learns the mixed-case text faster. Its all-caps prowess lags behind a lot. I've seen this before. There's nothing in what I gave it that explicitly says "golf" and "GOLF" are the same thing, so it has to learn how to use GOLF as well, from fewer examples.
By iteration 200, the all-caps courses become less alarming. It's taken longer, but it's starting to get the patterns of those too.
Dumyky GC - Sshanes
VIEW COUNTRY CLUB - Iudly
Lelch Golf & CC
Smitno Golf & Country Club
Barkbosa Golf Club - Onndle Hort Course
Fopbitth Country Club
Misserty Golf Club
Slake Country Club
Stale Lake Golf Club
Blewing Creek Golf Course
HILLFIE HILLS GC
Groygon Country Club
Lake Worse CC
River Antban Country Club
Gurdy Hills Golf Course
Warererer Golf Course
Iteration 200 is more or less the last point at which the golf course names are unique. Even with over 15,000 example courses, aitextgen has such a huge learning capacity that it begins to memorize the input course names. After all, if I'm asking it to predict the names of golf courses, giving me the names of existing courses is technically a great solution.
Bonus content for AI Weirdness supporters: I gave GPT-3 the task of generating golf courses in the style of GPT-2. Don't ask too many questions about the holes at "0.00001 Toilet".