"This is the moment I've been training for," said the pun-generating AI

Feb 28, 2023

(Update 3/9/23: Model weights can be downloaded here)

In 2022 Robert Gonsalves demonstrated that GPT-J-6B could be fine tuned for limerick generation. This is an interesting data point, historically speaking, for a few reasons:

GPT-J-6B was over a year old when this happened
It’s ~50x smaller than GPT3
Generating coherent and amusing jokes is considered computationally difficult
- Note: Google’s PaLM LLM already managed this task, albeit at 100x scale
Robert Gonsalves did this as a fun personal project, using readily available cloud tools

I’m currently trying to fine tune the same model to make puns. Some unique (I think) output examples so far:

Two guys argued about a painting. There was a rupture in the peace.
- Peace => Piece (painting)
When the townspeople found out the cow was giving birth, it was quite a cow to have to deal with.
- I like this one because it’s still a pun, despite not being remotely funny.
A musician went to the hospital because he swallowed a trombone. The doctor told him to have a tube inserted and he would be playing on his own soon.
- This is a mediocre pun, but the setup requires a large amount of real-world knowledge.
Two electricians had such different tastes, they went to a waffle outlet for a discussion.
- This one appears to be a double-pun (electricians => outlet, and waffle-food => waffle-to change opinions)
“I love kiwis,” said Tom kiwwisely.
- They’re not all zingers.
To be able to go back to boarding school and pass all her subjects meant that she had learnt her lesson.
- So much worldbuilding for such an anticlimactic payoff.
The story of a boy who was born with one eye in the wrong place was told from an unexpected angle.
- This one is probably the most impressive to date, after ~12000 fine tuning steps (and poring through maybe 800 non-pun or unfunny pun inferences).
Old pianists never die they just get tuned away.
- This format (“Old [specialist]s never die, they just [death euphemism]”) is featured many times in the training data. However, the above pun is not on Google anywhere, so I assume it’s new.
I like to have a fire lit in my chimney, said Tom light-heartedly.
- Heart=>Hearth
Old gardeners never die they just turn green
He didn't wear his house shoes to work because he's such a homeboy.
Old mathematicians never die, they just have to multiply.
A young man sitting at a table with a pot of stew was very busy keeping a lid on his appetite.
Drumlines are always being beat up.
"There's no shortage of water," said Tom rationally.
- Water rations.
My new job as a gem cutter is fascinating because I am so deeply engaging.
- Gems => engagement rings.

Update 3/9/23 - After 15000 steps it started to sound like this:

If you tell a falsehood just after waking up, you're lying in bed.
The hardest part of the garden project for Tom was the concrete edge.
A baseball player is a good salesman if he has a good pitch.
A certain teacher used a different accent in his teaching, making it seem as if he were speaking "class."
If a tree is leaned against a fence, the wood will ring.
An elephant's opinion carries a lot of weight.
"I can see through the window," said Tom, stiltedly.
Before the painter finished his picture, he had it in mind for a long time.
Pilots always tell tall tales.
My friend gave me a book about "how to cook without a recipe."
If you get distracted while sanding, you'll soon be in the red.

And now that at least the pun and computational humor enthusiasts are hooked, I can give some more backstory.

Some of Gonsalves’ generated limericks are available online, but I couldn’t find his fine tuned weights anywhere. I have a home rig with ~5/4ths the vRAM of the Colab Pro instance that R.G. used, so I followed his instructions and shrunk the 32-bit pretrained weights down to 8-bit so they’d fit on my rig, and went to town. After two days of gently warming my pantry, the magic word box started making coherent limericks line-by-line. Hooray.

Robert’s method combines multiple talents of GPT-type models, e.g. attention, memory, and even an inkling of generality, or at least a latent (npi) talent for quickly absorbing stuff like phonetics (which, unlike language itself, is a more immediate reflection of the physical properties and limitations of human vocalization). He also created an apparently unique custom tagging scheme to hold the preceding limerick lines in memory when generating subsequent ones. The eight tasks he trained it to do all contribute to solving the problem of semi-supervised topical limerick generation.

Part of what fine tuning is doing is a sort of taming of the output sequence space via fuzzy-structured blacklisting. E.g. GPT-J-6B was trained on The Pile, which is 850GB of mostly English text. So there’s probably a dip in the statistical likelihood of seeing a “q” without a subsequent “u” in the training set, just because that’s a specific quirk of the English language. But if I fine tune the model to create Quickbooks Desktop XML queries, it’s going to have to see a lot of substrings that violate that Q-U tendency (e.g. “qbXML” and “QBFC”) which should either loosen up the nodes deciding Q-U adjacency probability, or make downstream correction more likely in such cases. Hopefully then, for the QBXML example, this would remain situation-specific (requiring e.g. the various other indicators of QBXML to precede the part that will be following the new Q-U rules). This is a bit of an oversimplification, but only in the Wittgensteinian sense.

The other part of what fine tuning is doing is putting new data into a model. This partial-retraining only happens in certain layers, so most of the model’s pretrained brainpower remains intact. The Pile probably didn’t have a high enough density of phonetic info to give GPT-J-6B complete knowledge. So, some of the task examples that R.G. included in his fine tuning dataset included conversion between graphemes and phonemes. When generating training strings related to phonetics, he used the “Festival Speech Synthesizer” phonetic system instead of something better known (e.g. IPA), probably because Festival uses a simple character set that is straightforward to tokenize.

GPT-type io data is tokenized according to whatever vocabulary was decided on during pretraining. This usually involves a compromise between the size of the model and the character set of the training data, and it’s halted at some point in the pretraining process after being optimized according to certain metrics. This means input and output sequences get split into chunks according to an often very large and model-specific list of unique character strings. E.g.

“I would like to feed your fingertips to the wolverines.”

becomes:

['I', ' would', ' like', ' to', ' feed', ' your', ' fingertips', ' to', ' the', ' w', 'olver', 'ines', '.']

The relative scarcity of developed models in the field leaves significant room for exploration and discovery in combinatorially vast nooks, e.g. unexplained mysteries like the so-called “anomalous tokens phenomenon”, which becomes evident after sorting by length in the GPT-J-6B tokens.json file. Here’s the longest entry:

ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ

Presumably there’s a perfectly good reason why the optimizer felt it needed like 50 repetitions of four weird unicode variants that when strung together look like the sort of maniacal laughter one might encounter in radio transcription logs while exploring the remnants of a ship that briefly contacted some hellish other realm. Or maybe this is a different repeating sequence that HF’s tokenizer devs never needed to waste time encoding properly because it’s never been a useful part of the io? The next biggest token is the same thing, but half that size. Then we get to the infinitely more interesting “âĢĶ” repeated like 20 times. Then one more half-sized ÃĥÃĤ string. Then the first semi-readable one is “rawdownloadcloneembedreportprint”.

So, tokenizing with the transformers module in python:

>>> tokenizer.tokenize(“rawdownloadcloneembedreportprint”)

produces:

['rawdownloadcloneembedreportprint']

Some other anomalous tokens include:

BuyableInstoreAndOnline
- This one seems to just be a very common phrase with non-alphanumeric characters stripped.
RandomRedditorWithNo
- This is one of a number of reddit usernames for users who have been competitively counting to infinity. So maybe that’s what’s going on.
SolidGoldMagikarp
- This is the most famous inexplicably tokenized redditor. Apparently they were heavily involved in the “Twitch Plays Pokemon” event.
channelAvailability
- This string might appear frequently enough in certain logs to justify reserving a token space for it.
telecommunications
disproportionately
guiActiveUnfocused
- This is a string related to the videogame Kerbal Space Program

Vice magazine recently published an article on these strange tokens, because at the time many of them still appeared in chatGPT’s accessible vocabulary and could therefore cause visible problems when used as input. GPT models were invented by OpenAI, whereas the group that curated the datasets and managed the training of GPT-J-6B and her larger nerdier sister GPT-NeoX-20B (see: my notebook demonstrating few-shot inference on this model) is a nonprofit research group called EleutherAI. However, I assume the two entities talk, and maybe exchange datasets, or at least procure them from similar places. So that probably explains why some of these anomalies also appear in GPT3.

Tokenizing big blocks gets inconsistent fast. E.g.

>>> tokenizer.tokenize(“rawdownloadcloneembedreportprint”)
Output: ['rawdownloadcloneembedreportprint']

The same string, when embedded into a sentence, is broken up very differently:

>>> tokenizer.tokenize("I would like to feed your rawdownloadcloneembedreportprint to the wolverines.")
Output: ['I', ' would', ' like', ' to', ' feed', ' your', ' raw', 'download', 'cloneembedreportprint', ' to', ' the', ' w', 'olver', 'ines', '.']

Notice it still stopped at a real doozy of a token: “cloneembedreportprint”

It looks like ChatGPT’s tokenizer has been modified not to accept these strange tokens as input, but raw davinci 3 still processes them as anomalies:

Input: “isSpecialOrderable”
Output: “+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...”

vs. with spaces:

Input: “is Special Orderable”
Output: “The Lasko 754200 Ceramic Heater is not special orderable…”

All this discussion of anomalous tokens is just background to explain why I chose a different scheme for tagging the pun bot dataset. Basically, GPT-J-6B was pretrained with 147 reserve tokens that aren’t mapped to anything that appears in the training set (for a more easily divisible vocab on TPUs). I have no idea if this makes them ideal candidates for a single-token tagging scheme, but I used them in this case and everything seems to have worked out. Once the current punbot finishes training I plan to try some alternatives with the same input data and see if this is actually a useful approach or just a weird dead-end.

Puns are often based on ambiguity from partial-homophones or phonetic overlap. The trickery of pun humor is primarily mechanical (hence why they’re always so cringe, it feels cheap) but they still require a lot of world knowledge and semantics to follow through. In the end, the fine tuned model will need to be able to:

Comprehend phonetics (specifically, homophones and near-homophones)
- Gonsalves’ approach seems to already work here.
Sort of know a lot of stuff that most other people also sort of know
- The pretraining takes care of this part.
Know what is and is not a proper pun (e.g. some sweet spot of funniness/humor, universality of comprehensibility, speed of comprehensibility, etc.)
- I used part of the SemEval-2017 puns dataset for this part, along with corresponding human annotations from the 2022 “ExPUNations” paper.

I’m fine tuning on my home rig with consumer grade GPUs (two RTX 3090s + an nvlink) and each block of the dataset is capped at 1024 tokens or less. Parallelism is achieved via deepspeed with zero 1 optimization. The good news is that this setup takes like a day to fine tune GPT-J-6B with 8-bit quantized weights. The bad news is that huggingface doesn’t seem to support models trained in 8-bit:

“8-bit state dicts cannot currently be loaded directly into the 8-bit model after being pushed on the Hub. This is due to the fact that the statistics (remember weight.CB and weight.SCB) computed by the model are not currently stored or taken into account inside the state dict, and the Linear8bitLt module does not support this feature yet. We think that having the ability to save that and push it to the Hub might contribute to greater accessibility.”

It looks like the facebook team recently dropped support for the bitsandbytes github repo, possibly because they realized how many bad puns were about to be generated in their name. Hopefully more progress happens. I’ve tried loading GPT-J-6B with bf16 weights and deepspeed zero 3 optimization, but then the fine tuning time increases by a factor of ~20. It’s still possible to use a non-huggingface operation like torch.save() to store the monkey patched model somewhere non-volatile, but loading and patching the model class from HF makes redistribution more cumbersome.

It’s a pity facebook went dark on this, because 8-bit quantization appears to be a very efficient way to lossy-compress a pretrained model without losing much brainpower acquired during the full-weight pretraining (a process which can cost upwards of a million USD just in compute costs). Models like the bf16 version mentioned above are supported by HF’s git system, and can be converted by toggling the “load_in_8bit” kwarg in the .from_pretrained() method.

Now that I’ve explained the background, the actual fine tuning procedure was very straightforward. I used various extra tokens to tag the start and end of each task. Some of the puns had human ratings and explanations. At the beginning of the most highly rated puns I put a unique token that I had hoped might result in a sort of Greg Rutkowski effect by distinguishing good from great output. I also grabbed a dictionary and converted that to Festival, then matched that to a list of homophones and appended pieces from both to pad out the smaller pun blocks. Any homophones appearing directly in the pun took precedence in padding, but I also mixed in random phoneme-grapheme conversion tasks unrelated to the original pun.

The result has been a robot that makes puns. Once the training is complete I’ll push the weights to huggingface manually.

~~TBC~~

Update 3/8/23: As promised, punbot is available for download.

Paul Calhoun’s Substack

Discussion about this post