Tuesday, January 22, 2013

When is a meat sandwich like a merchant? A python joke generator


When is a meat sandwich like a merchant? When it is a burgher. 

Yes, you can groan but don't blame me, heckle the computer.

I enjoyed a recent New York Times piece, A Motherboard Walks Into a Bar ..." on how and whether computer can learn what is or is not funny. I'm a big fan of groan-inducing puns and Physics particle X walks into a bar type jokes. As I read the article, it occurred to me that there must be some simple lexical patterns that a computer could pick up on and auto-generate jokes. Consider the following:

What do you call a strange market? A bizarre bazaar.

That has the structure "What do you call a [Adjective1] [Noun1]? A [Adjective2] [Noun2]" where [Adjective2] and [Noun2] are homonyms and [Adjective1] and [Adjective2] and [Noun1] and [Noun2] are synonym pairs.

(A homonym is a word pronounced the same as another but differing in meaning, whether spelled the same way or not. Example: hare and hair. Synonyms as two or more different words with the same meaning. Example: lazy and idle.)

If we take a look through a list of english homonyms, we can easily pick out such joke material:

suite: ensemble
sweet: sugary

leads to "What do you call a sugary ensemble? A sweet suite."

Similarly,
What do you call a breezy eagle's nest? An airy aerie.
What do you call a coarse pleated collar? A rough ruff.

Another structure is when the homonyms are both nouns:

stake: wooden pole
steak: slice of meat

leads to "When is a slice of meat like a wooden pole? When it is a stake."

(Slightly more complicated is "When is a car like a frog? When it is being toad?")

This suggests that we can easily auto-generate jokes such as these. So, let's do it.

First, I downloaded that homonym webpage and parsed the HTML using the python BeautifulSoup library to extract the homonyms. There is one short function to parse the HTML to obtain two homonyms and their short definitions, and for each homonym I call a second function function which calls a unofficial google dictionary API to obtain the part of speech (noun, adjective etc.) of the homonym. Calling  python extract_homonyms.py > processed_homonyms.txt processes a flat text file of the six pieces of information: homonym1, definition1, pos1, homonym2, definition2, pos2
Here is the code.

With the hard work out the way, generating the jokes is simple. A second short script, generate_jokes.py, has two type of jokes: 1) one homonym is an adjective and the other is a noun, 2) both homonyms are nouns: 

def indefinite_article(w):
    if w.lower().startswith("a ") or w.lower().startswith("an "): return ""
    return "an " if w.lower()[0] in list('aeiou') else "a "

def camel(s):
    return s[0].upper() + s[1:]

def joke_type1(d1,d2,w1,w2):
    return "What do you call " + indefinite_article(d1) + d1 + " " + d2 + "? " + \
           camel(indefinite_article(w1)) + w1 + " " + w2 + "."

def joke_type2(d1,d2,w1,w2):
    return "When is " + indefinite_article(d1) + d1 + " like " + indefinite_article(d2) + d2 + "? " + \
           "When it is " + indefinite_article(w2) + w2 + "."

data = open("processed_homonyms.txt","r").readlines()

for line in data:
     [w1,d1,pos1,w2,d2,pos2]=line.strip().split("\t")
     if pos1=='adjective' and pos2=='noun':
         print joke_type1(d1,d2,w1,w2)
     elif pos1=='noun' and pos2=='adjective':
         print joke_type1(d2,d1,w2,w1)
     elif pos1=='noun' and pos2=='noun':
         print joke_type2(d1,d2,w1,w2
         print joke_type2(d2,d1,w2,w1)

When we run this, we output 493 wonderful, classy jokes (from just 70 lines of code). A few of my favorites are:
  • What do you call an accomplished young woman? A made maid.
  • When is a disparaging sounds from fans like a whiskey? When it is a booze.
  • When is a fish eggs like a seventeenth letter of Greek alphabet? When it is a rho.
  • When is a bench-mounted clamp like a bad habit? When it is a vice.
  • When is a fermented grape juice like an annoying cry? When it is a whine.
  • When is a location like a flounder? When it is a plaice.
  • What do you call a fake enemy? A faux foe.
  • What do you call a beloved Bambi? A dear deer.

Not bad, not bad although even Carrot Top's career is probably safe with these.

This is the complete source code.

(Another potential joke pattern comes from "What is the difference between a pretty glove and a silent cat? One is a cute mitten, the other is a mute kitten." where we can observe a transposition of the first letters of two pairs of words. You can discern some other patterns in this joke generator site.)

So, we can conceive that a computer could be programmed with, or learn, the structure of jokes. This is a generative approach (e.g., Manurung et al.).

A second approach is to learn which jokes are considered funny by humans. Given a suitable corpus and a reasonable set of features, any number of classifiers could learn, at least statistically, to sort the funny from the unfunny (e.g., Kiddon & Brun, That's what she said detector).

Finally, given a set of jokes, a system could learn which are funny to you given some basic training. Jester is a system where you are asked to rate 10 jokes. After that, you are presented with a series of jokes that you are more likely to find funny than other jokes. In web terms, it is an old site with what amounts to an early recommender system (Goldberg et al. 2000).

One final joke from my code:

What do you call a least best sausage? A worst wurst.

Ba dum dum, Thanks, folks! I'll be here all week.




No comments:

Post a Comment

Note: Only a member of this blog may post a comment.