p-value.info: January 2013

Tuesday, January 22, 2013

When is a meat sandwich like a merchant? A python joke generator

When is a meat sandwich like a merchant? When it is a burgher.

Yes, you can groan but don't blame me, heckle the computer.

I enjoyed a recent New York Times piece, A Motherboard Walks Into a Bar ..." on how and whether computer can learn what is or is not funny. I'm a big fan of groan-inducing puns and Physics particle X walks into a bar type jokes. As I read the article, it occurred to me that there must be some simple lexical patterns that a computer could pick up on and auto-generate jokes. Consider the following:

What do you call a strange market? A bizarre bazaar.

That has the structure "What do you call a [Adjective1] [Noun1]? A [Adjective2] [Noun2]" where [Adjective2] and [Noun2] are homonyms and [Adjective1] and [Adjective2] and [Noun1] and [Noun2] are synonym pairs.

(A homonym is a word pronounced the same as another but differing in meaning, whether spelled the same way or not. Example: hare and hair. Synonyms as two or more different words with the same meaning. Example: lazy and idle.)

If we take a look through a list of english homonyms, we can easily pick out such joke material:

suite: ensemble

sweet: sugary

leads to "What do you call a sugary ensemble? A sweet suite."

Similarly,

What do you call a breezy eagle's nest? An airy aerie.

What do you call a coarse pleated collar? A rough ruff.

Another structure is when the homonyms are both nouns:

stake: wooden pole

steak: slice of meat

leads to "When is a slice of meat like a wooden pole? When it is a stake."

(Slightly more complicated is "When is a car like a frog? When it is being toad?")

This suggests that we can easily auto-generate jokes such as these. So, let's do it.

First, I downloaded that homonym webpage and parsed the HTML using the python BeautifulSoup library to extract the homonyms. There is one short function to parse the HTML to obtain two homonyms and their short definitions, and for each homonym I call a second function function which calls a unofficial google dictionary API to obtain the part of speech (noun, adjective etc.) of the homonym. Calling python extract_homonyms.py > processed_homonyms.txt processes a flat text file of the six pieces of information: homonym1, definition1, pos1, homonym2, definition2, pos2

Here is the code.

With the hard work out the way, generating the jokes is simple. A second short script, generate_jokes.py, has two type of jokes: 1) one homonym is an adjective and the other is a noun, 2) both homonyms are nouns:

def indefinite_article(w):

if w.lower().startswith("a ") or w.lower().startswith("an "): return ""

return "an " if w.lower()[0] in list('aeiou') else "a "

def camel(s):

return s[0].upper() + s[1:]

def joke_type1(d1,d2,w1,w2):

return "What do you call " + indefinite_article(d1) + d1 + " " + d2 + "? " + \

camel(indefinite_article(w1)) + w1 + " " + w2 + "."

def joke_type2(d1,d2,w1,w2):

return "When is " + indefinite_article(d1) + d1 + " like " + indefinite_article(d2) + d2 + "? " + \

"When it is " + indefinite_article(w2) + w2 + "."

data = open("processed_homonyms.txt","r").readlines()

for line in data:

[w1,d1,pos1,w2,d2,pos2]=line.strip().split("\t")

if pos1=='adjective' and pos2=='noun':

print joke_type1(d1,d2,w1,w2)

elif pos1=='noun' and pos2=='adjective':

print joke_type1(d2,d1,w2,w1)

elif pos1=='noun' and pos2=='noun':

print joke_type2(d1,d2,w1,w2)

print joke_type2(d2,d1,w2,w1)

When we run this, we output 493 wonderful, classy jokes (from just 70 lines of code). A few of my favorites are:

What do you call an accomplished young woman? A made maid.
When is a disparaging sounds from fans like a whiskey? When it is a booze.
When is a fish eggs like a seventeenth letter of Greek alphabet? When it is a rho.
When is a bench-mounted clamp like a bad habit? When it is a vice.
When is a fermented grape juice like an annoying cry? When it is a whine.
When is a location like a flounder? When it is a plaice.
What do you call a fake enemy? A faux foe.
What do you call a beloved Bambi? A dear deer.

Not bad, not bad although even Carrot Top's career is probably safe with these.

This is the complete source code.

(Another potential joke pattern comes from "What is the difference between a pretty glove and a silent cat? One is a cute mitten, the other is a mute kitten." where we can observe a transposition of the first letters of two pairs of words. You can discern some other patterns in this joke generator site.)

So, we can conceive that a computer could be programmed with, or learn, the structure of jokes. This is a generative approach (e.g., Manurung et al.).

A second approach is to learn which jokes are considered funny by humans. Given a suitable corpus and a reasonable set of features, any number of classifiers could learn, at least statistically, to sort the funny from the unfunny (e.g., Kiddon & Brun, That's what she said detector).

Finally, given a set of jokes, a system could learn which are funny to you given some basic training. Jester is a system where you are asked to rate 10 jokes. After that, you are presented with a series of jokes that you are more likely to find funny than other jokes. In web terms, it is an old site with what amounts to an early recommender system (Goldberg et al. 2000).

One final joke from my code:

What do you call a least best sausage? A worst wurst.

Ba dum dum, Thanks, folks! I'll be here all week.

Sunday, January 6, 2013

What's the significance of 0.05 significance?

Why do we tend to use a statistical significance level of 0.05? When I teach statistics or mentor colleagues brushing up, I often get the sense that a statistical significance level of α = 0.05 is viewed as some hard and fast threshold, a publishable / not publishable step function. I've seen grad students finish up an empirical experiment and groan to find that p = 0.052. Depressed, they head for the pub. I've seen the same grad students extend their experiment just long enough for statistical variation to swing in their favor to obtain p = 0.049. Happy, they head for the pub.

Clearly, 0.05 is not the only significance level used. 0.1, 0.01 and some smaller values are common too. This is partly related to field. In my experience, the ecological literature and other fields that are often plagued by small sample sizes are more likely to use 0.1. Engineering and manufacturing where larger samples are easier to obtain tend to use 0.01. Most people in most fields, however, use 0.05. It is indeed the default value in most statistical software applications.

This "standard" 0.05 level is typically associated with Sir R. A. Fisher, a brilliant biologist and statistician that pioneered many areas of statistics, including ANOVA and experimental design. However, the true origins make for a much richer story.

Let's start, however, with Fisher's contribution. In Statistical Methods for Research Workers (1925), he states

The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.

The next year he states, somewhat loosely,

... it is convenient to draw the line at about the level at which we can say: "Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."...

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

(See http://www.jerrydallal.com/LHSP/p05.htm)

And there you have it. With no theoretical justification, these few sentences drove the standard significance level that we use to this day.

Fisher was not the first to think about this but he was the first to reframe it as a probability in this manner and the first to state this 0.05 value explicitly.

Those two z-values in the first quote, however, hint at a longer history and basis of the different significance levels that we know and love. Cowles & Davis (1982) On the Origins of the .05 level of statistical significance describe a fascinating extended history which reads like a Whos Whos of statistical luminaries: De Moivre, Pearson, Gossett (Student), Laplace, Gauss and others.

Our story really begins in 1818 with Bessel who coined the term "probable error" (well, at least the equivalent in German). Probable error is the semi-interquartle range. That is, ±1PE contains the central 50% of values and is roughly 2/3 of a standard deviation. So, for a uniform distribution ±2PE contains all values but for a standard normal it contains only the central 82% of values. Finally, and crucially to our story,

±3PE contains the central ~95% of values. 1 - 0.95 = 0.05
People like Quetelet and Galton had tended to express variation or errors outside some typical range in terms of ±3PE, even after Pearson coined the term standard deviation.

There you have the basis of 0.05 significance: ±3PE was in common use in the late 1890s and this translates to 0.05. 1 in 20 is easier to interpret for most people than a z value of 2 or in terms of PE (Cowles & Davis, 1982) and thus explains why 0.05 became more popular.

In one paper from the 1890s, Pearson remarks on different p-values obtained as

p = 0.5586 --- "thus we may consider the fit remarkably good"

p = 0.28 --- "fairly represented"

p = 0.1 --- "not very improbable that the observed frequencies are compatible with a random sampling"

p = 0.01 --- "this very improbable result"

and here we see the start of different significance levels. 0.1 is a little probable and 0.01 very improbable. 0.05 rests between the two.

Despite this, ±3PE continued to be used as the primary criterion up to the 1920s and is still used in some fields today, especially in physics. It was Fisher that rounded off the probability to 0.05 which in turn, switched from a clean ±2σ to ±1.96σ.

In summary, ±3PE --> ±2σ --> ±1.96σ --> α = 0.05 more accurately describes the evolution of statistical significance.