p-value.info

Tuesday, February 19, 2013

When tf*idf and cosine similarity fail

In this post I cover 2 edge cases of cosine similarity with tf*idf weights that fail, i.e. that don't provide the cosine similarity values that intuition and common sense says that they should return.

In information retrieval, tf*idf forms the basis of scoring documents for relevance when querying a corpus, as in a search engine. It is the product of two terms: term frequency and inverse document frequency.

Term frequency is the frequency of some term in the document, typically an absolute count or relative frequency. Documents with more mentions of a term are more likely to be relevant with respect to that term. For example, when querying for "dog," a document about caring for your dog which mentions "dog" 46 times is more likely to be relevant than a document with a single mention of "the dog days of summer."

Inverse document frequency (IDF) measures the dispersion of that term across the corpus. If every document contains "the," then "the" is not a particularly discriminating word. IDF is the ratio of the corpus size to the number of documents containing that term. The smaller the proportion of documents containing that term, the higher the magnitude of this metric. (In reality, we take the log of the ratio. That is, idf = log(N/n_i)).

These two measures quantify the frequency within a document and the relative rarity across the corpus. Taking the product we arrive at a simple, satisfyingly intuitive but surprisingly powerful metric to score documents. For each term t in each document d in some corpus D we can compute the tf*idf score. Let's call this tfidf (t,d).

We rarely query a corpus for a single term. Instead, we have a query q consisting of multiple terms. Now we want to compute the similarity between the query q and each document d in the corpus. For this, we tend to use something called cosine similarity.

This is a measure of the angle between two unit vectors:

similarity

= cos(a,b)

= dotproduct(a,b) / ( norm(a) * norm(b) )

= a.b / ||a|| * ||b||

[Definition: if a = (a1,a2,...,an) and b = (b1,b2,...,bn) then a.b = Sum(a1*b1 + a2*b2 + ... + an*bn)

and ||a|| = sqrt(a1^2 + a2^2 + ... + an^2) and ||b|| = sqrt(b1^2 + b2^2 + ... + bn^2). ]

The smaller the angle, the more similar are the two vectors.

In this case, the variables of a and b are the set of unique terms in q and d. For example, when q = "big red balloon" and d ="small green balloon" then the variables are (big,red,balloon,small,green) and a = (1,1,1,0,0) and b = (0,0,1,1,1).

Not all words are created equally. Some are more important than others when computing similarity. Rather than use the count or the presence/absence of each term, we can use a weight. For example, we can give a lower weight to common words. What would make a suitable weighting? tf*idf of course. Putting this altogether,

similarity(q,d) = a.b / ||a|| * ||b||

where

a = (

tfidf("big",q),

tfidf("red",q),

tfidf("balloon",q),

tfidf("small",q),

tfidf("green",q)

)

and

b = (

tfidf("big",d),

tfidf("red",d),

tfidf("balloon",d),

tfidf("small",d),

tfidf("green",d)

While cosine similarity with tf*idf works well, really well, there are a couple of edge cases where it fails, corner cases that don't seem to be covered in most introductory explanations and tutorials.

FAIL 1: imagine you have a corpus D consisting of one document d. You come along with a query q where q == d. That is, the corpus has exactly what you are looking for. Intuition should say that we expect that cosine similarity would be 1 because q == d. So, what do we get? While the dot product of q and d should be 1 giving cosine similarity 1, it is not when you use tf*idf weights. The tf*idf of each term of d will be zero--each term of d is in all documents (D==d). Therefore, the dot product is zero but the norms of the two vectors is also zero and will generate a division by zero error. In summary, similarity is 0/0 and so undefined.

FAIL 2: imagine you have a corpus with two documents, d_1 = "blue bag" and d_2 = "green bag". What is their similarity? Intuition says there are some similarities between them, they both contain "bag," but there are some differences: "blue" vs "green". Thus, this should mean that we get a cosine similarity somewhere between 0 and 1. Wrong! Tf*idf for "bag," the common term, is zero because IDF is zero. "blue" is not a shared term and so that term of the dot product is zero as is for term "green." In other words, where they differ it pumps zero terms into the dot product and where they are similar, those terms effectively convey no information whatsoever and so also generate zero values.

While these two scenarios may seem contrived, I encountered them while writing unit tests where I wanted to use minimal corpora possible to test my code. It seems that one needs three distinct documents to avoid the problems above, or your code must handle a NaN.

I use tf*idf and cosine similarity frequently. It can get you far with little cost (if your documents are not enormous). It does have a big limitation though, it is a "bag of words" model meaning it does not consider word order. In many cases, specific word order matters a lot---a red couch with gold legs is very different from a gold couch with red legs. What one can do is to use the fast and cheap cosine similarity with tf*idf weights to narrow down some larger corpus to a smaller subset of documents for which you run a more computationally expensive, more domain specific model or algorithm that does consider word order.

Thursday, February 14, 2013

20 Tips and tricks working with data

In this post I lay out a number of tips, tricks and suggestions for working with data, from collection to storage to analysis. The main idea is to help students and other younger analysts and data scientists. However, they apply to all ages and experiences. Indeed, while many of the following ideas may sound obvious and trivial, it is surprising how often I see the counter-examples and anti-patterns.

This is almost certainly a biased selection based on my personal experience over the years. Thus, if you have any additional comments or tips, feel free to add those in the comments.

This section is about data itself.

1) Inspect your data.

When you receive a dataset, open a text editor and inspect the data. Don't assume all is present and correct. Page through the data quickly to get a sense of columns that are mostly NULL, data that are zero or negative, the ranges of dates and so on. It is very easy to make mistakes when reformatting data and data entry people are often low-paid, low-skilled workers. Assume that there is something wrong with your data. In *nix, use the head and more commands to get a sense of the data.

2) Think through what you would expect.

Related to above, it helps to have an expectation of what you might expect in each column. This will help in spotting errors. For instance, would you expect to find negative values? I once worked with a government-recorded health dataset that included data of people's height, something one might expect would be easy to measure and record. However, we found a chunk of people listed as 5 inches tall. Likely, they should have been 5 feet tall but such outliers are easy to spot with a frequency histogram. In R, get in the habit of calling summary() on your dataset which provides a 5-number summary of each variable. Plot univariate distribution and in R pairs(), a function that produces a scatter plot of each pair of variables, can be very valuable.

3) Check your units.

It may sound obvious but check your units. CNN reports

NASA lost a $125 million Mars orbiter because a Lockheed Martin engineering team used English units of measurement while the agency's team used the more conventional metric system for a key spacecraft operation.

Yeah, it's that important.

4) Plain text > binary.

Plain text data files are always better than binaries. You get to see the raw data, to see how columns are delimited and to see which line ending types are used. Any text editor can be used to view the data. With a simple text file, it can be easy to write a python script to iterate through the data.

5) Structured > unstructured.

While there has been a recent focus on NoSQL technologies that store schemaless documents, if data are or should be well-structured or have a strict, static schema, keep them that way. It will make querying, slicing and dicing much easier later.

6) Clean > more > less.

Much of the time building models, classifiers and so on is spent cleaning data. It takes a lot of time and can be depressing to see huge swathes of data thrown away or excluded. It is better to start with clean data. This means having quality control on data entry. Run validators on HTML forms where possible. If you can't get clean data, at least get a lot of it (assuming it is a well-structured experiment). Clean data are better than more data which is better than less data.

7) Track data provenance.

Provenance means where data came from. In many cases, provenance is is explicit as it comes from a single source or there is some source/vendor code or name. Imagine though you create a corpus for a model in which you combine data from multiple sources. If you later find a problem with one of those souces, you will likey want to back out those data or weight them less. If your corpus is just a big set of numbers and you don't know which data came from where, you now have polluted low-quality data.

8) Be honest about data loss acceptability.

When processing data, there can be problems especially when working with remote services. Servers can be down, databases connection can momentarily drop etc. Think through how important it is to process each item. If you are a bank or many other transaction based businesses, everything must work perfectly. This though comes at a high cost. However, if you are running say sentiment analysis on a twitter stream, or running some summary metrics on historical data, a small data loss may be perfectly acceptable. Will the extra cost of getting every last tweet make a big difference to your analysis?

9) Embrace the *nix command line.

It is hard not to stress this enough. Learn to love the command line. While the commands may seem terse and obtuse, you will be amazed how much more productive you can be with just a few basic commands such as:

wc -l filename: returns the number of lines in the file.
head -n filename: shows the first n lines of a file
grep searchterm filename: filter in the lines containing this search term.

10) Denormalize data for warehousing.

There are very good reasons for normalized databases tables. A well-designed schema will be flexible for future data and grow by rows not columns over time. For wareshousing where data are stored for longer term, it can often be a good idea to denormalize data. This can make the provenance more explicit, can make it easier to split into chunks, and it is often easier for analysts to work with as you can avoid costly, complex joins.

This section refers mostly to processing data with scripts

11) Learn a scripting language.

Python, ruby, it doesn't really matter which but pick one. You want a language with at least simple I/O, a range of data structures and libraries to make HTTP requests. You will end up doing a lot of file manipulation in your career and so you need a good general tool to automate tasks. Also, some analyses, especially involving time series of individuals, are far easier to do in scripting code than with SQL self-join queries.

12) Comment your code.

This is a general tip that applies to all source code not just scripts for analyzing data. When your head is in a problem there are subtle nuances about why you're doing something the way that you are. You'll forget that in a week, especially if you work on multiple projects simultaneously. Comment on why you are taking a certain approach or why this is an edge case. Focus less on what you are doing--ideally your code will be readable enough to show that---but a few such comments, in combination with well chosen function and variable names, will help you, or someone else, get into the code later. Comment for your future self.

13) Use source control.

Again, a general coding tip. Source control (CSV, SVN, git, mercurial etc.) is indispensible. First, it is a form of backup. Second, it allows one to share code and work collaboratively. Third, it allows one the freedom to experiment more and try things out without fear of getting back to your last iteration--you can always revert back to the old code if it doesn't work out. Don't leave home without it. Also, always always write a commit message when committing. Like code comments, commit comments should focus more on what new code does rather than how the new code works. Again, comment for your future self.

14) Automate.

This is obvious but is a lesson that most people learn only after they have had to do a task manually several times when they thought it was a one off. If there is any chance that you might need to do a task twice, it probably means you will actually need to do it five times. Automate the task where possible. This might mean a python script, a SQL stored procedure, or a cron job but it most cases, it is worth it. Even if only some of the steps are automated, it can make a huge difference.

15) Create services.

In years of creating data-related tooling, I've found that people always use the tool in unexpected ways. Make the tools flexible as much of possible. Doing so, will provide an ecosystem of functionality. Create top-level functions that serve the 80% of uses as well as provide access to the lower-level functionality. Where possible make the functionality as web services that can talk to each other in a simple clean manner. In many cases, JSON is preferable to XML or SOAP.

16) Don't obsess over big data.

I might get lynched for this one but there is a lot of hype over big data as service companies vie to take a slice of the analytics pie. While big data certainly has a place, many people who think they need big data don't have big data and don't need big data technologies. If you work in mega- or giga-bytes, it is not big. Take a deep breath and answer honestly, do I have big data or do I need to scale to huge proportions; do I have a problem that can be parallelized well (some problems don't); do I need to process all my data or can I sample? I should stress that I am in no way against big data approaches---I use them myself---but it is not a panacea. Other, simpler approaches may get you what you need.

17) Be honest about required latency.

In most websites, data are replicated, warehoused and analytics performed offline. That latency for analysis can often be hours or a day without any real impact on the business. There are cases where more real time analysis is required, for instance collaborative filtering or computing real time twitter trends. Such short latencies comes at a cost, and significantly more so, the smaller the value. Don't waste money creating a system capable of 1 second sales data latency if you only analyze it in batch mode overnight.

I am hesitant to dip into data visualization as that is a whole post in itself but I will include the following two:

18) Use color wisely.

While neon colors might be in vogue at American Apparel this season that doesn't mean I want to see it in your charts. Take note of default color palettes in programs such as Excel and libraries such as ggplot2. A lot of thought, research and experience have gone into those. If you break the rules do it for good reason and for good effect. P.S. don't use yellow text on a white background. While it might look good on your laptop screen in almost all cases, it is unreadable when projected.

19) Declutter charts and visualizations.

When producing charts, infographics and other data visualizations, think very carefully about what you want the reader to take away from the data. Focus on that and make that message shine through with the least amount of reader effort. Think about information density. If I need a magnifying glass to decipher all the details, you're doing it wrong. If a chart has a lot of textual annotations, you are likely doing it wrong. If you need 6 charts to convey one main point, you are probably missing some key summary metric. Use larger fonts than you might think if you are projecting to a large room. (Where possible, actually stand at the back of the room and check readability of your presentation.) You would do well to read Edward Tufte's earlier books.

20) Use # dashboard users as a metric in your dashboards.

In his foundation video series, Kevin Rose interviewed Jack Dorsey (CEO Square and Exec Chairman, Twitter) [video]. Dorsey made a very interesting point in how he has tried to make Square data-centric. A dashboard is useless if no one is using it. How valuable a dashboard or other metrics are should be reflected in the very dashboard itself: use the numbers of users as a metric in the dashboard itself. I would add that such metrics should be honest signals. That is, if reports are emailed out to recipients, the number of recipients is not necessarily a true reflection of use. Perhaps most people don't open it. Email open rate, however, is a more reliable and reflective indicator.

There you go. A few tips and suggestions for working with data. Let me know what you think and what I missed.

Tuesday, January 22, 2013

When is a meat sandwich like a merchant? A python joke generator

When is a meat sandwich like a merchant? When it is a burgher.

Yes, you can groan but don't blame me, heckle the computer.

I enjoyed a recent New York Times piece, A Motherboard Walks Into a Bar ..." on how and whether computer can learn what is or is not funny. I'm a big fan of groan-inducing puns and Physics particle X walks into a bar type jokes. As I read the article, it occurred to me that there must be some simple lexical patterns that a computer could pick up on and auto-generate jokes. Consider the following:

What do you call a strange market? A bizarre bazaar.

That has the structure "What do you call a [Adjective1] [Noun1]? A [Adjective2] [Noun2]" where [Adjective2] and [Noun2] are homonyms and [Adjective1] and [Adjective2] and [Noun1] and [Noun2] are synonym pairs.

(A homonym is a word pronounced the same as another but differing in meaning, whether spelled the same way or not. Example: hare and hair. Synonyms as two or more different words with the same meaning. Example: lazy and idle.)

If we take a look through a list of english homonyms, we can easily pick out such joke material:

suite: ensemble

sweet: sugary

leads to "What do you call a sugary ensemble? A sweet suite."

Similarly,

What do you call a breezy eagle's nest? An airy aerie.

What do you call a coarse pleated collar? A rough ruff.

Another structure is when the homonyms are both nouns:

stake: wooden pole

steak: slice of meat

leads to "When is a slice of meat like a wooden pole? When it is a stake."

(Slightly more complicated is "When is a car like a frog? When it is being toad?")

This suggests that we can easily auto-generate jokes such as these. So, let's do it.

First, I downloaded that homonym webpage and parsed the HTML using the python BeautifulSoup library to extract the homonyms. There is one short function to parse the HTML to obtain two homonyms and their short definitions, and for each homonym I call a second function function which calls a unofficial google dictionary API to obtain the part of speech (noun, adjective etc.) of the homonym. Calling python extract_homonyms.py > processed_homonyms.txt processes a flat text file of the six pieces of information: homonym1, definition1, pos1, homonym2, definition2, pos2

Here is the code.

With the hard work out the way, generating the jokes is simple. A second short script, generate_jokes.py, has two type of jokes: 1) one homonym is an adjective and the other is a noun, 2) both homonyms are nouns:

def indefinite_article(w):

if w.lower().startswith("a ") or w.lower().startswith("an "): return ""

return "an " if w.lower()[0] in list('aeiou') else "a "

def camel(s):

return s[0].upper() + s[1:]

def joke_type1(d1,d2,w1,w2):

return "What do you call " + indefinite_article(d1) + d1 + " " + d2 + "? " + \

camel(indefinite_article(w1)) + w1 + " " + w2 + "."

def joke_type2(d1,d2,w1,w2):

return "When is " + indefinite_article(d1) + d1 + " like " + indefinite_article(d2) + d2 + "? " + \

"When it is " + indefinite_article(w2) + w2 + "."

data = open("processed_homonyms.txt","r").readlines()

for line in data:

[w1,d1,pos1,w2,d2,pos2]=line.strip().split("\t")

if pos1=='adjective' and pos2=='noun':

print joke_type1(d1,d2,w1,w2)

elif pos1=='noun' and pos2=='adjective':

print joke_type1(d2,d1,w2,w1)

elif pos1=='noun' and pos2=='noun':

print joke_type2(d1,d2,w1,w2)

print joke_type2(d2,d1,w2,w1)

When we run this, we output 493 wonderful, classy jokes (from just 70 lines of code). A few of my favorites are:

What do you call an accomplished young woman? A made maid.
When is a disparaging sounds from fans like a whiskey? When it is a booze.
When is a fish eggs like a seventeenth letter of Greek alphabet? When it is a rho.
When is a bench-mounted clamp like a bad habit? When it is a vice.
When is a fermented grape juice like an annoying cry? When it is a whine.
When is a location like a flounder? When it is a plaice.
What do you call a fake enemy? A faux foe.
What do you call a beloved Bambi? A dear deer.

Not bad, not bad although even Carrot Top's career is probably safe with these.

This is the complete source code.

(Another potential joke pattern comes from "What is the difference between a pretty glove and a silent cat? One is a cute mitten, the other is a mute kitten." where we can observe a transposition of the first letters of two pairs of words. You can discern some other patterns in this joke generator site.)

So, we can conceive that a computer could be programmed with, or learn, the structure of jokes. This is a generative approach (e.g., Manurung et al.).

A second approach is to learn which jokes are considered funny by humans. Given a suitable corpus and a reasonable set of features, any number of classifiers could learn, at least statistically, to sort the funny from the unfunny (e.g., Kiddon & Brun, That's what she said detector).

Finally, given a set of jokes, a system could learn which are funny to you given some basic training. Jester is a system where you are asked to rate 10 jokes. After that, you are presented with a series of jokes that you are more likely to find funny than other jokes. In web terms, it is an old site with what amounts to an early recommender system (Goldberg et al. 2000).

One final joke from my code:

What do you call a least best sausage? A worst wurst.

Ba dum dum, Thanks, folks! I'll be here all week.

Sunday, January 6, 2013

What's the significance of 0.05 significance?

Why do we tend to use a statistical significance level of 0.05? When I teach statistics or mentor colleagues brushing up, I often get the sense that a statistical significance level of α = 0.05 is viewed as some hard and fast threshold, a publishable / not publishable step function. I've seen grad students finish up an empirical experiment and groan to find that p = 0.052. Depressed, they head for the pub. I've seen the same grad students extend their experiment just long enough for statistical variation to swing in their favor to obtain p = 0.049. Happy, they head for the pub.

Clearly, 0.05 is not the only significance level used. 0.1, 0.01 and some smaller values are common too. This is partly related to field. In my experience, the ecological literature and other fields that are often plagued by small sample sizes are more likely to use 0.1. Engineering and manufacturing where larger samples are easier to obtain tend to use 0.01. Most people in most fields, however, use 0.05. It is indeed the default value in most statistical software applications.

This "standard" 0.05 level is typically associated with Sir R. A. Fisher, a brilliant biologist and statistician that pioneered many areas of statistics, including ANOVA and experimental design. However, the true origins make for a much richer story.

Let's start, however, with Fisher's contribution. In Statistical Methods for Research Workers (1925), he states

The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.

The next year he states, somewhat loosely,

... it is convenient to draw the line at about the level at which we can say: "Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."...

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

(See http://www.jerrydallal.com/LHSP/p05.htm)

And there you have it. With no theoretical justification, these few sentences drove the standard significance level that we use to this day.

Fisher was not the first to think about this but he was the first to reframe it as a probability in this manner and the first to state this 0.05 value explicitly.

Those two z-values in the first quote, however, hint at a longer history and basis of the different significance levels that we know and love. Cowles & Davis (1982) On the Origins of the .05 level of statistical significance describe a fascinating extended history which reads like a Whos Whos of statistical luminaries: De Moivre, Pearson, Gossett (Student), Laplace, Gauss and others.

Our story really begins in 1818 with Bessel who coined the term "probable error" (well, at least the equivalent in German). Probable error is the semi-interquartle range. That is, ±1PE contains the central 50% of values and is roughly 2/3 of a standard deviation. So, for a uniform distribution ±2PE contains all values but for a standard normal it contains only the central 82% of values. Finally, and crucially to our story,

±3PE contains the central ~95% of values. 1 - 0.95 = 0.05
People like Quetelet and Galton had tended to express variation or errors outside some typical range in terms of ±3PE, even after Pearson coined the term standard deviation.

There you have the basis of 0.05 significance: ±3PE was in common use in the late 1890s and this translates to 0.05. 1 in 20 is easier to interpret for most people than a z value of 2 or in terms of PE (Cowles & Davis, 1982) and thus explains why 0.05 became more popular.

In one paper from the 1890s, Pearson remarks on different p-values obtained as

p = 0.5586 --- "thus we may consider the fit remarkably good"

p = 0.28 --- "fairly represented"

p = 0.1 --- "not very improbable that the observed frequencies are compatible with a random sampling"

p = 0.01 --- "this very improbable result"

and here we see the start of different significance levels. 0.1 is a little probable and 0.01 very improbable. 0.05 rests between the two.

Despite this, ±3PE continued to be used as the primary criterion up to the 1920s and is still used in some fields today, especially in physics. It was Fisher that rounded off the probability to 0.05 which in turn, switched from a clean ±2σ to ±1.96σ.

In summary, ±3PE --> ±2σ --> ±1.96σ --> α = 0.05 more accurately describes the evolution of statistical significance.

Monday, December 31, 2012

Howto: build a news aggregator (in 100 LOC)

In this post, I'll cover how to build a crude, simple but working news aggregator in about 100 lines of python (source code). While a complete news aggregator will comprise of at least three steps:

filter: select a subset of stories of interest, i.e. the top news -- e.g., that Zynga's shuttering of several games will be one of the most interesting to Techmeme's readership in the screenshot below
aggregate: cluster like documents --- that Techcrunch and GamesIndustry International are both covering the same story
choose: choose the best representative story -- that the Techcrunch story should be highlighted over the GamesIndustry International's story,

in this article, we will focus on the second step only, aggregation of similar articles.

Let us state the goal explicitly: given multiple sources of documents, such as RSS news feeds, cluster together similar documents that cover the same material. For instance, both Reuters and Associated Press might cover Hillary Clinton's admission to hospital (which was a major story at time of writing). In the google news screen shot below we can see that USA Today, Detroit Free Press, Boston Herald and at least 10 other news organizations covered that story. We wish to recognize that these are the same underlying story and cluster or aggregate them.

My approach was the following:

parse RSS feeds
extract keywords using TF-IDF (i.e., term frequency * inverse document frequency)
union those to a set of features that encompasses the whole corpus
create a feature vector for each document
compute matrix of cosine similarities
run hierarchical clustering and plot dendrogram
extract final clusters of interest

I'm not claiming this is an optimal scheme or will necessarily scale well and I did not focus on vectorized operations. It does, however, produce non-embarassing results in 100 lines of code. Later, we will discuss other approaches including the algorithm versus human debate.

Let's build an aggregator. To start, we will need some data sources. I chose a handful of RSS feeds about technology and IT news:

feeds = [

'http://www.sfgate.com/rss/feed/Tech-News-449.php',

'http://feeds.feedburner.com/TechCrunch/startups',

'http://news.cnet.com/8300-1001_3-92.xml',

'http://www.zdnet.com/news/rss.xml',

'http://www.computerweekly.com/rss/Latest-IT-news.xml',

'http://feeds.reuters.com/reuters/technologyNews',

'http://www.tweaktown.com/news-feed/'

]

(We can assume that these source have already mostly accomplished the first news aggregator step: filtering.)

Next, we need to parse out the source. I decided to use the document title and description which is a shortish summary. I used the feedparser python library and the natural language processing library NLTK. For each document, I concatenate the title and description, convert to lowercase, and remove one character words. I store those final documents in a corpus and also store the titles to make the final results more interpretable:

import feedparser

import nltk

corpus = []

titles=[]

ct = -1

for feed in feeds:

d = feedparser.parse(feed)

for e in d['entries']:

words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))

words.extend(nltk.wordpunct_tokenize(e['title']))

lowerwords=[x.lower() for x in words if len(x) > 1]

ct += 1

print ct, "TITLE",e['title']

corpus.append(lowerwords)

titles.append(e['title'])

Here is a sample of those data (titles only):

0 TITLE Tweeters say the dumbest things

1 TITLE IFixit thrives by breaking tech items down

2 TITLE Earthcam's New Year's Live 2013

3 TITLE Content Fleet offers publishers hot tips

...

96 TITLE Sony stops shipments of PS2 console in Japan

97 TITLE RIM's first patent payment to Nokia: $65 million

98 TITLE ZTE aiming for thinner 5-inch 1080p-capable smartphone, Grand S has grand dreams

99 TITLE Japanese security firm to take to the sky in 2014, will unleash crime-fighting drones

We will extract keywords using TF-IDF. I modified the following from this post:

import math

from operator import itemgetter

def freq(word, document): return document.count(word)

def wordCount(document): return len(document)

def numDocsContaining(word,documentList):

count = 0

for document in documentList:

if freq(word,document) > 0:

count += 1

return count

def tf(word, document): return (freq(word,document) / float(wordCount(document)))

def idf(word, documentList): return math.log(len(documentList) / numDocsContaining(word,documentList))

def tfidf(word, document, documentList): return (tf(word,document) * idf(word,documentList))

I only want to extract a few top keywords from each document and I'll store the set of those keywords, across all documents, in key_word_list:

import operator

def top_keywords(n,doc,corpus):

d = {}

for word in set(doc):

d[word] = tfidf(word,doc,corpus)

sorted_d = sorted(d.iteritems(), key=operator.itemgetter(1))

sorted_d.reverse()

return [w[0] for w in sorted_d[:n]]

key_word_list=set()

nkeywords=4

[[key_word_list.add(x) for x in top_keywords(nkeywords,doc,corpus)] for doc in corpus]

ct=-1

for doc in corpus:

ct+=1

print ct,"KEYWORDS"," ".join(top_keywords(nkeywords,doc,corpus))

Here is a sample of those data:

0 KEYWORDS she tweeted athletes south

1 KEYWORDS ifixit repair devincenzi items

2 KEYWORDS earthcam live famed sound

3 KEYWORDS fleet content tips publishers

...

96 KEYWORDS playstation sony console ouya

97 KEYWORDS payment nokia patent agreement

98 KEYWORDS zte grand boots ces

99 KEYWORDS drones soon 2014 drone

Now that we have this superset of keywords, we need to go through each document again and compute TF-IDF for each term. Thus, this will be likely be a sparse vector as most of the entries will be zero.

feature_vectors=[]

n=len(corpus)

for document in corpus:

vec=[]

[vec.append(tfidf(word, document, corpus) if word in document else 0) for word in key_word_list]

feature_vectors.append(vec)

With a vector of TF-IDF for each document, we can compare the similarity with each other. I decided to use cosine similarity (and yes, this matrix is symmetric and thus the code is far from optimized):

import numpy

mat = numpy.empty((n, n))

for i in xrange(0,n):

for j in xrange(0,n):

mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])

We now have a single similarity number for each pair of documents.

The next step is clustering. Initially, the most obvious might be to use k-means or k-means++ clustering. It is simple, well-known and can work relatively well. As well as getting stuck on local optima, the major problem here, however, is that one has to choose k. This is not always easy. Moreover, I strongly suspect that the characteristics of news changes dramatically. That is, on presidential election day, almost all top stories will be about voting and speculation on the results. That is, there may be very few, very large clusters. On slow news days that story landscape may be very different. As such, I think it would be difficult to choose a constant k across days and the characteristics of the data may change so much that a scheme to choose k dynamically might be tricky.

I chose a different approach: agglomerative or hierarchical clustering. Here clusters are grown by fusing neighboring documents to form a tree. The structure of that tree may change radically among different days but we can choose a similarity threshold to prune the tree to a set of final clusters. That similarity metric can be constant over days.

I used the hcluster python library and saved the dendrogram to file:

from hcluster import linkage, dendrogram

t = 0.8

Z = linkage(mat, 'single')

dendrogram(Z, color_threshold=t)

import pylab

pylab.savefig( "hcluster.png" ,dpi=800)

Finally, we need to chose a threshold from which to prune -- how similar is similar enough. 0.8 worked best for my data. There is then a little work to extract the clustered items from the dendrogram and print out the IDs and titles of the final clusters.

def extract_clusters(Z,threshold,n):

clusters={}

ct=n

for row in Z:

if row[2] < threshold:

n1=int(row[0])

n2=int(row[1])

if n1 >= n:

l1=clusters[n1]

del(clusters[n1])

else:

l1= [n1]

if n2 >= n:

l2=clusters[n2]

del(clusters[n2])

else:

l2= [n2]

l1.extend(l2)

clusters[ct] = l1

ct += 1

else:

return clusters

clusters = extract_clusters(Z,t,n)

for key in clusters:

print "============================================="

for id in clusters[key]:

print id,titles[id]

Here then, in combination with the dendrogram above, are my final results:

=============================================

35 Huawei linked to plan to sell restricted equipment to Iran

65 Exclusive: Huawei partner offered embargoed HP gear to Iran

=============================================

10 UK's Pearson invests in Barnes & Noble's Nook

73 Pearson to buy stake in Nook, Barnes & Noble shares up

=============================================

21 Kim Dotcom To Host Mega’s Launch Event At His New Mega Zealand Mansion Next Month

94 Kim Dotcom to host Mega's launch event at his New Zealand mega mansion next month

=============================================

90 Data suggests Facebook's Poke has led to online buzz around Snapchat

93 Snapchat videos and photos can be saved after all

=============================================

14 Apple CEO Tim Cook paid $4.16 million

41 Apple's Tim Cook sees his 2012 pay fall 99 percent

=============================================

8 Netflix CEO gets pay bump after 2012 cut

72 Netflix increases CEO Hastings' 2013 salary to $4 million

=============================================

95 Nissan Leaf owners can get their batteries refreshed under new warranty options

99 Japanese security firm to take to the sky in 2014, will unleash crime-fighting drones

=============================================

83 Windows RT ported to HTC HD2

77 Samsung Galaxy S III extended battery arriving in Germany on January 5

92 RumorTT: Samsung Galaxy S IV rumored for April Launch with integrated S Pen

75 Unnamed BlackBerry 10 device slides through FCC with AT&T-capable LTE abilities

98 ZTE aiming for thinner 5-inch 1080p-capable smartphone, Grand S has grand dreams

=============================================

40 HP confirms: Feds investigating the Autonomy acquisition

52 US Department of Justice is investigating allegations of accounting fraud at Autonomy

11 HP says gov't investigating troubled Autonomy unit

39 Autonomy founder fires back at HP after news of DOJ inquiry

=============================================

55 Top 10 broadband stories of 2012

63 BT wins BDUK broadband contract in Norfolk

=============================================

97 RIM's first patent payment to Nokia: $65 million

84 Fujitsu blames weak Windows 8 demand for poor sales

87 Cheezburger network raises $5M in funding for LOLcats, FAIL Blog, others

79 Samsung has big plans for Silicon Valley with 1.1 million square foot R&D center

89 Apple drops patent claim against Samsung Galaxy S III Mini

The complete output is here.

To sort the items within a cluster, the final news aggragator step, one could sort by PageRank using code such as this.

We see good cluster around Huawei, Person, Kim Dotcom, Facebook/Snapchat, Tim Cook, and Netflix. All of those seem reasonable. The next is spurious. Next is a broader cluster around smartphones. The next is a good four story cluster around the HP and Autonomy scandal. The next cluster relates to broadband and the documents are related but not really about the same material. Finally, there is a cluster around "millions" which cover both dollars and square foot.

The results are not great. I'm not going to build a techmeme.com competitor from this. However, they are surprisingly respectable for a first pass and for so few lines of code.

To try to improve results, I tried doing a weighted TF-IDF. That is, I suspect that words in the title are relatively more important than those in the description. Thus, I set a weight 1 for each word in the description and weight title_wt > 1 for each word in the title. When we compute term frequency I sum the weights for the words present rather than counts. The code for this change is here.

I did not find that this improved results. The dominant parameter by far was the number of keywords from each document (nkeywords=4 above). Increasing this dramatically increases false positives (i.e. larger mixed clusters, such as the million cluster above). Reducing it below 4 led to very few clusters (for a given document similarity threshold). Actually, I was surprised that the optimal value of 4 was so low. I had expected a larger number would have been needed to disambiguate different the articles.

A potential improvement that I have not tried is stemming the words. In the sample keyword output above you might have noticed the following:

99 KEYWORDS drones soon 2014 drone

Drones and drone are the same noun. Had I stemmed these it would have increased the TF-IDF of drone and created a space for an additional keyword.

I may be able to tweak my current code and improve results. I could try different parameters, try different similarity metrics, different clustering schemes but it is not clear how good the results could be. Could I match a human curator? According to Gabe Rivera, the founder of Techmeme.com, algorithms are not currently a match for humans. His site employs a small team of editors [http://techmeme.com/about] to curate the top news:

Our experience leads us to believe that a thoughtful combination of both algorithmic and human editing offers the best means for curating in a space as broad as technology.

I somehow started thinking about news aggregators just a few days before prismatic got a large round of funding. In the days that followed there was some debate about algorithms versus humans. Here is Techcrunch:

News aggregation is a glory-less pursuit, where even the companies that do succeed have to grapple with the fact that no algorithm is ever going to be as good as the human brain at judging what news/content will appeal to said brain

And Gabe Rivera (from Techmeme.com):

"A lot of people who think they can go all the way with the automated approach fail to realize a news story has become obsolete," said Rivera, explaining that an article can be quickly superseded even if it receives a million links or tweets. from gigaom

and paraphrasing:

Some decisions, after all, are best left to humans. When a news story breaks, for example, or a much-hyped product launches, a wealth of news coverage follows. Which is the best version? While Techmeme’s algorithms often derive this answer from “signals” on the Internet, sometimes surfacing the right story for the audience — perhaps the most comprehensive take or the most nuanced coverage — requires human intervention. outbrain.com

and finally in 2008, at the time Rivera introduced human editors

I should note that the experience of introducing direct editing has been a revelation even for us, despite the fact that we planned it. Interacting directly with an automated news engine makes it clear that the human+algorithm combo can curate news far more effectively that the individual human or algorithmic parts. It really feels like the age of the news cyborg has arrived. Our goal is apply this new capability to producing the clearest and most useful tech news overview available

(See also this thread.)

As mentioned earlier, news aggregation is not just about clustering similar stories. There are really three steps: story selection, clustering like documents, and selecting best item from each cluster. This is certainly a set of hard problems. Which stories might be of interest depends on the audience and other contexts. Clustering similar documents is probably the easiest subtask and that is fraught with problems, e.g. that a bag-of-words model does not distinguish millions of dollars versus millions of square feet. (Thus, a n-gram model might be another possible approach.) There is a sea of potential news sources, they have different foci, reputations come and go, quality changes, news organization editors vary and change too. As Gabe mentions, there is a temporal aspect too where there is a window of interest before the news get stale.

There is no reason that an algorithm or suite of algorithms could not do this well. There are certainly a lot of social cues now that an algorithm could use to identify trending, interesting stories: the number of likes, shares or retweets; the reputation of those sharing the news; the number or rate of comments on those stories etc. To understand the audience and refine itself, an algorithm could use direct user feedback and again shares / like / retweets of those top stories. However, it remains a hard problem. With the big news sources covering core issues, smaller news operations have to work a different angle and sometimes those are the most interesting. Journalists are not consistent: they might write a blazing article one day and a mediocre the next. Thus, for now the algorithmic base + human editors or a pure crowd-sourced filtering (such as reddit) do seem to produce the best quality results.

Thursday, December 27, 2012

Detecting credible tweets

There has been some buzz in the last week or so around a paper detecting the credibility of tweets. As Twitter has grown, so has spam. Moreover, during crises such as tsunamis, the Arab spring, the UK riots and so on, one sees deliberate strategies to spread disinformation. The effects of this can range from inconvenience, to wasted time and resources to being downright dangerous. Thus, a system to filter out such disinformation would be valuable for media, first responders and others.

In Credibility Ranking of Tweets during High Impact Events, PSOM 2012, Gupta and Kamaguru attempted to identify features that helped identify credible information versus opinion versus spam. The built a supervised learning model using features of the tweets and the senders and ran those through a regression model and re-ranking scheme that they claim provided a reasonable predictive ability.

To do this, they collected 35 million tweets from 6 million users based on search terms from trending topics during summer 2011. These were high impact events meaning they produced at least 25k tweets and persisted as a trending topic for more than 48 hours. They then selected 14 trending topics---including UK riots, Hurricane Irene, Steve Jobs resignation and the bomb blast in Mumbai---and annotated the associated tweets to flag whether each tweet was credible. In short, a human had to decide whether each tweet was definitely credible / seems credible / definitely incredible / I can't decide and whether it contained information or not, was unrelated to the news event, or was spam. An example of the three classes is shown below.

They found that 30% of tweets provided information, 17% was credible and 14% was spam. They then proceeded to extract a large set of features from the tweets, the sender and their user profile. These features included:

Tweet: number of words, number of characters, whether it was a retweet, whether it contained a URL, whether it contained emoticons, number of hashtags etc.

Sender: age, number of followers, number of friends, description length, whether it was a verified account etc.

They performed a logistic regression (1 = credible, 0 = not credible/spam/opinion) and found the following significant indicators (p < 0.001):

number of characters and unique characters present in tweet --- likely tweets with hashtags, @mentions and URLs contain more unique characters and so are more informative
presence of swear words --- indicates opinion
inclusion of pronouns and presence of sad / happy emoticons --- fact based tweets are less personal
p<0.01: presence of URL --- often linked to images, video and resources related to the event.

They found that "low number of happy emoticons [:-), :)] and high number of sad emoticons [:-(, :(] act as strong predictors of credibility". Now, most of the high impact (arguably 13/14) events in their list are negative so one could imagine that for a general model that covered both positive and negative events, this feature would be use of emoticons: happy emoticons for happy events, sad emoticons for sad events. However, one could also argue that use of happy emoticons such as :-) could mean a joke, that the comment is not credible. We will see data from a second study below that finds the same results so it is more likely to be the latter: sad emoticons only are strongly linked to credible information.

They also used SVM and pseudo relevance feedback (PRF) to improve results. They ranked documents using RankSVM and using a unigram model identified the top K tweets and reranked them using a BM25 similarity metric. (This metric is a function of TF, IDF, K, tweet length, and some constants. See paper for full description.) To evaluate the results they used the Normalized Discounted Cumulative Gain metric. For the top 25 tweets, they achieved 0.37 NDCG, a statistically significant (p < 0.05) improvement over non ranked data. Unfortunately, there is scant further details about their analysis. I have no idea as to the accuracy, precision, recall or F-scores of their classifier. Saying that an additional layer of analysis improved results is not as useful as detailing fully how good the initial results were.

Gupta and Kamaguru are not the only ones working in this field. Castillo et al. (2011) published a more compelling paper entitled "Information Credibility on Twitter" with a more detailed statistical analysis. They too sucked up tweets from trending topics, this time over a 2 month period. They used Mechanical Turk to annotate the newsworthiness and credibility and asked the evaluators to write a sentence justifying their assessment. The extracted a sets of features, similar but larger and more detailed than above.

They tried a number of learning schemes including SVM, decision trees, decision rules, and Bayes networks. Results across these techniques were comparable with the best coming from a J48 decision tree. Their J48 classifier achieves an accuracy equal to 89%. For detecting newsworthy tweets they achieved an impressive F1 score of 0.924. For predicting credibility they also found that a J48 tree worked best and found the following significant features: presence of questions mark (non credible), sentiment (positive: no credible, negative: credible), presence of negative emoticons (credible). More active users are more credible (I wonder if this is related to professional media organizations such as CNN, Reuters, local news organizatins whose business it is to be both credible and active.) Together, these feature produced a model with 86% accuracy. See table below.

In both these studies, the features make sense. Active users, tweeting frequently with lots of followers, linking to relevant material and who have a reasonably fleshed out user profile. With 86% accuracy from a balanced training set, such features and associated model could form the basis of a reasonably good filter system for a twitter stream and in the case of emergency services could save valuable resources and possibly lives.

P.S. it is worth checking out the truthiness system from http://truthy.indiana.edu/

Friday, December 14, 2012

On the unreasonable effectiveness of data: why are more data better?

Update 11/30/14: I incorrectly named the authors of the "unreasonable effectiveness study, now corrected.

In the paper "The unreasonable effectiveness of data" [Halevy et al, 2009], Halevy, Norvig and Pererira, all from Google, argue that interesting things happen when corpora get to web scale:

"simple models and a lot of data trump more elaborate models based on less data".

In that paper and the more detailed tech talk given by Norvig, they demonstrate that when corpora get to hundreds of millions or trillions of training sample or words, very simple models with basic independence assumptions can outperform more complex models such as those based on carefully-crafted ontologies with smaller data. However, they provided relatively little explanation as to why more data are better. In this post, I want to attempt to think through that.

I propose that there are several classes of problems and reasons for why more data are better.

The first are nearest neighbor type problems. Halevy et al. mention "James Hay and Alexei A. Efros addressed the task of scene completion: removing an unwanted, unsightly automobile or ex-spouse from a photograph and filling in the background with pixels taken from a large corpus of other photos." [Hays and Efros, 2007 ]

Norvig presented the following schematic:

and described it as a "data threshold" in which results went from really bad to really good.

I'm not convinced that there is any threshold or anything that resembles a phase transition. This seems to me to be a problem of finding the closest match. The more data, the closer the expected match.

Hays and Efros (2007) mention:

"Indeed, our initial experiments with the gist descriptor on a dataset of ten thousand images were very discouraging. However, increasing the image collection to two million yielded a qualitative leap in performance... Independently, Torralba et al. [2007] have observed a similar effect with a dataset of up to 70 million tiny (32x32) images...It takes a large amount of data for our method to succeed. We saw dramatic improvement when moving from ten thousand to two million images" [Hays and Efros, 2007]

There is a large difference in those corpus sizes and a "qualitative leap" is not the same as a threshold (sensu phase transition).

More data can dramatically affect the metrics from simple effects. For instance, consider a sample of size n from a standard normal. How does the minimum of that sample vary with n? Let's create samples of different sizes and plot the minimum using the following R code

x<-seq(1,7,0.5)

y<-vector(mode="numeric",length=length(x))

for (i in 1:length(x)){ y[i] <- min(rnorm(10^(x[i]))) }

plot(x,y,xlab="Sample size, n (log10 scale)",ylab="Minimum value of sample",type="b")

The minimum decreases loglinearly. This is a case of extrema from an unbounded tail. Perhaps more relevantly, here, for a minimization problem such as scene matching there is a lower bound: for all intents and purposes, a perfect match. For instance, perhaps someone else stood in the same exact tourist spot and took a picture of the exact same scene but without the obstructing car.

I think this is what is happening in Norvig's schematic. At a certain corpus size we've found a really good match and a larger corpus doesn't and cannot improve results.

In summary, for nearest neighbor type minimization problems with a non-negative distance function (meaning that the cost function has a lower bound of zero), that distance function will, on average, decrease monotonically with data or sample size.

The second class are counting or relative frequency problems. These were the primary focus of Halevy et al. Norvig's presented a few examples. In segmentation, the task is to split a string such as "cheapdealsandstuff.com" into the most likely sequence of words. These strings are short enough to do a brute force in terms of possible partition, but for each partition we have to assess the likelihood of that partition. The simplest assumption is to assume independence among words. Thus, if Pr(w) is the probability of a word w given some corpus we can compute, say

Pr(che,apdeals,andstuff) = Pr(che) * Pr(apdeals) * Pr(andstuff).

...

Pr(cheap,deals,and,stuff) = Pr(cheap) * Pr(deals) * Pr(and) * Pr(stuff).

One can of course also use n-grams, e.g. for bigrams: Pr("cheap deals") * Pr("and stuff")

A second example that Norvig covered was spell checking. Here we can take a misspelled word and compute the likelihood of the possible variants to suggest the most likely form.

In both cases, we need a corpus that includes both common and uncommon phrases and we need counts of the occurrences of those phrases to compute relative frequency. The larger and more comprehensive the corpus, the better. I think that there are two statistical effects going on here:

the larger the corpus, the better the quality of the estimate of the relative frequency. This is the law of large numbers.
the larger the corpus, the more likely it is to include unusual phrases, i.e. the long tail. This is an unbounded effect. The more of the web is indexed, there will always be new, never-seen-before phrases. The problem is exacerbated in that the distribution of words in the english language is a power law. (e.g. Zipf, G. The Psycho-Biology of Language. Houghton Mifflin, Boston, MA, 1935.). This means that the tail is especially long and thus especially large samples are needed to capture those rare phrases.

A third class are estimating univariate distribution problems. I recently encountered an interesting example while attending a tech talk from Peter Skomoroch from LinkedIn. He showed a plot of likelihood of a member having particular software-related job titles versus number of months since graduation. What we see from the data is that the distributions of "Sr Software engineer" and "senior software engineer" are almost identical, as one would expect as they are synonyms, as are "CTO" and "chief technology officer". This then presents an interesting way of identifying synonyms and so deduping the data rather than maintaing a huge master list of acronyms and abbreviations. This is only possible because of the scale of data that they have where the distribution they can generate is reliable and presumably close to the true underlying population distribution.

A fourth class are general multivariate or correlational problems in which we are trying to estimate the relationship among variables. This could be estimating the relationship y = f(x) or perhaps estimating the joint pdf of many variables. We might use this for word sense disambiguation (e.g. is the document referring to pike the fish or pike the pointed weapon?) or to build up a dossier of associated features or concepts about an entity (e.g. a company has an associated CEO, head office, tax ID and so on). Here we are interested in the correlations among words or phrases. The problem is that web documents are very high dimensional and embarking on high dimensional problems like these we are under the curse of dimensionality, that data become very sparse. Thus, one effect of larger samples is to increase the density of data across state space. Again, with larger samples we can estimate metrics such as location metrics (mean, median and other metrics of the center of distributions) more accurately. We can also estimate joint pdfs more accurately. Below is a simple example from the following code:

par(mfrow=c(1,2))

plot(mvrnorm(100, mu = c(0, 0), Sigma = matrix(c(1, .9, .9, 1), 2)),xlab="X",ylab="Y",ylim=c(-4,4))

title("n = 100")

plot(mvrnorm(10000, mu = c(0, 0), Sigma = matrix(c(1, .9, .9, 1), 2)),xlab="X",ylab="Y",ylim=c(-4,4))

title("n = 10000")

At left is a small sample. It could easily be interpreted as linear. At right, with a larger sample, the true underlying bivariate normal is more obvious. Clearly, this is a trivial example. The point is that for higher dimensions you will much larger sample sizes to estimate the joint pdfs well.

This is clearly a cursory answer as to why more data are better. Quality data are still preferred. However, for many organizations, such as Google, Twitter, LinkedIn and Facebook where content is user-generated, is often free-form text and/or covers many domains (and so deep-cleaning and the use of ontologies is infeasible) then what we see that having very large datasets can compensate for the noise. It all evens out in the end and in the case of nearest neighbor problems, the solution will always be better.