Tuesday, February 19, 2013

When tf*idf and cosine similarity fail

In this post I cover 2 edge cases of cosine similarity with tf*idf weights that fail, i.e. that don't provide the cosine similarity values that intuition and common sense says that they should return. 

In information retrieval, tf*idf forms the basis of scoring documents for relevance when querying a corpus, as in a search engine. It is the product of two terms: term frequency and inverse document frequency.

Term frequency is the frequency of some term in the document, typically an absolute count or relative frequency. Documents with more mentions of a term are more likely to be relevant with respect to that term. For example, when querying for "dog," a document about caring for your dog which mentions "dog" 46 times is more likely to be relevant than a document with a single mention of "the dog days of summer."

Inverse document frequency (IDF) measures the dispersion of that term across the corpus. If every document contains "the," then "the" is not a particularly discriminating word. IDF is the ratio of the corpus size to the number of documents containing that term. The smaller the proportion of documents containing that term, the higher the magnitude of this metric. (In reality, we take the log of the ratio. That is, idf = log(N/n_i)).

These two measures quantify the frequency within a document and the relative rarity across the corpus. Taking the product we arrive at a simple, satisfyingly intuitive but surprisingly powerful metric to score documents. For each term t in each document d in some corpus D we can compute the tf*idf score. Let's call this tfidf (t,d).

We rarely query a corpus for a single term. Instead, we have a query q consisting of multiple terms. Now we want to compute the similarity between the query q and each document d in the corpus. For this, we tend to use something called cosine similarity

This is a measure of the angle between two unit vectors:

    = cos(a,b) 
    = dotproduct(a,b) / ( norm(a) * norm(b) ) 
    = a.b / ||a|| * ||b||

[Definition: if a = (a1,a2,...,an) and b = (b1,b2,...,bn) then a.b = Sum(a1*b1 + a2*b2 + ... + an*bn) 
and ||a|| = sqrt(a1^2 + a2^2 + ... + an^2) and ||b|| = sqrt(b1^2 + b2^2 + ... + bn^2). ]

The smaller the angle, the more similar are the two vectors.

In this case, the variables of a and b are the set of unique terms in q and d. For example, when q = "big red balloon" and d ="small green balloon" then the variables are (big,red,balloon,small,green) and a = (1,1,1,0,0) and b = (0,0,1,1,1).

Not all words are created equally. Some are more important than others when computing similarity. Rather than use the count or the presence/absence of each term, we can use a weight. For example, we can give a lower weight to common words. What would make a suitable weighting? tf*idf of course. Putting this altogether,

similarity(q,d) = a.b / ||a|| * ||b|| 


a = (


b = (

While cosine similarity with tf*idf works well, really well, there are a couple of edge cases where it fails, corner cases that don't seem to be covered in most introductory explanations and tutorials.

FAIL 1: imagine you have a corpus D consisting of one document d. You come along with a query q where q == d. That is, the corpus has exactly what you are looking for. Intuition should say that we expect that cosine similarity would be 1 because q == d. So, what do we get? While the dot product of q and d should be 1 giving cosine similarity 1, it is not when you use tf*idf weights. The tf*idf of each term of d will be zero--each term of d is in all documents (D==d). Therefore, the dot product is zero but the norms of the two vectors is also zero and will generate a division by zero error. In summary, similarity is 0/0 and so undefined. 

FAIL 2: imagine you have a corpus with two documents, d_1 = "blue bag" and d_2 = "green bag". What is their similarity? Intuition says there are some similarities between them, they both contain "bag," but there are some differences: "blue" vs "green". Thus, this should mean that we get a cosine similarity somewhere between 0 and 1. Wrong! Tf*idf for "bag," the common term, is zero because IDF is zero. "blue" is not a shared term and so that term of the dot product is zero as is for term "green." In other words, where they differ it pumps zero terms into the dot product and where they are similar, those terms effectively convey no information whatsoever and so also generate zero values.

While these two scenarios may seem contrived, I encountered them while writing unit tests where I wanted to use minimal corpora possible to test my code. It seems that one needs three distinct documents to avoid the problems above, or your code must handle a NaN.

I use tf*idf and cosine similarity frequently. It can get you far with little cost (if your documents are not enormous). It does have a big limitation though, it is a "bag of words" model meaning it does not consider word order. In many cases, specific word order matters a lot---a red couch with gold legs is very different from a gold couch with red legs. What one can do is to use the fast and cheap cosine similarity with tf*idf weights to narrow down some larger corpus to a smaller subset of documents for which you run a more computationally expensive, more domain specific model or algorithm that does consider word order.

Thursday, February 14, 2013

20 Tips and tricks working with data

In this post I lay out a number of tips, tricks and suggestions for working with data, from collection to storage to analysis. The main idea is to help students and other younger analysts and data scientists. However, they apply to all ages and experiences. Indeed, while many of the following ideas may sound obvious and trivial, it is surprising how often I see the counter-examples and anti-patterns. 

This is almost certainly a biased selection based on my personal experience over the years. Thus, if you have any additional comments or tips, feel free to add those in the comments.

This section is about data itself.

1) Inspect your data
When you receive a dataset, open a text editor and inspect the data. Don't assume all is present and correct. Page through the data quickly to get a sense of columns that are mostly NULL, data that are zero or negative, the ranges of dates and so on. It is very easy to make mistakes when reformatting data and data entry people are often low-paid, low-skilled workers. Assume that there is something wrong with your data. In *nix, use the head and more commands to get a sense of the data.

2) Think through what you would expect.
Related to above, it helps to have an expectation of what you might expect in each column. This will help in spotting errors. For instance, would you expect to find negative values? I once worked with a government-recorded health dataset that included data of people's height, something one might expect would be easy to measure and record. However, we found a chunk of people listed as 5 inches tall. Likely, they should have been 5 feet tall but such outliers are easy to spot with a frequency histogram. In R, get in the habit of calling summary() on your dataset which provides a 5-number summary of each variable. Plot univariate distribution and in R pairs(), a function that produces a scatter plot of each pair of variables, can be very valuable.

3) Check your units.
It may sound obvious but check your units. CNN reports 
NASA lost a $125 million Mars orbiter because a Lockheed Martin engineering team used English units of measurement while the agency's team used the more conventional metric system for a key spacecraft operation.
Yeah, it's that important.

4) Plain text > binary. 
Plain text data files are always better than binaries. You get to see the raw data, to see how columns are delimited and to see which line ending types are used. Any text editor can be used to view the data. With a simple text file, it can be easy to write a python script to iterate through the data. 

5) Structured > unstructured.
While there has been a recent focus on NoSQL technologies that store schemaless documents, if data are or should be well-structured or have a strict, static schema, keep them that way. It will make querying, slicing and dicing much easier later. 

6) Clean > more > less
Much of the time building models, classifiers and so on is spent cleaning data. It takes a lot of time and can be depressing to see huge swathes of data thrown away or excluded. It is better to start with clean data. This means having quality control on data entry. Run validators on HTML forms where possible. If you can't get clean data, at least get a lot of it (assuming it is a well-structured experiment). Clean data are better than more data which is better than less data.

7) Track data provenance
Provenance means where data came from. In many cases, provenance is is explicit as it comes from a single source or there is some source/vendor code or name. Imagine though you create a corpus for a model in which you combine data from multiple sources. If you later find a problem with one of those souces, you will likey want to back out those data or weight them less. If your corpus is just a big set of numbers and you don't know which data came from where, you now have polluted low-quality data.

8) Be honest about data loss acceptability
When processing data, there can be problems especially when working with remote services. Servers can be down, databases connection can momentarily drop etc. Think through how important it is to process each item. If you are a bank or many other transaction based businesses, everything must work perfectly. This though comes at a high cost. However, if you are running say sentiment analysis on a twitter stream, or running some summary metrics on historical data, a small data loss may be perfectly acceptable. Will the extra cost of getting every last tweet make a big difference to your analysis?

9) Embrace the *nix command line.
It is hard not to stress this enough. Learn to love the command line. While the commands may seem terse and obtuse, you will be amazed how much more productive you can be with just a few basic commands such as:
  • wc -l filename: returns the number of lines in the file.
  • head -n filename: shows the first n lines of a file
  • grep searchterm filename: filter in the lines containing this search term.
10) Denormalize data for warehousing
There are very good reasons for normalized databases tables. A well-designed schema will be flexible for future data and grow by rows not columns over time. For wareshousing where data are stored for longer term, it can often be a good idea to denormalize data. This can make the provenance more explicit, can make it easier to split into chunks, and it is often easier for analysts to work with as you can avoid costly, complex joins.

This section refers mostly to processing data with scripts

11) Learn a scripting language
Python, ruby, it doesn't really matter which but pick one. You want a language with at least simple I/O, a range of data structures and libraries to make HTTP requests. You will end up doing a lot of file manipulation in your career and so you need a good general tool to automate tasks. Also, some analyses, especially involving time series of individuals, are far easier to do in scripting code than with SQL self-join queries.

12) Comment your code
This is a general tip that applies to all source code not just scripts for analyzing data. When your head is in a problem there are subtle nuances about why you're doing something the way that you are. You'll forget that in a week, especially if you work on multiple projects simultaneously. Comment on why you are taking a certain approach or why this is an edge case. Focus less on what you are doing--ideally your code will be readable enough to show that---but a few such comments, in combination with well chosen function and variable names, will help you, or someone else, get into the code later. Comment for your future self.

13) Use source control
Again, a general coding tip. Source control (CSV, SVN, git, mercurial etc.) is indispensible. First, it is a form of backup. Second, it allows one to share code and work collaboratively. Third, it allows one the freedom to experiment more and try things out without fear of getting back to your last iteration--you can always revert back to the old code if it doesn't work out. Don't leave home without it. Also, always always write a commit message when committing. Like code comments, commit comments should focus more on what new code does rather than how the new code works. Again, comment for your future self.

14) Automate
This is obvious but is a lesson that most people learn only after they have had to do a task manually several times when they thought it was a one off. If there is any chance that you might need to do a task twice, it probably means you will actually need to do it five times. Automate the task where possible. This might mean a python script, a SQL stored procedure, or a cron job but it most cases, it is worth it. Even if only some of the steps are automated, it can make a huge difference.

15) Create services
In years of creating data-related tooling, I've found that people always use the tool in unexpected ways. Make the tools flexible as much of possible. Doing so, will provide an ecosystem of functionality. Create top-level functions that serve the 80% of uses as well as provide access to the lower-level functionality. Where possible make the functionality as web services that can talk to each other in a simple clean manner. In many cases, JSON is preferable to XML or SOAP. 

16) Don't obsess over big data
I might get lynched for this one but there is a lot of hype over big data as service companies vie to take a slice of the analytics pie. While big data certainly has a place, many people who think they need big data don't have big data and don't need big data technologies. If you work in mega- or giga-bytes, it is not big. Take a deep breath and answer honestly, do I have big data or do I need to scale to huge proportions; do I have a problem that can be parallelized well (some problems don't); do I need to process all my data or can I sample? I should stress that I am in no way against big data approaches---I use them myself---but it is not a panacea. Other, simpler approaches may get you what you need.

17) Be honest about required latency
In most websites, data are replicated, warehoused and analytics performed offline. That latency for analysis can often be hours or a day without any real impact on the business. There are cases where more real time analysis is required, for instance collaborative filtering or computing real time twitter trends. Such short latencies comes at a cost, and significantly more so, the smaller the value. Don't waste money creating a system capable of 1 second sales data latency if you only analyze it in batch mode overnight.

I am hesitant to dip into data visualization as that is a whole post in itself but I will include the following two:

18) Use color wisely
While neon colors might be in vogue at American Apparel this season that doesn't mean I want to see it in your charts. Take note of default color palettes in programs such as Excel and libraries such as ggplot2. A lot of thought, research and experience have gone into those. If you break the rules do it for good reason and for good effect. P.S. don't use yellow text on a white background. While it might look good on your laptop screen in almost all cases, it is unreadable when projected.

19) Declutter charts and visualizations
When producing charts, infographics and other data visualizations, think very carefully about what you want the reader to take away from the data. Focus on that and make that message shine through with the least amount of reader effort. Think about information density. If I need a magnifying glass to decipher all the details, you're doing it wrong. If a chart has a lot of textual annotations, you are likely doing it wrong. If you need 6 charts to convey one main point, you are probably missing some key summary metric. Use larger fonts than you might think if you are projecting to a large room. (Where possible, actually stand at the back of the room and check readability of your presentation.) You would do well to read Edward Tufte's earlier books.

20) Use # dashboard users as a metric in your dashboards
In his foundation video series, Kevin Rose interviewed Jack Dorsey (CEO Square and Exec Chairman, Twitter) [video]. Dorsey made a very interesting point in how he has tried to make Square data-centric. A dashboard is useless if no one is using it. How valuable a dashboard or other metrics are should be reflected in the very dashboard itself: use the numbers of users as a metric in the dashboard itself. I would add that such metrics should be honest signals. That is, if reports are emailed out to recipients, the number of recipients is not necessarily a true reflection of use. Perhaps most people don't open it. Email open rate, however, is a more reliable and reflective indicator. 

There you go. A few tips and suggestions for working with data. Let me know what you think and what I missed.