## Tuesday, February 19, 2013

### When tf*idf and cosine similarity fail

In this post I cover 2 edge cases of cosine similarity with tf*idf weights that fail, i.e. that don't provide the cosine similarity values that intuition and common sense says that they should return.

In information retrieval, tf*idf forms the basis of scoring documents for relevance when querying a corpus, as in a search engine. It is the product of two terms: term frequency and inverse document frequency.

Term frequency is the frequency of some term in the document, typically an absolute count or relative frequency. Documents with more mentions of a term are more likely to be relevant with respect to that term. For example, when querying for "dog," a document about caring for your dog which mentions "dog" 46 times is more likely to be relevant than a document with a single mention of "the dog days of summer."

Inverse document frequency (IDF) measures the dispersion of that term across the corpus. If every document contains "the," then "the" is not a particularly discriminating word. IDF is the ratio of the corpus size to the number of documents containing that term. The smaller the proportion of documents containing that term, the higher the magnitude of this metric. (In reality, we take the log of the ratio. That is, idf = log(N/n_i)).

These two measures quantify the frequency within a document and the relative rarity across the corpus. Taking the product we arrive at a simple, satisfyingly intuitive but surprisingly powerful metric to score documents. For each term t in each document d in some corpus D we can compute the tf*idf score. Let's call this tfidf (t,d).

We rarely query a corpus for a single term. Instead, we have a query q consisting of multiple terms. Now we want to compute the similarity between the query q and each document d in the corpus. For this, we tend to use something called cosine similarity

This is a measure of the angle between two unit vectors:

similarity
= cos(a,b)
= dotproduct(a,b) / ( norm(a) * norm(b) )
= a.b / ||a|| * ||b||

[Definition: if a = (a1,a2,...,an) and b = (b1,b2,...,bn) then a.b = Sum(a1*b1 + a2*b2 + ... + an*bn)
and ||a|| = sqrt(a1^2 + a2^2 + ... + an^2) and ||b|| = sqrt(b1^2 + b2^2 + ... + bn^2). ]

The smaller the angle, the more similar are the two vectors.

In this case, the variables of a and b are the set of unique terms in q and d. For example, when q = "big red balloon" and d ="small green balloon" then the variables are (big,red,balloon,small,green) and a = (1,1,1,0,0) and b = (0,0,1,1,1).

Not all words are created equally. Some are more important than others when computing similarity. Rather than use the count or the presence/absence of each term, we can use a weight. For example, we can give a lower weight to common words. What would make a suitable weighting? tf*idf of course. Putting this altogether,

similarity(q,d) = a.b / ||a|| * ||b||

where

a = (
tfidf("big",q),
tfidf("red",q),
tfidf("balloon",q),
tfidf("small",q),
tfidf("green",q)
)

and

b = (
tfidf("big",d),
tfidf("red",d),
tfidf("balloon",d),
tfidf("small",d),
tfidf("green",d)
).

While cosine similarity with tf*idf works well, really well, there are a couple of edge cases where it fails, corner cases that don't seem to be covered in most introductory explanations and tutorials.

FAIL 1: imagine you have a corpus D consisting of one document d. You come along with a query q where q == d. That is, the corpus has exactly what you are looking for. Intuition should say that we expect that cosine similarity would be 1 because q == d. So, what do we get? While the dot product of q and d should be 1 giving cosine similarity 1, it is not when you use tf*idf weights. The tf*idf of each term of d will be zero--each term of d is in all documents (D==d). Therefore, the dot product is zero but the norms of the two vectors is also zero and will generate a division by zero error. In summary, similarity is 0/0 and so undefined.

FAIL 2: imagine you have a corpus with two documents, d_1 = "blue bag" and d_2 = "green bag". What is their similarity? Intuition says there are some similarities between them, they both contain "bag," but there are some differences: "blue" vs "green". Thus, this should mean that we get a cosine similarity somewhere between 0 and 1. Wrong! Tf*idf for "bag," the common term, is zero because IDF is zero. "blue" is not a shared term and so that term of the dot product is zero as is for term "green." In other words, where they differ it pumps zero terms into the dot product and where they are similar, those terms effectively convey no information whatsoever and so also generate zero values.

While these two scenarios may seem contrived, I encountered them while writing unit tests where I wanted to use minimal corpora possible to test my code. It seems that one needs three distinct documents to avoid the problems above, or your code must handle a NaN.

I use tf*idf and cosine similarity frequently. It can get you far with little cost (if your documents are not enormous). It does have a big limitation though, it is a "bag of words" model meaning it does not consider word order. In many cases, specific word order matters a lot---a red couch with gold legs is very different from a gold couch with red legs. What one can do is to use the fast and cheap cosine similarity with tf*idf weights to narrow down some larger corpus to a smaller subset of documents for which you run a more computationally expensive, more domain specific model or algorithm that does consider word order.

1. You should smooth your TF-IDF measure, for example, log(N+1/n_i+1)), that helps in many cases and avoids NaN issue. Or use something like Okapi BM25.

If a term occurs in every document in your corpus then presence of that term in the query is not giving any meaningful information any way.

Also, similarity metric is very important for most of the IR/ML algorithms, and you need to be very careful how you define your similarity metric including Cosine and TF-IDF.

2. You are using this on a small dataset. This is expected.

3. Please use this to overcome these limitations:

tf (t in d) = frequency½

idf (t) = 1 + log ( N / ni + 1).

This is used in the popular Lucene engine.

Thanks