Comments on p-value.info: When tf*idf and cosine similarity fail

Please use this to overcome these limitations: t...

2014-06-27T08:23:41.577-07:00

Please use this to overcome these limitations:

tf (t in d) = frequency½

idf (t) = 1 + log ( N / ni + 1).

This is used in the popular Lucene engine.

Thanks

You are using this on a small dataset. This is exp...

2014-05-09T23:06:47.329-07:00

You are using this on a small dataset. This is expected.

Interesting read. A few comments: You should smoo...

2013-02-19T13:44:50.589-08:00

Interesting read. A few comments:

You should smooth your TF-IDF measure, for example, log(N+1/n_i+1)), that helps in many cases and avoids NaN issue. Or use something like Okapi BM25.

If a term occurs in every document in your corpus then presence of that term in the query is not giving any meaningful information any way.

Also, similarity metric is very important for most of the IR/ML algorithms, and you need to be very careful how you define your similarity metric including Cosine and TF-IDF.