tag:blogger.com,1999:blog-5547907074344788039.post5002251547468095189..comments2023-09-26T04:21:55.872-07:00Comments on p-value.info: When tf*idf and cosine similarity failCarl Andersonhttp://www.blogger.com/profile/11930448254473684406noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-5547907074344788039.post-19151895783280516332014-06-27T08:23:41.577-07:002014-06-27T08:23:41.577-07:00Please use this to overcome these limitations:
t...Please use this to overcome these limitations: <br /><br />tf (t in d) = frequency½<br /><br />idf (t) = 1 + log ( N / ni + 1). <br /><br />This is used in the popular Lucene engine.<br /><br />ThanksAnonymoushttps://www.blogger.com/profile/09867724863759888684noreply@blogger.comtag:blogger.com,1999:blog-5547907074344788039.post-39715864698875178122014-05-09T23:06:47.329-07:002014-05-09T23:06:47.329-07:00You are using this on a small dataset. This is exp...You are using this on a small dataset. This is expected. Ali Ayaz Gajanihttps://www.blogger.com/profile/15778005501555515595noreply@blogger.comtag:blogger.com,1999:blog-5547907074344788039.post-91625219387237378942013-02-19T13:44:50.589-08:002013-02-19T13:44:50.589-08:00Interesting read. A few comments:
You should smoo...Interesting read. A few comments:<br /><br />You should smooth your TF-IDF measure, for example, log(N+1/n_i+1)), that helps in many cases and avoids NaN issue. Or use something like Okapi BM25.<br /><br />If a term occurs in every document in your corpus then presence of that term in the query is not giving any meaningful information any way.<br /><br />Also, similarity metric is very important for most of the IR/ML algorithms, and you need to be very careful how you define your similarity metric including Cosine and TF-IDF.Mitul Tiwarihttps://www.blogger.com/profile/13083079571729960860noreply@blogger.com