Monday, December 31, 2012

Howto: build a news aggregator (in 100 LOC)


In this post, I'll cover how to build a crude, simple but working news aggregator in about 100 lines of python (source code). While a complete news aggregator will comprise of at least three steps:
  • filter: select a subset of stories of interest, i.e. the top news -- e.g., that Zynga's shuttering of several games will be one of the most interesting to Techmeme's readership in the screenshot below
  • aggregate: cluster like documents --- that Techcrunch and GamesIndustry International are both covering the same story
  • choose: choose the best representative story -- that the Techcrunch story should be highlighted over the GamesIndustry International's story,
in this article, we will focus on the second step only, aggregation of similar articles. 


Let us state the goal explicitly: given multiple sources of documents, such as RSS news feeds, cluster together similar documents that cover the same material. For instance, both Reuters and Associated Press might cover Hillary Clinton's admission to hospital (which was a major story at time of writing). In the google news screen shot below we can see that USA Today, Detroit Free Press, Boston Herald and at least 10 other news organizations covered that story. We wish to recognize that these are the same underlying story and cluster or aggregate them.


My approach was the following:
  • parse RSS feeds
  • extract keywords using TF-IDF (i.e., term frequency * inverse document frequency)
  • union those to a set of features that encompasses the whole corpus
  • create a feature vector for each document
  • compute matrix of cosine similarities
  • run hierarchical clustering and plot dendrogram
  • extract final clusters of interest
I'm not claiming this is an optimal scheme or will necessarily scale well and I did not focus on vectorized operations. It does, however, produce non-embarassing results in 100 lines of code. Later, we will discuss other approaches including the algorithm versus human debate.

Let's build an aggregator. To start, we will need some data sources. I chose a handful of RSS feeds about technology and IT news:

feeds = [
    'http://www.sfgate.com/rss/feed/Tech-News-449.php',
    'http://feeds.feedburner.com/TechCrunch/startups',
    'http://news.cnet.com/8300-1001_3-92.xml',
    'http://www.zdnet.com/news/rss.xml',
    'http://www.computerweekly.com/rss/Latest-IT-news.xml',
    'http://feeds.reuters.com/reuters/technologyNews',
    'http://www.tweaktown.com/news-feed/'
]

(We can assume that these source have already mostly accomplished the first news aggregator step: filtering.)

Next, we need to parse out the source. I decided to use the document title and description which is a shortish summary. I used the feedparser python library and the natural language processing library NLTK. For each document, I concatenate the title and description, convert to lowercase, and remove one character words. I store those final documents in a corpus and also store the titles to make the final results more interpretable:

import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
    d = feedparser.parse(feed)
    for e in d['entries']:
       words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
       words.extend(nltk.wordpunct_tokenize(e['title']))
       lowerwords=[x.lower() for x in words if len(x) > 1]
       ct += 1
       print ct, "TITLE",e['title']
       corpus.append(lowerwords)
       titles.append(e['title'])

Here is a sample of those data (titles only):

0 TITLE Tweeters say the dumbest things
1 TITLE IFixit thrives by breaking tech items down
2 TITLE Earthcam's New Year's Live 2013
3 TITLE Content Fleet offers publishers hot tips
...
96 TITLE Sony stops shipments of PS2 console in Japan
97 TITLE RIM's first patent payment to Nokia: $65 million
98 TITLE ZTE aiming for thinner 5-inch 1080p-capable smartphone, Grand S has grand dreams
99 TITLE Japanese security firm to take to the sky in 2014, will unleash crime-fighting drones

We will extract keywords using TF-IDF. I modified the following from this post:

import math
from operator import itemgetter
def freq(word, document): return document.count(word)
def wordCount(document): return len(document)
def numDocsContaining(word,documentList):
  count = 0
  for document in documentList:
    if freq(word,document) > 0:
      count += 1
  return count
def tf(word, document): return (freq(word,document) / float(wordCount(document)))
def idf(word, documentList): return math.log(len(documentList) / numDocsContaining(word,documentList))
def tfidf(word, document, documentList): return (tf(word,document) * idf(word,documentList))

I only want to extract a few top keywords from each document and I'll store the set of those keywords, across all documents, in key_word_list:

import operator
def top_keywords(n,doc,corpus):
    d = {}
    for word in set(doc):
        d[word] = tfidf(word,doc,corpus)
    sorted_d = sorted(d.iteritems(), key=operator.itemgetter(1))
    sorted_d.reverse()
    return [w[0] for w in sorted_d[:n]]   

key_word_list=set()
nkeywords=4
[[key_word_list.add(x) for x in top_keywords(nkeywords,doc,corpus)] for doc in corpus]
   
ct=-1
for doc in corpus:
   ct+=1
   print ct,"KEYWORDS"," ".join(top_keywords(nkeywords,doc,corpus))

Here is a sample of those data:

0 KEYWORDS she tweeted athletes south
1 KEYWORDS ifixit repair devincenzi items
2 KEYWORDS earthcam live famed sound
3 KEYWORDS fleet content tips publishers
...
96 KEYWORDS playstation sony console ouya
97 KEYWORDS payment nokia patent agreement
98 KEYWORDS zte grand boots ces
99 KEYWORDS drones soon 2014 drone

Now that we have this superset of keywords, we need to go through each document again and compute TF-IDF for each term. Thus, this will be likely be a sparse vector as most of the entries will be zero.

feature_vectors=[]
n=len(corpus)

for document in corpus:
    vec=[]
    [vec.append(tfidf(word, document, corpus) if word in document else 0) for word in key_word_list]
    feature_vectors.append(vec)

With a vector of TF-IDF for each document, we can compare the similarity with each other. I decided to use cosine similarity (and yes, this matrix is symmetric and thus the code is far from optimized):

import numpy
mat = numpy.empty((n, n))
for i in xrange(0,n):
    for j in xrange(0,n):
       mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])

We now have a single similarity number for each pair of documents.

The next step is clustering. Initially, the most obvious might be to use k-means or k-means++ clustering. It is simple, well-known and can work relatively well. As well as getting stuck on local optima, the major problem here, however, is that one has to choose k. This is not always easy. Moreover, I strongly suspect that the characteristics of news changes dramatically. That is, on presidential election day, almost all top stories will be about voting and speculation on the results. That is, there may be very few, very large clusters. On slow news days that story landscape may be very different. As such, I think it would be difficult to choose a constant k across days and the characteristics of the data may change so much that a scheme to choose k dynamically might be tricky. 

I chose a different approach: agglomerative or hierarchical clustering. Here clusters are grown by fusing neighboring documents to form a tree. The structure of that tree may change radically among different days but we can choose a similarity threshold to prune the tree to a set of final clusters. That similarity metric can be constant over days.

I used the hcluster python library and saved the dendrogram to file:

from hcluster import linkage, dendrogram
t = 0.8
Z = linkage(mat, 'single')
dendrogram(Z, color_threshold=t)

import pylab
pylab.savefig( "hcluster.png" ,dpi=800)



Finally, we need to chose a threshold from which to prune -- how similar is similar enough. 0.8 worked best for my data. There is then a little work to extract the clustered items from the dendrogram and print out the IDs and titles of the final clusters.

def extract_clusters(Z,threshold,n):
   clusters={}
   ct=n
   for row in Z:
      if row[2] < threshold:
          n1=int(row[0])
          n2=int(row[1])

          if n1 >= n:
             l1=clusters[n1] 
             del(clusters[n1]) 
          else:
             l1= [n1]
      
          if n2 >= n:
             l2=clusters[n2] 
             del(clusters[n2]) 
          else:
             l2= [n2]    
          l1.extend(l2)  
          clusters[ct] = l1
          ct += 1
      else:
          return clusters

clusters = extract_clusters(Z,t,n)

for key in clusters:
   print "=============================================" 
   for id in clusters[key]:
       print id,titles[id]

Here then, in combination with the dendrogram above, are my final results:

=============================================
35 Huawei linked to plan to sell restricted equipment to Iran
65 Exclusive: Huawei partner offered embargoed HP gear to Iran
=============================================
10 UK's Pearson invests in Barnes & Noble's Nook
73 Pearson to buy stake in Nook, Barnes & Noble shares up
=============================================
21 Kim Dotcom To Host Mega’s Launch Event At His New Mega Zealand Mansion Next Month
94 Kim Dotcom to host Mega's launch event at his New Zealand mega mansion next month
=============================================
90 Data suggests Facebook's Poke has led to online buzz around Snapchat
93 Snapchat videos and photos can be saved after all
=============================================
14 Apple CEO Tim Cook paid $4.16 million
41 Apple's Tim Cook sees his 2012 pay fall 99 percent
=============================================
8 Netflix CEO gets pay bump after 2012 cut
72 Netflix increases CEO Hastings' 2013 salary to $4 million
=============================================
95 Nissan Leaf owners can get their batteries refreshed under new warranty options
99 Japanese security firm to take to the sky in 2014, will unleash crime-fighting drones
=============================================
83 Windows RT ported to HTC HD2
77 Samsung Galaxy S III extended battery arriving in Germany on January 5
92 RumorTT: Samsung Galaxy S IV rumored for April Launch with integrated S Pen
75 Unnamed BlackBerry 10 device slides through FCC with AT&T-capable LTE abilities
98 ZTE aiming for thinner 5-inch 1080p-capable smartphone, Grand S has grand dreams
=============================================
40 HP confirms: Feds investigating the Autonomy acquisition
52 US Department of Justice is investigating allegations of accounting fraud at Autonomy
11 HP says gov't investigating troubled Autonomy unit
39 Autonomy founder fires back at HP after news of DOJ inquiry
=============================================
55 Top 10 broadband stories of 2012
63 BT wins BDUK broadband contract in Norfolk
=============================================
97 RIM's first patent payment to Nokia: $65 million
84 Fujitsu blames weak Windows 8 demand for poor sales
87 Cheezburger network raises $5M in funding for LOLcats, FAIL Blog, others
79 Samsung has big plans for Silicon Valley with 1.1 million square foot R&D center
89 Apple drops patent claim against Samsung Galaxy S III Mini

The complete output is here

To sort the items within a cluster, the final news aggragator step, one could sort by PageRank using code such as this.

We see good cluster around Huawei, Person, Kim Dotcom, Facebook/Snapchat, Tim Cook, and Netflix. All of those seem reasonable. The next is spurious. Next is a broader cluster around smartphones. The next is a good four story cluster around the HP and Autonomy scandal. The next cluster relates to broadband and the documents are related but not really about the same material. Finally, there is a cluster around "millions" which cover both dollars and square foot.

The results are not great. I'm not going to build a techmeme.com competitor from this. However, they are surprisingly respectable for a first pass and for so few lines of code. 

To try to improve results, I tried doing a weighted TF-IDF. That is, I suspect that words in the title are relatively more important than those in the description. Thus, I set a weight 1 for each word in the description and weight title_wt > 1 for each word in the title. When we compute term frequency I sum the weights for the words present rather than counts. The code for this change is here

I did not find that this improved results. The dominant parameter by far was the number of keywords from each document (nkeywords=4 above). Increasing this dramatically increases false positives (i.e. larger mixed clusters, such as the million cluster above). Reducing it below 4 led to very few clusters (for a given document similarity threshold). Actually, I was surprised that the optimal value of 4 was so low. I had expected a larger number would have been needed to disambiguate different the articles.

A potential improvement that I have not tried is stemming the words. In the sample keyword output above you might have noticed the following:

99 KEYWORDS drones soon 2014 drone

Drones and drone are the same noun. Had I stemmed these it would have increased the TF-IDF of drone and created a space for an additional keyword.

I may be able to tweak my current code and improve results. I could try different parameters, try different similarity metrics, different clustering schemes but it is not clear how good the results could be. Could I match a human curator? According to Gabe Rivera, the founder of Techmeme.com, algorithms are not currently a match for humans. His site employs a small team of editors [http://techmeme.com/about] to curate the top news:

Our experience leads us to believe that a thoughtful combination of both algorithmic and human editing offers the best means for curating in a space as broad as technology.

I somehow started thinking about news aggregators just a few days before prismatic got a large round of funding. In the days that followed there was some debate about algorithms versus humans. Here is Techcrunch:
News aggregation is a glory-less pursuit, where even the companies that do succeed have to grapple with the fact that no algorithm is ever going to be as good as the human brain at judging what news/content will appeal to said brain
And Gabe Rivera (from Techmeme.com): 
"A lot of people who think they can go all the way with the automated approach fail to realize a news story has become obsolete," said Rivera, explaining that an article can be quickly superseded even if it receives a million links or tweets. from gigaom 
and paraphrasing:
Some decisions, after all, are best left to humans.  When a news story breaks, for example, or a much-hyped product launches, a wealth of news coverage follows.  Which is the best version?  While Techmeme’s algorithms often derive this answer from “signals” on the Internet, sometimes surfacing the right story for the audience — perhaps the most comprehensive take or the most nuanced coverage — requires human intervention. outbrain.com
I should note that the experience of introducing direct editing has been a revelation even for us, despite the fact that we planned it. Interacting directly with an automated news engine makes it clear that the human+algorithm combo can curate news far more effectively that the individual human or algorithmic parts. It really feels like the age of the news cyborg has arrived. Our goal is apply this new capability to producing the clearest and most useful tech news overview available
(See also this thread.)

As mentioned earlier, news aggregation is not just about clustering similar stories. There are really three steps: story selection, clustering like documents, and selecting best item from each cluster. This is certainly a set of hard problems. Which stories might be of interest depends on the audience and other contexts. Clustering similar documents is probably the easiest subtask and that is fraught with problems, e.g. that a bag-of-words model does not distinguish millions of dollars versus millions of square feet. (Thus, a n-gram model might be another possible approach.) There is a sea of potential news sources, they have different foci, reputations come and go, quality changes, news organization editors vary and change too. As Gabe mentions, there is a temporal aspect too where there is a window of interest before the news get stale.

There is no reason that an algorithm or suite of algorithms could not do this well. There are certainly a lot of social cues now that an algorithm could use to identify trending, interesting stories: the number of likes, shares or retweets; the reputation of those sharing the news; the number or rate of comments on those stories etc. To understand the audience and refine itself, an algorithm could use direct user feedback and again shares / like / retweets of those top stories. However, it remains a hard problem. With the big news sources covering core issues, smaller news operations have to work a different angle and sometimes those are the most interesting. Journalists are not consistent: they might write a blazing article one day and a mediocre the next. Thus, for now the algorithmic base + human editors or a pure crowd-sourced filtering (such as reddit) do seem to produce the best quality results.

17 comments:

  1. Thanks for sharing this. Really helpful and informative.

    ReplyDelete
  2. This is great, thank you for sharing!

    ReplyDelete
  3. I can say this is the perfect blog for everybody who read and post here everyday. I dont know why I always visit essay-writings-service.com here maybe its so nice to read here a interesting topics all day. So ill be back more to check more updates and new comments:)

    ReplyDelete
  4. Thank you for a very good article..I am able to run the program partially..How ever unable to configure the hcluster.I get tis error

    from hcluster import linkage, dendrogram
    ImportError: No module named hcluster

    My question is .Is it mandatory to have Visual Studio installed inorder to import the hcluster ?

    Thanks
    Sree

    ReplyDelete
  5. Sree,

    You will need to install that library yourself as it is not part of the core python 2.7.

    I just installed on my unix system with:

    wget https://pypi.python.org/packages/source/h/hcluster/hcluster-0.2.0.tar.gz
    gunzip https://pypi.python.org/packages/source/h/hcluster/hcluster-0.2.0.tar.gz
    tar xvf hcluster-0.2.0.tar
    cd hcluster-0.2.0
    python setup.py install

    Hopefully this helps.

    Carl

    ReplyDelete
  6. Thanks..Able to figure it out.Its Working now

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. I used the MingW32 to install the hcluster. It worked.I had to download the Microsft VS C++ as well to render the graph.Not sure why python needs to have a dependency with the Microsoft CLR

    ReplyDelete
  9. Thanks for the wonderful article. However, I see one issue in it. I actually tried using the code using a couple of google RSS feeds for news.
    Ex: 'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+2014&output=rss',
    'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa&output=rss',
    'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+worldcup&output=rss',
    'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+world+cup+2014&output=rss',
    'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+world+cup&output=rss',

    I see duplicate results, just sources being different. On further digging, I see that they are same articles with minor changes to title and content.
    The issue here is with tf*idf which would ignore all keywords with partial match titles -
    Ex: 1. Brazil Fights Sex Tourism, Child Prostitution Ahead of FIFA World Cup -
    NDTVSports.com
    2. Brazil fights sex tourism, child prostitution ahead of FIFA World Cup 2014 -
    India.com

    ReplyDelete
    Replies
    1. Hi Raja,

      in that case, typically one would further preprocess the content by stemming or lemmatizing the content. Stemmers chop off word endings to leave word roots, e.g. airliner -> airlin (the final word root may or may not be a proper word) and lemmatizers turn a word into its canonical word form (airliner -> airline). NLTK has a whole slew of different stemmers (default=Porter) and a lemmatizers. This way, you standardize the content to enable more matches among similar but slightly different content. I hope that this helps.

      Carl

      Delete
  10. Hi Carl, thanks for such a wonderful article. Can also please provide some pointers/approaches for 1st and 3rd items i.e. (filter and choose) so that I can some research on that as well.

    ReplyDelete
    Replies
    1. Mansoor,

      Thanks for your kind words. For choosing a representative story, there are many ways but you could try to ascertain some sort of "reputation" metric for the source. Thus, CNN or Reuters should have a higher score than some small town newspaper covering the same story. One approach could be to use google. Searching for "wired" show 2.6 million followers on google+ but Techcrunch has 5.3 million. It is crude but it might work.

      For filtering, it kind of depends on the source of data and context. For instance, in some situations you could ask the reader to list their interests (as prismatic does) and you could do some keyword matching. You could cluster the data and look at the size of the clusters themselves. If there are 35 stories from different sources in one cluster, it is probably a big story.

      Those are a couple of ideas of the top of my head

      Regards,

      Carl

      Delete
  11. HI. Thanks for the article. im unable to install hcluster

    ReplyDelete
  12. i have solve this issue by installing scipy

    ReplyDelete
  13. Thank you for this post ! I'm sharing it with everyone I know. It's very well worded, straight up and to the point and I appreciate that.
    online paraphrase tool

    ReplyDelete
  14. An awesome post!! Thanks a lot...

    ReplyDelete

Note: Only a member of this blog may post a comment.