p-value.info: Howto: build a news aggregator (in 100 LOC)

Monday, December 31, 2012

Howto: build a news aggregator (in 100 LOC)

In this post, I'll cover how to build a crude, simple but working news aggregator in about 100 lines of python (source code). While a complete news aggregator will comprise of at least three steps:

filter: select a subset of stories of interest, i.e. the top news -- e.g., that Zynga's shuttering of several games will be one of the most interesting to Techmeme's readership in the screenshot below
aggregate: cluster like documents --- that Techcrunch and GamesIndustry International are both covering the same story
choose: choose the best representative story -- that the Techcrunch story should be highlighted over the GamesIndustry International's story,

in this article, we will focus on the second step only, aggregation of similar articles.

Let us state the goal explicitly: given multiple sources of documents, such as RSS news feeds, cluster together similar documents that cover the same material. For instance, both Reuters and Associated Press might cover Hillary Clinton's admission to hospital (which was a major story at time of writing). In the google news screen shot below we can see that USA Today, Detroit Free Press, Boston Herald and at least 10 other news organizations covered that story. We wish to recognize that these are the same underlying story and cluster or aggregate them.

My approach was the following:

parse RSS feeds
extract keywords using TF-IDF (i.e., term frequency * inverse document frequency)
union those to a set of features that encompasses the whole corpus
create a feature vector for each document
compute matrix of cosine similarities
run hierarchical clustering and plot dendrogram
extract final clusters of interest

I'm not claiming this is an optimal scheme or will necessarily scale well and I did not focus on vectorized operations. It does, however, produce non-embarassing results in 100 lines of code. Later, we will discuss other approaches including the algorithm versus human debate.

Let's build an aggregator. To start, we will need some data sources. I chose a handful of RSS feeds about technology and IT news:

feeds = [

'http://www.sfgate.com/rss/feed/Tech-News-449.php',

'http://feeds.feedburner.com/TechCrunch/startups',

'http://news.cnet.com/8300-1001_3-92.xml',

'http://www.zdnet.com/news/rss.xml',

'http://www.computerweekly.com/rss/Latest-IT-news.xml',

'http://feeds.reuters.com/reuters/technologyNews',

'http://www.tweaktown.com/news-feed/'

]

(We can assume that these source have already mostly accomplished the first news aggregator step: filtering.)

Next, we need to parse out the source. I decided to use the document title and description which is a shortish summary. I used the feedparser python library and the natural language processing library NLTK. For each document, I concatenate the title and description, convert to lowercase, and remove one character words. I store those final documents in a corpus and also store the titles to make the final results more interpretable:

import feedparser

import nltk

corpus = []

titles=[]

ct = -1

for feed in feeds:

d = feedparser.parse(feed)

for e in d['entries']:

words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))

words.extend(nltk.wordpunct_tokenize(e['title']))

lowerwords=[x.lower() for x in words if len(x) > 1]

ct += 1

print ct, "TITLE",e['title']

corpus.append(lowerwords)

titles.append(e['title'])

Here is a sample of those data (titles only):

0 TITLE Tweeters say the dumbest things

1 TITLE IFixit thrives by breaking tech items down

2 TITLE Earthcam's New Year's Live 2013

3 TITLE Content Fleet offers publishers hot tips

...

96 TITLE Sony stops shipments of PS2 console in Japan

97 TITLE RIM's first patent payment to Nokia: $65 million

98 TITLE ZTE aiming for thinner 5-inch 1080p-capable smartphone, Grand S has grand dreams

99 TITLE Japanese security firm to take to the sky in 2014, will unleash crime-fighting drones

We will extract keywords using TF-IDF. I modified the following from this post:

import math

from operator import itemgetter

def freq(word, document): return document.count(word)

def wordCount(document): return len(document)

def numDocsContaining(word,documentList):

count = 0

for document in documentList:

if freq(word,document) > 0:

count += 1

return count

def tf(word, document): return (freq(word,document) / float(wordCount(document)))

def idf(word, documentList): return math.log(len(documentList) / numDocsContaining(word,documentList))

def tfidf(word, document, documentList): return (tf(word,document) * idf(word,documentList))

I only want to extract a few top keywords from each document and I'll store the set of those keywords, across all documents, in key_word_list:

import operator

def top_keywords(n,doc,corpus):

d = {}

for word in set(doc):

d[word] = tfidf(word,doc,corpus)

sorted_d = sorted(d.iteritems(), key=operator.itemgetter(1))

sorted_d.reverse()

return [w[0] for w in sorted_d[:n]]

key_word_list=set()

nkeywords=4

[[key_word_list.add(x) for x in top_keywords(nkeywords,doc,corpus)] for doc in corpus]

ct=-1

for doc in corpus:

ct+=1

print ct,"KEYWORDS"," ".join(top_keywords(nkeywords,doc,corpus))

Here is a sample of those data:

0 KEYWORDS she tweeted athletes south

1 KEYWORDS ifixit repair devincenzi items

2 KEYWORDS earthcam live famed sound

3 KEYWORDS fleet content tips publishers

...

96 KEYWORDS playstation sony console ouya

97 KEYWORDS payment nokia patent agreement

98 KEYWORDS zte grand boots ces

99 KEYWORDS drones soon 2014 drone

Now that we have this superset of keywords, we need to go through each document again and compute TF-IDF for each term. Thus, this will be likely be a sparse vector as most of the entries will be zero.

feature_vectors=[]

n=len(corpus)

for document in corpus:

vec=[]

[vec.append(tfidf(word, document, corpus) if word in document else 0) for word in key_word_list]

feature_vectors.append(vec)

With a vector of TF-IDF for each document, we can compare the similarity with each other. I decided to use cosine similarity (and yes, this matrix is symmetric and thus the code is far from optimized):

import numpy

mat = numpy.empty((n, n))

for i in xrange(0,n):

for j in xrange(0,n):

mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])

We now have a single similarity number for each pair of documents.

The next step is clustering. Initially, the most obvious might be to use k-means or k-means++ clustering. It is simple, well-known and can work relatively well. As well as getting stuck on local optima, the major problem here, however, is that one has to choose k. This is not always easy. Moreover, I strongly suspect that the characteristics of news changes dramatically. That is, on presidential election day, almost all top stories will be about voting and speculation on the results. That is, there may be very few, very large clusters. On slow news days that story landscape may be very different. As such, I think it would be difficult to choose a constant k across days and the characteristics of the data may change so much that a scheme to choose k dynamically might be tricky.

I chose a different approach: agglomerative or hierarchical clustering. Here clusters are grown by fusing neighboring documents to form a tree. The structure of that tree may change radically among different days but we can choose a similarity threshold to prune the tree to a set of final clusters. That similarity metric can be constant over days.

I used the hcluster python library and saved the dendrogram to file:

from hcluster import linkage, dendrogram

t = 0.8

Z = linkage(mat, 'single')

dendrogram(Z, color_threshold=t)

import pylab

pylab.savefig( "hcluster.png" ,dpi=800)

Finally, we need to chose a threshold from which to prune -- how similar is similar enough. 0.8 worked best for my data. There is then a little work to extract the clustered items from the dendrogram and print out the IDs and titles of the final clusters.

def extract_clusters(Z,threshold,n):

clusters={}

ct=n

for row in Z:

if row[2] < threshold:

n1=int(row[0])

n2=int(row[1])

if n1 >= n:

l1=clusters[n1]

del(clusters[n1])

else:

l1= [n1]

if n2 >= n:

l2=clusters[n2]

del(clusters[n2])

else:

l2= [n2]

l1.extend(l2)

clusters[ct] = l1

ct += 1

else:

return clusters

clusters = extract_clusters(Z,t,n)

for key in clusters:

print "============================================="

for id in clusters[key]:

print id,titles[id]

Here then, in combination with the dendrogram above, are my final results:

=============================================

35 Huawei linked to plan to sell restricted equipment to Iran

65 Exclusive: Huawei partner offered embargoed HP gear to Iran

=============================================

10 UK's Pearson invests in Barnes & Noble's Nook

73 Pearson to buy stake in Nook, Barnes & Noble shares up

=============================================

21 Kim Dotcom To Host Mega’s Launch Event At His New Mega Zealand Mansion Next Month

94 Kim Dotcom to host Mega's launch event at his New Zealand mega mansion next month

=============================================

90 Data suggests Facebook's Poke has led to online buzz around Snapchat

93 Snapchat videos and photos can be saved after all

=============================================

14 Apple CEO Tim Cook paid $4.16 million

41 Apple's Tim Cook sees his 2012 pay fall 99 percent

=============================================

8 Netflix CEO gets pay bump after 2012 cut

72 Netflix increases CEO Hastings' 2013 salary to $4 million

=============================================

95 Nissan Leaf owners can get their batteries refreshed under new warranty options

99 Japanese security firm to take to the sky in 2014, will unleash crime-fighting drones

=============================================

83 Windows RT ported to HTC HD2

77 Samsung Galaxy S III extended battery arriving in Germany on January 5

92 RumorTT: Samsung Galaxy S IV rumored for April Launch with integrated S Pen

75 Unnamed BlackBerry 10 device slides through FCC with AT&T-capable LTE abilities

98 ZTE aiming for thinner 5-inch 1080p-capable smartphone, Grand S has grand dreams

=============================================

40 HP confirms: Feds investigating the Autonomy acquisition

52 US Department of Justice is investigating allegations of accounting fraud at Autonomy

11 HP says gov't investigating troubled Autonomy unit

39 Autonomy founder fires back at HP after news of DOJ inquiry

=============================================

55 Top 10 broadband stories of 2012

63 BT wins BDUK broadband contract in Norfolk

=============================================

97 RIM's first patent payment to Nokia: $65 million

84 Fujitsu blames weak Windows 8 demand for poor sales

87 Cheezburger network raises $5M in funding for LOLcats, FAIL Blog, others

79 Samsung has big plans for Silicon Valley with 1.1 million square foot R&D center

89 Apple drops patent claim against Samsung Galaxy S III Mini

The complete output is here.

To sort the items within a cluster, the final news aggragator step, one could sort by PageRank using code such as this.

We see good cluster around Huawei, Person, Kim Dotcom, Facebook/Snapchat, Tim Cook, and Netflix. All of those seem reasonable. The next is spurious. Next is a broader cluster around smartphones. The next is a good four story cluster around the HP and Autonomy scandal. The next cluster relates to broadband and the documents are related but not really about the same material. Finally, there is a cluster around "millions" which cover both dollars and square foot.

The results are not great. I'm not going to build a techmeme.com competitor from this. However, they are surprisingly respectable for a first pass and for so few lines of code.

To try to improve results, I tried doing a weighted TF-IDF. That is, I suspect that words in the title are relatively more important than those in the description. Thus, I set a weight 1 for each word in the description and weight title_wt > 1 for each word in the title. When we compute term frequency I sum the weights for the words present rather than counts. The code for this change is here.

I did not find that this improved results. The dominant parameter by far was the number of keywords from each document (nkeywords=4 above). Increasing this dramatically increases false positives (i.e. larger mixed clusters, such as the million cluster above). Reducing it below 4 led to very few clusters (for a given document similarity threshold). Actually, I was surprised that the optimal value of 4 was so low. I had expected a larger number would have been needed to disambiguate different the articles.

A potential improvement that I have not tried is stemming the words. In the sample keyword output above you might have noticed the following:

99 KEYWORDS drones soon 2014 drone

Drones and drone are the same noun. Had I stemmed these it would have increased the TF-IDF of drone and created a space for an additional keyword.

I may be able to tweak my current code and improve results. I could try different parameters, try different similarity metrics, different clustering schemes but it is not clear how good the results could be. Could I match a human curator? According to Gabe Rivera, the founder of Techmeme.com, algorithms are not currently a match for humans. His site employs a small team of editors [http://techmeme.com/about] to curate the top news:

Our experience leads us to believe that a thoughtful combination of both algorithmic and human editing offers the best means for curating in a space as broad as technology.

I somehow started thinking about news aggregators just a few days before prismatic got a large round of funding. In the days that followed there was some debate about algorithms versus humans. Here is Techcrunch:

News aggregation is a glory-less pursuit, where even the companies that do succeed have to grapple with the fact that no algorithm is ever going to be as good as the human brain at judging what news/content will appeal to said brain

And Gabe Rivera (from Techmeme.com):

"A lot of people who think they can go all the way with the automated approach fail to realize a news story has become obsolete," said Rivera, explaining that an article can be quickly superseded even if it receives a million links or tweets. from gigaom

and paraphrasing:

Some decisions, after all, are best left to humans. When a news story breaks, for example, or a much-hyped product launches, a wealth of news coverage follows. Which is the best version? While Techmeme’s algorithms often derive this answer from “signals” on the Internet, sometimes surfacing the right story for the audience — perhaps the most comprehensive take or the most nuanced coverage — requires human intervention. outbrain.com

and finally in 2008, at the time Rivera introduced human editors

I should note that the experience of introducing direct editing has been a revelation even for us, despite the fact that we planned it. Interacting directly with an automated news engine makes it clear that the human+algorithm combo can curate news far more effectively that the individual human or algorithmic parts. It really feels like the age of the news cyborg has arrived. Our goal is apply this new capability to producing the clearest and most useful tech news overview available

(See also this thread.)

As mentioned earlier, news aggregation is not just about clustering similar stories. There are really three steps: story selection, clustering like documents, and selecting best item from each cluster. This is certainly a set of hard problems. Which stories might be of interest depends on the audience and other contexts. Clustering similar documents is probably the easiest subtask and that is fraught with problems, e.g. that a bag-of-words model does not distinguish millions of dollars versus millions of square feet. (Thus, a n-gram model might be another possible approach.) There is a sea of potential news sources, they have different foci, reputations come and go, quality changes, news organization editors vary and change too. As Gabe mentions, there is a temporal aspect too where there is a window of interest before the news get stale.

There is no reason that an algorithm or suite of algorithms could not do this well. There are certainly a lot of social cues now that an algorithm could use to identify trending, interesting stories: the number of likes, shares or retweets; the reputation of those sharing the news; the number or rate of comments on those stories etc. To understand the audience and refine itself, an algorithm could use direct user feedback and again shares / like / retweets of those top stories. However, it remains a hard problem. With the big news sources covering core issues, smaller news operations have to work a different angle and sometimes those are the most interesting. Journalists are not consistent: they might write a blazing article one day and a mediocre the next. Thus, for now the algorithmic base + human editors or a pure crowd-sourced filtering (such as reddit) do seem to produce the best quality results.

17 comments:

OscarApril 18, 2013 at 7:06 PM
Thanks for sharing this. Really helpful and informative.
ReplyDelete
Replies
MattJune 3, 2013 at 6:37 PM
This is great, thank you for sharing!
ReplyDelete
Replies
DavidFebruary 7, 2014 at 2:20 AM
I can say this is the perfect blog for everybody who read and post here everyday. I dont know why I always visit essay-writings-service.com here maybe its so nice to read here a interesting topics all day. So ill be back more to check more updates and new comments:)
ReplyDelete
Replies
SreekanthNairApril 9, 2014 at 1:16 PM
Thank you for a very good article..I am able to run the program partially..How ever unable to configure the hcluster.I get tis error

from hcluster import linkage, dendrogram
ImportError: No module named hcluster

My question is .Is it mandatory to have Visual Studio installed inorder to import the hcluster ?

Thanks
Sree
ReplyDelete
Replies
Carl AndersonApril 9, 2014 at 2:23 PM
Sree,

You will need to install that library yourself as it is not part of the core python 2.7.

I just installed on my unix system with:

wget https://pypi.python.org/packages/source/h/hcluster/hcluster-0.2.0.tar.gz
gunzip https://pypi.python.org/packages/source/h/hcluster/hcluster-0.2.0.tar.gz
tar xvf hcluster-0.2.0.tar
cd hcluster-0.2.0
python setup.py install

Hopefully this helps.

Carl
ReplyDelete
Replies
SreekanthNairApril 10, 2014 at 1:11 PM
Thanks..Able to figure it out.Its Working now
ReplyDelete
Replies
SreekanthNairApril 10, 2014 at 1:21 PM
This comment has been removed by the author.
ReplyDelete
Replies
SreekanthNairApril 10, 2014 at 1:25 PM
I used the MingW32 to install the hcluster. It worked.I had to download the Microsft VS C++ as well to render the graph.Not sure why python needs to have a dependency with the Microsoft CLR
ReplyDelete
Replies
RajaMay 19, 2014 at 8:26 PM
Thanks for the wonderful article. However, I see one issue in it. I actually tried using the code using a couple of google RSS feeds for news.
Ex: 'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+2014&output=rss',
'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa&output=rss',
'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+worldcup&output=rss',
'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+world+cup+2014&output=rss',
'http://news.google.co.in/news?pz=1&cf=all&ned=in&hl=en&q=fifa+world+cup&output=rss',

I see duplicate results, just sources being different. On further digging, I see that they are same articles with minor changes to title and content.
The issue here is with tf*idf which would ignore all keywords with partial match titles -
Ex: 1. Brazil Fights Sex Tourism, Child Prostitution Ahead of FIFA World Cup -
NDTVSports.com
2. Brazil fights sex tourism, child prostitution ahead of FIFA World Cup 2014 -
India.com
ReplyDelete
Replies
Mansoor AliSeptember 28, 2014 at 4:08 PM
Hi Carl, thanks for such a wonderful article. Can also please provide some pointers/approaches for 1st and 3rd items i.e. (filter and choose) so that I can some research on that as well.
ReplyDelete
Replies
agnc2002March 22, 2015 at 10:09 PM
A terrific article. Thank you.
ReplyDelete
Replies
UnknownJune 25, 2015 at 3:58 AM
HI. Thanks for the article. im unable to install hcluster
ReplyDelete
Replies
UnknownJune 25, 2015 at 12:17 PM
i have solve this issue by installing scipy
ReplyDelete
Replies
UnknownJune 29, 2015 at 9:25 AM
Thank you for this post ! I'm sharing it with everyone I know. It's very well worded, straight up and to the point and I appreciate that.
online paraphrase tool
ReplyDelete
Replies
learnerMay 25, 2016 at 8:03 AM
An awesome post!! Thanks a lot...
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.