p-value.info: November 2012

Thursday, November 29, 2012

Introduction to Sentiment Analysis

In this post we are going to explore sentiment analysis using python. Let us first define the problem. Given a corpus of documents, we want to train a classifier to classify documents into one of several classes:

positive: e.g., "that movie was awesome"

negative: e.g., "that movie sucked"

We may also want a third category:

neutral: e.g., "I saw the movie at the Odeon".

In reality, we may have a fourth category unknown where we cannot tell. For instance, it may contain a document in a foreign language, a set of stop words only, or a set of words that our classifier does not understand, e.g. SMS acronyms (e.g., "cu l8r"). For simplicity, we can simply not classify these documents or we can categorize them as neutral.

To train the classifier we will need to choose a set of features, the facets of data that the classifier will work on. The simplest scenario is to use each word in the document. (Later, we will discuss more complex features and other features.)

This is a supervised learning problem so we will need a labeled corpus of data. That is, a set of documents each of which has been labeled with one of our class names or IDs. One of the simplest ways is a CSV or a TSV file:

pos that movie was awesome

neu I saw the movie at the Odeon

pos I love this place

pos I am happy

....

We will need a learner, an algorithm that will learn to classify. In most examples that I've seen online, people have used a naive Bayes classifier but there are many others that one could choose.

Importantly, sentiment analysis is very context sensitive. If you train a classifier using movie review data, it likely will not fare well classifying documents about election results or your startup's product. A classifier trained on US english tweets may or may not classify UK tweets well.

Finally, we cannot expect to assign sentiment correctly 100% of the time. Even humans can often disagree about the sentiment of documents. There are several reasons:

Cultural differences mentioned above. For example, "rocking" and "sick" are positive adjectives to members of some demographics and cultures but not others.
English is a hard and ambiguous language. For instance, consider Chomsky's example: "old men and women". Is that [old men] and [old women] or does it mean [old men] and [women]. Another example is Eats, Shoots & Leaves.
Sarcasm, innuendo, and double entendres. This is one of the current challenges in NLP. For a fun example take a look at the problem of detecting that's what she said.
That people often use a bag of words model for sentiment analysis, at least as a first pass. That is, we analyze a document as a set of words and not a phrase. Thus, we will miss that the "not" in "not good" negates "good". In general, we will miss double negatives and other qualifiers. I love this illustrative example:

During a lecture the Oxford linguistic philosopher J.L. Austin made the claim that although a double negative in English implies a positive meaning, there is no language in which a double positive implies a negative. To which Morgenbesser responded in a dismissive tone, "Yeah, yeah."

Now we have the background out of the way, let's starting building a concrete example.

First, let's grab a corpus. I'm taking the UMICH SI650 - Sentiment Classification training set from Kaggle.

This is a tab-delimited file with 7086 sentences tagged as 1 or 0.

head training.txt

1 The Da Vinci Code book is just awesome.

1 this was the first clive cussler i've ever read, but even books like Relic, and Da Vinci code were more plausible than this.

1 i liked the Da Vinci Code a lot.

1 I liked the Da Vinci Code but it ultimatly didn't seem to hold it's own.

1 that's not even an exaggeration ) and at midnight we went to Wal-Mart to buy the Da Vinci Code, which is amazing of course.

1 I loved the Da Vinci Code, but now I want something better and different!..

1 i thought da vinci code was great, same with kite runner.

1 The Da Vinci Code is actually a good movie...

1 I thought the Da Vinci Code was a pretty good book.

There are many duplicates. Let's remove them:

cat training.txt | sort | uniq > uniq_training.txt

How many positive and negative samples remain?

cat uniq_training.txt | grep ^1 | wc -l

772

cat uniq_training.txt | grep ^0 | wc -l

639

We need to extract features from a document. We'll take unique, lowercase words with more than two characters:

def extract_features(document):

features={}

for word in set(document.split()):

if len(word) > 2:

features['contains(%s)' % word.lower()] = True

return features

>>> extract_features('" A couple of very liberal people I know thought Brokeback Mountain was " stupid exploitation.')

{'contains(very)': True, 'contains(people)': True, 'contains(couple)': True, 'contains(mountain)': True, 'contains(was)': True, 'contains(brokeback)': True, 'contains(liberal)': True, 'contains(exploitation.)': True, 'contains(know)': True, 'contains(thought)': True, 'contains(stupid)': True}

We need to read in our documents:

documents=[]

f = open("uniq_training.txt","r")

for document in f.readlines():

parts= document.strip().split("\t")

documents.append((parts[1],bool(int(parts[0]))))

>>> documents[0]

('" A couple of very liberal people I know thought Brokeback Mountain was " stupid exploitation.', True)

Now extract features from each document in our corpus and split into a training set (80%) and a test set (20%):

import random

random.seed(1234) #so that you can reproduce my results if you wish

random.shuffle(documents)

import nltk

n_train = int(0.8*len(documents))

training_set = nltk.classify.apply_features(extract_features,documents[:n_train])

test_set = nltk.classify.apply_features(extract_features,documents[n_train:])

>>> training_set[0]

({'contains(very)': True, 'contains(people)': True, 'contains(couple)': True, 'contains(mountain)': True, 'contains(was)': True, 'contains(brokeback)': True, 'contains(liberal)': True, 'contains(exploitation.)': True, 'contains(know)': True, 'contains(thought)': True, 'contains(stupid)': True}, True)

Finally, let's now train our classifier:

>>> classifier = nltk.NaiveBayesClassifier.train(training_set)

>>> nltk.classify.accuracy(classifier, test_set)

0.911660777385159

Whoah, 91% accuracy isn't bad at all given that we had a fairly balanced training set:

ct_pos=0

for d in training_set:

if d[1]==True: ct_pos+=1

print ct_pos, len(training_set)-ct_pos

617 511

By that I mean, I would have been suspicious if we had a very imbalanced training set and say 95% of samples were positive.

The proper way to assess the performance is to examine precision, recall and F-scores:

import collections
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(test_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)

>>> print 'pos precision: %2.3f' % nltk.metrics.precision(refsets[True], testsets[True])

pos precision: 0.951

>>> print 'pos recall: %2.3f' % nltk.metrics.recall(refsets[True], testsets[True])

pos recall: 0.884

>>> print 'pos F-measure: %2.3f' % nltk.metrics.f_measure(refsets[True], testsets[True])

pos F-measure: 0.916

>>> print 'neg precision: %2.3f' % nltk.metrics.precision(refsets[False], testsets[False])

neg precision: 0.871

>>> print 'neg recall: %2.3f' % nltk.metrics.recall(refsets[False], testsets[False])

neg recall: 0.945

>>> print 'neg F-measure: %2.3f' % nltk.metrics.f_measure(refsets[False], testsets[False])

neg F-measure: 0.906

and now the confusion matrix:

observed=[]

actual=[]

for i, (feats, label) in enumerate(test_set):

actual.append(label)

observed.append(classifier.classify(feats))

print nltk.ConfusionMatrix(actual,observed)

| F |

| a T |

| l r |

| s u |

| e e |

------+---------+

False |<121> 7 |

True | 18<137>|

------+---------+

(row = reference; col = test)

These are all great numbers.

Now we can use it:

>>> classifier.classify(extract_features("that movie was awful"))

False

>>> classifier.classify(extract_features("that movie was great"))

True

>>> classifier.show_most_informative_features(16)

Most Informative Features

contains(awesome) = True True : False = 37.8 : 1.0

contains(sucked) = True False : True = 31.0 : 1.0

contains(hate) = True False : True = 24.0 : 1.0

contains(love) = True True : False = 17.8 : 1.0

contains(heard) = True False : True = 10.1 : 1.0

contains(kinda) = True False : True = 8.4 : 1.0

contains(it,) = True False : True = 7.6 : 1.0

contains(want) = True True : False = 6.9 : 1.0

contains(evil) = True False : True = 5.6 : 1.0

contains(those) = True False : True = 5.2 : 1.0

contains(didn't) = True False : True = 5.2 : 1.0

contains(has) = True False : True = 5.2 : 1.0

contains(think) = True False : True = 4.7 : 1.0

contains(miss) = True True : False = 4.7 : 1.0

contains(watch) = True False : True = 4.6 : 1.0

contains(liked) = True True : False = 4.5 : 1.0

>>>

These terms make a lot of sense: awesome, suck, hate, love...

For a more detailed tutorial using NLTK see offical docs. See also this post.

We trained our classifier to learn keywords that have particularly positive or negative associations (the most informative features). Once it has learned that list it is then easy to apply to a new input document. There are list of these learned words available such as AFINN and pattern. For these then, running sentiment analysis is trivial.

In pattern, "the sentiment() function returns a (polarity, subjectivity)-tuple for the given sentence (based on the adjectives in it), with polarity between -1.0 and 1.0 and subjectivity between 0.0 and 1.0".

Let's fire up a sentiment analyzer and use it

>>> from pattern.en import sentiment

>>> print sentiment("that movie was awesome")

(1.0, 1.0)

>>> print sentiment("that movie sucked")

(0, 0)

>>> print sentiment("that movie was ok")

(0.5, 0.5)

>>> print sentiment("that movie was fairly good")

(0.6, 0.8500000000000001)

That's all there is to it.

You can see that returning a continuous value from (-1,1) means that we could easily define neutral documents as those with some intermediate value, say (-0.1,0.1).

Let's return to the issue mentioned earlier that this is just a bag of words model and the issue "not good" versus "good". One thing we can do is to include bigrams in our features. That is, adjacent pairs of words:

>>> from nltk import bigrams

>>> bigrams("That movie was awful".split())

[('That', 'movie'), ('movie', 'was'), ('was', 'awful')]

Let's add bigrams to our extract features function:

def extract_features(document):

features={}

for word in set(document.split()):

if len(word) > 2:

features['contains(%s)' % word.lower()] = True

for bigram in bigrams(document.split()):

features['contains(%s)' % "_".join(i.lower() for i in bigram)] = True

return features

Retraining our classifier with this, we get 93.6% accuracy and a more interesting feature list: I hate, I love, I think, love the...

>>> classifier.show_most_informative_features(16)

Most Informative Features

contains(awesome) = True True : False = 37.8 : 1.0

contains(sucked) = True False : True = 31.0 : 1.0

contains(hate) = True False : True = 24.0 : 1.0

contains(love_the) = True True : False = 22.4 : 1.0

contains(i_hate) = True False : True = 22.2 : 1.0

contains(i_love) = True True : False = 22.1 : 1.0

contains(love) = True True : False = 17.8 : 1.0

contains(code_was) = True True : False = 15.7 : 1.0

contains(i_think) = True False : True = 14.9 : 1.0

contains(3_was) = True True : False = 13.0 : 1.0

contains(i_like) = True True : False = 11.9 : 1.0

contains(like_harry) = True True : False = 10.2 : 1.0

contains(heard) = True False : True = 10.1 : 1.0

contains(i_heard) = True False : True = 9.3 : 1.0

contains(the_mission) = True True : False = 8.6 : 1.0

contains(kinda) = True False : True = 8.4 : 1.0

Hopefully you can see that you can get some good sentiment results with just a few tens of lines of python, if you have the right libraries.

Here is the source code: p-value.info github

Monday, November 26, 2012

Why k-means++ works

I've become somewhat obsessed by clustering recently. Like many people I tend to use that general workhorse: k-means (also known as Lloyd's algorithm in comp. sci. circles). It does a pretty good job, on average, and is simple and intuitive:

Initialization step: pick k clusters at random from the training set
Expectation step: assign each member of the training set to the nearest of those k centers
Minimization step: update cluster centers to be the centroid of the training samples assigned to them
repeat the expectation-maximization steps until convergence, i.e. there is no change in the assignments or the cluster centers.

With a few tens of iterations, maybe up to a hundred, the algorithm will have converged to a reasonable set of clusters

if they exist.
if k is appropriate for the data.
if your clusters are roughly spherical.
if dimensionality is not too high / data are not too sparse.

While it is guaranteed to converge, it is not guaranteed to find the global minimum. It can get stuck on local optima. This can happen because of the stochastic nature of the initialization step in which it is possible for two or more of the initial cluster centers to be chosen close together and for k-means to then split the cluster. This is problematic enough that it is typically recommended to run k-means multiple times and to then choose the run with the lowest cost (sum of squared distances between the centers and their associated training samples).

In a key paper by David & Vassilvitskii 2007, k-means++: the advantages of careful seeding, they authors presented a tweak to k-means which (on average) reduced both the cost and the running time of k-means. It sounds too good to be true but it works. They tackled the cluster initialization step to avoid cluster centers being chosen close together. Their idea was as follows:

Pick the first cluster center randomly from the training set.
Compute the distance from each training sample to that first center. Pick the second cluster center in proportion to the squared distances.
Now compute the shortest distance from each of the remaining training samples to all the cluster centers. Pick the next cluster center in proportion to those squared distances.
Repeat step above until k centers have been chosen.

This seems simple enough but what is it doing? The first center is chosen randomly. The second center is chosen to be, on average, far from the first. The third is chosen to be far from both the previous two. And so on. With the centers more or less spread out throughout the dataset, there is less chance of splitting clusters and with the centers being a better starting point, the centers can converge more quickly. This more complex initialization comes at a cost---it is as costly as one iteration of k-means---but overall it converges more quickly and so is cheaper.

Their analyses and data demonstrated clearly that this is a better approach. However, something bugged me about this. There were two aspects:

It is probabilistic. There is still a chance that clusters can be close together.
We choose cluster centers sequentially which is not necessarily the same arrangement that one might choose if we could choose the k clusters simultaneously. That is, imagine your data were uniformly distributed across a disk and k = 3. You choose the first cluster, the expected location of which would be the center of the disk. The second cluster would be at the margin of the disk. Where now would the third cluster be placed? At the margin opposite center #2. The clusters centers are not optimally placed. If we could choose 3 cluster centers at once, we might choose them in an equilateral triangle at 0, 120, and 240 degrees and at say radius 1/sqrt(2).

I started thinking through this. How could you deterministically and globally choose the optimal cluster placement? How about a regular square lattice? No. What do you do with k = 7? A 3 * 3 grid with some cells empty? OK then, how about multidimensional cell packing, similar to soap bubbles? That is tricky, computationally intensive but possibly doable. How about an electron shell type arrangement in which we place a cluster center at the center of the data, a next shell with say 2 centers, the next outer shell with more cluster centers. (These ideas seems a little wacky but they seemed feasible at the time.) However, the more I thought about it, and the more I thought how real world data in not multi-uniform, the more the statistical properties of the k-means++ initialization made sense. This is the real point of this post.

I realized that any approach that attempted to distribute the cluster centers evenly across the span or convex hull of the data is going to be sensitive to outliers. We might start to place cluster centers across very low density or empty areas.

By choosing the data in proportion to the squared distances, we are not just choosing far away points but we are also choosing an area of feature space based on density. Thus, if we have 10 points close together roughly at distance D we are more likely to choose one of them than a single outlier point at distance 2D. This is precisely what we want in a good initialization step. In a sense, the simple procedure is statistically balancing distance among the cluster centers and density of data points, which could represent a cluster per se. Beautiful. As David & Vassilvitskii summarize in one of their presentations: "friends don't let their friends use k-means." Unfortunately, however, it does not seem to have the adoption that it deserves.

By the way, Vassilvitskii and colleagues went on to parallelize k-means++ for big data in a 2012 paper in a version called k-means||.

Sunday, November 25, 2012

Free Datascience books

I've been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don't mean free as in some guy paid for a PDF version of an O'Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, "the publisher has graciously agreed to allow a full, free version of my book to be available on this site."
Here are a few in my collection:

Mining of Massive datasets by Rajamaran, Leskovic & Ullman
Bayesian Reasoning and Machine Learning by David Barber [website]
Information Theory, Inference, and Learning Algorithms by David J.C. Mackay
~~Foundations of Statistical Natural Language Processing by Manning & Schütze~~ only sample chapters

While we are on the subject, I would be remiss of me not to recommend D.J. Patil's free minibooks/essays. While they are not the thick comprehensive tomes of those above, they are definitely worth the time to read.

Data Jujitsu by D.J. Patil
Building Data Science Teams by D.J. Patil

Finally, this is work in progress (just 3 chapters to date) but is one to watch: Network Science by A.-L. Barabasi.

Update [12/27/12]: adding in some additions from a hacker news discussion and the comments below (thanks guys):

Introduction to Information Retrieval, by Manning, Raghavan and Schütze
A first encounter with machine learning by Welling
Gaussian processes for Machine Learning by C.E. Rasmussen
The Elements of Statistical Learning, by Hastie, Tibshirani, Friedman -- grandaddy of them all
Introduction to Machine Learning by Smola, Vishwanathan
Think Bayes by Downey

Saturday, November 24, 2012

GraphConnect 2012

On Nov 6th, I attended the GraphConnect conference at the San Francisco Hyatt Regency. This was the best one day conference I have attended for quite a while.

Although it was organized and sponsored by Neo4j, it was not a Neo4j trade show. Instead, there were a number of general talks about graph databases and social network analysis. (However, if you wanted Neo4j centric-talks, they were provided too.) Now that some of the slides and videos are up on the site, I can link to a few highlights:

The Power of the Social Graph by James Fowler. Author of Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives gave an amazing overview of social network analysis. What was so refreshing, however, is that he had conducted actual experiments, A/B tests and the like and not just post hoc analyses.
Data Modeling with Graphs – Peter Bell, Graph Database Evangelist by Peter Bell. Thinking in graphs in really quite different from thinking about relational databases and requires a different mindset for data modeling.
Graph Processing at Twitter by Pankaj Gupta. It is really quite humbling to think through the largest datasets I have worked with (single digit terabytes) and the magnitude of data that Twitter handles every single day. He gave a really interesting overview of the different data sources, uses, approaches and technologies used at Twitter.

It is a little unfair to single out just those three as I enjoyed the whole day. The food was great too and as I had attended DataWeek (which was a tiresome tradeshow), I got a discount code that got me into GraphConnect for only $100. No complaints here.

Learning from imbalanced data

I recently read an awesome review article: He & Garcia, 2009. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21 (9): 1263--1284. I knew that imbalanced data, data with "the presence of underrepresented data and severe class distribution skews", was problematic and knew a few ways to deal with it. However, I didn't realize quite the variety of approaches, the magnitude of active research effort or even that there were several conferences solely around this topic.

If you are building a classifier with underrepresented data, I highly recommend reading this article. It covers the problem per se, some different approaches, metrics and an extensive literature review.

Hello World!

Hi there. I'm Dr. Carl Anderson, a data scientist at One Kings Lane. I get to do the fun things such as predictive analytics, personalization, recommendations, image processing and more. These are my ramblings on data science, machine learning and stats, making it up as I go along.