p-value.info: On the unreasonable effectiveness of data: why are more data better?

Update 11/30/14: I incorrectly named the authors of the "unreasonable effectiveness study, now corrected.

In the paper "The unreasonable effectiveness of data" [Halevy et al, 2009], Halevy, Norvig and Pererira, all from Google, argue that interesting things happen when corpora get to web scale:

"simple models and a lot of data trump more elaborate models based on less data".

In that paper and the more detailed tech talk given by Norvig, they demonstrate that when corpora get to hundreds of millions or trillions of training sample or words, very simple models with basic independence assumptions can outperform more complex models such as those based on carefully-crafted ontologies with smaller data. However, they provided relatively little explanation as to why more data are better. In this post, I want to attempt to think through that.

I propose that there are several classes of problems and reasons for why more data are better.

The first are nearest neighbor type problems. Halevy et al. mention "James Hay and Alexei A. Efros addressed the task of scene completion: removing an unwanted, unsightly automobile or ex-spouse from a photograph and filling in the background with pixels taken from a large corpus of other photos." [Hays and Efros, 2007 ]

Norvig presented the following schematic:

and described it as a "data threshold" in which results went from really bad to really good.

I'm not convinced that there is any threshold or anything that resembles a phase transition. This seems to me to be a problem of finding the closest match. The more data, the closer the expected match.

Hays and Efros (2007) mention:

"Indeed, our initial experiments with the gist descriptor on a dataset of ten thousand images were very discouraging. However, increasing the image collection to two million yielded a qualitative leap in performance... Independently, Torralba et al. [2007] have observed a similar effect with a dataset of up to 70 million tiny (32x32) images...It takes a large amount of data for our method to succeed. We saw dramatic improvement when moving from ten thousand to two million images" [Hays and Efros, 2007]

There is a large difference in those corpus sizes and a "qualitative leap" is not the same as a threshold (sensu phase transition).

More data can dramatically affect the metrics from simple effects. For instance, consider a sample of size n from a standard normal. How does the minimum of that sample vary with n? Let's create samples of different sizes and plot the minimum using the following R code

x<-seq(1,7,0.5)

y<-vector(mode="numeric",length=length(x))

for (i in 1:length(x)){ y[i] <- min(rnorm(10^(x[i]))) }

plot(x,y,xlab="Sample size, n (log10 scale)",ylab="Minimum value of sample",type="b")

The minimum decreases loglinearly. This is a case of extrema from an unbounded tail. Perhaps more relevantly, here, for a minimization problem such as scene matching there is a lower bound: for all intents and purposes, a perfect match. For instance, perhaps someone else stood in the same exact tourist spot and took a picture of the exact same scene but without the obstructing car.

I think this is what is happening in Norvig's schematic. At a certain corpus size we've found a really good match and a larger corpus doesn't and cannot improve results.

In summary, for nearest neighbor type minimization problems with a non-negative distance function (meaning that the cost function has a lower bound of zero), that distance function will, on average, decrease monotonically with data or sample size.

The second class are counting or relative frequency problems. These were the primary focus of Halevy et al. Norvig's presented a few examples. In segmentation, the task is to split a string such as "cheapdealsandstuff.com" into the most likely sequence of words. These strings are short enough to do a brute force in terms of possible partition, but for each partition we have to assess the likelihood of that partition. The simplest assumption is to assume independence among words. Thus, if Pr(w) is the probability of a word w given some corpus we can compute, say

Pr(che,apdeals,andstuff) = Pr(che) * Pr(apdeals) * Pr(andstuff).

...

Pr(cheap,deals,and,stuff) = Pr(cheap) * Pr(deals) * Pr(and) * Pr(stuff).

One can of course also use n-grams, e.g. for bigrams: Pr("cheap deals") * Pr("and stuff")

A second example that Norvig covered was spell checking. Here we can take a misspelled word and compute the likelihood of the possible variants to suggest the most likely form.

In both cases, we need a corpus that includes both common and uncommon phrases and we need counts of the occurrences of those phrases to compute relative frequency. The larger and more comprehensive the corpus, the better. I think that there are two statistical effects going on here:

the larger the corpus, the better the quality of the estimate of the relative frequency. This is the law of large numbers.
the larger the corpus, the more likely it is to include unusual phrases, i.e. the long tail. This is an unbounded effect. The more of the web is indexed, there will always be new, never-seen-before phrases. The problem is exacerbated in that the distribution of words in the english language is a power law. (e.g. Zipf, G. The Psycho-Biology of Language. Houghton Mifflin, Boston, MA, 1935.). This means that the tail is especially long and thus especially large samples are needed to capture those rare phrases.

A third class are estimating univariate distribution problems. I recently encountered an interesting example while attending a tech talk from Peter Skomoroch from LinkedIn. He showed a plot of likelihood of a member having particular software-related job titles versus number of months since graduation. What we see from the data is that the distributions of "Sr Software engineer" and "senior software engineer" are almost identical, as one would expect as they are synonyms, as are "CTO" and "chief technology officer". This then presents an interesting way of identifying synonyms and so deduping the data rather than maintaing a huge master list of acronyms and abbreviations. This is only possible because of the scale of data that they have where the distribution they can generate is reliable and presumably close to the true underlying population distribution.

A fourth class are general multivariate or correlational problems in which we are trying to estimate the relationship among variables. This could be estimating the relationship y = f(x) or perhaps estimating the joint pdf of many variables. We might use this for word sense disambiguation (e.g. is the document referring to pike the fish or pike the pointed weapon?) or to build up a dossier of associated features or concepts about an entity (e.g. a company has an associated CEO, head office, tax ID and so on). Here we are interested in the correlations among words or phrases. The problem is that web documents are very high dimensional and embarking on high dimensional problems like these we are under the curse of dimensionality, that data become very sparse. Thus, one effect of larger samples is to increase the density of data across state space. Again, with larger samples we can estimate metrics such as location metrics (mean, median and other metrics of the center of distributions) more accurately. We can also estimate joint pdfs more accurately. Below is a simple example from the following code:

par(mfrow=c(1,2))

plot(mvrnorm(100, mu = c(0, 0), Sigma = matrix(c(1, .9, .9, 1), 2)),xlab="X",ylab="Y",ylim=c(-4,4))

title("n = 100")

plot(mvrnorm(10000, mu = c(0, 0), Sigma = matrix(c(1, .9, .9, 1), 2)),xlab="X",ylab="Y",ylim=c(-4,4))

title("n = 10000")

At left is a small sample. It could easily be interpreted as linear. At right, with a larger sample, the true underlying bivariate normal is more obvious. Clearly, this is a trivial example. The point is that for higher dimensions you will much larger sample sizes to estimate the joint pdfs well.

This is clearly a cursory answer as to why more data are better. Quality data are still preferred. However, for many organizations, such as Google, Twitter, LinkedIn and Facebook where content is user-generated, is often free-form text and/or covers many domains (and so deep-cleaning and the use of ontologies is infeasible) then what we see that having very large datasets can compensate for the noise. It all evens out in the end and in the case of nearest neighbor problems, the solution will always be better.

p-value.info

Friday, December 14, 2012

On the unreasonable effectiveness of data: why are more data better?

1 comment: