p-value.info: 2013

Sunday, November 10, 2013

The data scene in NYC

I love the NYC data scene. It is vibrant, healthy and welcoming.

Eight months ago I left San Francisco and moved to New York. I had come very close to taking an offer in DC and had been struggling weighing the pros and cons of DC versus NYC. I had previously lived in DC so knew the area, Virginia schools are very good and the cost of living significantly lower than the New York area, all important considerations when you have a young family. Warby Parker (my current employer) had set me up to have a chat with DJ Patil who at the time was data scientist in residence at Greylock Partners. We had an honest and open discussion of these issues and he asked, "That is a great offer but what happens if it doesn't work out? New York has a rich and vibrant data scene. It is going to be stimulating and there are lots of opportunities." He was right of course.

DC has changed significantly since I left in 2007. There are many more startups and networks for entrepreneurs and Harlan Harris and co have done an amazing job bringing together the data community under the umbrella of Data Community DC. The reality, however, is that there are relatively few opportunities. The two big tech companies, AOL and Living Social, are both a mess. There are many other data-related positions, if you have security clearance, and the majority of other positions tend to be with small consulting firms that service the government and Department of Defense.

Contrast this with New York:

New York is the place to be for advertising and media.
New York is the place to be for fashion.
New York is the place to be for finance.

The tech scene is rich too and while I know that this is a gross oversimplification there is somehow something very tangible about the startups here. They are more likely to sell real goods and services, physical goods from Etsy, Rent the Runway or Birchbox. In other words, eCommerce is thriving. In addition, under Bloomberg's initiatives, the city is investing heavily in data science and statistics per se with new institutes and a campus on Roosevelt Island.

At the O'Reilly Strata + Hadoop World conference last week there was an interesting panel "New York City: a Data Science Mecca." On the panel were Yann Le Cunn (NYU), Chris Wiggins (HackNY/Columbia) and Deborah Estrin (Cornell NY Tech). Yann Le Cunn is the Director of the newly opened Center for Data Science, a multi- and inter-disciplinary research institute that plans to churn out 50 data science Masters students per year as well as host a PhD program, all of which will have strong ties to the local tech scene. Similarly, Columbia's new Institute for Data Science and Engineering will be hiring 30 new faculty positions, taking up shop in a new 44k square foot building and have an industrial affiliate program. Finally, Cornell will be moving to Roosevelt in 2015 with a broader program than just data science that covers computer science and operations research, all skills the feature in the data science world. The panel made the point that New York is such a great place to be for data because of the density of the ecosystem. On a tiny island with great public transport you have a huge conglomeration of finance, media and advertising companies, other organizations such as Mt Sinai (who recently hired Jeff Hammerbacher---the very person who coined the term "data scientist" with DJ Patil), a suite of world class universities who are investing faculty, buildings (highly significant given than land is precious) and have both research and training foci, and finally the city itself. Yann also made the point that the density of jobs and other organizations here make it easier to attract part-time students.

The data-related Meetup scene is very strong too. (Meetup is based in NYC.) You could go to a packed and interesting data-or data-tech related meetup almost every night. DataKind are based here too. One of the most prominent data scientists, Hilary Mason (now data scientist in residence at Accel Partners) is based here. Strata just took place last week. In Fashion Week, 596 people attended "Fashion Tech: Demos & Drinks" that showcased the local fashion related tech companies. I couldn't attend. Why? Because I was attending a DataGotham event, another important data conference that brings the data community together. Later this month is pyData. You get the idea.

NYC is a fruitful mix of data-research, data practitioners and a strong data community. I am happy to be both here and at Warby Parker. Oh, and the team that I would have joined in DC recently imploded. I dodged a bullet. Thanks DJ.

Sunday, September 8, 2013

Matching misspelled brand names -- the easy way

On Friday, Warby Parker's Director of Consumer Insights approached me with some data. They had sent out a survey one question of which was "list up to five eyeglass brands you are familiar with." She was wanting to aggregate the data but, being free text, the answers were riddled with misspellings. Manually, it would take a lot of time and effort to resolve the variants. She asked if there was a better way to standardize them.

With a question like this, your first reaction might be to think of regular expressions. While there are common forms of misspelling such as transposed characters (ie <--> ei) and (un)doubled consonants (Cincinnati, Mississippi), this is not a tenable approach. First, you are not going to capture all of the variants, common and uncommon. Second, brand names are not necessarily dictionary words and so may not follow normal spelling rules.

If we had a set of target brands, we might be able to use edit distance to associate some variants but "d+g" is very far away from "Dolce & Gabbana". That won't work. Besides, we don't have a limited list of brands. This is an open-ended question and gives rise to open ended results. For instance, Dale Earnhardt Jr has a line of glasses (who knew?) and appeared in our results.

To get a sense of the problem, here are the variants of just Tommy Hilfiger in our dataset:

tommy helfinger
tommy hf
tommy hildfigers
tommy hilfger
tommy hilfieger
tommy hilfigar
tommy hilfiger
tommy hilfigger
tommy hilfigher
tommy hilfigur
tommy hilfigure
tommy hilfiinger
tommy hilfinger
tommy hilifiger
tommy hillfiger
tommy hillfigger
tommy hillfigur
tommy hillfigure
tommy hillfinger

Even if it kind of worked and you could resolve the easy cases, leaving the remainder to resolve manually, it still is not desirable. We want to run this kind of survey at regular intervals and I'm lazy: I want to write code for the whole problem once and rerun multiple times later. Set it and forget it.

This kind of problem is something that search engines contend with all the time. So, what would google do? They have a sophisticated set of algorithms which associate document content and especially link text with target websites. They also have a ton of data and can reach out along the long tail of these variants. For my purposes, however, it doesn't matter how they do it but whether I can piggyback off their efforts.

Here was my solution to the problem:

If I type in "Diane von Burstenburg" into google, it returns,

Showing results for Diane von Furstenberg

and the top result is for dvf.com. This is precisely the behavior I want. It will map all of those Tommy Hilfiger variants to the same website.

We now have a reasonable approach. What about implementation? Google has really locked down API access. Their old JSON API is deprecated but is still available but limited to 100 queries per day. (I used pygoogle to query it easily with

>>> from pygoogle import pygoogle
>>> g = pygoogle('ray ban')
>>> g.pages = 1
>>> g.get_urls()[0]
u'http://www.ray-ban.com/'

but was shut down by google within minutes). Even if you wget their results pages, it doesn't even contain the search results as they are all ajaxed in. I didn't want to pay for API access to their results for a small one off project so went looking elsewhere. DuckDuckGo has a nice JSON API but its results are limited. I didn't feel like parsing Bing's results page. Yahoo (+ BeautifulSoup + ulrlib) saves the day!

The following works well, albeit slowly due to my rate limiting (sleep for 15 seconds):

from bs4 import BeautifulSoup
import urllib
import time
import urlparse

f_out = open("output_terms.tsv","w") #list of terms, one per line
f = open("terms.txt","r")
for line in f.readlines():
term = line.strip()
try:
print term
f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"")
soup = BeautifulSoup(f)
link = soup.find("a", {"id": "link-1"})['href']
parsed_uri = urlparse.urlparse( link )
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
f_out.write( term + "\t" + link + "\t" + domain + "\n")
time.sleep(15)
except TypeError:
print "ERROR with" + term

f.close()

f_out.close()

where output_terms.tsv was the set of unique terms after I lowercased each term, remove hyphens, ampersands and " and ".

This code outputs rows such as:

under amour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armer http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
underarmor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/

It is not perfect by any means. Those Tommy Hilfiger variants result in a mix of tommy.com and tommyhilfiger.com, which is Hilfiger's fault for having confusing / poor SEO. More importantly, about 10% of the the results map to wikipedia:

karl lagerfeld http://en.wikipedia.org/wiki/Karl_Lagerfeld

For these, I did

cat output_terms.tsv | grep wikipedia | awk -F '\t' '{print $1}' > wikipediaterms.txt

and reran these through my code using this query instead:

f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"" + "-wikipedia")

This worked well and Lagerfeld now maps to

karl lagerfeld http://www.karl.com/ http://www.karl.com/

(There are of course still errors: Catherine Deneuve maps to http://www.imdb.com/name/nm0000366/ http://www.imdb.com/, a perfectly reasonable response. I tried queries with '"searchterm" +glasses' for greater context but the overall results were not great. With that I got lots of ebay links appearing.)

Now I have a hands-free process that seems to captures most of the variants and has trouble mostly only on low frequency, seemingly genuinely ambiguous cases. This can easily be run and rerun for future surveys. Laziness for the win! I don't even care if it fails to swap out very uncommon variants because in this case we don't need perfect data. We will aggregate the websites and filter out anything with frequency less than say X so those odd terms this process got wrong, we don't need to worry about. In other words, we care most about the first few bars of our ordered histogram.

Data scientists need to be good at lateral thinking. One skill is not to focus too much on the perfect algorithmic solution to a problem but, when possible, find a quick, cheap and dirty solution that gets you what you want. If that means a simple hack to piggyback of a huge team of search engine engineers and their enormous corpus, so much the better.

Thursday, May 2, 2013

Leading Indicators: a response

In an interesting thought experiment, Mike Loukides and Q Ethan McCallum asked the question

"If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it?"

I certainly have my list of what I think are great companies and organizations for data science. Quora users have their list too but how do we know if we have don't first hand knowledge of working at all these places?

The authors provide two examples to motivate the discussion. The first I consider a kind of negative evidence: in a hotel, if they can't get a club sandwich right (the basics) they are certainly not going to get the details (such as great concierge service) right. That is, a club sandwich is a good indictor or proxy of the hotel's overall quality and level of service. The second example, I consider more as positive evidence. If a school has a great music program then it almost certainly excels in other areas too. Again, the music program is a good proxy for the school's overall program.

With this framework, they reframe the overall question:

"What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting?"

They then list out 7 ideas. However, I am not convinced that many of them are evaluable as an outsider---excepting asking the questions explicitly during a job interview. Let's review them one by one.

Loukides & McCallum #1: Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.

How would I determine that as an outsider? Those projects would have to be externally visible, on a website, on a blog or published. Or, if the project refers solely to code, it would have to be open source. However, any write up is unlikely to contain information about the managerial component.

Loukides & McCallum #2: Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.

This is how data scientists should be incorporated into a organization but unless a job description explicitly says "you will be reporting to the head of {marketing,operations,customer insight,...}", you will be hard-pressed to know as an outsider.

Loukides & McCallum #3: When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.

Self-explanatory

Loukides & McCallum #4: What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.

This is certainly a valid point and a primary indicator. A buzz-word filled job ad with little focus may well indicate a company jumping on the bandwagon. Conversely, a larger company with a larger, more well-established team are more likely to have more specialized job descriptions. [examples]

Loukides & McCallum #5: Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?

How would I determine that as an outsider? You might be able to question that during a job interview though.

Loukides & McCallum #6: What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.

If the project is open source, then yes, good documentation is a great indicator.

Loukides & McCallum #7: Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

This is the sort of information that you may be able to glean as an interviewee.

I had a few other ideas:

Tech talks: if the company is hosting tech talks and attracting well-known and respected speakers from other high quality organizations, this is good positive evidence.
Team size: how many other data scientists work there? You can likely find this out from a LinkedIn search. If you are young, starting out, you might prefer a larger team with more potential mentors, better tools and so on. Those with more experience might prefer to blaze a trail.
Existing team members: who are they and what is their pedigree? You can check their LinkedIn profiles or personal websites but there are other ways. For instance, LinkedIn has some great data scientists. How do I know they are good? They tweet relevant information, they post thoughtful posts. The speak at conferences. Their team is highly active in the general data science scene. All this visibility---provided the content is meaningful---is all good evidence.
Publications: academic journal publications may or may not be a good indicator. There is typically a big gulf between academic systems and toy problems and the mess and noise of real world systems and data. An algorithm may work great on the static data set that the grad student has worked with for 3 years but it might not scale, might require far too much parameter tuning in the real world. There are many exceptions of course. It really depends on the system.
Patents: patents coming out of an organization may or may not be a good indicator. It is essentially stales data as patents reveal the degree of innovation at the company two or more years ago (given the time it takes to process and award). A strong patent culture might mean that the IP is locked down so that you may not be able to discuss systems in development at conferences, publish work, open source the code etc.
Internship program: if the company has a strong internship program, attracting top young talent from top schools, and those interns go on to do good things as data scientists, this is very good evidence.

Wednesday, April 10, 2013

How do you create a data-driven organization?

Something that I've been thinking a lot about recently is how do you create a data-driven organization? A lot of companies play lip service to the idea but when it comes down to making decisions they end up being made by those that are more senior (by HiPPO: highest paid person's opinion) or, worse, loudest, based on gut instinct, experience or opinion. What does it take to create a company that makes evidence-based decisions and involves a broad swathe of employees vested in data capture, metric design and analysis?

I have recently taken on this challenge. A few weeks ago, I switched coasts and companies to head up the data team at Warby Parker in New York. This is a company that has been very successful to date and has grown very rapidly. So fast in fact that it has had little time, bandwidth or experience to put in place a centralized, standardized and scalable data infrastructure with professional business intelligence tools. Analysts are working at the limits of Excel and have great difficulties tying together different data sources across the company. What it does have, however, is a strong desire to change and is willing to provide resources and support to establish new data capture, analysis and reporting systems---and promoting the appropriate culture---that will take the company to the next level.

In this post, I wanted to set out what I've been thinking and how I've started to go about it. It is absolutely work in progress. I cannot guarantee that the following is the right approach or that it will go exactly as planned. Thus, I would love to hear how others have fared and what comments and suggestions that you might have.

UNDERSTAND THE BUSINESS AND INTERNAL CUSTOMERS

Listen to people. Chat with the different heads of departments, analysts and other data stakeholders and listen to what they do, how they do it, the data that they work with and what they would like to do. Ask them how the data team can help. Identify tasks that are highly manual, repetitive and could easily be automated. Identify different data sources, pain points and aspirations. Asking about what they would like to do but cannot is as important as asking what they currently do.

Identify easy wins and build trust. While the rule is always under promise and over deliver, it is always good to identify easy wins that provide some immediate benefit and set up good will and trust. We were able to identify and implement a simple javascript/HTML tool that will save one of our teams at least 100 hours/year. While it was not strictly a data project, the cost to us was just 3 hours and that team now loves our data team and will likely be more accepting of interruptions to their work flow as we implement changes.

TRAINING AND SKILL SETS

Identify workers with skills but not tools. One of our staff knows SQL well but has not had access to databases in his current position. Thus, he is relegated to working with Excel spreadsheets. Try to get those people the tools they already know how to use well and will use. There has to be some level of control here---you don't want too much tool / language proliferation and fragmentation---but if these will become core team skills or tools anyway, get them to those people early.

Identify staff hungry to learn. Identify any particular staff that are hungry for new tools and to learn new skills. These may be staff who are already taking statistics or data science classes outside work. Mentor them, work to get them tools they will make good use of. Send them to training. These people, as well as being more productive and happier, will become your prime advocates. They will be willing to mentor others and share their experience and skills.

Train and mentor. At a broader level, if all your analysts are using Excel, train them to also use SQL, R, python or some other skill to take them to the next level, some skill that will allow them to produce more detailed, insightful, automated analyses. Start with a small, motivated group and let them set a great example for others. Statistics is not only a set of tools for analysis but it also provides a framework for critical thinking, in this case about evidence. At Warby Parker, we are planning to send a reasonably large cohort of staff to statistics training this year. With great free online courses available now, this represents a relatively low cost to the company, other than employee time, but we expect the effect of having a significant fraction of the company thinking more critically, numerically and objectively, it will have a profound effect on culture and decision-making.

Carefully choose the right tools. Clearly, if you are introducing a new tool to a team or organization, make sure that it is the right one. It should perform the tasks you need, ideally with a easy to use interface but also with power-user functionality, be well documented and supported and in an ideal world be open source.

DATA INFRASTRUCTURE

It goes without saying that you a need a robust, scalable data infrastructure.

Centralize data where possible. This is very company-scale dependent but try to create a data infrastructure to bring together all the different data sources where possible to allow a holistic view of your customers and business. For instance, make it easy to tie ad strategy to clickstream to sales to social etc. A particular solution may not scale indefinitely as the business and data grow. For instance, you may need to switch from a single MySQL instance to a hadoop-based solution in time but scaling issues are always good problems to have.

Create an open data warehouse. Create a data warehouse with broad access and easy to use tables. For instance, there may be some key concepts that are frequently used for analysis that require highly complex joins across multiple tables. For those, denormalize the tables to allow easy querying (as well as other housing benefits). There will be some data that are sensitive---credit card transactions, any medical or HIPAA compliant data etc---but favor openness wherever possible.

Automate where possible. If I think that I will need to do a task two or more times, I will attempt to automate it if possible. Whenever you think something is a one off, it almost certainly is not. By automating processes, you will free up future analyst time to focus on, you know, analysis.

Focus on the team's ROI. Like everyone else, a data team has limited time and resources. Focus on return on investment. Implementing two "good enough" solutions for problems in a week may pay higher dividends than one "almost perfect" solution for one problem. Beware of diminishing returns.

Suck down data now. Some data are potentially valuable or useful but ephemeral. For instance, Instagram photos are only available for a week or so before they disappear forever. Suck them down now as you will never know when or for what you might need them for later.

METRICS AND DASHBOARDS

The goals of the the above strategies is to capture the data and make it accessible. Now comes the fun part: analysis and reporting.

Design metrics carefully. They should be unbiased, deterministic and should reflect true measurable variables. They should be readily interpretable. They should reflect the business. Design or identify metrics that make the company tick. Think carefully about units. If you end up trying to compare apples and oranges, is there some common currency, such as dollars, that they can be converted to? For instance, if you improve operations and can ship product to customer one day faster, what is that worth? Can you assign a dollar cost to per customer / per day / per order ship time?

Remove redundancy. Dashboards should be information rich. Like building a statistical model, if you have two very highly correlated metrics you can consider one to be redundant and you may do better to remove it and increase the information density of the remaining metrics.

Tailor to your audience. In some cases, it may make sense to have multiple reports with different levels of details for different audiences. For instance, a manager may have a highly detailed report about their team and responsibilities, a higher level report goes out to her team and the C-level execs get the 50,000ft view, which if you choose the right metrics, will still be useful and relevant.

Pander to the C-level. In terms of driving a data-driven culture, impressing C-level execs with dashboard and reports that deliver huge value (and are not just eye candy) will almost certainly produce a trickle down effect. They will expect such reports and will provide resources to create reports commonplace. Create dashboards so relevant that C-level execs watch them like a hawk.

Identify metrics that map to the organization's core values. One of Warby Parker's core values is to deliver insanely great customer service. Thus, there are metrics that relate to that, net promoter score being one. Those metrics should be highly visible across the organization: in top level dashboards, on screens, on reports that get emailed out.

Conversely, take out distracting metrics. One of Marisa Mayer's first actions when she took over Yahoo! was to take the share price off their internal home page. It is her job to worry about that but the rest of the org had been focusing on actions that tried to drive up the share price (unsuccessfully) and they had almost forgotten about the users and the value that Yahoo! delivered or should be delivering to them.

Where possible, tie those key indicators to other metrics that fundamentally drive it. For instance, suppose that the major component of dissatisfaction is late shipping then promote that metric in a top level dashboard.

Let the data speak. In some cases, a more machine learning or unsupervised learning approach may bring some surprising insights. For instance, many companies segment their customers using some set of subjective a priori criteria. Running unsupervised clustering may support those segment choices but it also may provide some interesting insight into new types of groups you might never have expected. Be open to findings that challenge your intuition or understanding of your business, market and customers. Be objective: if an A/B test shows a higher average order value but the results are not statistically significant accept that they are not significant. Never fish for significant results.

Let the interns speak. A data-driven organization should give the data a voice from wherever that derives. Thus, a new intern who has been analyzing data with a fresh perspective should be given as much voice and respect as a senior manager. Data are the senior partners here. Give people a voice, forum and opportunity to provide data-backed evidence.

Share the data and findings widely. A data-driven organization should share data and reports widely. This is not to say that they should be blasted to everyone as spam but those that are interested should have access. (Remember that interesting insights and alternative points of view could come from anywhere in the company.) Owners and higher managers should be open to questions, to alternative evidence, and to implement change based on that evidence.

Those are my initial thoughts. I will report back later in the year, perhaps at DataGotham in New York, about what worked and what did not. Again, if you have any suggestions or feedback, I would love to hear from you.

Tuesday, February 19, 2013

When tf*idf and cosine similarity fail

In this post I cover 2 edge cases of cosine similarity with tf*idf weights that fail, i.e. that don't provide the cosine similarity values that intuition and common sense says that they should return.

In information retrieval, tf*idf forms the basis of scoring documents for relevance when querying a corpus, as in a search engine. It is the product of two terms: term frequency and inverse document frequency.

Term frequency is the frequency of some term in the document, typically an absolute count or relative frequency. Documents with more mentions of a term are more likely to be relevant with respect to that term. For example, when querying for "dog," a document about caring for your dog which mentions "dog" 46 times is more likely to be relevant than a document with a single mention of "the dog days of summer."

Inverse document frequency (IDF) measures the dispersion of that term across the corpus. If every document contains "the," then "the" is not a particularly discriminating word. IDF is the ratio of the corpus size to the number of documents containing that term. The smaller the proportion of documents containing that term, the higher the magnitude of this metric. (In reality, we take the log of the ratio. That is, idf = log(N/n_i)).

These two measures quantify the frequency within a document and the relative rarity across the corpus. Taking the product we arrive at a simple, satisfyingly intuitive but surprisingly powerful metric to score documents. For each term t in each document d in some corpus D we can compute the tf*idf score. Let's call this tfidf (t,d).

We rarely query a corpus for a single term. Instead, we have a query q consisting of multiple terms. Now we want to compute the similarity between the query q and each document d in the corpus. For this, we tend to use something called cosine similarity.

This is a measure of the angle between two unit vectors:

similarity

= cos(a,b)

= dotproduct(a,b) / ( norm(a) * norm(b) )

= a.b / ||a|| * ||b||

[Definition: if a = (a1,a2,...,an) and b = (b1,b2,...,bn) then a.b = Sum(a1*b1 + a2*b2 + ... + an*bn)

and ||a|| = sqrt(a1^2 + a2^2 + ... + an^2) and ||b|| = sqrt(b1^2 + b2^2 + ... + bn^2). ]

The smaller the angle, the more similar are the two vectors.

In this case, the variables of a and b are the set of unique terms in q and d. For example, when q = "big red balloon" and d ="small green balloon" then the variables are (big,red,balloon,small,green) and a = (1,1,1,0,0) and b = (0,0,1,1,1).

Not all words are created equally. Some are more important than others when computing similarity. Rather than use the count or the presence/absence of each term, we can use a weight. For example, we can give a lower weight to common words. What would make a suitable weighting? tf*idf of course. Putting this altogether,

similarity(q,d) = a.b / ||a|| * ||b||

where

a = (

tfidf("big",q),

tfidf("red",q),

tfidf("balloon",q),

tfidf("small",q),

tfidf("green",q)

)

and

b = (

tfidf("big",d),

tfidf("red",d),

tfidf("balloon",d),

tfidf("small",d),

tfidf("green",d)

While cosine similarity with tf*idf works well, really well, there are a couple of edge cases where it fails, corner cases that don't seem to be covered in most introductory explanations and tutorials.

FAIL 1: imagine you have a corpus D consisting of one document d. You come along with a query q where q == d. That is, the corpus has exactly what you are looking for. Intuition should say that we expect that cosine similarity would be 1 because q == d. So, what do we get? While the dot product of q and d should be 1 giving cosine similarity 1, it is not when you use tf*idf weights. The tf*idf of each term of d will be zero--each term of d is in all documents (D==d). Therefore, the dot product is zero but the norms of the two vectors is also zero and will generate a division by zero error. In summary, similarity is 0/0 and so undefined.

FAIL 2: imagine you have a corpus with two documents, d_1 = "blue bag" and d_2 = "green bag". What is their similarity? Intuition says there are some similarities between them, they both contain "bag," but there are some differences: "blue" vs "green". Thus, this should mean that we get a cosine similarity somewhere between 0 and 1. Wrong! Tf*idf for "bag," the common term, is zero because IDF is zero. "blue" is not a shared term and so that term of the dot product is zero as is for term "green." In other words, where they differ it pumps zero terms into the dot product and where they are similar, those terms effectively convey no information whatsoever and so also generate zero values.

While these two scenarios may seem contrived, I encountered them while writing unit tests where I wanted to use minimal corpora possible to test my code. It seems that one needs three distinct documents to avoid the problems above, or your code must handle a NaN.

I use tf*idf and cosine similarity frequently. It can get you far with little cost (if your documents are not enormous). It does have a big limitation though, it is a "bag of words" model meaning it does not consider word order. In many cases, specific word order matters a lot---a red couch with gold legs is very different from a gold couch with red legs. What one can do is to use the fast and cheap cosine similarity with tf*idf weights to narrow down some larger corpus to a smaller subset of documents for which you run a more computationally expensive, more domain specific model or algorithm that does consider word order.

Thursday, February 14, 2013

20 Tips and tricks working with data

In this post I lay out a number of tips, tricks and suggestions for working with data, from collection to storage to analysis. The main idea is to help students and other younger analysts and data scientists. However, they apply to all ages and experiences. Indeed, while many of the following ideas may sound obvious and trivial, it is surprising how often I see the counter-examples and anti-patterns.

This is almost certainly a biased selection based on my personal experience over the years. Thus, if you have any additional comments or tips, feel free to add those in the comments.

This section is about data itself.

1) Inspect your data.

When you receive a dataset, open a text editor and inspect the data. Don't assume all is present and correct. Page through the data quickly to get a sense of columns that are mostly NULL, data that are zero or negative, the ranges of dates and so on. It is very easy to make mistakes when reformatting data and data entry people are often low-paid, low-skilled workers. Assume that there is something wrong with your data. In *nix, use the head and more commands to get a sense of the data.

2) Think through what you would expect.

Related to above, it helps to have an expectation of what you might expect in each column. This will help in spotting errors. For instance, would you expect to find negative values? I once worked with a government-recorded health dataset that included data of people's height, something one might expect would be easy to measure and record. However, we found a chunk of people listed as 5 inches tall. Likely, they should have been 5 feet tall but such outliers are easy to spot with a frequency histogram. In R, get in the habit of calling summary() on your dataset which provides a 5-number summary of each variable. Plot univariate distribution and in R pairs(), a function that produces a scatter plot of each pair of variables, can be very valuable.

3) Check your units.

It may sound obvious but check your units. CNN reports

NASA lost a $125 million Mars orbiter because a Lockheed Martin engineering team used English units of measurement while the agency's team used the more conventional metric system for a key spacecraft operation.

Yeah, it's that important.

4) Plain text > binary.

Plain text data files are always better than binaries. You get to see the raw data, to see how columns are delimited and to see which line ending types are used. Any text editor can be used to view the data. With a simple text file, it can be easy to write a python script to iterate through the data.

5) Structured > unstructured.

While there has been a recent focus on NoSQL technologies that store schemaless documents, if data are or should be well-structured or have a strict, static schema, keep them that way. It will make querying, slicing and dicing much easier later.

6) Clean > more > less.

Much of the time building models, classifiers and so on is spent cleaning data. It takes a lot of time and can be depressing to see huge swathes of data thrown away or excluded. It is better to start with clean data. This means having quality control on data entry. Run validators on HTML forms where possible. If you can't get clean data, at least get a lot of it (assuming it is a well-structured experiment). Clean data are better than more data which is better than less data.

7) Track data provenance.

Provenance means where data came from. In many cases, provenance is is explicit as it comes from a single source or there is some source/vendor code or name. Imagine though you create a corpus for a model in which you combine data from multiple sources. If you later find a problem with one of those souces, you will likey want to back out those data or weight them less. If your corpus is just a big set of numbers and you don't know which data came from where, you now have polluted low-quality data.

8) Be honest about data loss acceptability.

When processing data, there can be problems especially when working with remote services. Servers can be down, databases connection can momentarily drop etc. Think through how important it is to process each item. If you are a bank or many other transaction based businesses, everything must work perfectly. This though comes at a high cost. However, if you are running say sentiment analysis on a twitter stream, or running some summary metrics on historical data, a small data loss may be perfectly acceptable. Will the extra cost of getting every last tweet make a big difference to your analysis?

9) Embrace the *nix command line.

It is hard not to stress this enough. Learn to love the command line. While the commands may seem terse and obtuse, you will be amazed how much more productive you can be with just a few basic commands such as:

wc -l filename: returns the number of lines in the file.
head -n filename: shows the first n lines of a file
grep searchterm filename: filter in the lines containing this search term.

10) Denormalize data for warehousing.

There are very good reasons for normalized databases tables. A well-designed schema will be flexible for future data and grow by rows not columns over time. For wareshousing where data are stored for longer term, it can often be a good idea to denormalize data. This can make the provenance more explicit, can make it easier to split into chunks, and it is often easier for analysts to work with as you can avoid costly, complex joins.

This section refers mostly to processing data with scripts

11) Learn a scripting language.

Python, ruby, it doesn't really matter which but pick one. You want a language with at least simple I/O, a range of data structures and libraries to make HTTP requests. You will end up doing a lot of file manipulation in your career and so you need a good general tool to automate tasks. Also, some analyses, especially involving time series of individuals, are far easier to do in scripting code than with SQL self-join queries.

12) Comment your code.

This is a general tip that applies to all source code not just scripts for analyzing data. When your head is in a problem there are subtle nuances about why you're doing something the way that you are. You'll forget that in a week, especially if you work on multiple projects simultaneously. Comment on why you are taking a certain approach or why this is an edge case. Focus less on what you are doing--ideally your code will be readable enough to show that---but a few such comments, in combination with well chosen function and variable names, will help you, or someone else, get into the code later. Comment for your future self.

13) Use source control.

Again, a general coding tip. Source control (CSV, SVN, git, mercurial etc.) is indispensible. First, it is a form of backup. Second, it allows one to share code and work collaboratively. Third, it allows one the freedom to experiment more and try things out without fear of getting back to your last iteration--you can always revert back to the old code if it doesn't work out. Don't leave home without it. Also, always always write a commit message when committing. Like code comments, commit comments should focus more on what new code does rather than how the new code works. Again, comment for your future self.

14) Automate.

This is obvious but is a lesson that most people learn only after they have had to do a task manually several times when they thought it was a one off. If there is any chance that you might need to do a task twice, it probably means you will actually need to do it five times. Automate the task where possible. This might mean a python script, a SQL stored procedure, or a cron job but it most cases, it is worth it. Even if only some of the steps are automated, it can make a huge difference.

15) Create services.

In years of creating data-related tooling, I've found that people always use the tool in unexpected ways. Make the tools flexible as much of possible. Doing so, will provide an ecosystem of functionality. Create top-level functions that serve the 80% of uses as well as provide access to the lower-level functionality. Where possible make the functionality as web services that can talk to each other in a simple clean manner. In many cases, JSON is preferable to XML or SOAP.

16) Don't obsess over big data.

I might get lynched for this one but there is a lot of hype over big data as service companies vie to take a slice of the analytics pie. While big data certainly has a place, many people who think they need big data don't have big data and don't need big data technologies. If you work in mega- or giga-bytes, it is not big. Take a deep breath and answer honestly, do I have big data or do I need to scale to huge proportions; do I have a problem that can be parallelized well (some problems don't); do I need to process all my data or can I sample? I should stress that I am in no way against big data approaches---I use them myself---but it is not a panacea. Other, simpler approaches may get you what you need.

17) Be honest about required latency.

In most websites, data are replicated, warehoused and analytics performed offline. That latency for analysis can often be hours or a day without any real impact on the business. There are cases where more real time analysis is required, for instance collaborative filtering or computing real time twitter trends. Such short latencies comes at a cost, and significantly more so, the smaller the value. Don't waste money creating a system capable of 1 second sales data latency if you only analyze it in batch mode overnight.

I am hesitant to dip into data visualization as that is a whole post in itself but I will include the following two:

18) Use color wisely.

While neon colors might be in vogue at American Apparel this season that doesn't mean I want to see it in your charts. Take note of default color palettes in programs such as Excel and libraries such as ggplot2. A lot of thought, research and experience have gone into those. If you break the rules do it for good reason and for good effect. P.S. don't use yellow text on a white background. While it might look good on your laptop screen in almost all cases, it is unreadable when projected.

19) Declutter charts and visualizations.

When producing charts, infographics and other data visualizations, think very carefully about what you want the reader to take away from the data. Focus on that and make that message shine through with the least amount of reader effort. Think about information density. If I need a magnifying glass to decipher all the details, you're doing it wrong. If a chart has a lot of textual annotations, you are likely doing it wrong. If you need 6 charts to convey one main point, you are probably missing some key summary metric. Use larger fonts than you might think if you are projecting to a large room. (Where possible, actually stand at the back of the room and check readability of your presentation.) You would do well to read Edward Tufte's earlier books.

20) Use # dashboard users as a metric in your dashboards.

In his foundation video series, Kevin Rose interviewed Jack Dorsey (CEO Square and Exec Chairman, Twitter) [video]. Dorsey made a very interesting point in how he has tried to make Square data-centric. A dashboard is useless if no one is using it. How valuable a dashboard or other metrics are should be reflected in the very dashboard itself: use the numbers of users as a metric in the dashboard itself. I would add that such metrics should be honest signals. That is, if reports are emailed out to recipients, the number of recipients is not necessarily a true reflection of use. Perhaps most people don't open it. Email open rate, however, is a more reliable and reflective indicator.

There you go. A few tips and suggestions for working with data. Let me know what you think and what I missed.

Tuesday, January 22, 2013

When is a meat sandwich like a merchant? A python joke generator

When is a meat sandwich like a merchant? When it is a burgher.

Yes, you can groan but don't blame me, heckle the computer.

I enjoyed a recent New York Times piece, A Motherboard Walks Into a Bar ..." on how and whether computer can learn what is or is not funny. I'm a big fan of groan-inducing puns and Physics particle X walks into a bar type jokes. As I read the article, it occurred to me that there must be some simple lexical patterns that a computer could pick up on and auto-generate jokes. Consider the following:

What do you call a strange market? A bizarre bazaar.

That has the structure "What do you call a [Adjective1] [Noun1]? A [Adjective2] [Noun2]" where [Adjective2] and [Noun2] are homonyms and [Adjective1] and [Adjective2] and [Noun1] and [Noun2] are synonym pairs.

(A homonym is a word pronounced the same as another but differing in meaning, whether spelled the same way or not. Example: hare and hair. Synonyms as two or more different words with the same meaning. Example: lazy and idle.)

If we take a look through a list of english homonyms, we can easily pick out such joke material:

suite: ensemble

sweet: sugary

leads to "What do you call a sugary ensemble? A sweet suite."

Similarly,

What do you call a breezy eagle's nest? An airy aerie.

What do you call a coarse pleated collar? A rough ruff.

Another structure is when the homonyms are both nouns:

stake: wooden pole

steak: slice of meat

leads to "When is a slice of meat like a wooden pole? When it is a stake."

(Slightly more complicated is "When is a car like a frog? When it is being toad?")

This suggests that we can easily auto-generate jokes such as these. So, let's do it.

First, I downloaded that homonym webpage and parsed the HTML using the python BeautifulSoup library to extract the homonyms. There is one short function to parse the HTML to obtain two homonyms and their short definitions, and for each homonym I call a second function function which calls a unofficial google dictionary API to obtain the part of speech (noun, adjective etc.) of the homonym. Calling python extract_homonyms.py > processed_homonyms.txt processes a flat text file of the six pieces of information: homonym1, definition1, pos1, homonym2, definition2, pos2

Here is the code.

With the hard work out the way, generating the jokes is simple. A second short script, generate_jokes.py, has two type of jokes: 1) one homonym is an adjective and the other is a noun, 2) both homonyms are nouns:

def indefinite_article(w):

if w.lower().startswith("a ") or w.lower().startswith("an "): return ""

return "an " if w.lower()[0] in list('aeiou') else "a "

def camel(s):

return s[0].upper() + s[1:]

def joke_type1(d1,d2,w1,w2):

return "What do you call " + indefinite_article(d1) + d1 + " " + d2 + "? " + \

camel(indefinite_article(w1)) + w1 + " " + w2 + "."

def joke_type2(d1,d2,w1,w2):

return "When is " + indefinite_article(d1) + d1 + " like " + indefinite_article(d2) + d2 + "? " + \

"When it is " + indefinite_article(w2) + w2 + "."

data = open("processed_homonyms.txt","r").readlines()

for line in data:

[w1,d1,pos1,w2,d2,pos2]=line.strip().split("\t")

if pos1=='adjective' and pos2=='noun':

print joke_type1(d1,d2,w1,w2)

elif pos1=='noun' and pos2=='adjective':

print joke_type1(d2,d1,w2,w1)

elif pos1=='noun' and pos2=='noun':

print joke_type2(d1,d2,w1,w2)

print joke_type2(d2,d1,w2,w1)

When we run this, we output 493 wonderful, classy jokes (from just 70 lines of code). A few of my favorites are:

What do you call an accomplished young woman? A made maid.
When is a disparaging sounds from fans like a whiskey? When it is a booze.
When is a fish eggs like a seventeenth letter of Greek alphabet? When it is a rho.
When is a bench-mounted clamp like a bad habit? When it is a vice.
When is a fermented grape juice like an annoying cry? When it is a whine.
When is a location like a flounder? When it is a plaice.
What do you call a fake enemy? A faux foe.
What do you call a beloved Bambi? A dear deer.

Not bad, not bad although even Carrot Top's career is probably safe with these.

This is the complete source code.

(Another potential joke pattern comes from "What is the difference between a pretty glove and a silent cat? One is a cute mitten, the other is a mute kitten." where we can observe a transposition of the first letters of two pairs of words. You can discern some other patterns in this joke generator site.)

So, we can conceive that a computer could be programmed with, or learn, the structure of jokes. This is a generative approach (e.g., Manurung et al.).

A second approach is to learn which jokes are considered funny by humans. Given a suitable corpus and a reasonable set of features, any number of classifiers could learn, at least statistically, to sort the funny from the unfunny (e.g., Kiddon & Brun, That's what she said detector).

Finally, given a set of jokes, a system could learn which are funny to you given some basic training. Jester is a system where you are asked to rate 10 jokes. After that, you are presented with a series of jokes that you are more likely to find funny than other jokes. In web terms, it is an old site with what amounts to an early recommender system (Goldberg et al. 2000).

One final joke from my code:

What do you call a least best sausage? A worst wurst.

Ba dum dum, Thanks, folks! I'll be here all week.

Sunday, January 6, 2013

What's the significance of 0.05 significance?

Why do we tend to use a statistical significance level of 0.05? When I teach statistics or mentor colleagues brushing up, I often get the sense that a statistical significance level of α = 0.05 is viewed as some hard and fast threshold, a publishable / not publishable step function. I've seen grad students finish up an empirical experiment and groan to find that p = 0.052. Depressed, they head for the pub. I've seen the same grad students extend their experiment just long enough for statistical variation to swing in their favor to obtain p = 0.049. Happy, they head for the pub.

Clearly, 0.05 is not the only significance level used. 0.1, 0.01 and some smaller values are common too. This is partly related to field. In my experience, the ecological literature and other fields that are often plagued by small sample sizes are more likely to use 0.1. Engineering and manufacturing where larger samples are easier to obtain tend to use 0.01. Most people in most fields, however, use 0.05. It is indeed the default value in most statistical software applications.

This "standard" 0.05 level is typically associated with Sir R. A. Fisher, a brilliant biologist and statistician that pioneered many areas of statistics, including ANOVA and experimental design. However, the true origins make for a much richer story.

Let's start, however, with Fisher's contribution. In Statistical Methods for Research Workers (1925), he states

The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.

The next year he states, somewhat loosely,

... it is convenient to draw the line at about the level at which we can say: "Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."...

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

(See http://www.jerrydallal.com/LHSP/p05.htm)

And there you have it. With no theoretical justification, these few sentences drove the standard significance level that we use to this day.

Fisher was not the first to think about this but he was the first to reframe it as a probability in this manner and the first to state this 0.05 value explicitly.

Those two z-values in the first quote, however, hint at a longer history and basis of the different significance levels that we know and love. Cowles & Davis (1982) On the Origins of the .05 level of statistical significance describe a fascinating extended history which reads like a Whos Whos of statistical luminaries: De Moivre, Pearson, Gossett (Student), Laplace, Gauss and others.

Our story really begins in 1818 with Bessel who coined the term "probable error" (well, at least the equivalent in German). Probable error is the semi-interquartle range. That is, ±1PE contains the central 50% of values and is roughly 2/3 of a standard deviation. So, for a uniform distribution ±2PE contains all values but for a standard normal it contains only the central 82% of values. Finally, and crucially to our story,

±3PE contains the central ~95% of values. 1 - 0.95 = 0.05
People like Quetelet and Galton had tended to express variation or errors outside some typical range in terms of ±3PE, even after Pearson coined the term standard deviation.

There you have the basis of 0.05 significance: ±3PE was in common use in the late 1890s and this translates to 0.05. 1 in 20 is easier to interpret for most people than a z value of 2 or in terms of PE (Cowles & Davis, 1982) and thus explains why 0.05 became more popular.

In one paper from the 1890s, Pearson remarks on different p-values obtained as

p = 0.5586 --- "thus we may consider the fit remarkably good"

p = 0.28 --- "fairly represented"

p = 0.1 --- "not very improbable that the observed frequencies are compatible with a random sampling"

p = 0.01 --- "this very improbable result"

and here we see the start of different significance levels. 0.1 is a little probable and 0.01 very improbable. 0.05 rests between the two.

Despite this, ±3PE continued to be used as the primary criterion up to the 1920s and is still used in some fields today, especially in physics. It was Fisher that rounded off the probability to 0.05 which in turn, switched from a clean ±2σ to ±1.96σ.

In summary, ±3PE --> ±2σ --> ±1.96σ --> α = 0.05 more accurately describes the evolution of statistical significance.