p-value.info

Wednesday, May 7, 2014

How to create a data-driven organization: one year on

A switch from "I think" to "The data show"

A year ago, I wrote a well-received post here entitled How do you create a data-driven organization?". I had just joined Warby Parker and set out my various thoughts on the subject at the time, covering topics such as understanding the business and customer, skills and training, infrastructure, dashboards and metrics. One year on, I decided to write an update. So, how did we do?

We've achieved a lot in the last year, made some great strides in some areas, less so in others.

Initiative metrics

One of the greatest achievements and impacts, because it cuts across the whole organization and affects all managers, are to do with our initiative teams and how they are evaluated. Evaluation is now tied very strongly to metrics, evidence backing underlying assumptions, and return on investment.

What’s an initiative team?

Much of the work and improvements that individual teams (such as Customer Experience, Retail and Consumer Insights), requires software development work from our technology team. For instance, retail might want a custom point of sale application, or Supply Chain might want better integration and tracking with vendors and optical labs. The problem is that the number of developers, here organized into agile teams, is limited. Thus, different departments essentially have to compete for software development time. If they win, they get use of a team --- a 3-month joint "initiative" between an agile team and business owner --- and can implement their vision. With such limited, vital resources, it is imperative that the diverse initiative proposals are evaluated carefully and are comparable (that we can compare apples to apples), and that we track costs, progress and success objectively.

These proposals are expected to set out the metrics that the initiative is trying to drive (revenue, cost, customer satisfaction etc.) and upon which the initiative will be evaluated --- for example, reducing website bounce rate (the proximate metric) should lead to increased revenue (the ultimate metric). They are also expected to set out the initiative’s assumptions. If you claim that this shiny new feature will drive $1 million in increased revenue, you need to back up your claim. As these proposals will be reviewed, discussed and voted upon by all managers in the company, and they are in competition, it creates increased pressure to have a bullet proof argument with sound assumptions and evidence, and to focus on work that will really make a difference to the company. It also has an additional benefit: it keeps all the managers up to speed with what teams are thinking about and what they would like to accomplish, even if their initiative does not get "funded" this time around.

It took a few rounds of this process to get where we are now and there are still improvements to be made. For instance, in this last round we still saw #hours saved as a proposed initiative impact when the hourly rate varies considerably among employees. That is, they should be standardized into actual dollars or at least a tiered system of hourly rates so that we can compare against hours saved in a more expensive team. This visibility of work, metrics, assumptions and the process by which resources are allocated has really pushed us towards data-driven decisions about priorities and resource allocation.

ROI

While the initiative process covers the overall strategy and just touches on tactics at a very high level, what happens within a funded initiative is all about low-level tactics. Teams have different options to achieve their goals and drive their metrics, what features should they work on specifically, and when. This, too, is a very data-driven process all about return on investment (ROI). Again, there is a good process in place in which our business analysts estimate costs, returns, assumptions and impacts on metrics (this is the “return” component). While the developmental time is mostly a fixed cost (the agile teams are stable), costs can vary because they may choose to pay for a 3rd party vendor or service rather than build the same functionality (this is the “investment” component). These ROI discussions are really negotiations between the lead on the agile team and the business owner (such as head of Supply Chain): what makes most sense for us to work on this sprint. This ROI process also covers my team too, the Data Science team, which is outside the initiative process but involves similar negotiations with department heads who request work; this allows us to say no to teams for requests because the ROI is too low. By asking department heads to spell out precisely the business impact and ROI for their requests, it also gets them to think more carefully about their strategy and tactics.

Our ROI process is very new but is clearly a step in the right direction. Estimating both return on investment and justifying the assumptions is not at all that easy, but it is the right thing to do. In essence, we are switching from "I think" to "The data show"...

"Guild meetings are held to improve. They are there to ‘sharpen your saw’. Every minute you use for ‘sawing’ decreases the amount of time for sharpening" from a post by Rini van Solingen

Analyst guild

Warby Parker employs a decentralized analyst model. That is, analysts are embedded in individual teams such as Digital Marketing, Customer Experience, and Consumer Insights. Those analysts report to their respective team leads and share those team's goals. The advantage, of course, is that analysts are very close to what their team is thinking about, what they are trying to measure, what questions they are asking. The downside, however, is that metrics, processes and tools can get out of sync with analysts on other teams. This can --- and in our cases did --- result in redundancy of effort, divergent metric definitions and proliferation of tools and approaches, etc.

To compensate for these inefficiencies, we instituted a "guild," a group that cuts across the organization (rather like a matrix style organization). The guild is an email list and, more importantly, an hour long meeting every two weeks, a place for all the analysts to come together to discuss analytics, share their experiences and detail new data sources that might be useful to other teams. In recent weeks, the guild has switched to a more show-and-tell format in which they showcase their work, ask for honest feedback and stimulate discussion. This is working really well. Now, we all have a better sense of who to ask about metrics and issues, what our KPIs mean, where collaborations may lay and what new data sources and data vendors we are testing or are in discussion with. When the analysts are aligned you stand a far greater chance of aligning the organization, too.

SQL Warehouse

Supporting the analysts, my team has built a MySQL data warehouse that pulls all the data from our enterprise resource planning software (hereafter ERP; we use Netsuite) with 30-minute latency and exposes those data in a simpler, cleaner SQL interface. Combined with SQL training, this has had a significant impact in the analysts’ ability to conduct analysis and compile reports on large datasets.

Prior to that, all analysts were exporting data from the ERP in CSV file and doing analysis in Excel; that came with its problems. The ERP software has limits, so exports can time out. Excel has its limits, and analysts could sometimes run out of rows or more frequently memory. Finally, the ERP software did not allow (easy) custom joins in the data; you exported what the view showed. This meant that analysts had to export multiple sets of data in separate CSV files and then run huge VLOOKUPs in the Excel file. Those lookups might run for 6 hours or more and would frequently crash. (There is a reason that the financial analysts have the machines with the highest amount of RAM in the whole company.)

To combat this insanity, we built a data warehouse. We flattened some of those tables to make them more easy to use. We then ran a number of SQL trainings. We combined going through material in w3schools as well as interactive sessions and tutorials using simplified Warby Parker data. After analysts got their feet wet, we supplemented these with more one-on-one tutorials and help sessions, and also hosted group sessions in the analyst guild. The analyst guild was a place that individuals could show off their queries and share how quickly and easily their queries ran compared to the old ERP/Excel approach. We now have a reasonable number of analysts running queries regularly and getting answers to their questions far more easily and quickly.

In addition, this centralization of not just the raw data but also the derived measures, such as sales channel or net promoter score, means a central source of truth with a single definition. This has helped move us away from decentralized definitions in Excel formulae sitting on people’s laptops to standard definitions baked into a SQL database field. In short, people now (mostly) speak the same language when they reference these metrics and measures. While there is still work to be done, we are in a better place than a year ago.

"People want to move from a culture of reporting to a culture of analytics" - Steffin Harris

BI tooling

While writing SQL is one approach to getting answers, it does not suit all levels of skills and experience. Once you have a set of core queries, you likely want to run these frequently, automatically and share results (think canned reports and dashboards). This is where business intelligence tools come into play. While we did a good job at automating a number of core queries and reports using Pentaho Data Integration, we did not make sufficient progress (and I am solely to blame for this) in rolling out a more self-service set of business intelligence tools, a place where analysts can spend more time exploring and visualizing data without writing SQL. While we trialed Tableau and, more recently, Looker, my team did not push analysts hard enough to use these tools and try to switch from Excel charting and dashboard to report feedback. Thus, while we are currently rolling these out to production this quarter, we could have done this up to 6 months ago. Getting analysts to switch would have both created more high quality dashboards that could be easily visible and shared around the company. It would have gotten more people to see data on monitors or in their inbox. It would have also freed up more time for analysts to conduct analysis rather than reporting, an important distinction.

Statistics

Another area where I made less progress than I expected was the level of statistical expertise across the company. Having a Ph.D. from a probability and statistics department, I am clearly biased, but I think having statistical training is hugely valuable not just for the analysts performing the analysis and designing the experiments, but their managers too. Statistical training imports a degree of rigor in thinking in terms of hypotheses, experimental design, thinking about populations and samples, as well as the analysis per se. In many cases, analysts would ask for my advice about how to analyze some dataset. I would ask "precisely what are you trying to answer?", but they wouldn’t be able to express it clearly and unambiguously. When I pushed them to set out a null and alternative hypothesis, this crystallized the questions in their mind and made the associated metrics and analytical approach far more obvious.

I announced that I would run some statistical training and 40 people (which represented a large proportion of the company at the time) immediately signed up. There was a lot of interest and excitement. I vetted a number of online courses and chose Udacity's The science of decisions course. This has a great interactive interface (the student is asked to answer a number of questions inline in the video itself during each lesson) and good curriculum for an introductory course. It also has course notes, another feature I liked. I decided to send about 20 employees through the first trial.

It was a complete disaster.

The number of people who completed the course: zero. The number of people who completed half the course: zero. The problem was completely unrelated to Udacity; it was our fault. Students (=staff) weren't fully committed to spending several hours per week of their own time learning what should be a valuable, transferable skill. To truly embrace statistical thinking you have to practice, to do the exercises, and to attempt to translate concepts you are learning to personal examples such as specific datasets that you use in your job as an analyst. There was insufficient buy in to this degree of effort. There was also insufficient reinforcement and monitoring of progress from managers; that is, expecting participation and following up with their direct reports. I am also to blame for not having in-house check in sessions, a chance to go through material and cover problematic concepts.

I haven't yet solved this. What I have noticed is that a concrete need drives a response to "level up" and meet expectations. Over the last few months, our A/B tests have picked up. The business owners, the agile teams building features (including their business analysts and the project managers), as well as the particular manager who runs our A/B tests and analysis, are all simultaneously expecting and being expected to provide rigor and objectivity. That is, to run a sample size and power analysis in advance of the test, to define clear metrics and null and alternative hypotheses, to be able to explain why they are using a Chi-squared test rather than another test. Having to defend themselves from probing questions from other senior managers, ones who do have some experience in this area, and just asking the right questions is forcing people to learn, and to learn quickly. This is not ideal, and there is a long way to go, but I do feel that this represents a significant shift.

In conclusion, the business is in a far better place than a year ago. People are starting to ask the right questions and have the right expectations from peers. More decisions are based from data-backed assumptions and results than before. More people are starting to talk in more precise language that involve phrases such as "test statistic" and "p-value." More people are harnessing the power of databases to leverage its strength: crunching through large amounts of data in seconds. We are not there yet. My dream for the coming year is

more canned reports
more sharing of results and insights
more dashboards on monitors
more time spent on analysis and deep dives rather than reporting
more accountability and retrospectives such as for prediction errors, misplaced assumptions
more A/B testing
more clickstream analysis
more holistic view of the business

How will we do? I don't know. Check back next year!

Sunday, April 27, 2014

Interview with Data Science Weekly

I was recently interviewed for Data Science Weekly in a post entitled

Data Science & Online Retail - At Warby Parker and Beyond: Carl Anderson Interview.

It was also incorporated into a PDF collection of interviews here.

Update: It was picked up by Business Insider.

Update: And then things really jumped the shark on gothamist

Saturday, April 19, 2014

Data science in e-Commerce

I meet a lot of aspiring data scientists, people starting out who are often switching from academia or finance. They are all keen-eyed and bushy tailed, drawn in by the tales of advanced algorithms from Netflix, the latest competition at Kaggle or the shiny new visualization from Facebook. However, when it comes to e-Commerce, they are kind of stumped. They don't really grasp the scope of how data science can help a business that sells physical “stuff”. They get the idea of recommendation engines baked into almost every chunk of Amazon's website of course but beyond that, they find it hard to imagine how else data scientists may spend their days in such companies.

The purpose of this post, then, is a brief, almost superficial, overview of some of the different aspects of a typical e-Commerce business where data scientists can add value.

Before I start, however, I want to mention a couple of caveats:

• All of the areas below are serviced by a swathe of specialized vendors. They can do a great job --- potentially far superior than an in-house data science team because their business is so focused and specialized and their tools so developed --- but it usually comes at a price. At small company scale, an individual data scientist or team may be able to provide something that is sufficiently good to meet the company's needs or to demonstrate the need for a specialized service. At larger scale, it may make sense to build such systems in-house using the data science team.

• In the list below, there is a broad overlap between the responsibilities of a typical analyst and a data scientist. Some aspects, such as “implement a recommendation engine” are clearly in the data science camp. Other areas, such as those relating to customer insights, are usually performed by analysts. In this case, however, the data scientists may be able to help the business and analysts with more sophisticated statistical approaches (say feature reduction or unsupervised clustering of customers rather than a priori slicing and dicing based on age, gender, zip etc), in other words advanced analytics, or more programmatic approaches (e.g. use an API to pull down supplementary data).

Few individual companies will use data scientists for all of these aspects. The point here is to highlight different aspects where data scientists can and do get involved and provide some value and insight.

Recommendation and Personalization

Let's get the obvious one out of the way. Consumers are increasingly reliant on recommendations these days, whether it is for news, restaurants, bands or items to purchase. Many, if not most, e-Commerce sites have some sort of recommendation engine under the hood and it is typically the data scientist's role to help conceive the type, features, weights and in many case implement it. These engines are used for cross-sell (“you are ordering this iPad so you probably want one of these cases to protect it”), up-sell (“you have been looking at this camera, here is the next level up which is even more awesome”) and personalization. It is the data scientist's role to learn the attributes and relationships among products and when possible to learn the tastes and anticipate needs of the customers. They can then help tailor the customer's experience. This might involve changing the ordering of products in the search results or galley pages specifically for the customer.

Many of us have supermarket loyalty cards. So do some e-Commerce sites (think Amazon prime). They are a source of extremely valuable data (so much so that it may even be worth making some amount of loss on those customers). Coupons and discounts can drive new purchase behavior and provide insights for whole segments of customers not in the loyalty program itself. Those programs need to be conceived, managed and maximal use made of the data.

Product strategy

All e-Commerce sites have to tackle the questions: what should we sell, at what price and when. Data scientists can help define and optimize the product mix. In some cases, such as my current employer Warby Parker, the company may design and manufacture their products. That is, they own the whole process from produce conception to final sale to a customer. While there is a typically a product team that owns that design process, data scientists can and do help with forecasting. Is there a hole in our product mix, what should we make and when should we sell it? How many units should we order in the initial batch from the factory? When should we retire products? Analysts will typically tackle the retrospective analysis (how much did we sell, what are the duds) whereas data scientists can help with the more advanced prescriptive and predictive analytics.

Supply chain

If an e-Commerce is to sell “stuff,” it needs the right amount of the right stuff in the right place at the right time. Supply chain is a particularly complex and important part of the business. It is complex because it often involves multiple vendors and factories, significant time lags for international shipping, significant shipping costs (especially if one gets it wrong and has to expedite pallets of good to warehouses) and significant capex. Also, there can be very narrow windows of demand for a product and if you miss that window, you might be stuck with a big pile of useless inventory (think of “Happy New Year 2014” products on Jan 2). Finally, demand van be highly unpredictable and might correlate strongly with exogenous factors such as above-average weather. Ideally, e-Commerce will work with specialized vendors to handle supply chain or hire an expert in-house operations research team. However, in many e-Commerce sites, especially when small, there is plenty of scope for data scientists to perform detailed analysis and develop predictive models than can help minimize risk, inform strategy and optimize customer satisfaction.

Customer Service

A company that puts the customers first is going to have a great customer service team that handles issues, deals with returns and complaints and generally tries to keep the customers happy. These teams generate a trove of data from phone calls, instant messages and email interactions with the customers and back end systems. They also tend to be fairly metric driven: how long on average does it take to answer the phone, to resolve a case, what is the size of the case backlog etc.? Data scientists can help with predictive models and visualization. They can also use their skills with natural language processing. For instance, they could use keyword extraction and topical modeling to understand the types of complaints and issues being filed.

Fraud

Fraud is, unfortunately, very common and the strategies employed by the thieves varied and in some cases sophisticated. It can range from the use of stolen credit cards, non-returned items or items returned which are shrink-wrapped but which do not contain the original product. Again, there is potential for data scientists to develop models or monitoring or alerting systems

Hiring is tough, especially in technology where it is extremely competitive to hire good engineers (and data scientists of course). Hiring is time consuming and expensive because of the cost of recruiters, fees, and time spent interviewing. In addition, a bad hire can be counter-productive to the team or company and expensive to manage. Increasingly, companies are interested in honing their recruiting process: what makes a good fit for our company, where can we streamline the interview process, what are good discriminating interview questions and so on. Models can be used to understand attrition and retention, identify who should be rejected at the resume phase, and analyze and optimize the interview pipeline.

Customer insights

An e-Commerce site has stuff to sell but who are the people buying it? What are they interested in? Where do they live? How can we serve them better? What makes them tick? These questions are typically answered by analysts in a group akin to customer insights or, as the company scales, to specialized teams that might work within just one realm within the product space. As above, data scientists can help here with more advanced analytics (classifiers, predictive modeling, unsupervised clustering and segmentation and so on). This team is often responsible for customer surveys and so there is ample opportunity to help them with natural language processing including keyword extraction and topic modeling. (I wrote about this earlier in my post about matching misspelled brand names.)

Marketing

OK, so the site has product to sell and they know something about their customers. An obvious next question for them to ask is how do they get more customers or encourage existing customers to purchase more? Here we enter the realm of marketing. Once again, there is lots of scope for data scientists to contribute. This might range from adword buying optimization, channel mix optimization (by that I mean print vs web vs TV), ad retargeting optimization, and SEO. Most e-Commerce sites send out a lot of emails, especially if they are in flash sales. There is lot of scope for understanding, optimizing and A/B testing subject lines, content, send times and so on. (At One Kings Lane, a home decor flash sales site, we sent customers up to 17 emails per week.) There is a careful balance between reminding customers about your presence and what you offer and turning people off by being a nuisance. In many e-Commerce sites, cart abandonment rates reach dizzy heights and understanding and addressing that can pay rich rewards. Data scientists can often comprise a core part of personalization programs.

Web analytics

Another obvious area where data scientists can contribute is web analytics. How do people come to the site (this relates to SEO, search terms and referer URL analysis)? What paths do they take? When and where do they bounce? What stages of the checkout funnel do we lose most customers? How can we make the experience more frictionless, enjoyable and relevant? Which products are customers entering into our search box, which we do not currently, but should, supply? These are all areas which should be covered by a specialized analyst team as the company scales but where data scientists can help with data munging, visualization, advanced clickstream analysis, A/B testing as well as contributing data products for the site (personalization and recommender APIs).

And there you have it: a whirlwind tour of an e-Commerce site from the perspective of data scientists. It is this breadth that makes being a data scientist fun, rewarding and challenging. You get to work with a board spectrum of partners across the organization, dip into different domains, and make a difference in a variety of ways.

Sunday, November 10, 2013

The data scene in NYC

I love the NYC data scene. It is vibrant, healthy and welcoming.

Eight months ago I left San Francisco and moved to New York. I had come very close to taking an offer in DC and had been struggling weighing the pros and cons of DC versus NYC. I had previously lived in DC so knew the area, Virginia schools are very good and the cost of living significantly lower than the New York area, all important considerations when you have a young family. Warby Parker (my current employer) had set me up to have a chat with DJ Patil who at the time was data scientist in residence at Greylock Partners. We had an honest and open discussion of these issues and he asked, "That is a great offer but what happens if it doesn't work out? New York has a rich and vibrant data scene. It is going to be stimulating and there are lots of opportunities." He was right of course.

DC has changed significantly since I left in 2007. There are many more startups and networks for entrepreneurs and Harlan Harris and co have done an amazing job bringing together the data community under the umbrella of Data Community DC. The reality, however, is that there are relatively few opportunities. The two big tech companies, AOL and Living Social, are both a mess. There are many other data-related positions, if you have security clearance, and the majority of other positions tend to be with small consulting firms that service the government and Department of Defense.

Contrast this with New York:

New York is the place to be for advertising and media.
New York is the place to be for fashion.
New York is the place to be for finance.

The tech scene is rich too and while I know that this is a gross oversimplification there is somehow something very tangible about the startups here. They are more likely to sell real goods and services, physical goods from Etsy, Rent the Runway or Birchbox. In other words, eCommerce is thriving. In addition, under Bloomberg's initiatives, the city is investing heavily in data science and statistics per se with new institutes and a campus on Roosevelt Island.

At the O'Reilly Strata + Hadoop World conference last week there was an interesting panel "New York City: a Data Science Mecca." On the panel were Yann Le Cunn (NYU), Chris Wiggins (HackNY/Columbia) and Deborah Estrin (Cornell NY Tech). Yann Le Cunn is the Director of the newly opened Center for Data Science, a multi- and inter-disciplinary research institute that plans to churn out 50 data science Masters students per year as well as host a PhD program, all of which will have strong ties to the local tech scene. Similarly, Columbia's new Institute for Data Science and Engineering will be hiring 30 new faculty positions, taking up shop in a new 44k square foot building and have an industrial affiliate program. Finally, Cornell will be moving to Roosevelt in 2015 with a broader program than just data science that covers computer science and operations research, all skills the feature in the data science world. The panel made the point that New York is such a great place to be for data because of the density of the ecosystem. On a tiny island with great public transport you have a huge conglomeration of finance, media and advertising companies, other organizations such as Mt Sinai (who recently hired Jeff Hammerbacher---the very person who coined the term "data scientist" with DJ Patil), a suite of world class universities who are investing faculty, buildings (highly significant given than land is precious) and have both research and training foci, and finally the city itself. Yann also made the point that the density of jobs and other organizations here make it easier to attract part-time students.

The data-related Meetup scene is very strong too. (Meetup is based in NYC.) You could go to a packed and interesting data-or data-tech related meetup almost every night. DataKind are based here too. One of the most prominent data scientists, Hilary Mason (now data scientist in residence at Accel Partners) is based here. Strata just took place last week. In Fashion Week, 596 people attended "Fashion Tech: Demos & Drinks" that showcased the local fashion related tech companies. I couldn't attend. Why? Because I was attending a DataGotham event, another important data conference that brings the data community together. Later this month is pyData. You get the idea.

NYC is a fruitful mix of data-research, data practitioners and a strong data community. I am happy to be both here and at Warby Parker. Oh, and the team that I would have joined in DC recently imploded. I dodged a bullet. Thanks DJ.

Sunday, September 8, 2013

Matching misspelled brand names -- the easy way

On Friday, Warby Parker's Director of Consumer Insights approached me with some data. They had sent out a survey one question of which was "list up to five eyeglass brands you are familiar with." She was wanting to aggregate the data but, being free text, the answers were riddled with misspellings. Manually, it would take a lot of time and effort to resolve the variants. She asked if there was a better way to standardize them.

With a question like this, your first reaction might be to think of regular expressions. While there are common forms of misspelling such as transposed characters (ie <--> ei) and (un)doubled consonants (Cincinnati, Mississippi), this is not a tenable approach. First, you are not going to capture all of the variants, common and uncommon. Second, brand names are not necessarily dictionary words and so may not follow normal spelling rules.

If we had a set of target brands, we might be able to use edit distance to associate some variants but "d+g" is very far away from "Dolce & Gabbana". That won't work. Besides, we don't have a limited list of brands. This is an open-ended question and gives rise to open ended results. For instance, Dale Earnhardt Jr has a line of glasses (who knew?) and appeared in our results.

To get a sense of the problem, here are the variants of just Tommy Hilfiger in our dataset:

tommy helfinger
tommy hf
tommy hildfigers
tommy hilfger
tommy hilfieger
tommy hilfigar
tommy hilfiger
tommy hilfigger
tommy hilfigher
tommy hilfigur
tommy hilfigure
tommy hilfiinger
tommy hilfinger
tommy hilifiger
tommy hillfiger
tommy hillfigger
tommy hillfigur
tommy hillfigure
tommy hillfinger

Even if it kind of worked and you could resolve the easy cases, leaving the remainder to resolve manually, it still is not desirable. We want to run this kind of survey at regular intervals and I'm lazy: I want to write code for the whole problem once and rerun multiple times later. Set it and forget it.

This kind of problem is something that search engines contend with all the time. So, what would google do? They have a sophisticated set of algorithms which associate document content and especially link text with target websites. They also have a ton of data and can reach out along the long tail of these variants. For my purposes, however, it doesn't matter how they do it but whether I can piggyback off their efforts.

Here was my solution to the problem:

If I type in "Diane von Burstenburg" into google, it returns,

Showing results for Diane von Furstenberg

and the top result is for dvf.com. This is precisely the behavior I want. It will map all of those Tommy Hilfiger variants to the same website.

We now have a reasonable approach. What about implementation? Google has really locked down API access. Their old JSON API is deprecated but is still available but limited to 100 queries per day. (I used pygoogle to query it easily with

>>> from pygoogle import pygoogle
>>> g = pygoogle('ray ban')
>>> g.pages = 1
>>> g.get_urls()[0]
u'http://www.ray-ban.com/'

but was shut down by google within minutes). Even if you wget their results pages, it doesn't even contain the search results as they are all ajaxed in. I didn't want to pay for API access to their results for a small one off project so went looking elsewhere. DuckDuckGo has a nice JSON API but its results are limited. I didn't feel like parsing Bing's results page. Yahoo (+ BeautifulSoup + ulrlib) saves the day!

The following works well, albeit slowly due to my rate limiting (sleep for 15 seconds):

from bs4 import BeautifulSoup
import urllib
import time
import urlparse

f_out = open("output_terms.tsv","w") #list of terms, one per line
f = open("terms.txt","r")
for line in f.readlines():
term = line.strip()
try:
print term
f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"")
soup = BeautifulSoup(f)
link = soup.find("a", {"id": "link-1"})['href']
parsed_uri = urlparse.urlparse( link )
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
f_out.write( term + "\t" + link + "\t" + domain + "\n")
time.sleep(15)
except TypeError:
print "ERROR with" + term

f.close()

f_out.close()

where output_terms.tsv was the set of unique terms after I lowercased each term, remove hyphens, ampersands and " and ".

This code outputs rows such as:

under amour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armer http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
underarmor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/

It is not perfect by any means. Those Tommy Hilfiger variants result in a mix of tommy.com and tommyhilfiger.com, which is Hilfiger's fault for having confusing / poor SEO. More importantly, about 10% of the the results map to wikipedia:

karl lagerfeld http://en.wikipedia.org/wiki/Karl_Lagerfeld

For these, I did

cat output_terms.tsv | grep wikipedia | awk -F '\t' '{print $1}' > wikipediaterms.txt

and reran these through my code using this query instead:

f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"" + "-wikipedia")

This worked well and Lagerfeld now maps to

karl lagerfeld http://www.karl.com/ http://www.karl.com/

(There are of course still errors: Catherine Deneuve maps to http://www.imdb.com/name/nm0000366/ http://www.imdb.com/, a perfectly reasonable response. I tried queries with '"searchterm" +glasses' for greater context but the overall results were not great. With that I got lots of ebay links appearing.)

Now I have a hands-free process that seems to captures most of the variants and has trouble mostly only on low frequency, seemingly genuinely ambiguous cases. This can easily be run and rerun for future surveys. Laziness for the win! I don't even care if it fails to swap out very uncommon variants because in this case we don't need perfect data. We will aggregate the websites and filter out anything with frequency less than say X so those odd terms this process got wrong, we don't need to worry about. In other words, we care most about the first few bars of our ordered histogram.

Data scientists need to be good at lateral thinking. One skill is not to focus too much on the perfect algorithmic solution to a problem but, when possible, find a quick, cheap and dirty solution that gets you what you want. If that means a simple hack to piggyback of a huge team of search engine engineers and their enormous corpus, so much the better.

Thursday, May 2, 2013

Leading Indicators: a response

In an interesting thought experiment, Mike Loukides and Q Ethan McCallum asked the question

"If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it?"

I certainly have my list of what I think are great companies and organizations for data science. Quora users have their list too but how do we know if we have don't first hand knowledge of working at all these places?

The authors provide two examples to motivate the discussion. The first I consider a kind of negative evidence: in a hotel, if they can't get a club sandwich right (the basics) they are certainly not going to get the details (such as great concierge service) right. That is, a club sandwich is a good indictor or proxy of the hotel's overall quality and level of service. The second example, I consider more as positive evidence. If a school has a great music program then it almost certainly excels in other areas too. Again, the music program is a good proxy for the school's overall program.

With this framework, they reframe the overall question:

"What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting?"

They then list out 7 ideas. However, I am not convinced that many of them are evaluable as an outsider---excepting asking the questions explicitly during a job interview. Let's review them one by one.

Loukides & McCallum #1: Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.

How would I determine that as an outsider? Those projects would have to be externally visible, on a website, on a blog or published. Or, if the project refers solely to code, it would have to be open source. However, any write up is unlikely to contain information about the managerial component.

Loukides & McCallum #2: Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.

This is how data scientists should be incorporated into a organization but unless a job description explicitly says "you will be reporting to the head of {marketing,operations,customer insight,...}", you will be hard-pressed to know as an outsider.

Loukides & McCallum #3: When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.

Self-explanatory

Loukides & McCallum #4: What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.

This is certainly a valid point and a primary indicator. A buzz-word filled job ad with little focus may well indicate a company jumping on the bandwagon. Conversely, a larger company with a larger, more well-established team are more likely to have more specialized job descriptions. [examples]

Loukides & McCallum #5: Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?

How would I determine that as an outsider? You might be able to question that during a job interview though.

Loukides & McCallum #6: What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.

If the project is open source, then yes, good documentation is a great indicator.

Loukides & McCallum #7: Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

This is the sort of information that you may be able to glean as an interviewee.

I had a few other ideas:

Tech talks: if the company is hosting tech talks and attracting well-known and respected speakers from other high quality organizations, this is good positive evidence.
Team size: how many other data scientists work there? You can likely find this out from a LinkedIn search. If you are young, starting out, you might prefer a larger team with more potential mentors, better tools and so on. Those with more experience might prefer to blaze a trail.
Existing team members: who are they and what is their pedigree? You can check their LinkedIn profiles or personal websites but there are other ways. For instance, LinkedIn has some great data scientists. How do I know they are good? They tweet relevant information, they post thoughtful posts. The speak at conferences. Their team is highly active in the general data science scene. All this visibility---provided the content is meaningful---is all good evidence.
Publications: academic journal publications may or may not be a good indicator. There is typically a big gulf between academic systems and toy problems and the mess and noise of real world systems and data. An algorithm may work great on the static data set that the grad student has worked with for 3 years but it might not scale, might require far too much parameter tuning in the real world. There are many exceptions of course. It really depends on the system.
Patents: patents coming out of an organization may or may not be a good indicator. It is essentially stales data as patents reveal the degree of innovation at the company two or more years ago (given the time it takes to process and award). A strong patent culture might mean that the IP is locked down so that you may not be able to discuss systems in development at conferences, publish work, open source the code etc.
Internship program: if the company has a strong internship program, attracting top young talent from top schools, and those interns go on to do good things as data scientists, this is very good evidence.