p-value.info: 2016

Monday, February 8, 2016

The role of model interpretability in data science

This is a cross post of a piece I posted on medium (Feb 1, 2016):

In data science, models can involve abstract features in high dimensional spaces, or they can be more concrete, in lower dimensions, and more readily understood by humans; that is, they are interpretable. What’s the role of interpretable models in data science, especially when working with less technical partners from the business? When, and why, should we favor model interpretability?

The key here is figuring out the audience. Who is going to use the model and to what purpose? Let’s take a simple but specific example. Last week, I was working on a typical cold-start predictive modeling problem for e-commerce: how do you predict initial sales for new products if you’ve never sold them before?

One common approach is to make the most of your existing data. For instance, you can estimate the effect of temporal features, such as launch month or day of week, using your historical data. One can also find similar products that you have sold, making the naïve assumption that the market will respond similarly to this new product, and create predictors based on their historical sales. In this case, I had access to a set of product attributes: frame measurements of Warby Parker glasses. The dataset also happened to contain a set of principal components for those measurements. (If you don’t understand what this means, don’t worry, that is the whole point of this story.) I created a model that contained both readily-understood temporal features and the more abstract principal component #3, “pc03,” which turned out to be a good predictor. However, I decided to rip out pc03 and replace it with the set of raw product attributes. I pruned the model with stepwise regression and was left with a model that had three frame measurement variables instead of just one (pc03). Thus, the final model was more complex for no additional predictive power. However, it was the right approach from a business perspective. Why?

I wasn’t creating the model for my benefit. I was creating it for the benefit of the business, in particular, a team whose job it is to estimate demand and place purchase orders for initial inventory and ongoing replenishment. This team is skilled and knows the products, market, and our customers extremely well but they are not statisticians. This is a really tough task and they need help. Further, ultimately, they — not me — are on the hook for bad estimates. If they under-predict demand and we stock out, we lose those immediate sales and the lifetime value of any potential customers that go elsewhere. If they over-predict, we invest unnecessary CAPEX in purchasing the units, we incur increased warehouse costs, and we may be left with a pile of duds. Thus, the idea of my model is to serve as an additional voice to help them make their decisions. However, they need to trust and understand it, and therein lies the rub. They don’t know what principal component means. It is a very abstract concept. I can’t point to a pair of glasses and show them what it represents because it doesn’t exist like that. However, by restricting the model to actual physical features, features they know very well, they could indeed understand and trust the model. This final model had a very similar prediction error profile — i.e., the model was basically just as good — and it yielded some surprising insights for them. The three measurements the model retained were not the set that they had assumed a priori were most important.

This detailed example was meant to highlight a few reasons why a poorer, more complex but interpretable models might be favored:

Interpretable models can be understood by less-technical (non data scientists) in the business and, importantly, they are often the decision makers. They need to understand and use the model, otherwise the work has no ultimate impact. My job as a data scientist is to maximize impact.
Interpretable models can yield simple direct insights, the sort that they can readily communicate to colleagues.
Interpretable models can help build up trust with those partners and, through repeat engagements, can lay foundations for more sophisticated approaches for the future.
In many cases, the interpretable model is similarly performant to a more complex model anyway. That is, you may not be losing much predictive power.

My team often makes these tradeoffs, i.e., choosing an interpretable model when working in close partnership with a business team where we care more about understanding a system or the customer than pure predictive power. We may use classification trees as we can sit with the business owner and interpret the model together and, importantly, discuss the actionable insights which tie naturally to the decision points, the splits, in those trees. Or, in other cases, we may opt for Naïve Bayes over support vector classifiers, again because the terms of the model are readily understood and communicated.

In all these cases, it represents a calculated decision of how we maximize our impact as a data science team and an understanding of the actual tradeoffs, if there is a loss of performance.

Of course, those tradeoffs do not always land in favor of interpretable models. There are many cases where the best model, as determined byRMSE, F-score and so on, do and should win out irrespective of whether it is interpretable or not. In these cases, performance is more important than understanding. For instance,

Show me the money! The goal is simply to maximize revenue. A hedge fund CEO is probably not going to be worried about how the algorithmic trading models work under the hood if they bring home the bacon.
There is an existing culture of trust and trail of evidence. Amazon has done recommenders well for years. They’ve proved themselves, earned the respect to use the very best models, whatever they are.
The goal is purely model performance for performance sake. Kaggleleaderboards are littered with overly-optimized and overly-complex leaners (although one could argue it is related to 1 through the prize money).
Non-financial stakes are high. I would want a machine-learned heart disease detector to do the best possible job as any type II prediction error could be devastating to the subject. Here, I believe, the confusion matrix performance outweighs a doctor’s need to understand fully how it works.

When choosing a model, think carefully about what one is trying to achieve. Is it simply to minimize a metric, to understand system, or to make inroads to a more data-driven culture within the organization? A model’s total performance is the product of the model predictive performance times the probability that the model will be used. One needs to optimize for both.

Saturday, January 23, 2016

When should I hire a data scientist?

This is a cross-post from a piece I wrote in medium:

Being part of the New York data scene, and especially being part of multiple VC networks, I often get asked to meet and advise early-stage startups and give my perspective on setting up the right data infrastructure and data team. Frequently, I get asked “when should I hire a data scientist?” as the word on the street is that you need a data scientist on staff. Sooner is better than later, right? They are often surprised when I say, “No, not yet. You are not ready.”

The truth is that very early startups often only have a basic data infrastructure in place to support the current business and are not ready to spend precious resources for more advanced analytics and data products such as recommenders. Focus on the foundational data first. Keep the website backend up and running, keep the signups flowing into a database table, instrument the site and track how users use your products. There’s lots to do.

You need some central transactional and analytics store that will at least scale for the next year or two. Be safe and opt for boring technology such as relational databases unless you have good reason otherwise. They are tried and tested. Boring is good. Centralize the data. Build in data quality processes. Create more robust ETLs to marshall the data. The data engineer is going to support the whole business not just analytics so is a good deal. Moreover, they are easier and cheaper to hire than data scientist.

“OK, great. We’ll hire a data engineer. And then we hire a data scientist?”

No, not yet. I would recommend hiring a data analyst first. Why? An early stage startup is probably still feeling out their business model. They are still trying to work out strategically where they should go. They are probably still seeking funding. These activities require getting answers from traditional analytics to help the founders and advisors make the right decisions and to provide the necessary information to investors. Excel will probably suffice for this work — and you can even connect it to a relational database as a data source. A good analyst will take you far. If they know SQL and they can query raw data stores directly, or can do some modeling in R, even better. Importantly for a cash strapped startup, a business analyst is probably only half the price of a good data scientist.

So at this point, you have a reasonable data infrastructure, hopefully some somewhat solid data quality processes, and you’ve met the founders’ basic data needs. Now, we hire a data scientist? Well, maybe. It very much depends on the type of business and whether a data scientist will be a central part of the business model. If you could only hire one more person for the team/rocket ship, who would provide the greatest return? That is, if a central offering of the business is a data product, a data science driven process, such as recommenders, or something similar that provide a competitive advantage then now might indeed be a good time. Maybe not. Maybe you just need another analyst. You need to have a good idea of why you need that data scientist. Don’t get me wrong. I’m very pro data science. I’m a data scientist. However, I do believe that for early stage startups at least, there can be too early for a data scientist. We are not cheap. We need data, and we are not necessarily the best people to be building out the early ETLs to get the data. Others, including software engineers, can probably do a better job, more quickly for less.

One option of course is to outsource. If you have a clear, crisp question you can essentially hand over a dataset to a consulting data scientist and let them at it. Who is going to prepare that dataset, or build an API to get the data, or provide raw access to the database? That’s right: the data engineer that you hired ahead of the data scientist.

By all means hire a data scientist but let them come into an environment where there is data ready to be mined and others to focus on vanilla business intelligence reporting and analysis and free up the data scientist to focus as much as possible on what they are good at: the fun stuff, where that unique blend of business, data, math, stats, and visualization skills can really shine.