Saturday, January 23, 2016

When should I hire a data scientist?

This is a cross-post from a piece I wrote in medium:

Being part of the New York data scene, and especially being part of multiple VC networks, I often get asked to meet and advise early-stage startups and give my perspective on setting up the right data infrastructure and data team. Frequently, I get asked “when should I hire a data scientist?” as the word on the street is that you need a data scientist on staff. Sooner is better than later, right? They are often surprised when I say, “No, not yet. You are not ready.”

The truth is that very early startups often only have a basic data infrastructure in place to support the current business and are not ready to spend precious resources for more advanced analytics and data products such as recommenders. Focus on the foundational data first. Keep the website backend up and running, keep the signups flowing into a database table, instrument the site and track how users use your products. There’s lots to do.

You need some central transactional and analytics store that will at least scale for the next year or two. Be safe and opt for boring technology such as relational databases unless you have good reason otherwise. They are tried and tested. Boring is good. Centralize the data. Build in data quality processes. Create more robust ETLs to marshall the data. The data engineer is going to support the whole business not just analytics so is a good deal. Moreover, they are easier and cheaper to hire than data scientist.

“OK, great. We’ll hire a data engineer. And then we hire a data scientist?”

No, not yet. I would recommend hiring a data analyst first. Why? An early stage startup is probably still feeling out their business model. They are still trying to work out strategically where they should go. They are probably still seeking funding. These activities require getting answers from traditional analytics to help the founders and advisors make the right decisions and to provide the necessary information to investors. Excel will probably suffice for this work — and you can even connect it to a relational database as a data source. A good analyst will take you far. If they know SQL and they can query raw data stores directly, or can do some modeling in R, even better. Importantly for a cash strapped startup, a business analyst is probably only half the price of a good data scientist.

So at this point, you have a reasonable data infrastructure, hopefully some somewhat solid data quality processes, and you’ve met the founders’ basic data needs. Now, we hire a data scientist? Well, maybe. It very much depends on the type of business and whether a data scientist will be a central part of the business model. If you could only hire one more person for the team/rocket ship, who would provide the greatest return? That is, if a central offering of the business is a data product, a data science driven process, such as recommenders, or something similar that provide a competitive advantage then now might indeed be a good time. Maybe not. Maybe you just need another analyst. You need to have a good idea of why you need that data scientist. Don’t get me wrong. I’m very pro data science. I’m a data scientist. However, I do believe that for early stage startups at least, there can be too early for a data scientist. We are not cheap. We need data, and we are not necessarily the best people to be building out the early ETLs to get the data. Others, including software engineers, can probably do a better job, more quickly for less.

One option of course is to outsource. If you have a clear, crisp question you can essentially hand over a dataset to a consulting data scientist and let them at it. Who is going to prepare that dataset, or build an API to get the data, or provide raw access to the database? That’s right: the data engineer that you hired ahead of the data scientist.

By all means hire a data scientist but let them come into an environment where there is data ready to be mined and others to focus on vanilla business intelligence reporting and analysis and free up the data scientist to focus as much as possible on what they are good at: the fun stuff, where that unique blend of business, data, math, stats, and visualization skills can really shine.