Monday, December 3, 2012

What is a data scientist?

Over the last year, there has been a lot of discussion about what is data science and who are data scientists. There are those that think that a data scientist is simply a glorified, over-hyped term for a business intelligence analyst. On the other hand, there are those that equate it with big data. I think that neither of those interpretations are correct and here is my take.

I like to think of it in the following manner: imagine a triangle. In one corner are business intelligence (BI) analysts. In a second corner are data engineers and in the third corner are applied statisticians. 

Each corner represents an extreme form of those roles. For BI, it might represent someone who knows a little SQL and does everything else in Excel. There is nothing wrong with this at all, Excel can get you far, but my underlying point is that most BI analysts do much more than this, they protude much further into the triangle. The extreme data engineer might be someone who owns some data processing system, is agnostic about the data flowing through their system and cares solely about gathering, throughput, scaling, peak loads, data gathering logging and so on. Again, most data engineers do much more than this. Finally, the extreme applied statistician might be someone involved in statistical modeling or algorithm development. For instance, they might study recommender systems with toy datasets somewhat in a vacuum from real problems. Again, most statisticians have a broader role and set of skills than this.

While BI is mostly in their corner of the triangle, there are some BI analysts that overlap significantly with data engineers. They are involved in data gathering, architectural decisions about data storage, processing, visualization solutions and replication. They may query with big data technologies such as pig, hive and so on. Some BI analysts may also stray significantly towards the data scientist corner. For instance, they may be involved in some sort of predictive modeling, significant tests, experimental design, may use tools such as R and so on.

Data engineers may show some overlap with BI, perhaps implementing tools for data analysis, doing some analyses, care about the meaning and value of the data in their systems. They may also work closely with data scientists. For instance, they may take prototypes and algorithms implemented by the data scientists and refactor them or completely implement them in a new language to scale. They may understand the core algorithms sufficiently well to improve them for faster performance or better results.

Data scientists are a group of people that cover some central swathe of the triangle, the blue area of triangle (which is purely illustrative). Within that we might find specializations: the machine learners, the data miners, the recsys guys. For now, let's label them collectively as data scientists.

Some data scientists may stray towards BI and write retrospective analyses and, yes, may even use Excel. They may also work significantly with big data and work with map reduce, pig, hive, kafka, fluentd etc. and be concerned with robust, scalable, high throughput systems. They may also be concerned with algorithm development, predictive analytics and a while lot more. No single person, however, covers all these areas. They are will cover some proportion of the area. Thus, one data scientist may be represented by the orange blobs of the triangle while a different data scientist may be represented by the red blob. 

This then, I think, is the root of some of the problem here. Job ads for data scientists tend to cover the large blue area. They have requirements or nice-to-haves that cover predictive analytics, recommenders, big data, A/B testing and coding. And, certainly when I interview data scientists, I will often touch upon all of these, especially if they make it to an onsite interview. However, I don't expect that any single candidate to be great or have experience in all these areas. If they did, their experience would be so shallow so as to be useless. The trick then is to define the area of the triangle you need to cover with your data science team and then hire staff that collectively, not necessarily individually, cover that area.

What then is a data scientist? My favorite definition comes from Josh Wills:
Data Scientists (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
While inadequate, this definition does capture the essence of the idea that there is a continuum in terms of skills and experience. They are not at the extremes of the triangle but each person is a bit of this and a bit of that. It also captures the idea that a typical (non-extreme) data scientist must be able to code. It might be in R or matlab but it must be code and not solely equations. More preferable is some scripting language such as ruby or python that allow rapid prototyping, and the ability to tap into libraries such as numpy, scipy and NLTK. If you can't code, you are unlikely to get too far as a data scientist. Some discuss data scientists' roles as those that find the needle in the haystacks or the gold in the hills. To do that, however, is going to require some code to automate processing or to learn parameters of some model. 

I think data science is more about building data products and new data-driven features and services. I don't mean shuffling existing data around as web services but generating new data that drives a feature or product that, in turn, drives value to the organization. This might be an internal sales forecast, website personalization or a recommendation. In my role at, some of my time involves image processing, generating entirely new data from product images which can drive new site features such as faceted search or deeper BI analyses.

Data science can involve big data but not necessarily. If you are Facebook, Twitter, or LinkedIn with a large and very active user base then it is a necessity. But big data is not a necessary condition for data science. In my last last post, we built a very respectable sentiment analyzer with less than a 100Kb training set. Quality data are preferred but if one has big data to offer, we'll take it.

In summary, data scientists vary. They cover a lot of territory, certainly collectively but not necessarily individually. Some are more statistical, some are more analytical and into data visualization, and some are more computational. While they may overlap with BI and data engineers, they are not the same, are complementary, and do possess a certain novel mix of skills and talents that can provide significant value to their organizations.


  1. Great post! Personally, I'm a "data scientist" who is in the Applied Statistician's corner of your triangle, and trying to gain more of the Data Engineer's skills.

    For small start-ups that may be unable to hire a large data-science team, which part of the triangle is most important? Personally, I think statistics, because you can lean on developers for engineering and marketing for BI, but I'm interested to hear your thoughts!

  2. I definitely think that works in a large company setting - you want people who expect clean data and likely never write code running in production systems, but, make wonderful high-precision models and can juice out that last 3% -- it's competitive advantage.

    The risk with startups is that you a statistician can quickly become someone who is productive only when paired with an engineer and that increases the cost of the in-house data science. On the other hand, an engineer who can " get away with stats and ML" is much more useful since they can come up with a solution and mvp it themselves. For most early stage companies, going 60-70% of the way faster is much much more valuable than all the way in even twice as much time.

    I'm definitely curious as to what Carl's thoughts are here.

  3. Tina and Dhruvkaran, like all things it depends. For a startup, the key thing is that someone is looking at the data. You want someone, such as a BI analyst, creating basic reports: how did we do today compared to yesterday, to last month etc. Simple counts, histograms etc. And, you want someone like that thinking about what data should be collected. This may rely on some help from engineering to gather, store, replicate the data and make it accessible for analysis. If you want more actionability in the data then a statistician could make sense: they can also do clustering, feature selection and so on. More generally, for startups, you want generalists who can switch and multitask and get a minimal viable product out of the door and that generally means coding. As such, I tend to agree with Dhruvkaran here. As you scale, you can take on more specialists.

    However, it really depends on the sphere of the startup. If you are a data-product driven startup such as prismatic, you probably want data scientists from the start. If you are in e-commerce, you probably want BI analysts first and add statisticians and data scientists later when you have reached a certain scale or size of revenue and you view data products as the next generation of offerings on your site.


Note: Only a member of this blog may post a comment.