Over the last year, there has been a lot of discussion about what is data science and who are data scientists. There are those that think that a data scientist is simply a glorified, over-hyped term for a business intelligence analyst. On the other hand, there are those that equate it with big data. I think that neither of those interpretations are correct and here is my take.
I like to think of it in the following manner: imagine a triangle. In one corner are business intelligence (BI) analysts. In a second corner are data engineers and in the third corner are applied statisticians.
Each corner represents an extreme form of those roles. For BI, it might represent someone who knows a little SQL and does everything else in Excel. There is nothing wrong with this at all, Excel can get you far, but my underlying point is that most BI analysts do much more than this, they protude much further into the triangle. The extreme data engineer might be someone who owns some data processing system, is agnostic about the data flowing through their system and cares solely about gathering, throughput, scaling, peak loads, data gathering logging and so on. Again, most data engineers do much more than this. Finally, the extreme applied statistician might be someone involved in statistical modeling or algorithm development. For instance, they might study recommender systems with toy datasets somewhat in a vacuum from real problems. Again, most statisticians have a broader role and set of skills than this.
While BI is mostly in their corner of the triangle, there are some BI analysts that overlap significantly with data engineers. They are involved in data gathering, architectural decisions about data storage, processing, visualization solutions and replication. They may query with big data technologies such as pig, hive and so on. Some BI analysts may also stray significantly towards the data scientist corner. For instance, they may be involved in some sort of predictive modeling, significant tests, experimental design, may use tools such as R and so on.
Data engineers may show some overlap with BI, perhaps implementing tools for data analysis, doing some analyses, care about the meaning and value of the data in their systems. They may also work closely with data scientists. For instance, they may take prototypes and algorithms implemented by the data scientists and refactor them or completely implement them in a new language to scale. They may understand the core algorithms sufficiently well to improve them for faster performance or better results.
Data scientists are a group of people that cover some central swathe of the triangle, the blue area of triangle (which is purely illustrative). Within that we might find specializations: the machine learners, the data miners, the recsys guys. For now, let's label them collectively as data scientists.
Some data scientists may stray towards BI and write retrospective analyses and, yes, may even use Excel. They may also work significantly with big data and work with map reduce, pig, hive, kafka, fluentd etc. and be concerned with robust, scalable, high throughput systems. They may also be concerned with algorithm development, predictive analytics and a while lot more. No single person, however, covers all these areas. They are will cover some proportion of the area. Thus, one data scientist may be represented by the orange blobs of the triangle while a different data scientist may be represented by the red blob.
This then, I think, is the root of some of the problem here. Job ads for data scientists tend to cover the large blue area. They have requirements or nice-to-haves that cover predictive analytics, recommenders, big data, A/B testing and coding. And, certainly when I interview data scientists, I will often touch upon all of these, especially if they make it to an onsite interview. However, I don't expect that any single candidate to be great or have experience in all these areas. If they did, their experience would be so shallow so as to be useless. The trick then is to define the area of the triangle you need to cover with your data science team and then hire staff that collectively, not necessarily individually, cover that area.
What then is a data scientist? My favorite definition comes from Josh Wills:
Data Scientists (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
While inadequate, this definition does capture the essence of the idea that there is a continuum in terms of skills and experience. They are not at the extremes of the triangle but each person is a bit of this and a bit of that. It also captures the idea that a typical (non-extreme) data scientist must be able to code. It might be in R or matlab but it must be code and not solely equations. More preferable is some scripting language such as ruby or python that allow rapid prototyping, and the ability to tap into libraries such as numpy, scipy and NLTK. If you can't code, you are unlikely to get too far as a data scientist. Some discuss data scientists' roles as those that find the needle in the haystacks or the gold in the hills. To do that, however, is going to require some code to automate processing or to learn parameters of some model.
I think data science is more about building data products and new data-driven features and services. I don't mean shuffling existing data around as web services but generating new data that drives a feature or product that, in turn, drives value to the organization. This might be an internal sales forecast, website personalization or a recommendation. In my role at onekingslane.com, some of my time involves image processing, generating entirely new data from product images which can drive new site features such as faceted search or deeper BI analyses.
Data science can involve big data but not necessarily. If you are Facebook, Twitter, or LinkedIn with a large and very active user base then it is a necessity. But big data is not a necessary condition for data science. In my last last post, we built a very respectable sentiment analyzer with less than a 100Kb training set. Quality data are preferred but if one has big data to offer, we'll take it.
In summary, data scientists vary. They cover a lot of territory, certainly collectively but not necessarily individually. Some are more statistical, some are more analytical and into data visualization, and some are more computational. While they may overlap with BI and data engineers, they are not the same, are complementary, and do possess a certain novel mix of skills and talents that can provide significant value to their organizations.