p-value.info: May 2013

In an interesting thought experiment, Mike Loukides and Q Ethan McCallum asked the question

"If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it?"

I certainly have my list of what I think are great companies and organizations for data science. Quora users have their list too but how do we know if we have don't first hand knowledge of working at all these places?

The authors provide two examples to motivate the discussion. The first I consider a kind of negative evidence: in a hotel, if they can't get a club sandwich right (the basics) they are certainly not going to get the details (such as great concierge service) right. That is, a club sandwich is a good indictor or proxy of the hotel's overall quality and level of service. The second example, I consider more as positive evidence. If a school has a great music program then it almost certainly excels in other areas too. Again, the music program is a good proxy for the school's overall program.

With this framework, they reframe the overall question:

"What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting?"

They then list out 7 ideas. However, I am not convinced that many of them are evaluable as an outsider---excepting asking the questions explicitly during a job interview. Let's review them one by one.

Loukides & McCallum #1: Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.

How would I determine that as an outsider? Those projects would have to be externally visible, on a website, on a blog or published. Or, if the project refers solely to code, it would have to be open source. However, any write up is unlikely to contain information about the managerial component.

Loukides & McCallum #2: Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.

This is how data scientists should be incorporated into a organization but unless a job description explicitly says "you will be reporting to the head of {marketing,operations,customer insight,...}", you will be hard-pressed to know as an outsider.

Loukides & McCallum #3: When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.

Self-explanatory

Loukides & McCallum #4: What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.

This is certainly a valid point and a primary indicator. A buzz-word filled job ad with little focus may well indicate a company jumping on the bandwagon. Conversely, a larger company with a larger, more well-established team are more likely to have more specialized job descriptions. [examples]

Loukides & McCallum #5: Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?

How would I determine that as an outsider? You might be able to question that during a job interview though.

Loukides & McCallum #6: What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.

If the project is open source, then yes, good documentation is a great indicator.

Loukides & McCallum #7: Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

This is the sort of information that you may be able to glean as an interviewee.

I had a few other ideas:

Tech talks: if the company is hosting tech talks and attracting well-known and respected speakers from other high quality organizations, this is good positive evidence.
Team size: how many other data scientists work there? You can likely find this out from a LinkedIn search. If you are young, starting out, you might prefer a larger team with more potential mentors, better tools and so on. Those with more experience might prefer to blaze a trail.
Existing team members: who are they and what is their pedigree? You can check their LinkedIn profiles or personal websites but there are other ways. For instance, LinkedIn has some great data scientists. How do I know they are good? They tweet relevant information, they post thoughtful posts. The speak at conferences. Their team is highly active in the general data science scene. All this visibility---provided the content is meaningful---is all good evidence.
Publications: academic journal publications may or may not be a good indicator. There is typically a big gulf between academic systems and toy problems and the mess and noise of real world systems and data. An algorithm may work great on the static data set that the grad student has worked with for 3 years but it might not scale, might require far too much parameter tuning in the real world. There are many exceptions of course. It really depends on the system.
Patents: patents coming out of an organization may or may not be a good indicator. It is essentially stales data as patents reveal the degree of innovation at the company two or more years ago (given the time it takes to process and award). A strong patent culture might mean that the IP is locked down so that you may not be able to discuss systems in development at conferences, publish work, open source the code etc.
Internship program: if the company has a strong internship program, attracting top young talent from top schools, and those interns go on to do good things as data scientists, this is very good evidence.

p-value.info

Thursday, May 2, 2013

Leading Indicators: a response