You don’t need a data scientist (yet)

The hype around big data has caused many organisations to hire data scientists without giving much thought to what these data scientists are going to do and whether they’re actually needed. This is a source of frustration for all parties involved. This post discusses some questions you should ask yourself before deciding to hire your first data scientist.

Q1: Do you know what data scientists do?

Somewhat surprisingly, there are quite a few companies that hire data scientists without having a clear idea of what data scientists actually do. People seem to have a fear of missing out on the big data hype, and think of hiring data scientists as the solution. A common misconception is that a data scientist’s role includes telling you what to do with your data. While this may sometimes happen in practice, the ideal scenario is where the business has problems that can be solved using data science (more on this under Q3 below). If you don’t know what your data scientist is going to do, you probably don’t need one.

So what do data scientists do? When you think about it, adding the word “data” to “science” is a bit redundant, as all science is based on data. Following from this, anyone who does any kind of data analysis is a data scientist. While it may be true, this broad definition is not very helpful. As discussed in a previous post, it’s more useful to define data scientists as individuals who combine expertise in statistics and machine learning with strong software engineering skills.

Q2: Do you have enough data available?

It’s not uncommon to see products that suffer from over-engineering and premature investment in advanced analytics capabilities. In the early stages, it’s important to focus on creating a minimum viable product and getting it to market quickly. Data science starts to shine once the product is generating enough data, as most of the power of advanced analytics is in optimising and automating existing processes.

Not having a data scientist in the early stages doesn’t mean the data is being ignored – it just means that it doesn’t require the attention of a full-time data scientist. If your product is at an early stage and you are still concerned, you’re better off hiring a data science consultant for a few days to help lay out the long-term vision for data-driven capabilities. This would be cheaper and less time-consuming than hiring a full-timer. The exception to this rule is when the product itself is built around advanced analytics (e.g., AlchemyAPI or Enlitic). Building such products without data scientists is far from ideal, or just impossible.

Even if your product is mature and generating a lot of data, it doesn’t mean it’s ready for data science. Advanced analytics capabilities are at the top of data’s hierarchy of needs: If your product is buggy, or if your data is scattered everywhere and your platform lacks centralised reporting, you need to first invest in fixing your data plumbing. This is the job of data engineers. Getting data scientists involved when the data is hardly available due to infrastructure issues is likely to lead to frustration. In addition, setting up centralised reporting and dashboarding is likely to give you ideas for problems that data scientists can solve.

Q3: Do you have a specific problem to solve?

If the problem you’re trying to solve is “everyone is doing smart things with data, we should be doing stuff with data too”, you don’t have a specific problem that can be solved by bringing a data scientist on board. Defining the problem often ends up occupying a lot of the data scientist’s time, so you are likely to obtain better results if have more than just a vague idea around “doing something with data, because Hadoop”. Ideally you want to optimise an existing process that is currently being solved with heuristics, make an existing model better, implement a new data-driven feature, or something along these lines. Common examples include reducing churn, increasing conversions, and replacing manual processes with automated data-driven systems. Again, getting advice from experienced data scientists before committing to hiring one may be your best first step.

Q4: Can you get away with heuristics, intuition, and/or manual processes?

Some data scientists would passionately claim that you must deploy only models that are theoretically justified and well-tested. However, in many cases you can get away with using simple heuristics, intuition, and/or manual processes. These can be orders of magnitude cheaper than building sophisticated predictive models and the infrastructure to support them. For many businesses, there are more pressing needs than doing everything in a theoretically sound way. Despite what many technical people like to think, customers don’t tend to care how things are implemented, as long as their needs are fulfilled.

For example, I spent some time with a client whose product includes a semi-manual part where structured data is extracted from documents. Their process included sending some of the documents to a trained team in the Philippines for manual analysis. The client was interested in replacing that manual work with a machine learning algorithm. As is often the case with machine learning, it was unknown whether the resultant model would be accurate enough to completely replace the manual workers. This generally depends on data quality and the feasibility of solving the problem. Assessing the feasibility would have taken some time and money, so the client decided to park the idea and focus on other areas of their business.

Every business has resource constraints. Situations where the best investment you can make is hiring a full-time data scientist are rarer than what the hype may make you think. It’s often the case that functions that would be the responsibility of a data scientist are adequately performed by existing employees, such as software engineers, business/data analysts, and marketers.

Q5: Are you committed to being data-driven?

I have seen more than one case where data scientists are hired only to be blocked or ignored. This is more prevalent in the corporate world, where managers are often incentivised to prioritise doing things that look good over things that make financial sense. But even if recruitment is done with the best intentions, progress may be blocked by employees who feel threatened because they would be replaced by automated data-driven algorithms. Successful data science projects require support from senior leadership, as discussed by Greta Roberts, Radim Řehůřek, Alec Smith, and many others. Without such support and a strong commitment to making data-driven decisions, everyone is just wasting their time.

Closing thoughts

While data science is currently over-hyped, many organisations still have much to gain from hiring data scientists. I hope that this post has helped you decide whether you need a data scientist right now. If you’re unsure, please don’t hesitate to contact me. And to any data scientists reading this: Be very wary of potential employers who do not have good answers to the above questions. At this point in time you can afford to be picky, at least until the hype is over.


    Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

    I enjoyed the post - though I offer some contrary points to consider:

    I have learned that if it is clear that you will need a data scientists (by someone who knows what they do), then you should get them as soon as possible. Don’t wait. Data Scientists work best when they have full context for the problem they are here to solve. Getting them in early allows them to help frame the problem. This framing is critical. If the framing is off, it takes a very long time (sometimes never) to get it back on track. A late-to-the-game data scientists can be too influenced by the the existing framing they are given. They tend to think within that box, when in reality, the box was never the right way to approach the problem. Even if they do see outside of it, it can be very difficult to convince the original framers that there is a better way to do things (people can get quite attached to their vision).

    It also can be wise to NOT WAIT till there is data to analyze. Too often, data is an afterthought. Its important for the data scientist to get in early on the initiative so he or she can help define the needed instrumentation and data acquisition strategy. They can even guide the needs of the data warehouse and other repositories where the newly captured data will reside.

    Further, it is often the case that it is the data scientist that identifies the specific problem to solve. At my company, I estimate that over half of the ideas for new data products, features, and services come from the data science team – not the business. This is intuitive as the data scientists are the folks that are most intimate with the data and are least constrained by what is possible to do with data. Give them business context and they will come up with problems/solutions that no one has thought of.

    Finally, I find heuristics to be dangerous. At best they are suboptimal, and more often than not, they are just plain wrong (those with extensive A/B testing experience can attest to the fact that our intuition fails us again and again). Undoing a bad heuristics can be very painful - in the technical work, the coordinate work, and in the resetting of expectations. Its hard to get people to not walk on a paved path … even if that path is the long way or a dead-end.

    I totally agree with “Q5: Are you committed to being data-driven?”. This comes down to business model and culture. Is your business model one where data science can be the source of strategic differentiation? Is your culture able to support empiricism? The answer to both of these has to be ‘yes’ in order to commit to being data-driven.

    Thank you for your thoughtful comments, Eric!

    I generally agree that it can be beneficial to involve data scientists early on and to avoid thoughtless heuristics, but that it all depends on having a supportive data-driven environment and on resource constraints. As mentioned under Q2, getting advice from a data scientist in the early stages of the product is worthwhile, so it may be smart to pay for a few days of consulting, but not necessarily a good idea to hire a full-timer. A lot of it depends on the general product vision.

    Another note regarding heuristics and intuition: While some may be dangerous, you can view many modelling decisions as heuristics. For example, when building a predictive model, you have to make some intuition-driven choices around features (no model uses all the knowledge in the world), learning algorithms and their hyperparameters. You just can’t test everything, so there’s a need for compromises if you aim to ever deliver anything.