I got my first data science job in 2012, the year Harvard Business Review announced data scientist to be the sexiest job of the 21st century. Two years later, I published a post on my then-favourite definition of data science, as the intersection between software engineering and statistics. Unfortunately, that definition became somewhat irrelevant as more and more people jumped on the data science bandwagon – possibly to the point of making data scientist useless as a job title. However, I still call myself a data scientist. Even better – I still get paid for being a data scientist. But what does it mean? What do I actually do here? This article is a short summary of my understanding of the definition of data science in 2018.
It’s not all about machine learning
As I was wrapping up my PhD in 2012, I started thinking about my next steps. I knew I wanted to get back to working in the tech industry, ideally with a small startup. But it wasn’t clear to me how to market myself – my LinkedIn title at the time was “software engineer with a research background”, which is a bit of a mouthful. Around that time I heard about Kaggle and decided to try competing. This went pretty well, and exposed me to the data science community globally and in Melbourne, where I was living at the time. That’s how I first met Adam Neumann, the founder of Giveable, a startup that aimed to recommend gifts based on social networking data. Upon graduating, I joined Giveable as a data scientist. Changing my LinkedIn title quickly led to many other offers, but I was happy to be working on Giveable – I felt fortunate to have found a startup job that was related to my PhD research on recommender systems.
My understanding of data science at the time was heavily influenced by Kaggle and the tech industry. Kaggle was only about predictive modelling competitions back then, and so I believed that data science is about using machine learning to build models and deploy them as part of various applications. I was very comfortable with that definition, having spent my PhD years on several predictive modelling tasks, and having worked as a software engineer prior to that.
Things have changed considerably since 2012. It is now much easier to deploy machine learning models, even without a deep understanding of how they work. Many more people call themselves data scientists, including some who are more focused on data analysis than on building data products. Even Kaggle – which is now owned by Google – has broadened its scope beyond modelling competitions to support other types of analysis. Numerous articles have been published on the meaning of data science in the past six years. We seem to be going towards a broad definition of the field, which includes any type of general data analysis. This trend of broadening the definition may make data scientist somewhat useless as a job title. However, I believe that data science tasks remain useful, as shown by the following definitions.
Recent definitions by Hernán, Hawkins, and Dubossarsky
In a recent article, Hernán et al. classify data science tasks into three types: description, prediction, and causal inference. Like other authors, they argue that causal inference has been neglected by traditional statistics and some scientific disciplines. They claim that the emergence of data science is an opportunity to get causal inference “right”. Further, they emphasise the importance of domain expert knowledge, which is essential in causal inference. Defining data science in this broad manner seems to capture the essence of what the field is about these days. However, purely descriptive tasks are still often performed by data analysts rather than scientists. And the distinction between prediction and causal inference can be a bit fuzzy, especially as the tools for the latter are at a lower level of maturity. In addition, while I agree with Hernán et al. that domain expertise is important, it seems unlikely that this will forever be the case. No one is born an expert – expertise is gained by learning from and interacting with the world. Therefore, it’s plausible that gaining expertise can and will be automated. Further, there are numerous cases where experts were proven to be wrong. For example, it wasn’t so long ago that doctors recommended smoking.
Despite the importance of domain knowledge, one can argue that scientists that specialise in a single domain are not data scientists. In fact, the ability to go beyond one domain and think of data in a more abstract manner is what makes a data scientist. Applying this abstract knowledge often requires some domain expertise or input from domain experts, but most data science techniques are not domain-specific – they can be applied to many different problems. John Hawkins explains this point well in an article titled why all scientists are not data scientists:
Those scientists and statisticians who have focused themselves on understanding the limitations and possibilities of making inferences from experimental data are the ones who are the forerunners to data scientists. They have a skill which transcends the particulars of what it takes to do lab work on cell cultures, or field studies for ecology etc. Their core skill involves thinking about the data involved at an abstracted level. To ask the question “given data with these properties, what conclusions can we draw?”
Finally, according to Eugene Dubossarsky, “there’s only one purpose to data science, and that is to support decisions. And more specifically, to make better decisions. That should be something no one can argue with.” This goal-focused definition is unsurprising, given the fact that Eugene runs a training and consulting business and has been working in the field for over 20 years. I’m not going to argue with him, but to put it all together, we can define data science as a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.
What about AI?
Everyone loves a good buzzword, and these days AI (Artificial Intelligence) is one of the hottest buzzwords. However, despite what some people may try to tell you, AI is unlikely to make data science obsolete any time soon. Following the above definition, as long as there is a need to make decisions based on data, there will be a need for data scientists. This includes decisions that aren’t made by humans, as data scientists are involved in building systems that make decisions autonomously.
The resurgence of AI feels somewhat amusing given my personal experience. One of the reasons I decided to pursue a PhD in natural language processing and personalisation was my interest in what I considered to be AI back in 2008. My initial introduction to the field was through an AI course and a project I did as part of my bachelor’s degree in computer science. However, by the time I graduated from my PhD, saying that I’m an AI expert seemed less useful than calling myself a data scientist. It may be that the field is about to shift again, and that rebranding as an AI expert would be more beneficial (though I’d be doing exactly the same work). Titles are somewhat silly – I’m going to continue working with data to support decisions for as long as there is demand for this kind of work and I continue enjoying it. There is plenty to learn and develop in this area, regardless of buzzwords and sexy titles.
Great set of definitions and path of evolutions here!
There has to be chaos and confusion as it evolves surely, but the consensus as you very well mentioned is decisions. Anything done in the data world, if not leading to decisions is not quite viable in long term.
Thanks for sharing your thoughts, love reading your blog.