# Software commodities are eating interesting data science work

The passage of time makes wizards of us all. Today, any dullard can make bells ring across the ocean by tapping out phone numbers, cause inanimate toys to march by barking an order, or activate remote devices by touching a wireless screen. Thomas Edison couldn’t have managed any of this at his peak—and shortly before his time, such powers would have been considered the unique realm of God.

– Rob Reid, After On

Being a data scientist can sometimes feel like a race against software innovations. Every interesting and useful problem is bound to become a software commodity. My story seems to reflect that: From my first steps in sentiment analysis and topic modelling, through building recommender systems while dabbling in Kaggle competitions and deep learning a few years ago, and to my present-day interest in causal inference. What can one do to remain relevant in such an environment? Read this post to find out.

## Highlights from my past

When I started my PhD in 2009, the plan was to work on sentiment analysis of opinion polls. This got me into applied machine learning using Java and Weka, with which I made some modest contributions to the field. Today, researching sentiment analysis would feel somewhat pointless, given the plethora of sentiment analysis services. Sentiment analysis is a commodity – using it in practice is a software engineering problem.

Moving forward in my PhD, I got into topic modelling. I learned about Bayesian statistics and conjugate priors. I went through the arduous process of solving integrals by hand and coding a custom Gibbs sampler for the models I specified. Today, I probably wouldn’t bother with the maths. Instead, I’d specify the model and let a probabilistic programming tool like pymc3 or Stan handle the rest. Bayesian inference is now a commodity that’s accessible to any hacker.

A part of my PhD thesis that can probably be replaced by a probabilistic programming tool

Towards the end of my PhD in 2012, I got into Kaggle competitions. Back then, it seemed like “real” data science consisted of building and tuning machine learning models – that’s what Kaggle was all about. While I’ve done quite well in those competitions, I’ve come to realise that the utility of fine-tuning machine learning algorithms is quite limited. In reality, problem definition and solution measurement are more challenging and important. Using machine learning in practice is typically an engineering problem: We can use an existing service or package, follow best practices, and have a great solution for most use cases. No research or custom data work is required beyond turning data into features, which is essentially a data engineering problem. In short, solid machine learning solutions are delivered by solid engineers who glue together solid commodity components. Quoting Google’s Rules of Machine Learning:

To make great products: do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

Most of the problems you will face are, in fact, engineering problems. Even with all the resources of a great machine learning expert, most of the gains come from great features, not great machine learning algorithms. So, the basic approach is:

1. Make sure your pipeline is solid end to end.
3. Add common­-sense features in a simple way.
4. Make sure that your pipeline stays solid.

Many problems in data “science” are actually engineering problems – described best by the flow on the right (source)

Some of my first jobs as a data scientist in industry involved building recommender systems. With recommender systems, much of the work is on the system around the recommendation algorithm. That is, building a recommender system was always mostly an engineering problem. However, these days we have services like AWS Personalize, which does most of the heavy lifting around recommendation. This makes the deployment of recommender systems a pure engineering problem. Like many other problems, recommender systems have been commodified.

I have not done much with deep learning, but there the general trend is even more apparent: Useful innovations quickly turn into tools. Examples include library evolution from Theano to TensorFlow, and commodified prediction services from companies like Google, Amazon, and Microsoft. If you want to use a deep learning service in your application, you probably don’t need a data scientist or even a machine learning engineer. A solid software engineer who can pick the right tools should be enough.

## How to remain relevant?

So where does this leave us? It seems to be a more general phenomenon. Essentially every problem that requires specialised knowledge and is valuable ends up attracting repeatable solutions that obviate the need for deep thinking and manual work. These solutions are software commodities. Deploying them is a matter of writing some glue code and fitting them into the overall system – an engineering problem. Implementing data science components to compete with commodities may be interesting and fun, but it’s usually a waste of time when there’s a generic solution that is good enough.

As an individual data scientist, what can you do when your speciality becomes a software commodity? I see a few options:

1. Embrace the engineering angle. Become good (or better) at engineering solutions. Be pragmatic. Do what what it takes to get the job done. This is probably easier for data scientists like me, who have an engineering background, than for more research/analysis-oriented data scientists. Such data scientists sometimes sneer at engineering work, claiming it’s “fake” data science. Fake or not, solid engineering tools can easily make stubborn data scientists obsolete.
2. Keep building custom solutions even when viable commodities exist. While this may be more fun for the individual, I believe it isn’t a sustainable approach. The cost of building and maintaining custom solutions will typically be higher than the cost of commodity solutions. Insisting on custom solutions seems like a recipe for becoming irrelevant.
3. Keep adapting and moving to non-commodity areas. Some things are easier to automate than others. For example, building a machine learning pipeline when the problem is well-defined is relatively easy, but deciding what features to create typically requires some domain expertise. In addition, new research keeps coming out in areas that are less hot than machine learning. One such area is causal inference, where there are still solutions that are yet to be commodified.
4. Move to the cutting edge. If you want to research novel methods, a “standard” data scientist position may not be for you. Many industry positions are focused on applying proven solutions to a specific organisation. If that doesn’t sound like fun, you’re better off moving to academia or joining a commercial research group.

Are there any other options I don’t see? Let me know in the comments!

# Defining data science in 2018

I got my first data science job in 2012, the year Harvard Business Review announced data scientist to be the sexiest job of the 21st century. Two years later, I published a post on my then-favourite definition of data science, as the intersection between software engineering and statistics. Unfortunately, that definition became somewhat irrelevant as more and more people jumped on the data science bandwagon – possibly to the point of making data scientist useless as a job title. However, I still call myself a data scientist. Even better – I still get paid for being a data scientist. But what does it mean? What do I actually do here? This article is a short summary of my understanding of the definition of data science in 2018.

## It’s not all about machine learning

As I was wrapping up my PhD in 2012, I started thinking about my next steps. I knew I wanted to get back to working in the tech industry, ideally with a small startup. But it wasn’t clear to me how to market myself – my LinkedIn title at the time was “software engineer with a research background”, which is a bit of a mouthful. Around that time I heard about Kaggle and decided to try competing. This went pretty well, and exposed me to the data science community globally and in Melbourne, where I was living at the time. That’s how I first met Adam Neumann, the founder of Giveable, a startup that aimed to recommend gifts based on social networking data. Upon graduating, I joined Giveable as a data scientist. Changing my LinkedIn title quickly led to many other offers, but I was happy to be working on Giveable – I felt fortunate to have found a startup job that was related to my PhD research on recommender systems.

My understanding of data science at the time was heavily influenced by Kaggle and the tech industry. Kaggle was only about predictive modelling competitions back then, and so I believed that data science is about using machine learning to build models and deploy them as part of various applications. I was very comfortable with that definition, having spent my PhD years on several predictive modelling tasks, and having worked as a software engineer prior to that.

Things have changed considerably since 2012. It is now much easier to deploy machine learning models, even without a deep understanding of how they work. Many more people call themselves data scientists, including some who are more focused on data analysis than on building data products. Even Kaggle – which is now owned by Google – has broadened its scope beyond modelling competitions to support other types of analysis. Numerous articles have been published on the meaning of data science in the past six years. We seem to be going towards a broad definition of the field, which includes any type of general data analysis. This trend of broadening the definition may make data scientist somewhat useless as a job title. However, I believe that data science tasks remain useful, as shown by the following definitions.

## Recent definitions by Hernán, Hawkins, and Dubossarsky

In a recent article, Hernán et al. classify data science tasks into three types: description, prediction, and causal inference. Like other authors, they argue that causal inference has been neglected by traditional statistics and some scientific disciplines. They claim that the emergence of data science is an opportunity to get causal inference “right”. Further, they emphasise the importance of domain expert knowledge, which is essential in causal inference. Defining data science in this broad manner seems to capture the essence of what the field is about these days. However, purely descriptive tasks are still often performed by data analysts rather than scientists. And the distinction between prediction and causal inference can be a bit fuzzy, especially as the tools for the latter are at a lower level of maturity. In addition, while I agree with Hernán et al. that domain expertise is important, it seems unlikely that this will forever be the case. No one is born an expert – expertise is gained by learning from and interacting with the world. Therefore, it’s plausible that gaining expertise can and will be automated. Further, there are numerous cases where experts were proven to be wrong. For example, it wasn’t so long ago that doctors recommended smoking.

Despite the importance of domain knowledge, one can argue that scientists that specialise in a single domain are not data scientists. In fact, the ability to go beyond one domain and think of data in a more abstract manner is what makes a data scientist. Applying this abstract knowledge often requires some domain expertise or input from domain experts, but most data science techniques are not domain-specific – they can be applied to many different problems. John Hawkins explains this point well in an article titled why all scientists are not data scientists:

Those scientists and statisticians who have focused themselves on understanding the limitations and possibilities of making inferences from experimental data are the ones who are the forerunners to data scientists. They have a skill which transcends the particulars of what it takes to do lab work on cell cultures, or field studies for ecology etc. Their core skill involves thinking about the data involved at an abstracted level. To ask the question “given data with these properties, what conclusions can we draw?”

Finally, according to Eugene Dubossarsky, “there’s only one purpose to data science, and that is to support decisions. And more specifically, to make better decisions. That should be something no one can argue with.” This goal-focused definition is unsurprising, given the fact that Eugene runs a training and consulting business and has been working in the field for over 20 years. I’m not going to argue with him, but to put it all together, we can define data science as a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.

Everyone loves a good buzzword, and these days AI (Artificial Intelligence) is one of the hottest buzzwords. However, despite what some people may try to tell you, AI is unlikely to make data science obsolete any time soon. Following the above definition, as long as there is a need to make decisions based on data, there will be a need for data scientists. This includes decisions that aren’t made by humans, as data scientists are involved in building systems that make decisions autonomously.

The resurgence of AI feels somewhat amusing given my personal experience. One of the reasons I decided to pursue a PhD in natural language processing and personalisation was my interest in what I considered to be AI back in 2008. My initial introduction to the field was through an AI course and a project I did as part of my bachelor’s degree in computer science. However, by the time I graduated from my PhD, saying that I’m an AI expert seemed less useful than calling myself a data scientist. It may be that the field is about to shift again, and that rebranding as an AI expert would be more beneficial (though I’d be doing exactly the same work). Titles are somewhat silly – I’m going to continue working with data to support decisions for as long as there is demand for this kind of work and I continue enjoying it. There is plenty to learn and develop in this area, regardless of buzzwords and sexy titles.

# Customer lifetime value and the proliferation of misinformation on the internet

## Background: Misleading search results and fake news

While Google tries to filter obvious spam from its index, it still relies to a great extent on popularity to rank search results. Popularity is a function of inbound links (weighted by site credibility), and of user interaction with the presented results (e.g., time spent on a result page before moving on to the next result or search). There are two obvious problems with this approach. First, there are no guarantees that wrong, misleading, or inaccurate pages won’t be popular, and therefore earn high rankings. Second, given Google’s near-monopoly of the search market, if a page ranks highly for popular search terms, it is likely to become more popular and be seen as credible. Hence, when searching for the truth, it’d be wise to follow Abraham Lincoln’s famous warning not to trust everything you read on the internet.

Google is not alone in helping spread misinformation. Following Donald Trump’s recent victory in the US presidential election, many people have blamed Facebook for allowing so-called fake news to be widely shared. Indeed, any popular media outlet or website may end up spreading misinformation, especially if – like Facebook and Google – it mainly aggregates and amplifies user-generated content. However, as noted by John Herrman, the problem is much deeper than clearly-fabricated news stories. It is hard to draw the lines between malicious spread of misinformation, slight inaccuracies, and plain ignorance. For example, how would one classify Trump’s claims that climate change is a hoax invented by the Chinese? Should Twitter block his account for knowingly spreading outright lies?

## Wrong customer value calculation by example

Fortunately, when it comes to customer lifetime value, I doubt that any of the top results returned by Google is intentionally misleading. This is a case where inaccuracies and misinformation result from ignorance rather than from malice. However, relying on such resources without digging further is just as risky as relying on pure fabrications. For example, see this infographic by Kissmetrics, which suggests three different formulas for calculating the average lifetime value of a Starbucks customer. Those three formulas yield very different values ($5,489,$11,535, and \$25,272), which the authors then say should be averaged to yield the final lifetime value figure. All formulas are based on numbers that the authors call constants, despite the fact that numbers such as the average customer lifespan or retention rate are clearly not constant in this context (since they’re estimated from the data and used as projections into the future). Indeed, several people have commented on the flaws in Kissmetrics’ approach, which is reminiscent of the Dilbert strip where the pointy-haired boss asks Dilbert to average and multiply wrong data.

My main problem with the Kissmetrics infographic is that it helps feed an illusion of understanding that is prevalent among those with no statistical training. As the authors fail to acknowledge the fact that the predictions produced by the formulas are inaccurate, they may cause managers and marketers to believe that they know the lifetime value of their customers. However, it’s important to remember that all models are wrong (but some models are useful), and that the lifetime value of active customers is unknowable since it involves forecasting of uncertain quantities. Hence, it is reckless to encourage people to use the Kissmetrics formulas without trying to quantify how wrong they may be on the specific dataset they’re applied to.

## Fader and Hardie: The voice of reason

The formula discussed by Fader and Hardie is $CLV = \sum_{t=0}^{T} m \frac{r^t}{(1 + d)^t}$, where $m$ is the net cash flow per period, $r$ is the retention rate, $d$ is the discount rate, and $T$ is the time horizon. The five issues that Fader and Hardie identify are as follows.

1. The true lifetime value is unknown while the customer is still active, so the formula is actually for the expected lifetime value, i.e., $E(CLV)$.
2. Since the summation is bounded, the formula isn’t really for the lifetime value – it is an estimate of value up to period $T$ (which may still be useful).
3. As the summation starts at $t=0$, it gives the expected value of a customer that hasn’t been acquired yet. According to Fader and Hardie, in some cases the formula starts at $t=1$, i.e., it applies only to existing customers. The distinction between the two cases isn’t always made clear.
4. The formula assumes a constant retention rate. However, it is often the case that retention increases with tenure, i.e., customers who have been with the company for a long time are less likely to churn than recently-acquired customers.
5. It isn’t always possible to calculate a retention rate, as the point at which a customer churns isn’t observed for many products. For example, Starbucks doesn’t know whether customers who haven’t made a purchase for a while have decided to never visit Starbucks again, or whether they’re just going through a period of inactivity. Further, given the ubiquity of Starbucks, it is probably safe to assume that all past customers have a non-zero probability of making another purchase (unless they’re physically dead).

According to Fader and Hardie, “the bottom line is that there is no ‘one formula’ that can be used to compute customer lifetime value“. Therefore, teaching the above formula (or one of its variants) misleads people into thinking that they know how to calculate the lifetime value of customers. Hence, they advocate going back to the definition of lifetime value as “the present value of the future cashflows attributed to the customer relationship“, and using a probabilistic approach to generate estimates of the expected lifetime value for each customer. This conclusion also appears in a more accessible series of blog posts by Custora, where it is claimed that probabilistic modelling can yield significantly more accurate estimates than naive formulas.

## Getting serious with the lifetimes package

As mentioned above, Fader and Hardie provide Excel implementations of some of their models, which produce individual-level lifetime value predictions. While this is definitely an improvement over using general formulas, better solutions are available if you can code (or have access to people who can do coding for you). For example, using a software package makes it easy to integrate the lifetime value calculation into a live product, enabling automated interventions to increase revenue and profit (among other benefits). According to Roberto Medri, this approach is followed by Etsy, where lifetime value predictions are used to retain customers and increase their value.

An example of a software package that I can vouch for is the Python lifetimes package, which implements several probabilistic models for lifetime value prediction in a non-contractual setting (i.e., where churn isn’t observed – as in the Starbucks example above). This package is maintained by Cameron Davidson-Pilon of Shopify, who may be known to some readers from his Bayesian Methods for Hackers book and other Python packages. I’ve successfully used the package on a real dataset and have contributed some small fixes and improvements. The documentation on GitHub is quite good, so I won’t repeat it here. However, it is worth reiterating that as with any predictive model, it is important to evaluate performance on your own dataset before deciding to rely on the package’s predictions. If you only take away one thing from this article, let it be the reminder that it is unwise to blindly accept any formula or model. The models implemented in the package (some of which were introduced by Fader and Hardie) are fairly simple and generally applicable, as they rely only on the past transaction log. These simple models are known to sometimes outperform more complex models that rely on richer data, but this isn’t guaranteed to happen on every dataset. My untested feeling is that in situations where clean and relevant training data is plentiful, models that use other features in addition to those extracted from the transaction log would outperform the models provided by the lifetimes package (if you have empirical evidence that supports or refutes this assumption, please let me know).

## Conclusion: You’re better than that

Accurate estimation of customer lifetime value is crucial to most businesses. It informs decisions on customer acquisition and retention, and getting it wrong can drive a business from profitability to insolvency. The rise of data science increases the availability of statistical and scientific tools to small and large businesses. Hence, there are few reasons why a revenue-generating business should rely on untested customer value formulas rather than on more realistic models. This extends beyond customer value to nearly every business endeavour: Relying on fabrications is not a sustainable growth strategy, there is no way around learning how to be intelligently driven by data, and no amount of cheap demagoguery and misinformation can alter the objective reality of our world.

# If you don’t pay attention, data can drive you off a cliff

You’re a hotshot manager. You love your dashboards and you keep your finger on the beating pulse of the business. You take pride in using data to drive your decisions rather than shooting from the hip like one of those old-school 1950s bosses. This is the 21st century, and data is king. You even hired a sexy statistician or data scientist, though you don’t really understand what they do. Never mind, you can proudly tell all your friends that you are leading a modern data-driven team. Nothing can go wrong, right? Incorrect. If you don’t pay attention, data can drive you off a cliff. This article discusses seven of the ways this can happen. Read on to ensure it doesn’t happen to you.

## 1. Pretending uncertainty doesn’t exist

Last month, your favourite metric was 5.2%. This month, it’s 5.5%. Looks like things are getting better – you must be doing something right! But is 5.5% really different from 5.2%? All things being equal, you should expect some variability in most of your metrics. The values you see are drawn from a distribution of possible values, which means you can’t be certain what value you’ll be seeing next. Fortunately, with more data you would be able to quantify this uncertainty and know which values are more likely. Don’t fear or ignore uncertainty. Embrace and study it, and you’ll be on the right track.

## 2. Confusing observed and unobserved quantities

Everyone agrees that the future is uncertain. We can generate forecasts with varying degrees of confidence, but we never know for sure what’s going to happen. However, some people tend to ignore uncertainty in forecasts, treating the unobserved future values as comparable to observed present values. For example, marketers often compare customer lifetime value with the cost of acquiring a customer. The problem is that customer lifetime value relies on a prediction of the net profit from a customer (so it’s largely unobserved and uncertain), while the business has much more control and certainty around the cost of acquiring a customer (though it’s not completely known). Treating the two values as if they’re observed and known is risky, as it can lead to major financial losses.

## 3. Thinking that your data is correct

Ask anyone who works with data, and they’ll tell you that it’s always messy. A well-known saying among data scientists is that 80% of the work is data cleaning and the other 20% is complaining about data cleaning. Hence, it’s likely that at least some of the figures you’re relying on to make decisions are somewhat inaccurate. However, it’s important to remember that this doesn’t make the data completely useless. But if something looks too good to be true, it probably isn’t true. Finally, it’s highly unlikely that the data is always correct when you like the results and always incorrect when the results aren’t favourable, so don’t use the “guy on the internet said our data isn’t 100% correct” excuse to push back on inconvenient truths.

## 4. Believing that your data is complete

No matter how big you are, your data doesn’t capture everything your customers do. Even Google and the NSA don’t have a full view of what people are up to in the non-digital world, and they can’t completely read our minds (yet). Most businesses have much less data than the big tech companies, and they look a bit silly trying to explain customer behaviour using only the data they have. At the end of the day, you have to work with the data you can access, but never underestimate the effectiveness of obtaining more (relevant) data.

## 5. Measuring the wrong thing

Maybe you recently read an article emphasising the importance of real metrics, like daily active users, as opposed to vanity metrics like number of signups to your service. You therefore decide to track the daily active users of your product. But have you thought about whether this metric is relevant to what you’re trying to achieve? If you run a business like Airbnb, where transactions are inherently infrequent, do you really care if people don’t regularly log in? You probably don’t, as long as they use the product when they actually need it. Measuring and trying to optimise the wrong thing can be very risky. Indeed, deciding on metrics and their measurement can be seen as the hardest parts of data science.

## 6. Not recognising your unconscious incompetence

To quote Bertrand Russell: “One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision.” Not recognising the extent of your ignorance when it comes to data is pretty common among those with no training in the field, which may lead to illusory superiority. This may be exacerbated by the fact that those who do know what they’re doing tend to talk a lot about uncertainty and how there are many things that are simply unknowable. My hope is that this short article would help people graduate from unconscious incompetence, where you don’t even recognise the importance of what you don’t know, to conscious incompetence, where you recognise the need to learn and rely on expert advice.

Once you’ve recognised your skill gaps, you may decide to hire a data scientist to help you get more value out of your data. However, despite the hype, data scientists are not magicians. In fact, because of the hype, the definition of data science is so diluted that some people say that the term itself has become useless. The truth is that dealing with data is hard, every organisation is somewhat different, and it takes time and commitment to get value out of data. The worst thing you can do is to hire an expensive expert to help you, and then ignore their advice when their findings are hard to digest. If you’re not ready to work with a data scientist, you might as well save yourself some money and remain in a state of blissful ignorance.

Note: This article is not a portrayal of how things are with my current employer, Car Next Door. Views expressed are my own. In fact, if you want to work at a place where expert advice is acted on and uncertainty is seen as something to be studied rather than ignored, we’re hiring!

# Is Data Scientist a useless job title?

Data science can be defined as either the intersection or union of software engineering and statistics. In recent years, the field seems to be gravitating towards the broader unifying definition, where everyone who touches data in some way can call themselves a data scientist. Hence, while many people whose job title is Data Scientist do very useful work, the title itself has become fairly useless as an indication of what the title holder actually does. This post briefly discusses how we got to this point, where I think the field is likely to go, and what data scientists can do to remain relevant.

## The many definitions of data science

About two years ago, I published a post discussing the definition of data scientist by Josh Wills, as a person who is better at statistics than any software engineer and better at software engineering than any statistician. I still quite like this definition, because it describes me well, as someone with education and experience in both areas. However, to be better at statistics than any software engineer and better at software engineering than any statistician, you have to be truly proficient in both areas, as some software engineers are comfortable running complex experiments, and some statisticians are capable of building solid software. Quite a few people who don’t meet Wills’s criteria have decided they wanted to be data scientists too, expanding the definition to be something along the lines of someone who is better at statistics than some software engineers (who’ve never done anything fancier than calculating a sample mean) and better at software engineering than some statisticians (who can’t code).

In addition to software engineering and statistics, data scientists are expected to deeply understand the domain in which they operate, and be excellent communicators. This leads to the proliferation of increasingly ridiculous Venn diagrams, such as the one by Stephan Kolassa:

The perfect data scientist from Kolassa’s Venn diagram is a mythical sexy unicorn ninja rockstar who can transform a business just by thinking about its problems. A more realistic (and less exciting) view of data scientists is offered by Rob Hyndman:

I take the broad inclusive view. I am a data scientist because I do data analysis, and I do research on the methodology of data analysis. The way I would express it is that I’m a data scientist with a statistical perspective and training. Other data scientists will have different perspectives and different training.

We are comfortable with having medical specialists, and we will go to a GP, endocrinologist, physiotherapist, etc., when we have medical problems. We also need to take a team perspective on data science.

None of us can realistically cover the whole field, and so we specialise on certain problems and techniques. It is crazy to think that a doctor must know everything, and it is just as crazy to think a data scientist should be an expert in statistics, mathematics, computing, programming, the application discipline, etc. Instead, we need teams of data scientists with different skills, with each being aware of the boundary of their expertise, and who to call in for help when required.

Indeed, data science is too broad for any data scientist to fully master all areas of expertise. Despite the misleading name of the field, it encompasses both science and engineering, which is why data scientists can be categorised into two types, as suggested by Michael Hochster:

• Type A (analyst): focused on static data analysis. Essentially a statistician with coding skills.
• Type B (builder): focused on building data products. Essentially a software engineer with knowledge in machine learning and statistics.

Type A is more of a scientist, and Type B is more of an engineer. Many people end up doing both, but it is pretty rare to have an even 50-50 split between the science and engineering sides, as they require different mindsets. This is illustrated by the following diagram, showing the information flow in science and engineering (source).

## Why Data Scientist is a useless job title

Given that a data scientist is someone who does data analysis, and/or a scientist, and/or an engineer, what does it mean for a person to hold a Data Scientist position? It can mean anything, as it depends on the company and industry. A job title like Data Scientist at Company is about as meaningful as Engineer at Organisation, Scientist at Institution, or Doctor at Hospital. It gives you a general idea what the person’s background is, but provides little clue as to what the person actually does on a day-to-day basis.

Don’t believe me? Let’s look at a few examples. Noah Lorang (Basecamp) is OK with mostly doing arithmetic. David Robinson (Stack Overflow) builds machine learning features and internal R packages, and visualises data. Robert Chang (Twitter) helps surface product insights, create data pipelines, run A/B tests, and build predictive models. Rob Hyndman (Monash University) and Jake VanderPlas (University of Washington) are academic data scientists who contribute to major R and Python open-source libraries, respectively. From personal knowledge, data scientists in many Australian enterprises focus on generating reports and building dashboards. And in my current role at Car Next Door I do a little bit of everything, e.g., implement new features, fix bugs, set up data pipelines and dashboards, run experiments, build predictive models, and analyse data.

To be clear, the work done by many data scientists is very useful. The number of decisions made based on arbitrary thresholds and some means multiplied together on a spreadsheet can be horrifying to those of us with minimal knowledge of basic statistics. Having a good data scientist on board can have a transformative effect on a business. But it’s also very easy to end up with ineffective hires working on low-impact tasks if the business has no idea what their data scientists should be doing. This situation isn’t uncommon, given the wide range of activities that may be performed by data scientists, the lack of consensus on the definition of the field, and a general disagreement over who deserves to be called a real data scientist. We need to move beyond the hype towards clearer definitions that would help align the expectations of data scientists with those of their current and future employers.

## It’s time to specialise

Four years ago, I changed my LinkedIn title from software engineer with a research background to data scientist. Various offers started coming my way, and they haven’t stopped since. Many people have done the same. To be a data scientist, you just need to call yourself a data scientist. The dilution of the term means that as a job title, it is useless. Useless terms are unlikely to last, so if you’re seriously thinking of becoming a data scientist, you should also consider specialising. I believe we’ll see the emergence of new specific titles, such as Machine Learning Engineer. In addition, less “sexy” titles, such as Data Analyst, may end up making a comeback. In any case, those of us who invest in building their skills, delivering value in their job, and making sure people know about it don’t have much to worry about.

What do you think? Is specialisation inevitable or are generalist data scientists here to stay? Please let me know privately, via Twitter, or in the comments section.

# You don’t need a data scientist (yet)

The hype around big data has caused many organisations to hire data scientists without giving much thought to what these data scientists are going to do and whether they’re actually needed. This is a source of frustration for all parties involved. This post discusses some questions you should ask yourself before deciding to hire your first data scientist.

### Q1: Do you know what data scientists do?

Somewhat surprisingly, there are quite a few companies that hire data scientists without having a clear idea of what data scientists actually do. People seem to have a fear of missing out on the big data hype, and think of hiring data scientists as the solution. A common misconception is that a data scientist’s role includes telling you what to do with your data. While this may sometimes happen in practice, the ideal scenario is where the business has problems that can be solved using data science (more on this under Q3 below). If you don’t know what your data scientist is going to do, you probably don’t need one.

So what do data scientists do? When you think about it, adding the word “data” to “science” is a bit redundant, as all science is based on data. Following from this, anyone who does any kind of data analysis is a data scientist. While it may be true, this broad definition is not very helpful. As discussed in a previous post, it’s more useful to define data scientists as individuals who combine expertise in statistics and machine learning with strong software engineering skills.

### Q2: Do you have enough data available?

It’s not uncommon to see products that suffer from over-engineering and premature investment in advanced analytics capabilities. In the early stages, it’s important to focus on creating a minimum viable product and getting it to market quickly. Data science starts to shine once the product is generating enough data, as most of the power of advanced analytics is in optimising and automating existing processes.

Not having a data scientist in the early stages doesn’t mean the data is being ignored – it just means that it doesn’t require the attention of a full-time data scientist. If your product is at an early stage and you are still concerned, you’re better off hiring a data science consultant for a few days to help lay out the long-term vision for data-driven capabilities. This would be cheaper and less time-consuming than hiring a full-timer. The exception to this rule is when the product itself is built around advanced analytics (e.g., AlchemyAPI or Enlitic). Building such products without data scientists is far from ideal, or just impossible.

Even if your product is mature and generating a lot of data, it doesn’t mean it’s ready for data science. Advanced analytics capabilities are at the top of data’s hierarchy of needs: If your product is buggy, or if your data is scattered everywhere and your platform lacks centralised reporting, you need to first invest in fixing your data plumbing. This is the job of data engineers. Getting data scientists involved when the data is hardly available due to infrastructure issues is likely to lead to frustration. In addition, setting up centralised reporting and dashboarding is likely to give you ideas for problems that data scientists can solve.

### Q3: Do you have a specific problem to solve?

If the problem you’re trying to solve is “everyone is doing smart things with data, we should be doing stuff with data too”, you don’t have a specific problem that can be solved by bringing a data scientist on board. Defining the problem often ends up occupying a lot of the data scientist’s time, so you are likely to obtain better results if have more than just a vague idea around “doing something with data, because Hadoop”. Ideally you want to optimise an existing process that is currently being solved with heuristics, make an existing model better, implement a new data-driven feature, or something along these lines. Common examples include reducing churn, increasing conversions, and replacing manual processes with automated data-driven systems. Again, getting advice from experienced data scientists before committing to hiring one may be your best first step.

### Q4: Can you get away with heuristics, intuition, and/or manual processes?

Some data scientists would passionately claim that you must deploy only models that are theoretically justified and well-tested. However, in many cases you can get away with using simple heuristics, intuition, and/or manual processes. These can be orders of magnitude cheaper than building sophisticated predictive models and the infrastructure to support them. For many businesses, there are more pressing needs than doing everything in a theoretically sound way. Despite what many technical people like to think, customers don’t tend to care how things are implemented, as long as their needs are fulfilled.

For example, I spent some time with a client whose product includes a semi-manual part where structured data is extracted from documents. Their process included sending some of the documents to a trained team in the Philippines for manual analysis. The client was interested in replacing that manual work with a machine learning algorithm. As is often the case with machine learning, it was unknown whether the resultant model would be accurate enough to completely replace the manual workers. This generally depends on data quality and the feasibility of solving the problem. Assessing the feasibility would have taken some time and money, so the client decided to park the idea and focus on other areas of their business.

Every business has resource constraints. Situations where the best investment you can make is hiring a full-time data scientist are rarer than what the hype may make you think. It’s often the case that functions that would be the responsibility of a data scientist are adequately performed by existing employees, such as software engineers, business/data analysts, and marketers.

### Q5: Are you committed to being data-driven?

I have seen more than one case where data scientists are hired only to be blocked or ignored. This is more prevalent in the corporate world, where managers are often incentivised to prioritise doing things that look good over things that make financial sense. But even if recruitment is done with the best intentions, progress may be blocked by employees who feel threatened because they would be replaced by automated data-driven algorithms. Successful data science projects require support from senior leadership, as discussed by Greta Roberts, Radim Řehůřek, Alec Smith, and many others. Without such support and a strong commitment to making data-driven decisions, everyone is just wasting their time.

### Closing thoughts

While data science is currently over-hyped, many organisations still have much to gain from hiring data scientists. I hope that this post has helped you decide whether you need a data scientist right now. If you’re unsure, please don’t hesitate to contact me. And to any data scientists reading this: Be very wary of potential employers who do not have good answers to the above questions. At this point in time you can afford to be picky, at least until the hype is over.

Almost a year ago, I left my last full-time job and decided to set on an independent path that includes data science consulting and work on my own projects. The ultimate goal is not to have to sell my time for money by generating enough passive income to live comfortably. My five main areas of focus are – in no particular order – personal branding & networking, data science contracting, Bandcamp Recommender, Price Dingo, and marine conservation. This post summarises what I’ve been doing in each of these five areas, including highlights and lowlights. So far, it’s way better than having a “real” job. I hope this post will help others who are on a similar journey (there seem to be more and more of us – I’d love to hear from you).

### Personal branding & networking

Finding clients requires considerably more work than finding a full-time job. As with job hunting, the ideal situation is where people come to you for help, rather than you chasing them. To this end, I’ve been networking a lot, giving talks, writing up posts and working on distributing them. It may be harder than getting a full-time job, but it’s also much more interesting.

Highlights: going viral in China, getting a post featured in KDNuggets
Lowlights: not having enough time to write all the things and meet all the people

### Data science contracting

My goal with contracting/consulting is to have a steady income stream while working on my own projects. As my projects are small enough to be done only by me (with optional outsourcing to contractors), this means I have infinite runway to pursue them. While this is probably not the best way of building a Silicon Valley-style startup that is going to make the world a better place, many others have applied this approach to building a so-called lifestyle business, which is what I want to achieve.

Early on, I realised that doing full-on consulting would be too time consuming, as many clients expect full-time availability. In addition, constantly needing to find new clients means that not much time would be left for work on my own projects. What I really wanted was a stable part-time gig. The first one was with GetUp (who reached out to me following a workshop I gave at General Assembly), where I did some work on forecasting engagement and churn. In parallel, I went through the interview process at DuckDuckGo, which included delivering a piece of work to production. DuckDuckGo ended up wanting me to work full-time (like a few other companies), so last month I started a part-time (three days a week) contract at Commonwealth Bank. I joined a team of very strong data scientists – it looks like it’s going to be interesting.

Highlights: seeing my DuckDuckGo work every time I search for a Python package, the work environment at GetUp
Lowlights: chasing leads that never eventuated

### Bandcamp Recommender (BCRecommender)

I’ve written a several posts about BCRecommender, my Bandcamp music recommendation project. While I’ve always treated it as a side-project, it’s been useful in learning how to get traction for a product. It now has thousands of monthly users, and is still growing. My goal for BCRecommender has changed from the original one of finding music for myself to growing it enough to be a noticeable source of traffic for Bandcamp, thereby helping artists and fans. Doing it in side-project mode can be a bit challenging at times (because I have so many other things to do and a long list of ideas to make the app better), but I’ve been making gradual progress and discovering a lot of great music in the process.

Highlights: every time someone gives me positive feedback, every time I listen to music I found using BCRecommender
Lowlights: dealing with Parse issues and random errors

### Price Dingo

The inability to reliably compare prices for many types of products has been bothering me for a while. Unlike general web search, where the main providers rank results by relevance, most Australian price comparison engines still require merchants to pay to even have their products listed. This creates an obvious bias in the results. To address this bias, I created Price Dingo – a user-centric price comparison engine. It serves users with results they can trust by not requiring merchants to pay to have their products listed. Just like general web search engines, the main ranking factor is relevancy to the user. This relevancy is also achieved by implementing Price Dingo as a network of independent sites, each focused on a specific product category, with the first category being scuba diving gear.

Implementing Price Dingo hasn’t been too hard – the main challenge has been finding the time to do it with all the other stuff I’ve been doing. There are still plenty of improvements to be made to the site, but now the main goal is to get enough traction to make ongoing time investment worthwhile. Judging by the experience of Booko’s founder, there is space in the market for niche price comparison sites and apps, so it is just a matter of execution.

Highlights: being able to finally compare dive gear prices, the joys of integrating Algolia
Lowlights: extracting data from messy websites – I’ve seen some horrible things…

### Marine conservation

The first thing I did after leaving my last job was go overseas for five weeks, which included a ten-day visit to Israel (rockets!) and three weeks of conservation diving with New Heaven Dive School in Thailand. Back in Sydney, I joined the Underwater Research Group of NSW, a dive club that’s involved in many marine conservation and research activities, including Reef Life Survey (RLS) and underwater cleanups. With URG, I’ve been diving more than before, and for a change, some of my dives actually do good. I’d love to do this kind of stuff full-time, but there’s a lot less money in getting people to do less stuff (i.e., conservation and sustainability) than in consuming more. The compromise for now is that a portion of Price Dingo’s scuba revenue goes to the Australian Marine Conservation Society, and the plan is to expand this to other charities as more categories are added. Update – May 2015: I decided that this compromise isn’t good enough for me, so I shut down Price Dingo to focus on projects that are more aligned with my values.

Highlights: becoming a certified RLS diver, pretty much every dive
Lowlights: cutting my hand open by falling on rocks on the first day of diving in Thailand

### The future

So far, I’m pretty happy with this not-having-a-job-doing-my-own-thing business. According to The 1000 Day Rule, I still have a long way to go until I get the lifestyle I want. It may even take longer than 1000 days given my decision to not work full-time on a single profitable project, together with my tendency to take more time off than I would if I had a “real” job. But the beauty of this path is that there are no investors breathing down my neck or the feeling of mental rot that comes with a full-time job, so there’s really no rush and I can just enjoy the ride.

# BCRecommender Traction Update

This is the fifth part of a series of posts on my Bandcamp recommendations (BCRecommender) project.
Check out previous posts on the general motivation behind this project, the system’s architecture, the recommendation algorithms, and initial traction planning.

In a previous post, I discussed my plans to apply the Bullseye framework from the Traction Book to BCRecommender, my Bandcamp recommendations project. In that post, I reviewed the 19 traction channels described in the book, and decided to focus on the three most promising ones: blogger outreach, search engine optimisation (SEO), and content marketing. This post discusses my progress to date.

### Goals

My initial traction goals were rather modest: get some feedback from real people, build up steady nonzero traffic to the site, and then increase that traffic to 10+ unique visitors per day. It’s worth noting that I have four other main areas of focus at the moment, so BCRecommender is not getting all the attention I could potentially give it. Nonetheless, I have made good progress on achieving my goals (first two have been obtained, but traffic still fluctuates), and learnt a lot in the process.

### Things that worked

Blogger outreach. The most obvious people to contact are existing Bandcamp fans. It was straightforward to generate a list of prolific fans with blogs, as Bandcamp allows people to populate their profile with a short bio and links to their sites. I worked my way through part of the list, sending each fan an email introducing BCRecommender and asking for their feedback. Each email required some manual work, as the vast majority of people don’t have their email address listed on their Bandcamp profile page. I was careful not to be too spammy, which seemed to work: about 50% of the people I contacted visited BCRecommender, 20% responded with positive feedback, and 10% linked to BCRecommender in some form, with the largest volume of traffic coming from my Hypebot guest post. The problem with this approach is that it doesn’t scale, but the most valuable thing I got out of it was that people like the project and that there’s a real need for it.

Twitter. I’m not sure where Twitter falls as a traction channel. It’s probably somewhere between (micro)blogger outreach and content marketing. However you categorise Twitter, it has been working well as a source of traffic. Simply finding people who may be interested in BCRecommender and tweeting related content has proven to be a rather low-effort way of getting attention, which is great at this stage. I have a few ideas for driving more traffic from Twitter, which I will try as I go.

### Things that didn’t work

Content marketing. I haven’t really spent time doing serious content marketing apart from the Spotlights pilot. My vision for the spotlights was to generate quality articles automatically and showcase music on Bandcamp in an engaging way that helps people discover new artists, even if they don’t have a fan account. However, full automation of the spotlight feature would require a lot of work, and I think that there are lower-hanging fruits that I should focus on first. For example, finding interesting insights in the data and presenting them in an engaging way may be a better content strategy, as it would be unique to BCRecommender. For the spotlights, partnering with bloggers to write the articles may be a better approach than automation.

SEO. I expected BCRecommender to rank higher for “bandcamp recommendations” by now, as a result of my blogger outreach efforts. At the moment, it’s still on the second page for this query on Google, though it’s the first result on Bing and DuckDuckGo. Obviously, “bandcamp recommendations” is not the only query worth ranking for, but it’s very relevant to BCRecommender, and not too competitive (half of the first page results are old forum posts). One encouraging outcome from the work done so far is that my Hypebot guest post does appear on the first page. Nonetheless, I’m still interested in getting more search engine traffic. Ranking higher would probably require adding more relevant content on the site and getting more quality links (basically what SEO is all about).

### Points to improve and next steps

I could definitely do better work on all of the above channels. Contrary to what’s suggested by the Bullseye framework, I would like to put more effort into the channels that didn’t work well. The reason is that I think they didn’t work well because of lack of attention and weak experiments, rather than due to their unsuitability to BCRecommender.

As mentioned above, my main limiting factor is a lack of time to spend on the project. However, there’s no pressing need to hit certain traction milestones by a specific deadline. My stretch goals are to get all Bandcamp fans to check out the project (hundreds of thousands of people), and have a significant portion of them convert by signing up to updates (tens of thousands of people). Getting there will take time. So far I’m finding the process educational and enjoyable, which is a pleasant surprise.

# Applying the Traction Book’s Bullseye framework to BCRecommender

This is the fourth part of a series of posts on my Bandcamp recommendations (BCRecommender) project.
Check out previous posts on the general motivation behind this project, the system’s architecture, and the recommendation algorithms.

Having used BCRecommender to find music I like, I’m certain that other Bandcamp fans would like it too. It could probably be extended to attract a wider audience of music lovers, but for now, just getting feedback from Bandcamp fans would be enough. There are about 200,000 fans that I know of – getting even a fraction of them to use and comment on BCRecommender would serve as a good guide to what’s worth building and improving.

In addition to getting feedback, the personal value for me in getting BCRecommender users is learning some general lessons on traction building. Like many technical people, I like building products and playing with data, but I don’t really enjoy sales and marketing (and that’s an understatement). One of my goals in working independently is forcing myself to get better at the things I’m not good at. To that end, I recently started reading Traction: A Startup Guide to Getting Customers by Gabriel Weinberg and Justin Mares.

The Traction book identifies 19 different channels for getting traction, and suggests a simple framework (named Bullseye) to ranking and quickly exploring the channels. They explain that many technical founders tend to focus on traction channels they’re familiar with, and that the effort invested in those channels tends to be rather small compared to the investment in building the product. The authors rightly note that “Almost every failed startup has a product. What failed startups don’t have is traction – real customer growth.” They argue that following a rigorous approach to gaining traction via their framework is likely to improve a startup’s chances of success. From personal experience, this is very likely to be true.

The key steps in the Bullseye framework are brainstorming ideas for each traction channel, ranking the channels into tiers, prioritising the most promising ones, testing them, and focusing on the channels that work. This is not a one-off process – channel suitability changes over time, and one needs to go through the process repeatedly as the product evolves and traction grows.

Here are the traction channels, ordered in the same order as in the book. Each traction channel is marked with a letter denoting its ranking tier from A (most appropriate) to C (unsuitable right now). A short explanation is provided for each channel.

• [B] viral marketing: everyone wants to go viral, but at the moment I don’t have a good-enough understanding of my target audience to seriously pursue this channel.
• [C] public relations (PR): I don’t think that PR would give me access to the kind of focused user group I need at this phase.
• [C] unconventional PR: same as conventional PR.
• [C] search engine marketing (SEM): may work, but I don’t want to spend money at this stage.
• [C] social and display ads: see SEM.
• [C] offline ads: see SEM.
• [A] search engine optimization (SEO): this channel seems promising, as ranking highly for queries such as “bandcamp recommendations” should drive quality traffic that is likely to convert (i.e., play recommendations and sign up for updates). It doesn’t seem like “bandcamp recommendations” is a very competitive query, so it’s definitely worth doing some SEO work.
• [A] content marketing: I think that there’s definitely potential in this channel, since I have a lot of data that can be explored and presented in interesting ways. The problem is creating content that is compelling enough to attract people. I started playing with this channel via the Spotlights feature, but it’s not good enough yet.
• [B] email marketing: BCRecommender already has the subscription feature for retention. At this stage, this doesn’t seem like a viable acquisition channel.
• [B] engineering as marketing: this channel sounds promising, but I don’t have good ideas for it at the moment. This may change soon, as I’m currently reading this chapter.
• [A] targeting blogs: this approach should work for getting high-quality feedback, and help SEO as well.
• [C] business development: there may be some promising ideas in this channel, but only worth pursuing later.
• [C] sales: not much to sell.
• [C] affiliate programs: I’m not going to pay affiliates as I’m not making any money.
• [B] existing platforms: in a way, I’m already building on top of the existing Bandcamp platform. One way of utilising it for growth is by getting fans to link to BCRecommender when it leads to sales (as I’ve done on my fan page), but that would be more feasible at a later stage with more active users.
• [C] trade shows: I find it hard to think of trade shows where there are many Bandcamp fans.
• [C] offline events: probably easier than trade shows (think concerts/indie events), but doesn’t seem worth pursuing at this stage.
• [C] speaking engagements: similar to offline events. I do speaking engagements, and I’m actually going to mention BCRecommender as a case study at my workshop this week, but the intersection between Bandcamp fans and people interested in data science seems rather small.
• [C] community building: this may be possible later on, when there is a core group of loyal users. However, some aspects of community building are provided by Bandcamp and I don’t want to compete with them.

Cool, writing everything up explicitly was actually helpful! The next step is to test the three channels that ranked the highest: SEO, content marketing and targeting blogs. I will report the results in future posts.

# Data’s hierarchy of needs

One of my favourite blog posts in recent times is The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps. That post comprehensively describes how abstracting all the data produced by LinkedIn’s various components into a single log pipeline greatly simplified their architecture and enabled advanced data-driven applications. Among the various technical details there are some beautifully-articulated business insights. My favourite one defines data’s hierarchy of needs:

Effective use of data follows a kind of Maslow’s hierarchy of needs. The base of the pyramid involves capturing all the relevant data, being able to put it together in an applicable processing environment (be that a fancy real-time query system or just text files and python scripts). This data needs to be modeled in a uniform way to make it easy to read and process. Once these basic needs of capturing data in a uniform way are taken care of it is reasonable to work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.

It’s worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult to assemble space heater. Once data and processing are available, one can move concern on to more refined problems of good data models and consistent well understood semantics. Finally, concentration can shift to more sophisticated processing—better visualization, reporting, and algorithmic processing and prediction.

In my experience, most organizations have huge holes in the base of this pyramid—they lack reliable complete data flow—but want to jump directly to advanced data modeling techniques. This is completely backwards. [emphasis mine]

Visually, it looks something like this:

In addition, before starting to build a data pipeline, one needs to ensure that the tracked system works as expected. For example, a buggy website is likely to produce weird metrics, which in turn would make the data processing, reporting and predictions unreliable. I completely agree with Jay’s point about needing to get the basis of the pyramid right before setting out to do “something with data” (which seems to be the desire of every company nowadays).

The general point is that it’s important to have realistic expectations about what can be obtained by data-driven algorithms and insights. These can only be as good as the underlying data, with the results always depending to a large degree on having a solid infrastructure. Not everything has to be perfect from the start (most things never will be), but some degree of robustness is required to avoid spending too many resources on things that would never work. Trying to apply the latest predictive models without a reliable data infrastructure is like driving a fancy car on broken roads – you’re unlikely to get very far.