# Bootstrapping the right way?

Bootstrapping the right way is a talk I gave earlier this year at the YOW! Data conference in Sydney. You can now watch the video of the talk and have a look through the slides. The content of the talk is similar to a post I published on bootstrapping pitfalls, with some additional simulations.

The main takeaways shared in the talk are:

• Don’t compare single-sample confidence intervals by eye
• Use enough resamples (15K?)
• Use a solid bootstrapping package (e.g., Python ARCH)
• Use the right bootstrap for the job
• Consider going parametric Bayesian
• Test all the things

Testing all the things typically requires writing code, which I did for the talk. You can browse through it in this notebook. The most interesting findings from my tests are summarised by the following figure.

The figure shows how the accuracy of confidence interval estimation varies by algorithm, sample size, and the number of bootstrapping resamples on a synthetic revenue dataset. This sort of dataset may occur in freemium scenarios, where several product variations are offered at a few price tiers, including a price of zero (i.e., free). In all cases, the dashed line denotes the requested confidence level of 95%, i.e., the true difference in means between the two revenue distributions should be inside the confidence interval in approximately 95% of the simulations for it to be accurate. Unfortunately, it is clear that both the percentile and BCa algorithms perform poorly on the simulated data. Even with a sample size of 10K, they both yield “95%” confidence intervals that contain the true difference in means less than 90% of the time, i.e., the intervals are too narrow. By contrast, the studentized algorithm gets much closer to the requested confidence level, but this comes at the price of considerably longer runtime due to the need for nested bootstrapping.

Note that the results presented in the talk are slightly different from the figure above. The difference is due to a small bug in the simulation code: I used a constant random seed for all the bootstrapping simulation iterations (every iteration still contained different data). This has led to the surprising finding that accuracy with 10,000 resamples was lower than with 1,000 resamples. I attributed that finding to dataset quirks, and noted that my results may not generalise to all cases. Indeed, I recently ran a similar set of experiments on different data as part of my work at Automattic, and found that the studentized algorithm accuracy wasn’t as impressive as the results shown here.

In addition to synthetic data, the experiments I ran at Automattic included an implementation of an idea by my colleague, Demet Dagdelen: Test accuracy on samples from the full population for a given period (e.g., all sales over a calendar year). In such cases, the full population is well-defined. Therefore, we know the value of the “true” parameters, and we can run the same simulations as on synthetic data. While I can’t share that data, I can say that all algorithms performed much worse on real data than on simulated data. Therefore, we decided to follow the penultimate takeaway and use a parametric Bayesian approach for modelling our data. We may share insights from that line of work on data.blog in the future. In the meantime, comments are very welcome!

# Hackers beware: Bootstrap sampling may be harmful

Bootstrap sampling techniques are very appealing, as they don’t require knowing much about statistics and opaque formulas. Instead, all one needs to do is resample the given data many times, and calculate the desired statistics. Therefore, bootstrapping has been promoted as an easy way of modelling uncertainty to hackers who don’t have much statistical knowledge. For example, the main thesis of the excellent Statistics for Hackers talk by Jake VanderPlas is: “If you can write a for-loop, you can do statistics”. Similar ground was covered by Erik Bernhardsson in The Hacker’s Guide to Uncertainty Estimates, which provides more use cases for bootstrapping (with code examples). However, I’ve learned in the past few weeks that there are quite a few pitfalls in bootstrapping. Much of what I’ve learned is summarised in a paper titled What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum by Tim Hesterberg. I doubt that many hackers would be motivated to read a paper with such a title, so my goal with this post is to make some of my discoveries more accessible to a wider audience. To learn more about the issues raised in this post, it’s worth reading Hesterberg’s paper and other linked resources.

For quick reference, here’s a summary of the advice in this post:

• Use an accurate method for estimating confidence intervals
• Use enough resamples – at least 10-15K
• Don’t compare confidence intervals visually
• Ensure that the basic assumptions apply to your situation

## Pitfall #1: Inaccurate confidence intervals

Confidence intervals are a common way of quantifying the uncertainty in an estimate of a population parameter. The percentile method is one of the simplest bootstrapping approaches for generating confidence intervals. For example, let’s say we have a data sample of size n and we want to estimate a 95% confidence interval for the population mean. We take r bootstrap resamples from the original data sample, where each resample is a sample with replacement of size n. We calculate the mean of each resample and store the means in a sorted array. We then return the 95% confidence interval as the values that fall at the 0.025r and 0.975r indices of the sorted array (i.e., the 2.5% and 97.5% percentiles). The following table shows what the first two resamples may look like for a data sample of size n = 5.

Original sample Resample #1 Resample #2
Values 10 30 20
12 20 20
20 12 30
30 12 30
45 45 30
Mean 23.4 23.8 26

The percentile method is nice and simple. Any programmer should be able to easily implement it in their favourite programming language, assuming they can actually program. Unfortunately, this method is just not accurate enough for small sample sizes. Quoting Hesterberg (emphasis mine):

The sample sizes needed for different intervals to satisfy the “reasonably accurate” (off by no more than 10% on each side) criterion are: n ≥ 101 for the bootstrap t, 220 for the skewness-adjusted t statistic, 2,235 for expanded percentile, 2,383 for percentile, 4,815 for ordinary t (which I have rounded up to 5,000 above), 5,063 for t with bootstrap standard errors and something over 8,000 for the reverse percentile method.

In a shorter version of the paper cited above, Hesterberg concludes that:

In practice, implementing some of the more accurate bootstrap methods is difficult (especially those not described here), and people should use a package rather than attempt this themselves.

In short, make sure you’re using an accurate method for estimating confidence intervals when dealing with sample sizes of less than a few thousand values. Using a package is a great idea, but unfortunately I don’t know of any Python bootstrapping package that is feature-complete: ARCH and scikits-bootstrap support advanced confidence interval methods but don’t support analysis of two samples of uneven sizes, while bootstrapped works with samples of uneven sizes but only supports the percentile and the reverse percentile method (which Hesterberg found to be even less accurate). If you know of any better Python packages, please let me know! (I don’t use R, but I suspect the situation is better there). Update: ARCH now supports analysis of samples of uneven sizes following an issue I reported. It seems to be the best Python bootstrapping package, so I recommend using it.

## Pitfall #2: Not enough resamples

Accurate bootstrap estimates require a large number of resamples. Many code snippets use 1,000 resamples, probably because it looks like a large number. However, seeming large isn’t enough. Quoting Hesterberg again:

For both the bootstrap and permutation tests, the number of resamples needs to be 15,000 or more, for 95% probability that simulation-based one-sided levels fall within 10% of the true values, for 95% intervals and 5% tests. I recommend r = 10,000 for routine use, and more when accuracy matters.

[…]

We want decisions to depend on the data, not random variation in the Monte Carlo implementation. We used r = 500,000 in the Verizon project.

That’s right, half a million resamples! Accuracy mattered in the Verizon case, as the results of the analysis determined whether large penalties were paid or not. In short, use at least 10-15,000 resamples to be safe. Don’t use 1,000.

## Pitfall #3: Comparison of single-sample confidence intervals

Confidence intervals are commonly used to decide if the difference between two samples is statistically significant. Bootstrapping provides a straightforward way of estimating confidence intervals without making assumptions about the way the data was generated. For example, given two samples, we can obtain confidence intervals for the mean of each sample and end up with a plot like this:

When looking at this plot, some people may conclude that the difference between the groups isn’t statistically significant because the confidence intervals overlap. However, overlapping confidence intervals don’t imply a lack of statistical significance because it is possible for the confidence interval of the difference between the sample means to not contain zero. Prasanna Parasurama explained why this happens in this post. While this issue isn’t unique to bootstrapping, it’s worth remembering that when comparing two groups, we need to obtain the confidence interval for the difference in the parameter we’re comparing, not compare single-sample confidence intervals.

For a concrete example, consider a case where we’re looking at a binary outcomes (yes/no or 1/0), which occur in coin flips or online A/B tests. Sample A consists of 2,150 zeroes and 350 ones, while sample B consists of 2,250 zeroes and 440 ones. As these are fairly large samples, we can use the bootstrap percentile method to obtain 95% confidence intervals for the mean of each sample. As the following figure shows, these intervals overlap. If we use the same method to also obtain a 95% confidence interval for the difference in means between B and A, we see that it doesn’t include zero. Therefore, we can say that the difference between B and A is statistically significant, despite the overlap between the single-sample confidence intervals.

It’s worth noting that when analysing binary outcomes, we can make stronger assumptions about the data rather than use bootstrapping to obtain confidence intervals. Erik Bernhardsson suggests using the Beta distribution to obtain single-sample confidence intervals, but as we’ve seen, they don’t tell us enough about the differences between samples. I suggested using a Bayesian approach in the past, which makes explicit modelling assumptions that allow us to encode our prior knowledge on the specific environment where the data was generated. For example, when running online A/B tests, we often have a ballpark figure for reasonable results, which can be used in the Bayesian A/B testing calculator I built.

## Pitfall #4: Unrepresentative and dependent samples

While the basic bootstrap makes no assumption about the underlying distribution of the data, it is not assumption-free. For example, when dealing with correlated data points from a time series, using the basic bootstrapping approach is wrong because it assumes that the data points are independent. Instead, a block bootstrap should be used – see the ARCH package for some implementation examples. In addition, bootstrapping doesn’t solve problems with the underlying sampling approach. For example, the data sample may not be representative of the population because of its small size, or there may be selection biases and measurement errors. No amount of bootstrapping is going to help with such issues. In general, it always helps to be aware of the data’s generation process, e.g., different considerations apply when dealing with data from online experiments versus observational studies.

## Conclusion and next steps

While bootstrapping is a powerful method, its initial impression of simplicity is misleading. To draw valid conclusions, it’s a good idea to use a package and be aware of considerations that are specific to the analysed data sample. However, if you’re already increasing your awareness of the data and its generation process, it may make sense to explicitly encode your assumptions in the model. This is where another hacker resource would come in handy: Probabilistic Programming & Bayesian Methods for Hackers by Cam Davidson-Pilon. Admittedly, it’s a bit longer than the average blog post or conference talk, but it is worth reading.

Going down the bootstrapping rabbit hole has reminded me of an important lesson: Blog posts and talks – especially ones with the word hacker in the title – may be a good starting point, but they shouldn’t be relied on for serious work. Instead, it is better to consult peer-reviewed resources and textbooks, such as the references listed in ARCH’s documentation. In my future explorations of bootstrapping and other methods, I will heed Abraham Lincoln’s timeless advice to not trust everything I read on the internet.

Update (Oct 2019): I published a post summarising a talk I gave on the topic, complete with simulation code that illustrates the issues with some bootstrapping algorithms.

# The most practical causal inference book I’ve read (is still a draft)

I’ve been interested in the area of causal inference in the past few years. In my opinion it’s more exciting and relevant to everyday life than more hyped data science areas like deep learning. However, I’ve found it hard to apply what I’ve learned about causal inference to my work. Now, I believe I’ve finally found a book with practical techniques that I can use on real problems: Causal Inference by Miguel Hernán and Jamie Robins. It is available for free from their site, but is still in draft mode. This post is a short summary of the reasons why I think Causal Inference is a great practical resource.

One of the things that sets Causal Inference apart from other books on the topic is the background of its authors. Hernán and Robins are both epidemiologists, which means they often have to deal with data with strong limitations on sample size and feasibility of experiments. Decisions driven by causal inference in epidemiology can often make the difference between life and death of individuals. Hence, the book is full of practical examples.

The book focuses on randomised controlled trials and well-defined interventions as the basis of causal inference from both experimental and observational data. As the authors show, even with randomised experiments, the analysis often requires using observational causal inference tools due to factors like selection and measurement biases. Their insistence on well-defined interventions is particularly refreshing, as one of the things that bothers me about the writings of Judea Pearl (a prominent researcher of causal inference) is the vagueness of statements like “smoking causes cancer” and “mud doesn’t cause rain”. The need for well-defined interventions was summarised by Hernán in the article Does water kill? A call for less casual causal inferences.

Unlike some other resources, Causal Inference doesn’t appear to be too dogmatic about the framework used for modelling causality. I’m not an expert on where each idea originated, but it seems like the authors mix elements from the potential outcomes framework and from Pearl’s graphical models. They also don’t neglect time as an important consideration in cause-and-effect relationships. In fact, the third part of the book is dedicated to the topic of time-varying treatments and effects.

The practicality of the book is also demonstrated by the fact that it comes with code examples in multiple languages. In addition, the authors don’t dwell too much on the philosophy of causality. While it is a fascinating topic, the opening paragraphs of the book make its goals clear:

By reading this book you are expressing an interest in learning about causal inference. But, as a human being, you have already mastered the fundamental concepts of causal inference. You certainly know what a causal effect is; you clearly understand the difference between association and causation; and you have used this knowledge constantly throughout your life. In fact, had you not understood these causal concepts, you would have not survived long enough to read this chapter–or even to learn to read. As a toddler you would have jumped right into the swimming pool after observing that those who did so were later able to reach the jam jar. As a teenager, you would have skied down the most dangerous slopes after observing that those who did so were more likely to win the next ski race. As a parent, you would have refused to give antibiotics to your sick child after observing that those children who took their medicines were less likely to be playing in the park the next day.

Since you already understand the definition of causal effect and the difference between association and causation, do not expect to gain deep conceptual insights from this chapter. Rather, the purpose of this chapter is to introduce mathematical notation that formalizes the causal intuition that you already possess. Make sure that you can match your causal intuition with the mathematical notation introduced here. This notation is necessary to precisely define causal concepts, and we will use it throughout the book.

I won’t try to summarise the technical aspects of the book – partly because I don’t fully understand it all, and partly because the book itself is already a summary of a very rich research area. However, I’m likely to go back and reread the book in the future, with the goal of applying the techniques from the book to my work. I’d also like to take Hernán’s causal inference course as a way of practising what I’ve learned from the book. For people who want a non-technical summary of the topics covered by the book, I recommend the article The c-word: Scientific euphemisms do not improve causal inference from observational data. If you’re curious about other (less practical) causality books I’ve read, check out my causal inference reading list and my two previous posts on the topic: Why you should stop worrying about deep learning and deepen your understanding of causality instead and Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptions.

# Reflections on remote data science work

It’s been about a year and a half since I joined Automattic as a remote data scientist. This is the longest I’ve been in one position since finishing my PhD in 2012. This is also the first time I’ve worked full-time with a fully-distributed team. In this post, I briefly discuss some of the top pluses and minuses of remote work, based on my experience so far.

## + Flexible hours – Potentially boundless work

By far, one of the top perks of remote work with a distributed team is truly flexible hours. I only have one or two synchronous meetings a week, and in the rest of my time I’m free to work the hours I prefer. No one expects me to be online at specific times, as long as the work gets done and I respond to pings within a reasonable time. As I’m a morning person, this means that I typically work a few hours in the early morning, take a long break (e.g., to surf or run some errands), and then work a few more hours in the afternoon or early evening.

The potential downside of such flexibility is not being able to stop working, especially as most of my colleagues are in Europe and North America. I deal with this by avoiding all work communications during my designated non-work hours. For example, I don’t have any work-related apps on my phone, I keep all my work tabs in a separate tab group, and I turn Slack off when I’m not working. I found that this approach sets enough of a boundary between my work and personal life, though I do end up thinking about work problems outside work hours occasionally.

## + More time for non-work activities – There’s never enough time!

Not commuting freed up the equivalent of a workday in my schedule. In addition, having flexible hours means that I can make time in the middle of the day for leisure activities like surfing and diving. However, it’s still a full-time job, so I’m not completely free to pursue non-work activities. It often feels like there isn’t enough time in the day, as I can always think of more stuff I’d like to do. But my current situation is much better than having to commute on a daily basis. Even though it’s been a relatively short time, I find the idea of going back to full-time office work hard to imagine.

## + No need to attend an office – Possible isolation from colleagues (and the real world)

Offices – especially open-plan offices – are not great places to get work done. This is definitely the case with work that requires a high level of concentration over uninterrupted blocks of time, like coding and data analysis. Working from home is great for avoiding distractions – there’s no need for silly horse blinders here (though I do enjoy looking at the bird and lizard action outside my window).

One good thing about offices is the physical availability of colleagues. It’s easy to ask others for feedback, socialise over drinks or shared meals, and keep up to date with company politics. Automattic works around the lack of daily physical interaction by running a few meetups a year. The number of people attending a meetup can vary from a handful for team meetups, to hundreds for the annual Grand Meetup. In all cases, the idea is to bring employees together for up to a week at a time to work and socialise. In my experience, the everyday distance creates a craving to attend meetups. I’ve never worked in a place where co-workers were so enthusiastic about spending so much time together – with non-distributed companies, team building is often seen as a chore. I suppose that the physical distance makes us appreciate the opportunity to be together and make the most of this precious time – it’s a bit like being in a long-distance relationship.

That said, in the majority of the time, isolation can be a problem. As I’m based in Australia, I probably feel it more than others – most of my teammates are offline during my work hours, which means that there’s no one to chat with on Slack. This isn’t a huge issue, but I do need to ensure I get enough social interaction through other avenues. As the jobs page of Bandcamp (another distributed company) used to say: “If you do not have a strong social structure outside of work then employment at Bandcamp will likely lead to heart disease and an early death. We’re hiring!”

## + Most communication is written – Information overload

As Automattic is a fully-distributed company, most of the communication is done in writing. The main tools are Slack and internal forums called P2s (emails are rarely used). This makes catching up on the latest company news easy in comparison to places that rely more heavily on synchronous meetings. The downside of so much written communication is potential information overload. It is impossible to follow all the P2 posts, and even keeping up with stuff I should know can sometimes be overwhelming. I especially feel it in the mornings, as most of my colleagues work while I’m sleeping. Therefore, catching up on everything that happened overnight and responding to pings often takes over an hour – things are rarely as I left them when I last logged off. I experience this same feeling of being overwhelmed when coming back from vacation. Depending on the length of time away, it can take days to catch up. On the plus side, this process doesn’t rely on someone filling me in – it’s all there for me to read.

## + Free trips around the world – Jet lag and flying

As noted above, Automatticians meet in person a few times a year. Since joining, I attended meetups in Montreal, Whistler, Playa del Carmen, Bali, and Orlando. In some cases, I used the opportunity for personal trips near the meetup locations. Such trips can be a lot of fun. However, the obvious downside when travelling from Australia is that getting to meetups usually involves days of jetlag and long flights (e.g., the 17-hour Dallas to Sydney trip). Nonetheless, I still enjoy the travel opportunities. For example, I doubt I would have ever visited Florida and snorkelled with manatees if it wasn’t for Automattic.

## + Exposure to diverse opinions and people – Cultural differences can pose challenges

Australia’s population is made up of many migrants, especially in the tech industry. However, all such migrants have some familiarity with Australian culture and values. The composition of Automattic’s workforce is even more diverse, and it lacks the unifying factor of everyone choosing to live in the same place. This is mostly positive, as I find the exposure to a diverse set of people interesting, and everyone tends to be friendly, welcoming, and focused on the work rather than on cultural differences. However, it’s important to be aware of differences in communication styles. There’s also a wider range of cultural sensitivities than when working with a more homogeneous group. Still, I haven’t found it to be much of an issue, possibly because I’m already used to being a migrant. For example, moving to Australia from Israel required some adjustment of my communication style to be less direct.

## Closing words

Overall, I like working with Automattic. For me, the positives outweigh the negatives, as evidenced by the fact that it’s the longest I’ve been in one position since 2012. Doing remote data science work doesn’t seem particularly different to doing any other sort of non-physical work remotely. I hope that more companies will join Automattic and the growing list of remote companies, and offer their employees the option to work from wherever they’re most productive.

Update (March 2019): I also covered similar topics in a Data Science Sydney talk about a day in the life of a remote data scientist.

# Defining data science in 2018

I got my first data science job in 2012, the year Harvard Business Review announced data scientist to be the sexiest job of the 21st century. Two years later, I published a post on my then-favourite definition of data science, as the intersection between software engineering and statistics. Unfortunately, that definition became somewhat irrelevant as more and more people jumped on the data science bandwagon – possibly to the point of making data scientist useless as a job title. However, I still call myself a data scientist. Even better – I still get paid for being a data scientist. But what does it mean? What do I actually do here? This article is a short summary of my understanding of the definition of data science in 2018.

## It’s not all about machine learning

As I was wrapping up my PhD in 2012, I started thinking about my next steps. I knew I wanted to get back to working in the tech industry, ideally with a small startup. But it wasn’t clear to me how to market myself – my LinkedIn title at the time was “software engineer with a research background”, which is a bit of a mouthful. Around that time I heard about Kaggle and decided to try competing. This went pretty well, and exposed me to the data science community globally and in Melbourne, where I was living at the time. That’s how I first met Adam Neumann, the founder of Giveable, a startup that aimed to recommend gifts based on social networking data. Upon graduating, I joined Giveable as a data scientist. Changing my LinkedIn title quickly led to many other offers, but I was happy to be working on Giveable – I felt fortunate to have found a startup job that was related to my PhD research on recommender systems.

My understanding of data science at the time was heavily influenced by Kaggle and the tech industry. Kaggle was only about predictive modelling competitions back then, and so I believed that data science is about using machine learning to build models and deploy them as part of various applications. I was very comfortable with that definition, having spent my PhD years on several predictive modelling tasks, and having worked as a software engineer prior to that.

Things have changed considerably since 2012. It is now much easier to deploy machine learning models, even without a deep understanding of how they work. Many more people call themselves data scientists, including some who are more focused on data analysis than on building data products. Even Kaggle – which is now owned by Google – has broadened its scope beyond modelling competitions to support other types of analysis. Numerous articles have been published on the meaning of data science in the past six years. We seem to be going towards a broad definition of the field, which includes any type of general data analysis. This trend of broadening the definition may make data scientist somewhat useless as a job title. However, I believe that data science tasks remain useful, as shown by the following definitions.

## Recent definitions by Hernán, Hawkins, and Dubossarsky

In a recent article, Hernán et al. classify data science tasks into three types: description, prediction, and causal inference. Like other authors, they argue that causal inference has been neglected by traditional statistics and some scientific disciplines. They claim that the emergence of data science is an opportunity to get causal inference “right”. Further, they emphasise the importance of domain expert knowledge, which is essential in causal inference. Defining data science in this broad manner seems to capture the essence of what the field is about these days. However, purely descriptive tasks are still often performed by data analysts rather than scientists. And the distinction between prediction and causal inference can be a bit fuzzy, especially as the tools for the latter are at a lower level of maturity. In addition, while I agree with Hernán et al. that domain expertise is important, it seems unlikely that this will forever be the case. No one is born an expert – expertise is gained by learning from and interacting with the world. Therefore, it’s plausible that gaining expertise can and will be automated. Further, there are numerous cases where experts were proven to be wrong. For example, it wasn’t so long ago that doctors recommended smoking.

Despite the importance of domain knowledge, one can argue that scientists that specialise in a single domain are not data scientists. In fact, the ability to go beyond one domain and think of data in a more abstract manner is what makes a data scientist. Applying this abstract knowledge often requires some domain expertise or input from domain experts, but most data science techniques are not domain-specific – they can be applied to many different problems. John Hawkins explains this point well in an article titled why all scientists are not data scientists:

Those scientists and statisticians who have focused themselves on understanding the limitations and possibilities of making inferences from experimental data are the ones who are the forerunners to data scientists. They have a skill which transcends the particulars of what it takes to do lab work on cell cultures, or field studies for ecology etc. Their core skill involves thinking about the data involved at an abstracted level. To ask the question “given data with these properties, what conclusions can we draw?”

Finally, according to Eugene Dubossarsky, “there’s only one purpose to data science, and that is to support decisions. And more specifically, to make better decisions. That should be something no one can argue with.” This goal-focused definition is unsurprising, given the fact that Eugene runs a training and consulting business and has been working in the field for over 20 years. I’m not going to argue with him, but to put it all together, we can define data science as a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.

Everyone loves a good buzzword, and these days AI (Artificial Intelligence) is one of the hottest buzzwords. However, despite what some people may try to tell you, AI is unlikely to make data science obsolete any time soon. Following the above definition, as long as there is a need to make decisions based on data, there will be a need for data scientists. This includes decisions that aren’t made by humans, as data scientists are involved in building systems that make decisions autonomously.

The resurgence of AI feels somewhat amusing given my personal experience. One of the reasons I decided to pursue a PhD in natural language processing and personalisation was my interest in what I considered to be AI back in 2008. My initial introduction to the field was through an AI course and a project I did as part of my bachelor’s degree in computer science. However, by the time I graduated from my PhD, saying that I’m an AI expert seemed less useful than calling myself a data scientist. It may be that the field is about to shift again, and that rebranding as an AI expert would be more beneficial (though I’d be doing exactly the same work). Titles are somewhat silly – I’m going to continue working with data to support decisions for as long as there is demand for this kind of work and I continue enjoying it. There is plenty to learn and develop in this area, regardless of buzzwords and sexy titles.

# Engineering Data Science at Automattic

A post I’ve written on applying some software engineering best practices to data science projects.

Most data scientists have to write code to analyze data or build products. While coding, data scientists act as software engineers. Adopting best practices from software engineering is key to ensuring the correctness, reproducibility, and maintainability of data science projects. This post describes some of our efforts in the area.

One of many data science Venn diagrams. Source: Data Science Stack Exchange

## Different data scientists, different backgrounds

Data science is often defined as the intersection of many fields, including software engineering and statistics. However, as demonstrated by the above Venn diagram, viewing it as an intersection tends to be too exclusive – in reality, it’s a union of many fields. Hence, data scientists tend to come from various backgrounds, and it is common to encounter data scientists with no formal training in computer science or software engineering. According to Michael Hochster, data scientists can be classified into two types

View original post 1,069 more words

# Advice for aspiring data scientists and other FAQs

Aspiring data scientists and other visitors to this site often repeat the same questions. This post is the definitive collection of my answers to such questions (which may evolve over time).

How do I become a data scientist?

It depends on your situation. Before we get into it, have you thought about why you want to become a data scientist?

Hmm… Not really. Why should I become a data scientist?

I can’t answer this for you, but it’s great to see you asking why. Do you know what data science is? Do you understand what data scientists do?

Sort of. Just so we’re on the same page, what is data science?

What are the hardest parts of data science?

Thanks, that’s helpful. But what do data scientists actually do?

It varies a lot. This variability makes the job title somewhat useless. You should try to get an idea what areas of data science interest you. For many people, excitement over the technical aspects wanes with time. And even if you still find the technical aspects exciting, most jobs have boring parts. When considering career changes, think of the non-technical aspects that would keep you engaged.

To answer the question, here are some posts on things I’ve done: Joined Automattic by improving the Elasticsearch language detection plugin, calculated customer lifetime value, analysed A/B test results, built recommender systems (including one for Bandcamp music), competed on Kaggle, and completed a PhD. I’ve also dabbled in deep learning, marine surveys, causality, and other things that I haven’t had the chance to write about.

Cool! Can you provide a general overview of how to become a data scientist?

I’m pretty happy with my current job, but still thinking of becoming a data scientist. What should I do?

Find ways of doing data science within your current role, working overtime if needed. Working on a real problem in a familiar domain is much more valuable than working on toy problems from online courses and platforms like Kaggle (though they’re also useful). If you’re a data analyst, learn how to program to automate and simplify your analyses. If you’re a software engineer, become comfortable with analysing and modelling data. Machine learning doesn’t have to be a part of what you choose to do.

I’m pretty busy. What online course should I take to learn about the area?

Calling Bullshit: Data Reasoning for the Digital Age is a good place to start. Deep learning should be pretty low on your list if you don’t have much background in the area.

Should I learn Python or R? Keras or Tensorflow? What about <insert name here>?

It doesn’t matter. Focus on principles and you’ll be fine. The following quote still applies today (to people of all genders).

As to methods, there may be a million and then some, but principles are few. The man who grasps principles can successfully select his own methods. The man who tries methods, ignoring principles, is sure to have trouble.

I want to become a data science freelancer. Can you provide some advice?

As with any freelancing job, expect to spend much of your time on sales and networking. I’ve only explored the freelancing path briefly, but Radim Řehůřek has published great slides on the topic. If you’re thinking of freelancing as a way of gaining financial independence, also consider spending less, earning more, and investing wisely.

Can you recommend an academic data science degree?

Sorry, but I don’t know much about those degrees. Boris Gorelik has some interesting thoughts on studying data science.

Will you be my mentor?

Probably not, unless you’re hard-working, independent, and doing something I find interesting. Feel free to contact me if you believe we’d both find the relationship beneficial.

Can you help with my project?

Probably not, as I work full-time with Automattic. I barely have time for my side projects, and I’m not looking for more paid work. However, if you think I’d find your project exciting, please do contact me.

What about them? There isn’t a single definition of right and wrong, as morality is multi-dimensional. I believe it’s important to question your own choices, and avoid applying data science blindly. For me, this means divesting from harmful industries like fossil fuels and striving to go beyond the creation of greedy robots (among other things).

I’m a manager. When should I hire a data scientist and start using machine learning?

There’s a good chance you don’t need a data scientist yet, but you should be aware of common pitfalls when trying to be data-driven. It’s also worth reading Paras Chopra’s post on what you need to know before you board the machine learning train.

Do you want to buy my products or services?

No. If I did, I’d contact you.

I have a question that isn’t answered here or anywhere on the internet, and I think you can help. Can I contact you?

# My 10-step path to becoming a remote data scientist with Automattic

About two years ago, I read the book The Year without Pants, which describes the author’s experience leading a team at Automattic (the company behind WordPress.com, among other products). Automattic is a fully-distributed company, which means that all of its employees work remotely (hence pants are optional). While the book discusses some of the challenges of working remotely, the author’s general experience was very positive. A few months after reading the book, I decided to look for a full-time position after a period of independent work. Ideally, I wanted a well-paid data science-y remote job with an established distributed tech company that offers a good life balance and makes products I care about. Automattic seemed to tick all my boxes, so I decided to apply for a job with them. This post describes my application steps, which ultimately led to me becoming a data scientist with Automattic.

Before jumping in, it’s worth noting that this post describes my personal experience. If you apply for a job with Automattic, your experience is likely to be different, as the process varies across teams, and evolves over time.

## 📧 Step 1: Do background research and apply

I decided to apply for a data wrangler position with Automattic in October 2015. While data wrangler may sound less sexy than data scientist, reading the job ad led me to believe that the position may involve interesting data science work. This impression was strengthened by some LinkedIn stalking, which included finding current data wranglers and reading through their profiles and websites. I later found out that all the people on the data division start out as data wranglers, and then they may pick their own title. Some data wranglers do data science work, while others are more focused on data engineering, and there are some projects that require a broad range of skills. As the usefulness of the term data scientist is questionable, I’m not too fussed about fancy job titles. It’s more important to do interesting work in a supportive environment.

Applying for the job was fairly straightforward. I simply followed the instructions from the ad:

Does this sound interesting? If yes, please send a short email to jobs @ this domain telling us about yourself and attach a resumé. Let us know what you can contribute to the team. Include the title of the position you’re applying for and your name in the subject. Proofread! Make sure you spell and capitalize WordPress and Automattic correctly. We are lucky to receive hundreds of applications for every position, so try to make your application stand out. If you apply for multiple positions or send multiple emails there will be one reply.

Having been on the receiving side of job applications, I find it surprising that many people don’t bother writing a cover letter, addressing the selection criteria in the ad, or even applying for a job they’re qualified to do. Hence, my cover letter was fairly short, comprising of several bullet points that highlight the similarities between the job requirements and my experience. It was nothing fancy, but simple cover letters have worked well for me in the past.

## ⏳ Step 2: Wait patiently

The initial application was followed by a long wait. From my research, this is the typical scenario. This is unsurprising, as Automattic is a fairly small company with a large footprint, which is both distributed and known as a great place to work (e.g., its Glassdoor rating is 4.9). Therefore, it attracts many applicants from all over the world, which take a while to process. In addition, Matt Mullenweg (Automattic’s CEO) reviews job applications before passing them on to the team leads.

As I didn’t know that Matt reviewed job applications, I decided to try to shorten the wait by getting introduced to someone in the data division. My first attempt was via a second-degree LinkedIn connection who works for Automattic. He responded quickly when I reached out to him, saying that his experience working with the company is in line with the Glassdoor reviews – it’s the best job he’s had in his 15-year-long career. However, he couldn’t help me with an intro, because there is no simple way around Automattic’s internal processes. Nonetheless, he reassured me that it is worth waiting patiently, as the strict process means that you end up working with great people.

I wasn’t in a huge rush to find a job, but in December 2015 I decided to accept an offer to become the head of data science at Car Next Door. This was a good decision at the time, as I believe in the company’s original vision of reducing the number of cars on the road through car sharing, and it seemed like there would be many interesting projects for me to work on. The position wasn’t completely remote, but as the company was already spread across several cities, I was able to work from home for a day or two every week. In addition, it was a pleasant commute by bike from my Sydney home to the office, so putting the fully-remote job search on hold didn’t seem like a major sacrifice. As I haven’t heard anything from Automattic at that stage, it seemed unwise to reject a good offer, so I started working full-time with Car Next Door in January 2016.

I successfully attracted Automattic’s attention with a post I published on the misuse of the word insights by many tech companies, which included an example from WordPress.com. Greg Ichneumon Brown, one of the data wranglers, commented on the post, and invited me to apply to join Automattic and help them address the issues I raised. This happened after I accepted the offer from Car Next Door, and hasn’t resulted in any speed up of the process, so I just gave up on Automattic and carried on with my life.

## 💬 Step 3: Chat with the data lead

I finally heard back from Automattic in February 2016 (four months after my initial application and a month into my employment with Car Next Door). Martin Remy, who leads the data division, emailed me to enquire if I’m still interested in the position. I informed him that I was no longer looking for a job, but we agreed to have an informal chat, as I’ve been waiting for such a long time.

As is often the case with Automattic interviews, the chat with Martin was completely text-based. Working with a distributed team means that voice and video calls can be hard to schedule. Hence, Automattic relies heavily on textual channels, and text-based interviews allow the company to test the written communication skills of candidates. The chat revolved around my past work experience, and Martin also took the time to answer my questions about the company and the data division. At the conclusion of the chat, Martin suggested I contact him directly if I was ever interested in continuing the application process. While I was happy with my position at the time, the chat strengthened my positive impression of Automattic, and I decided that I would reapply if I were to look for a full-time position again.

My next job search started earlier than I had anticipated. In October 2016, I decided to leave Car Next Door due to disagreements with the founders over the general direction of the company. In addition, I had more flexibility in choosing where to live, as my personal circumstances had changed. As I’ve always been curious about life outside the capital cities of Australia, I wanted to move away from Sydney. While I could have probably continued working remotely with Car Next Door, I felt that it would be better to find a job with a fully-distributed team. Therefore, I messaged Martin and we scheduled another chat.

The second chat with Martin took place in early November. Similarly to the first chat, it was conducted via Skype text messages, and revolved around my work in the time that has passed since the first chat. This time, as I was keen on continuing with the process, I asked more specific questions about what kind of work I’m likely to end up doing and what the next steps would be. The answers were that I’d be joining the data science team, and that the next steps are a pre-trial test, a paid trial, and a final interview with Matt. While this sounds straightforward, it took another six months until I finally became an Automattic employee (but I wasn’t in a rush).

## ☑️ Step 4: Pass the pre-trial test

The pre-trial test consisted of a data analysis task, where I was given a dataset and a set of questions to answer by Carly Stambaugh, the data science lead. The goal of the test is to evaluate the candidate’s approach to a problem, and assess organisational and communication skills. As such, the focus isn’t on obtaining a specific result, so candidates are given a choice of several potential avenues to explore. The open-ended nature of the task is reminiscent of many real-world data science projects, where you don’t always have a clear idea of what you’re going to discover. While some people may find this kind of uncertainty daunting, I find it interesting, as it is one of the things that makes data science a science.

I spent a few days analysing the data and preparing a report, which was submitted as a Jupyter Notebook. After submitting my initial report, there were a few follow-up questions, which I answered by email. The report was reviewed by Carly and Martin, and as they were satisfied with my work, I was invited to proceed to the next stage: A paid trial project.

## 👨‍💻 Step 5: Do the trial project

The main part of the application process with Automattic is the paid trial project. The rationale behind doing paid trials was explained a few years ago by Matt in Hire by Auditions, Not Resumes:

Before we hire anyone, they go through a trial process first, on contract. They can do the work at night or over the weekend, so they don’t have to leave their current job in the meantime. We pay a standard rate of $25 per hour, regardless of whether you’re applying to be an engineer or the chief financial officer. During the trials, we give the applicants actual work. If you’re applying to work in customer support, you’ll answer tickets. If you’re an engineer, you’ll work on engineering problems. If you’re a designer, you’ll design. There’s nothing like being in the trenches with someone, working with them day by day. It tells you something you can’t learn from resumes, interviews, or reference checks. At the end of the trial, everyone involved has a great sense of whether they want to work together going forward. And, yes, that means everyone — it’s a mutual tryout. Some people decide we’re not the right fit for them. The goal of my trial project was to improve the Elasticsearch language detection algorithm. This took about a month, and ultimately resulted in a pull request that got merged into the language detection plugin. I find this aspect of the process pretty exciting: While the plugin is used to classify millions of documents internally by Automattic, its impact extends beyond the company, as Elasticsearch is used by many other organisations and projects. This stands in contrast to many other technical job interviews, which consist of unpaid work on toy problems under stressful conditions, where the work performed is ultimately thrown away. While the monetary compensation for the trial work is lower than the market rate for data science consulting, I valued the opportunity to work on a real open source project, even if this hadn’t led to me getting hired. There was much more to the trial project than what’s shown in the final pull request. Most of the discussions were held on an internal project thread, primarly under the guidance of Carly (the data science lead), and Greg (the data wrangler who replied to my post a year earlier). The project was kicked off with a general problem statement: There was some evidence that the Elasticsearch language detection plugin doesn’t perform well on short texts, and my mission was to improve it. As the plugin didn’t include any tests for short texts, one of the main contributions of my work was the creation of datasets and tests to measure its accuracy on texts of different lengths. This was followed by some tweaks that improved the plugin’s performance, as summarised in the pull request. Internally, this work consisted of several iterations where I came up with ideas, asked questions, implemented the ideas, shared the results, and discussed further steps. There are still many possible improvements to the work done in the trial. However, as trials generally last around a month, we decided to end it after a few iterations. I enjoyed the trial process, but it is definitely not for everyone. Most notably, there is a strong emphasis on asynchronous text-based communication, which is the main mode by which projects are coordinated at Automattic. People who don’t enjoy written communication may find this aspect challenging, but I have always found that writing helps me organise my thoughts, and that I retain information better when reading than when listening to people speak. That being said, Automatticians do meet in person several times a year, and some teams have video chats for some discussions. While doing the trial, I had a video chat with Carly, which was the first (and last) time in the process that I got to see and hear a live human. However, this was not an essential part of the trial project, as our chat was mostly on the data scientist role and my job expectations. ## ⏳ Step 6: Wait patiently I finished working on the trial project just before Christmas. The feedback I received throughout the trial was positive, but Martin, Carly, and Greg had to go through the work and discuss it among themselves before making a final decision. This took about a month, due to the holiday period, various personal circumstances, and the data science team meetup that was scheduled for January 2017. Eventually, Martin got back to me with positive news: They were satisfied with my trial work, which meant there was only one stage left – the final interview with Matt Mullenweg, Automattic’s CEO. ## 👉 Step 7: Ping Matt Like other parts of the process, the interview with Matt is text-based. The way it works is fairly simple: I was instructed to message Matt on Slack and wait for a response, which may take days or weeks. I sent Matt a message on January 25, and was surprised to hear back from him the following morning. However, that day was Australia Day, which is a public holiday here. Therefore, I only got back to him two hours after he messaged me that morning, and by that time he was probably already busy with other things. This was the start of a pretty long wait. ## ⏳ Step 8: Wait patiently I left Car Next Door at the end of January, as I figured that I would be able to line up some other work even if things didn’t work out with Automattic. My plan was to take some time off, and then move up to the Northern Rivers area of New South Wales. I had two Reef Life Survey trips planned, so I wasn’t going to start working again before mid-April. I assumed that I would hear back from Matt before then, which would have allowed me to make an informed decision whether to look for another job or not. After two weeks of waiting, the time for my dive trips was nearing. As I was going to be without mobile reception for a while, I thought it’d be worth letting Matt know my schedule. After discussing the matter with Martin, I messaged Matt. He responded, saying that we might as well do the interview at the beginning of April, as I won’t be starting work before that time anyway. I would have preferred to be done with the interview earlier, but was happy to have some certainty and not worry about missing more chat messages before April. In early April, I returned from my second dive trip (which included a close encounter with Cyclone Debbie), and was hoping to sort out my remote work situation while completing the move up north. Unfortunately, while the move was successful, I was ready to give up on Automattic because I haven’t heard back from Matt at all in April. However, Martin remained optimistic and encouraged me to wait patiently, which I did as I was pretty busy with the move and with some casual freelancing projects. ## 💬 Step 9: Chat with Matt and accept the job offer The chat with Matt finally happened on May 2. As is often the case, it took a few hours and covered my background, the trial process, and some other general questions. I asked him about my long wait for the final chat, and he apologised for me being an outlier, as most chats happen within two weeks of a candidate being passed over to him. As the chat was about to conclude, we got to the topic of salary negotiation (which went well), and then the process was finally over! Within a few hours of the chat I was sent an offer letter and an employment contract. As Automattic has an entity in Australia (called Ausomattic), it’s a fairly standard contract. I signed the contract and started work the following week – over a year and a half after my initial application. Even before I started working, I booked tickets to meet the data division in Montréal – a fairly swift transition from the long wait for the final interview. ## 🎉 Step 10: Start working and choose a job title As noted above, Automatticians get to choose their own job titles, so to become a data scientist with Automattic, I had to set my job title to Data Scientist. This is generally how many people become data scientists these days, even outside Automattic. However, job titles don’t matter as much as job satisfaction. And after 2.5 months with Automattic, I’m very satisfied with my decision to join the company. My first three weeks were spent doing customer support, like all new Automattic employees. Since then, I’ve been involved in projects to make engagement measurement more consistent (harder than it sounds, as counting things is hard), and to improve the data science codebase (e.g., moving away from Legacy Python). Besides that, I also went to Montréal for the data division meetup, and have started getting into chatbot work. I’m looking forward to doing more work and sharing my experience here and on data.blog. # Customer lifetime value and the proliferation of misinformation on the internet Suppose you work for a business that has paying customers. You want to know how much money your customers are likely to spend to inform decisions on customer acquisition and retention budgets. You’ve done a bit of research, and discovered that the figure you want to calculate is commonly called the customer lifetime value. You google the term, and end up on a page with ten results (and probably some ads). How many of those results contain useful, non-misleading information? As of early 2017, fewer than half. Why is that? How can it be that after nearly 20 years of existence, Google still surfaces misleading information for common search terms? And how can you calculate your customer lifetime value correctly, avoiding the traps set up by clever search engine marketers? Read on to find out! ## Background: Misleading search results and fake news While Google tries to filter obvious spam from its index, it still relies to a great extent on popularity to rank search results. Popularity is a function of inbound links (weighted by site credibility), and of user interaction with the presented results (e.g., time spent on a result page before moving on to the next result or search). There are two obvious problems with this approach. First, there are no guarantees that wrong, misleading, or inaccurate pages won’t be popular, and therefore earn high rankings. Second, given Google’s near-monopoly of the search market, if a page ranks highly for popular search terms, it is likely to become more popular and be seen as credible. Hence, when searching for the truth, it’d be wise to follow Abraham Lincoln’s famous warning not to trust everything you read on the internet. Google is not alone in helping spread misinformation. Following Donald Trump’s recent victory in the US presidential election, many people have blamed Facebook for allowing so-called fake news to be widely shared. Indeed, any popular media outlet or website may end up spreading misinformation, especially if – like Facebook and Google – it mainly aggregates and amplifies user-generated content. However, as noted by John Herrman, the problem is much deeper than clearly-fabricated news stories. It is hard to draw the lines between malicious spread of misinformation, slight inaccuracies, and plain ignorance. For example, how would one classify Trump’s claims that climate change is a hoax invented by the Chinese? Should Twitter block his account for knowingly spreading outright lies? ## Wrong customer value calculation by example Fortunately, when it comes to customer lifetime value, I doubt that any of the top results returned by Google is intentionally misleading. This is a case where inaccuracies and misinformation result from ignorance rather than from malice. However, relying on such resources without digging further is just as risky as relying on pure fabrications. For example, see this infographic by Kissmetrics, which suggests three different formulas for calculating the average lifetime value of a Starbucks customer. Those three formulas yield very different values ($5,489, $11,535, and$25,272), which the authors then say should be averaged to yield the final lifetime value figure. All formulas are based on numbers that the authors call constants, despite the fact that numbers such as the average customer lifespan or retention rate are clearly not constant in this context (since they’re estimated from the data and used as projections into the future). Indeed, several people have commented on the flaws in Kissmetrics’ approach, which is reminiscent of the Dilbert strip where the pointy-haired boss asks Dilbert to average and multiply wrong data.

My main problem with the Kissmetrics infographic is that it helps feed an illusion of understanding that is prevalent among those with no statistical training. As the authors fail to acknowledge the fact that the predictions produced by the formulas are inaccurate, they may cause managers and marketers to believe that they know the lifetime value of their customers. However, it’s important to remember that all models are wrong (but some models are useful), and that the lifetime value of active customers is unknowable since it involves forecasting of uncertain quantities. Hence, it is reckless to encourage people to use the Kissmetrics formulas without trying to quantify how wrong they may be on the specific dataset they’re applied to.

## Fader and Hardie: The voice of reason

The formula discussed by Fader and Hardie is $CLV = \sum_{t=0}^{T} m \frac{r^t}{(1 + d)^t}$, where $m$ is the net cash flow per period, $r$ is the retention rate, $d$ is the discount rate, and $T$ is the time horizon. The five issues that Fader and Hardie identify are as follows.

1. The true lifetime value is unknown while the customer is still active, so the formula is actually for the expected lifetime value, i.e., $E(CLV)$.
2. Since the summation is bounded, the formula isn’t really for the lifetime value – it is an estimate of value up to period $T$ (which may still be useful).
3. As the summation starts at $t=0$, it gives the expected value of a customer that hasn’t been acquired yet. According to Fader and Hardie, in some cases the formula starts at $t=1$, i.e., it applies only to existing customers. The distinction between the two cases isn’t always made clear.
4. The formula assumes a constant retention rate. However, it is often the case that retention increases with tenure, i.e., customers who have been with the company for a long time are less likely to churn than recently-acquired customers.
5. It isn’t always possible to calculate a retention rate, as the point at which a customer churns isn’t observed for many products. For example, Starbucks doesn’t know whether customers who haven’t made a purchase for a while have decided to never visit Starbucks again, or whether they’re just going through a period of inactivity. Further, given the ubiquity of Starbucks, it is probably safe to assume that all past customers have a non-zero probability of making another purchase (unless they’re physically dead).

According to Fader and Hardie, “the bottom line is that there is no ‘one formula’ that can be used to compute customer lifetime value“. Therefore, teaching the above formula (or one of its variants) misleads people into thinking that they know how to calculate the lifetime value of customers. Hence, they advocate going back to the definition of lifetime value as “the present value of the future cashflows attributed to the customer relationship“, and using a probabilistic approach to generate estimates of the expected lifetime value for each customer. This conclusion also appears in a more accessible series of blog posts by Custora, where it is claimed that probabilistic modelling can yield significantly more accurate estimates than naive formulas.

## Getting serious with the lifetimes package

As mentioned above, Fader and Hardie provide Excel implementations of some of their models, which produce individual-level lifetime value predictions. While this is definitely an improvement over using general formulas, better solutions are available if you can code (or have access to people who can do coding for you). For example, using a software package makes it easy to integrate the lifetime value calculation into a live product, enabling automated interventions to increase revenue and profit (among other benefits). According to Roberto Medri, this approach is followed by Etsy, where lifetime value predictions are used to retain customers and increase their value.

An example of a software package that I can vouch for is the Python lifetimes package, which implements several probabilistic models for lifetime value prediction in a non-contractual setting (i.e., where churn isn’t observed – as in the Starbucks example above). This package is maintained by Cameron Davidson-Pilon of Shopify, who may be known to some readers from his Bayesian Methods for Hackers book and other Python packages. I’ve successfully used the package on a real dataset and have contributed some small fixes and improvements. The documentation on GitHub is quite good, so I won’t repeat it here. However, it is worth reiterating that as with any predictive model, it is important to evaluate performance on your own dataset before deciding to rely on the package’s predictions. If you only take away one thing from this article, let it be the reminder that it is unwise to blindly accept any formula or model. The models implemented in the package (some of which were introduced by Fader and Hardie) are fairly simple and generally applicable, as they rely only on the past transaction log. These simple models are known to sometimes outperform more complex models that rely on richer data, but this isn’t guaranteed to happen on every dataset. My untested feeling is that in situations where clean and relevant training data is plentiful, models that use other features in addition to those extracted from the transaction log would outperform the models provided by the lifetimes package (if you have empirical evidence that supports or refutes this assumption, please let me know).

## Conclusion: You’re better than that

Accurate estimation of customer lifetime value is crucial to most businesses. It informs decisions on customer acquisition and retention, and getting it wrong can drive a business from profitability to insolvency. The rise of data science increases the availability of statistical and scientific tools to small and large businesses. Hence, there are few reasons why a revenue-generating business should rely on untested customer value formulas rather than on more realistic models. This extends beyond customer value to nearly every business endeavour: Relying on fabrications is not a sustainable growth strategy, there is no way around learning how to be intelligently driven by data, and no amount of cheap demagoguery and misinformation can alter the objective reality of our world.

# Ask Why! Finding motives, causes, and purpose in data science

Some people equate predictive modelling with data science, thinking that mastering various machine learning techniques is the key that unlocks the mysteries of the field. However, there is much more to data science than the What and How of predictive modelling. I recently gave a talk where I argued the importance of asking Why, touching on three different topics: stakeholder motives, cause-and-effect relationships, and finding a sense of purpose. A video of the talk is available below. Unfortunately, the videographer mostly focused on me pacing rather than on the screen, but you can check out the slides here (note that you need to use both the left/right and up/down arrows to see all the slides).

If you’re interested in the topics covered in the talk, here are a few posts you should read.

Stakeholders and their motives

Causality and experimentation

Purpose, ethics, and my personal path

Cover image: Why by Ksayer