analytics

Propaganda graffiti

Customer lifetime value and the proliferation of misinformation on the internet

Suppose you work for a business that has paying customers. You want to know how much money your customers are likely to spend to inform decisions on customer acquisition and retention budgets. You’ve done a bit of research, and discovered that the figure you want to calculate is commonly called the customer lifetime value. You google the term, and end up on a page with ten results (and probably some ads). How many of those results contain useful, non-misleading information? As of early 2017, fewer than half. Why is that? How can it be that after nearly 20 years of existence, Google still surfaces misleading information for common search terms? And how can you calculate your customer lifetime value correctly, avoiding the traps set up by clever search engine marketers? Read on to find out!

Background: Misleading search results and fake news

While Google tries to filter obvious spam from its index, it still relies to a great extent on popularity to rank search results. Popularity is a function of inbound links (weighted by site credibility), and of user interaction with the presented results (e.g., time spent on a result page before moving on to the next result or search). There are two obvious problems with this approach. First, there are no guarantees that wrong, misleading, or inaccurate pages won’t be popular, and therefore earn high rankings. Second, given Google’s near-monopoly of the search market, if a page ranks highly for popular search terms, it is likely to become more popular and be seen as credible. Hence, when searching for the truth, it’d be wise to follow Abraham Lincoln’s famous warning not to trust everything you read on the internet.

Abraham Lincoln internet quote

Google is not alone in helping spread misinformation. Following Donald Trump’s recent victory in the US presidential election, many people have blamed Facebook for allowing so-called fake news to be widely shared. Indeed, any popular media outlet or website may end up spreading misinformation, especially if – like Facebook and Google – it mainly aggregates and amplifies user-generated content. However, as noted by John Herrman, the problem is much deeper than clearly-fabricated news stories. It is hard to draw the lines between malicious spread of misinformation, slight inaccuracies, and plain ignorance. For example, how would one classify Trump’s claims that climate change is a hoax invented by the Chinese? Should Twitter block his account for knowingly spreading outright lies?

Wrong customer value calculation by example

Fortunately, when it comes to customer lifetime value, I doubt that any of the top results returned by Google is intentionally misleading. This is a case where inaccuracies and misinformation result from ignorance rather than from malice. However, relying on such resources without digging further is just as risky as relying on pure fabrications. For example, see this infographic by Kissmetrics, which suggests three different formulas for calculating the average lifetime value of a Starbucks customer. Those three formulas yield very different values ($5,489, $11,535, and $25,272), which the authors then say should be averaged to yield the final lifetime value figure. All formulas are based on numbers that the authors call constants, despite the fact that numbers such as the average customer lifespan or retention rate are clearly not constant in this context (since they’re estimated from the data and used as projections into the future). Indeed, several people have commented on the flaws in Kissmetrics’ approach, which is reminiscent of the Dilbert strip where the pointy-haired boss asks Dilbert to average and multiply wrong data.

Dilbert: average and multiply wrong data

My main problem with the Kissmetrics infographic is that it helps feed an illusion of understanding that is prevalent among those with no statistical training. As the authors fail to acknowledge the fact that the predictions produced by the formulas are inaccurate, they may cause managers and marketers to believe that they know the lifetime value of their customers. However, it’s important to remember that all models are wrong (but some models are useful), and that the lifetime value of active customers is unknowable since it involves forecasting of uncertain quantities. Hence, it is reckless to encourage people to use the Kissmetrics formulas without trying to quantify how wrong they may be on the specific dataset they’re applied to.

Fader and Hardie: The voice of reason

Notably, the work of Peter Fader and Bruce Hardie on customer lifetime value isn’t directly referenced on the first page of Google results. This is unfortunate, as they have gone through the effort of making their models accessible to people with no academic background, e.g., using Excel spreadsheets and YouTube videos. However, it is clear that they are not optimising for search engine rankings, as I found out about their work by adding search terms that the average marketer is unlikely to use (e.g., Python and Bayesian). While surveying Fader and Hardie’s large body of work is beyond the scope of this article, it is worth summarising their criticism of the lifetime value formula that is taught in introductory marketing courses.

The formula discussed by Fader and Hardie is CLV = \sum_{t=0}^{T} m \frac{r^t}{(1 + d)^t}, where m is the net cash flow per period, r is the retention rate, d is the discount rate, and T is the time horizon. The five issues that Fader and Hardie identify are as follows.

  1. The true lifetime value is unknown while the customer is still active, so the formula is actually for the expected lifetime value, i.e., E(CLV).
  2. Since the summation is bounded, the formula isn’t really for the lifetime value – it is an estimate of value up to period T (which may still be useful).
  3. As the summation starts at t=0, it gives the expected value of a customer that hasn’t been acquired yet. According to Fader and Hardie, in some cases the formula starts at t=1, i.e., it applies only to existing customers. The distinction between the two cases isn’t always made clear.
  4. The formula assumes a constant retention rate. However, it is often the case that retention increases with tenure, i.e., customers who have been with the company for a long time are less likely to churn than recently-acquired customers.
  5. It isn’t always possible to calculate a retention rate, as the point at which a customer churns isn’t observed for many products. For example, Starbucks doesn’t know whether customers who haven’t made a purchase for a while have decided to never visit Starbucks again, or whether they’re just going through a period of inactivity. Further, given the ubiquity of Starbucks, it is probably safe to assume that all past customers have a non-zero probability of making another purchase (unless they’re physically dead).

According to Fader and Hardie, “the bottom line is that there is no ‘one formula’ that can be used to compute customer lifetime value“. Therefore, teaching the above formula (or one of its variants) misleads people into thinking that they know how to calculate the lifetime value of customers. Hence, they advocate going back to the definition of lifetime value as “the present value of the future cashflows attributed to the customer relationship“, and using a probabilistic approach to generate estimates of the expected lifetime value for each customer. This conclusion also appears in a more accessible series of blog posts by Custora, where it is claimed that probabilistic modelling can yield significantly more accurate estimates than naive formulas.

Getting serious with the lifetimes package

As mentioned above, Fader and Hardie provide Excel implementations of some of their models, which produce individual-level lifetime value predictions. While this is definitely an improvement over using general formulas, better solutions are available if you can code (or have access to people who can do coding for you). For example, using a software package makes it easy to integrate the lifetime value calculation into a live product, enabling automated interventions to increase revenue and profit (among other benefits). According to Roberto Medri, this approach is followed by Etsy, where lifetime value predictions are used to retain customers and increase their value.

An example of a software package that I can vouch for is the Python lifetimes package, which implements several probabilistic models for lifetime value prediction in a non-contractual setting (i.e., where churn isn’t observed – as in the Starbucks example above). This package is maintained by Cameron Davidson-Pilon of Shopify, who may be known to some readers from his Bayesian Methods for Hackers book and other Python packages. I’ve successfully used the package on a real dataset and have contributed some small fixes and improvements. The documentation on GitHub is quite good, so I won’t repeat it here. However, it is worth reiterating that as with any predictive model, it is important to evaluate performance on your own dataset before deciding to rely on the package’s predictions. If you only take away one thing from this article, let it be the reminder that it is unwise to blindly accept any formula or model. The models implemented in the package (some of which were introduced by Fader and Hardie) are fairly simple and generally applicable, as they rely only on the past transaction log. These simple models are known to sometimes outperform more complex models that rely on richer data, but this isn’t guaranteed to happen on every dataset. My untested feeling is that in situations where clean and relevant training data is plentiful, models that use other features in addition to those extracted from the transaction log would outperform the models provided by the lifetimes package (if you have empirical evidence that supports or refutes this assumption, please let me know).

If you don't test your models, you're gonna have a bad time

Conclusion: You’re better than that

Accurate estimation of customer lifetime value is crucial to most businesses. It informs decisions on customer acquisition and retention, and getting it wrong can drive a business from profitability to insolvency. The rise of data science increases the availability of statistical and scientific tools to small and large businesses. Hence, there are few reasons why a revenue-generating business should rely on untested customer value formulas rather than on more realistic models. This extends beyond customer value to nearly every business endeavour: Relying on fabrications is not a sustainable growth strategy, there is no way around learning how to be intelligently driven by data, and no amount of cheap demagoguery and misinformation can alter the objective reality of our world.

cliff

If you don’t pay attention, data can drive you off a cliff

You’re a hotshot manager. You love your dashboards and you keep your finger on the beating pulse of the business. You take pride in using data to drive your decisions rather than shooting from the hip like one of those old-school 1950s bosses. This is the 21st century, and data is king. You even hired a sexy statistician or data scientist, though you don’t really understand what they do. Never mind, you can proudly tell all your friends that you are leading a modern data-driven team. Nothing can go wrong, right? Incorrect. If you don’t pay attention, data can drive you off a cliff. This article discusses seven of the ways this can happen. Read on to ensure it doesn’t happen to you.

1. Pretending uncertainty doesn’t exist

Last month, your favourite metric was 5.2%. This month, it’s 5.5%. Looks like things are getting better – you must be doing something right! But is 5.5% really different from 5.2%? All things being equal, you should expect some variability in most of your metrics. The values you see are drawn from a distribution of possible values, which means you can’t be certain what value you’ll be seeing next. Fortunately, with more data you would be able to quantify this uncertainty and know which values are more likely. Don’t fear or ignore uncertainty. Embrace and study it, and you’ll be on the right track.

2. Confusing observed and unobserved quantities

Everyone agrees that the future is uncertain. We can generate forecasts with varying degrees of confidence, but we never know for sure what’s going to happen. However, some people tend to ignore uncertainty in forecasts, treating the unobserved future values as comparable to observed present values. For example, marketers often compare customer lifetime value with the cost of acquiring a customer. The problem is that customer lifetime value relies on a prediction of the net profit from a customer (so it’s largely unobserved and uncertain), while the business has much more control and certainty around the cost of acquiring a customer (though it’s not completely known). Treating the two values as if they’re observed and known is risky, as it can lead to major financial losses.

3. Thinking that your data is correct

Dilbert: average and multiply wrong data

Ask anyone who works with data, and they’ll tell you that it’s always messy. A well-known saying among data scientists is that 80% of the work is data cleaning and the other 20% is complaining about data cleaning. Hence, it’s likely that at least some of the figures you’re relying on to make decisions are somewhat inaccurate. However, it’s important to remember that this doesn’t make the data completely useless. But if something looks too good to be true, it probably isn’t true. Finally, it’s highly unlikely that the data is always correct when you like the results and always incorrect when the results aren’t favourable, so don’t use the “guy on the internet said our data isn’t 100% correct” excuse to push back on inconvenient truths.

4. Believing that your data is complete

iceberg

No matter how big you are, your data doesn’t capture everything your customers do. Even Google and the NSA don’t have a full view of what people are up to in the non-digital world, and they can’t completely read our minds (yet). Most businesses have much less data than the big tech companies, and they look a bit silly trying to explain customer behaviour using only the data they have. At the end of the day, you have to work with the data you can access, but never underestimate the effectiveness of obtaining more (relevant) data.

5. Measuring the wrong thing

Maybe you recently read an article emphasising the importance of real metrics, like daily active users, as opposed to vanity metrics like number of signups to your service. You therefore decide to track the daily active users of your product. But have you thought about whether this metric is relevant to what you’re trying to achieve? If you run a business like Airbnb, where transactions are inherently infrequent, do you really care if people don’t regularly log in? You probably don’t, as long as they use the product when they actually need it. Measuring and trying to optimise the wrong thing can be very risky. Indeed, deciding on metrics and their measurement can be seen as the hardest parts of data science.

6. Not recognising your unconscious incompetence

To quote Bertrand Russell: “One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision.” Not recognising the extent of your ignorance when it comes to data is pretty common among those with no training in the field, which may lead to illusory superiority. This may be exacerbated by the fact that those who do know what they’re doing tend to talk a lot about uncertainty and how there are many things that are simply unknowable. My hope is that this short article would help people graduate from unconscious incompetence, where you don’t even recognise the importance of what you don’t know, to conscious incompetence, where you recognise the need to learn and rely on expert advice.

7. Ignoring expert advice

Hal Varian sexy statistician quote

Once you’ve recognised your skill gaps, you may decide to hire a data scientist to help you get more value out of your data. However, despite the hype, data scientists are not magicians. In fact, because of the hype, the definition of data science is so diluted that some people say that the term itself has become useless. The truth is that dealing with data is hard, every organisation is somewhat different, and it takes time and commitment to get value out of data. The worst thing you can do is to hire an expensive expert to help you, and then ignore their advice when their findings are hard to digest. If you’re not ready to work with a data scientist, you might as well save yourself some money and remain in a state of blissful ignorance.

Note: This article is not a portrayal of how things are with my current employer, Car Next Door. Views expressed are my own. In fact, if you want to work at a place where expert advice is acted on and uncertainty is seen as something to be studied rather than ignored, we’re hiring!

Bayesian split testing calculator screenshot

Making Bayesian A/B testing more accessible

Much has been written in recent years on the pitfalls of using traditional hypothesis testing with online A/B tests. A key issue is that you’re likely to end up with many false positives if you repeatedly check your results and stop as soon as you reach statistical significance. One way of dealing with this issue is by following a Bayesian approach to deciding when the experiment should be stopped. While I find the Bayesian view of statistics much more intuitive than the frequentist view, it can be quite challenging to explain Bayesian concepts to laypeople. Hence, I decided to build a new Bayesian A/B testing calculator, which aims to make these concepts clear to any user. This post discusses the general problem and existing solutions, followed by a review of the new tool and how it can be improved further.

The problem

The classic A/B testing problem is as follows. Suppose we run an experiment where we have a control group and a test group. Participants (typically website visitors) are allocated to groups randomly, and each group is presented with a different variant of the website or page (e.g., variant A is assigned to the control group and variant B is assigned to the test group). Our aim is to increase the overall number of binary successes, where success can be defined as clicking a button or opening a new account. Hence, we track the number of trials in each group together with the number of successes. For a given group, the number of successes divided by number of trials is the group’s raw success rate.

Given the results of an experiment (trials and successes for each group), there are a few questions we would typically like to answer:

  1. Should we choose variant A or variant B to maximise our success rate?
  2. How much would our success rate change if we chose one variant over the other?
  3. Do we have enough data or should we keep experimenting?

It’s important to note some points that might be obvious, but are often overlooked. First, we run an experiment because we assume that it will help us uncover a causal link, where something about A or B is hypothesised to cause people to behave differently, thereby affecting the overall success rate. Second, we want to make a decision and choose either A or B, rather than maintain multiple variants and present the best variant depending on a participant’s features (a problem that’s addressed by contextual bandits, for example). Third, online A/B testing is different from traditional experiments in a lab, because we often have little control over the characteristics of our participants, and when, where, and how they choose to interact with our experiment. This is an important point, because it means that we may need to wait a long time until we get a representative sample of the population. In addition, the raw numbers of trials and successes can’t tell us whether the sample is representative.

Bayesian solutions

Many blog posts have been written on how to use Bayesian statistics to answer the above questions, so I won’t get into too much detail here (see the posts by David Robinson, Maciej Kula, Chris Stucchio, and Evan Miller if you need more background). The general idea is that we assume that the success rates for the control and test variants are drawn from Beta(αA, βA) and Beta(αB, βB), respectively, where Beta(α, β) is the beta distribution with shape parameters α and β (which yields values in the [0, 1] interval). As the experiment runs, we update the parameters of the distributions – each success gets added to the group’s α, and each unsuccessful trial gets added to the group’s β. It is often reasonable to assume that the prior (i.e., initial) values of α and β are the same for both variants. If we denote the prior values of the parameters with α0 and β0, and the number of successes and trials for group x with Sx and Tx respectively, we get that the success rates are distributed according to Beta(α0 + SA, β0 + TA – SA) for control and Beta(α0 + SB, β0 + TB – SB) for test.

For example, if α0 = β0 = 1, TA = 200, SA = 120, TB = 200, and SB = 100, plotting the probability density functions yields the following chart (A – blue, B – red):

Beta distributions examples

Given these distributions, we can calculate the most probable range for the success rate of each variant, and estimate the difference in success rate between the variants. These can be calculated by deriving closed formulas, or by drawing samples from each distribution. In addition, it is important to note that the distributions change as we gather more data, even if the raw success rates don’t. For example, multiplying each count by 10 to obtain TA = 2000, SA = 1200, TB = 2000, and SB = 1000 doesn’t change the success rates, but it does change the distributions – they become much narrower:

Narrower beta distributions

In the second case we’ve gathered ten times the data, which made the distributions much more distinct. Intuitively, this means we can now be more confident that the success rate of A is higher than that of B. Quantifying this confidence and deciding when to conclude the experiment isn’t straightforward, and should depend on factors that aren’t fully captured by the raw counts. The way I chose to address this issue is presented below, after briefly discussing existing calculators and their limitations.

Existing online calculators

The beauty of frequentist tools for significance testing is that they always give you a simple answer. For example, if we plug the numbers from the first case above (TA = 200, SA = 120, TB = 200, and SB = 100) into Evan Miller’s calculator, we get:

Chi-Squared test example

Unfortunately, both Bayesian calculators that I’m aware of have some limitations. Plugging the same numbers into the calculators by PeakConversion and Lyst would inform you that the probability of A being best is approximately 0.98, but it won’t tell you what’s the best way forward given this information. PeakConversion also outputs the 95% success rate intervals for A (between 53.1% and 66.7%) and B (between 43.1% and 56.9%), but it doesn’t let users set the prior values α0 and β0 (it uses α0 = β0 = 0.5). The ability to set priors based on what we know about our experimental setting is an important feature of Bayesian statistics that can help reduce the number of false positives. Hiding the priors in PeakConversion’s calculator makes it easier to use but less powerful than Lyst’s tool. In addition, Lyst’s calculator presents the distribution of differences between the success rates of A and B, i.e., the effect size. This is important because we may not bother implementing certain changes if the effect is negligible, even if the probability of one variant being better than the other is very close to 1.

Despite being more powerful, I find Lyst’s calculator just a bit too technical. Specifically, setting the α0 and β0 priors requires some familiarity with the beta distribution, which many people don’t have. Also, the effect size distribution is important, but can be hard to get one’s head around. Therefore, I decided to extend Lyst’s calculator, aiming to release a new tool that is both powerful and easy to use.

Building the new calculator

The source code for Lyst’s calculator is available on GitHub, so I decided to use that as the foundation of the new calculator. The first step was to convert the code from HTML, CSS, and JavaScript to Jade, Sass, and CoffeeScript, and clean up some code duplication. As the calculator is served from my GitHub Pages domain, it was easiest to put all the code in that repository. Once I had an environment and codebase that I was happy with, it was time to make functional changes:

  • Change the layout to be responsive, so it’d work well on mobile devices.
  • Enable sharing of results by changing the URL when the input changes.
  • Provide clear instructions, so that the calculator can be used by people who don’t necessarily have a strong background in statistics.
  • Allow users to set priors based on more familiar figures than the beta distribution’s α0 and β0 priors.
  • Make a clear and well-justified recommendation on how to proceed.

While the first two changes were straightforward to implement, the other points were somewhat more challenging. Specifically, providing clear explanations that assume little background knowledge isn’t simple, and I still feel that the current version of the new calculator is a bit too wordy (this may be improved in the future based on user feedback – suggestions welcome). Life would be easier if everyone thought of observed values as being drawn from distributions, but in my experience this is not always the case. However, I believe it is important to communicate the reality of uncertainty, so I don’t want to hide it from users of the calculator, even at the price of more elaborate explanations.

Making the priors more intuitive was a bit tricky. At first, I thought I’d let users state their prior knowledge in terms of the mean and variance of past performance, relying on the fact that for Beta(α, β) the mean μ is α / (α + β), and the variance σ2 is αβ / (α + β)2(α + β + 1). The problem is that while the mean is simple to set, as it is always in the (0, 1) range, the upper bound for the variance depends on the mean. Specifically, it can be shown that the variance is in the range (0, μ(1 – μ)). Therefore, I decided to let users quantify their uncertainty about the mean as a number u in the range (0, 1), where σ2 = uμ(1 – μ). Having played with the calculator a bit, I think this makes it easier to set good informative priors. It is also worth noting that I considered allowing users to set different priors for the control and test group, but decided against it to reduce complexity. In addition, it makes sense to have the same prior for both groups – if you have a strong belief or knowledge on which one is going to perform better, you probably don’t need to run an experiment.

One of the main reasons I decided to build the calculator was because I wanted a tool that outputs a clear recommendation. This proved to be the most challenging (and interesting) part of this project, as there are quite a few options for Bayesian stopping rules. After reading David Robinson’s review of the limitations of a stopping rule based on the expected loss, and a few of the other resources mentioned in his post, I decided to go with a combination of the third and fourth rules tested by John Kruschke. These rules rely on a threshold of caring, which is the minimum effect size that is seen as significant by the user. For example, if we’re running experiments on the conversion rate of a landing page, we may decide that we don’t care if the absolute change in conversion rate is less than 0.1%. Given this threshold and data from the experiment, the following recommendations are possible:

  1. Stop the experiment and implement either variant, because the difference between the variants is smaller than the threshold.
  2. Stop the experiment and implement the winning variant, because the difference between the variants is greater than the threshold.
  3. Keep running the experiment, because there isn’t enough data to make a decision.

Formally, Kruschke’s rules work as follows. Given the minimum effect threshold t, we define a region of practical equivalence (ROPE) to zero difference as the interval [-tt]. Then, we compare the ROPE to the 95% high density interval (HDI) of the distribution of differences between A and B. When comparing the ROPE and HDI, there are three options that correspond to the recommendations above:

  1. The ROPE is completely contained in the HDI (stop the experiment and implement either variant).
  2. The intersection between the ROPE and HDI is empty (stop the experiment and implement the winning variant).
  3. The ROPE and HDI only partly overlap (keep running the experiment).

Kruschke’s post shows that making the rule more restrictive by adding a notion of user-settable precision can reduce the rate of false positives. The idea is to stop only if the HDI is narrower than precision multiplied by the width of the ROPE. Intuitively, this forces the experimenter to collect more data because it makes the posterior distributions narrower (as shown by the charts above). I found it hard to explain the idea of precision, and didn’t want to confuse users by adding another parameter, so I decided to use a constant precision value of 0.8. If the ROPE and HDI don’t overlap, the tool makes a recommendation to stop, accompanied by a binary level of confidence: high if the precision condition is met, and low otherwise.

Putting in the numbers from the running example (TA = 200, SA = 120, TB = 200, and SB = 100) together with a minimum effect of 1%, prior success rate of 50%, and 57.74% uncertainty (equivalent to α0 = β0 = 1), we get the following output:

Calculator recommendation example

The full results also include plots of the distributions and their high density intervals. I’m pretty happy with the richer information provided by the calculator, though it still has some limitations and areas that can be improved.

Limitations and potential improvements

As mentioned above, I’d love to reduce the wordiness of the calculator while keeping it self-contained, but I need some feedback to understand if any explanations are redundant. It’d also be great to reduce the reliance on magic numbers, such as the 95% HDI and 0.8 precision used for generating a recommendation. However, making these settable by users would increase the complexity of using the calculator, which is already harder to use than the frequentist alternative. Nonetheless, it’s important to remember that oversimplification is the reason why it’s easier to make the wrong decision when following the classical approach.

Other potential changes include switching to a closed-form formula rather than draws from a distribution, comparing more than two variants, and improving Kruschke’s stopping rules by simulating more scenarios than those considered in his post. In addition, I’d like to go beyond binary responses (success/failure) to support continuous rewards (e.g., revenue), and allow users to specify different costs for the variants (e.g., implementing B may cost more than sticking with A).

Finally, it is important to keep in mind that significance testing can’t tell you whether your sample is representative of the population. For example, if you run an experiment on a very popular website, you can get a sample of thousands of people within a few minutes. Concluding an experiment based on such a sample is probably a bad idea, as it is plausible that you would reach different conclusions if you kept running the experiment for a few days, to reduce the effect that the time of day has on the results. Similarly, a few days may not be enough if your user population behaves differently on weekends – you would need to run the experiment over a few weeks. This can be extended to months and years to rule out seasonal effects, but it is up to the experimenter to weigh the practicality of considering such factors versus the need to make decisions (see articles by Peep Laja, Martin Goodson, Sam Ju, and Kohavi et al. for more details). The main thing to remember is that you just cannot completely eliminate uncertainty and the need to consider background knowledge, which is why I believe that helping more people follow the Bayesian approach is a step in the right direction.

Correlation and causation XKCD: https://xkcd.com/552/

Why you should stop worrying about deep learning and deepen your understanding of causality instead

Everywhere you go these days, you hear about deep learning’s impressive advancements. New deep learning libraries, tools, and products get announced on a regular basis, making the average data scientist feel like they’re missing out if they don’t hop on the deep learning bandwagon. However, as Kamil Bartocha put it in his post The Inconvenient Truth About Data Science, 95% of tasks do not require deep learning. This is obviously a made up number, but it’s probably an accurate representation of the everyday reality of many data scientists. This post discusses an often-overlooked area of study that is of much higher relevance to most data scientists than deep learning: causality.

Causality is everywhere

An understanding of cause and effect is something that is not unique to humans. For example, the many videos of cats knocking things off tables appear to exemplify experimentation by animals. If you are not familiar with such videos, it can easily be fixed. The thing to notice is that cats appear genuinely curious about what happens when they push an object. And they tend to repeat the experiment to verify that if you push something off, it falls to the ground.

Humans rely on much more complex causal analysis than that done by cats – an understanding of the long-term effects of one’s actions is crucial to survival. Science, as defined by Wikipedia, is a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe. Causal analysis is key to producing explanations and predictions that are valid and sound, which is why understanding causality is so important to data scientists, traditional scientists, and all humans.

What is causality?

It is surprisingly hard to define causality. Just like cats, we all have an intuitive sense of what causality is, but things get complicated on deeper inspection. For example, few people would disagree with the statement that smoking causes cancer. But does it cause cancer immediately? Would smoking a few cigarettes today and never again cause cancer? Do all smokers develop cancer eventually? What about light smokers who live in areas with heavy air pollution?

Samantha Kleinberg summarises it very well in her book, Why: A Guide to Finding and Using Causes:

While most definitions of causality are based on Hume’s work, none of the ones we can come up with cover all possible cases and each one has counterexamples another does not. For instance, a medication may lead to side effects in only a small fraction of users (so we can’t assume that a cause will always produce an effect), and seat belts normally prevent death but can cause it in some car accidents (so we need to allow for factors that can have mixed producer/preventer roles depending on context).

The question often boils down to whether we should see causes as a fundamental building block or force of the world (that can’t be further reduced to any other laws), or if this structure is something we impose. As with nearly every facet of causality, there is disagreement on this point (and even disagreement about whether particular theories are compatible with this notion, which is called causal realism). Some have felt that causes are so hard to find as for the search to be hopeless and, further, that once we have some physical laws, those are more useful than causes anyway. That is, “causes” may be a mere shorthand for things like triggers, pushes, repels, prevents, and so on, rather than a fundamental notion.

It is somewhat surprising, given how central the idea of causality is to our daily lives, but there is simply no unified philosophical theory of what causes are, and no single foolproof computational method for finding them with absolute certainty. What makes this even more challenging is that, depending on one’s definition of causality, different factors may be identified as causes in the same situation, and it may not be clear what the ground truth is.

Why study causality now?

While it’s hard to conclusively prove, it seems to me like interest in formal causal analysis has increased in recent years. My hypothesis is that it’s just a natural progression along the levels of data’s hierarchy of needs. At the start of the big data boom, people were mostly concerned with storing and processing large amounts of data (e.g., using Hadoop, Elasticsearch, or your favourite NoSQL database). Just having your data flowing through pipelines is nice, but not very useful, so the focus switched to reporting and visualisation to extract insights about what happened (commonly known as business intelligence). While having a good picture of what happened is great, it isn’t enough – you can make better decisions if you can predict what’s going to happen, so the focus switched again to predictive analytics. Those who are familiar with predictive analytics know that models often end up relying on correlations between the features and the predicted labels. Using such models without considering the meaning of the variables can lead us to erroneous conclusions, and potentially harmful interventions. For example, based on the following graph we may make a recommendation that the US government decrease its spending on science to reduce the number of suicides by hanging.

US science spending versus suicides

Source: Spurious Correlations by Tyler Vigen

Causal analysis aims to identify factors that are independent of spurious correlations, allowing stakeholders to make well-informed decisions. It is all about getting to the top of the DIKW (data-information-knowledge-wisdom) pyramid by understanding why things happen and what we can do to change the world. However, finding true causes can be very hard, especially in cases where you can’t perform experiments. Judea Pearl explains it well:

We know, from first principles, that any causal conclusion drawn from observational studies must rest on untested causal assumptions. Cartwright (1989) named this principle ‘no causes in, no causes out,’ which follows formally from the theory of equivalent models (Verma and Pearl, 1990); for any model yielding a conclusion C, one can construct a statistically equivalent model that refutes C and fits the data equally well.

What this means in practice is that you can’t, for example, conclusively prove that smoking causes cancer without making some reasonable assumptions about the mechanisms at play. For ethical reasons, we can’t perform a randomly controlled trial where a test group is forced to smoke for years while a control group is forced not to smoke. Therefore, our conclusions about the causal link between smoking and cancer are drawn from observational studies and an understanding of the mechanisms by which various cancers develop (e.g., the effect of cigarette smoke on individual cells can be studied without forcing people to smoke). Cancer Tobacco companies have exploited this fact for years, making the claim that the probability of both cancer and smoking is raised by some mysterious genetic factors. Fossil fuel and food companies use similar arguments to sell their products and block attempts to regulate their industries (as discussed in previous posts on the hardest parts of data science and nutritionism). Fighting against such arguments is an uphill battle, as it is easy to sow doubt with a few simplistic catchphrases, while proving and communicating causality to laypeople is much harder (or impossible when it comes to deeply-held irrational beliefs).

My causality journey is just beginning

My interest in formal causal analysis was seeded a couple of years ago, with a reading group that was dedicated to Judea Pearl’s work. We didn’t get very far, as I was a bit disappointed with what causal calculus can and cannot do. This may have been because I didn’t come in with the right expectations – I expected a black box that automatically finds causes. Recently reading Samantha Kleinberg’s excellent book Why: A Guide to Finding and Using Causes has made my expectations somewhat more realistic:

Thousands of years after Aristotle’s seminal work on causality, hundreds of years after Hume gave us two definitions of it, and decades after automated inference became a possibility through powerful new computers, causality is still an unsolved problem. Humans are prone to seeing causality where it does not exist and our algorithms aren’t foolproof. Even worse, once we find a cause it’s still hard to use this information to prevent or produce an outcome because of limits on what information we can collect and how we can understand it. After looking at all the cases where methods haven’t worked and researchers and policy makers have gotten causality really wrong, you might wonder why you should bother.

[…]

Rather than giving up on causality, what we need to give up on is the idea of having a black box that takes some data straight from its source and emits a stream of causes with no need for interpretation or human intervention. Causal inference is necessary and possible, but it is not perfect and, most importantly, it requires domain knowledge.

Kleinberg’s book is a great general intro to causality, but it intentionally omits the mathematical details behind the various methods. I am now ready to once again go deeper into causality, perhaps starting with Kleinberg’s more technical book, Causality, Probability, and Time. Other recommendations are very welcome!

Cover image source: xkcd: Correlation

DIKW pyramid

This holiday season, give me real insights

Merriam-Webster defines an insight as an understanding of the true nature of something. Many companies seem to define an insight as any piece of data or information, which I would call a pseudo-insight. This post surveys some examples of pseudo-insights, and discusses how these can be built upon to provide real insights.

Exhibit A: WordPress stats

This website is hosted on wordpress.com. I’m generally happy with WordPress – though it’s not as exciting and shiny as newer competitors, it is rock-solid and very feature-rich. An example of a great WordPress feature is the new stats area (available under wordpress.com/stats if you have a WordPress website). This area includes an insights page, which is full of prime examples of pseudo-insights.

At the top of the insights page, there is a visualisation of posting activity. As the image below shows, this isn’t very interesting for websites like mine. I already know that I post irregularly, because writing a blog post is time-consuming. I suspect that this visualisation isn’t very useful even for more active multi-author blogs, as it is essentially just a different way of displaying the raw data of post dates. Without joining this data with other information, we won’t gain a better understanding of how the blog is performing and why it performs the way it does.

WordPress insights: posting activity

An attempt to extract more meaningful insights from posting times appears further down the page, in the form of a widget that tells you the most popular day and hour. The help text says that This is the day and hour when you have been getting the most Views on average. The best timing for publishing a post may be around this period. Unfortunately, I’m pretty certain that this isn’t true in my case. Monday happens to be the most popular day because that’s when I published two of my most popular posts, and I usually try to spread the word about a new post as soon as I publish it. Further, blog posts can become popular a long time after publication, so it is unlikely that the best timing for publishing a post is around Monday 3pm.

WordPress insights: most popular day and hour

What would real WordPress insights look like? If we stick to idea of exploring the effect of publication timing, I would be curious to know if there is indeed a link between when a post is published and its popularity. Automattic (the company behind WordPress) is in a position to test this, as they can explore data from millions of blogs. My gut feeling is that the time of publication has a negligible effect on popularity. Things that matter much more are a post’s title, content, and effective distribution channels. Given the amount of data that they have, Automattic data scientists can definitely explore all of these factors. This would allow them to surface insights that will help authors drive more quality traffic to their websites.

Exhibit B: Facebook page insights

As anyone who manages a Facebook page probably knows, Facebook provides pretty rich analytics of pages on their platform. For example, you can see the likes you’ve received over time and how your posts perform, and slice and dice this information in various ways. This is a great feature, but again, calling it insights is a misuse of the word and somewhat of an insult for those of us who work to extract real insights from data. An analytics dashboard is not insights.

Facebook page insights

What would real Facebook page insights look like? Working off the assumption that people manage a Facebook page to reach and engage their audience, real insights would enhance a page administrator’s understanding of their audience and improve their ability to engage them and reach new people. However, Facebook is famous for having a conflict of interest here, because they require you to pay to reach more people. For example, if a post you shared is performing better than usual, Facebook will send you a notification, asking you to pay to boost the post further. It would be better if they told you what has caused this post to reach more people, and how to reproduce this success with future posts (for free). But this is very unlikely to happen. In the words of CGP Grey: professional sharers cannot trust the platforms upon which they stand, audiences cannot trust the platform to show what they asked to see.

Exhibit C: LinkedIn profile views

Who’s viewed your profile is a popular LinkedIn feature. A key part of this feature is a graph that includes your weekly profile views together with actions taken on LinkedIn. The official LinkedIn blog calls this graph the insights graph and provides some examples for its uses:

So, for example, if you are trying to attract new clients or business leads, you can see how many potential partners looked at your profile after you joined an important industry group. Or, if you’re looking for a new job, you can look at your insights graph to see whether adding a skill to your profile or endorsing a peer gave you a bigger bump in views by recruiters. No matter your goal, you’ll be able to see which actions lead to the most relevant profile views – then start reaching out and closing the sale or applying for your dream job.

As the examples show, the so-called insights graph merely provides information about past actions and profile views on the LinkedIn platform. It is up to you to come up with the insights, but this may be hard if you consider only the actions taken within the walled garden of LinkedIn. For example, as shown in the following graph, my profile views received a boost on the week starting November 23, which was mostly due to publishing a popular post on this website. In general, social networks such as LinkedIn, Twitter, and Facebook tend to have a very narrow view of the world – as if the only interesting things happen on the platform. In reality, most of the action happens off-platform, either within other digital assets or in the physical world.

LinkedIn profile views

What would real LinkedIn insights look like? First, I think that the focus on profile views is somewhat misguided. It’s not that hard to artificially generate profile views – simply view other people’s profiles. There is no intrinsic value in someone having viewed your profile – the value comes from a connection that leads to an interesting offer or conversation. Second, LinkedIn is about professional networking that is based on real-world activity. As such, it only forms a small part of the world of professional networking by allowing people to have an online presence that makes them contactable by people they don’t already know. When it comes to insights, it’d be useful to know the true causal factors that lead to interesting connections – much more useful than suggestions such as add software development as a skill on your profile to get up to 3% more profile views.

Summary: Real insights are about the why

There are many other examples of pseudo-insights out there. The reason is probably that the field of analytics is becoming increasingly commoditised, and it is easier to rebrand an analytics dashboard as an insights dashboard than to provide real insights. Providing real insights requires moving up the DIKW pyramid from data and information to knowledge and wisdom – from describing the past to learning general lessons that allow you to influence the future. Providing real insights can be very hard, as it often requires inferring the causes of events – the why that comes after the what and how. More on this later – I have just started reading Samantha Kleinberg’s Why: A Guide to Finding and Using Causes and will report (hopefully real) insights on causality in future posts.