# Making Bayesian A/B testing more accessible

Much has been written in recent years on the pitfalls of using traditional hypothesis testing with online A/B tests. A key issue is that you’re likely to end up with many false positives if you repeatedly check your results and stop as soon as you reach statistical significance. One way of dealing with this issue is by following a Bayesian approach to deciding when the experiment should be stopped. While I find the Bayesian view of statistics much more intuitive than the frequentist view, it can be quite challenging to explain Bayesian concepts to laypeople. Hence, I decided to build a new Bayesian A/B testing calculator, which aims to make these concepts clear to any user. This post discusses the general problem and existing solutions, followed by a review of the new tool and how it can be improved further.

## The problem

The classic A/B testing problem is as follows. Suppose we run an experiment where we have a control group and a test group. Participants (typically website visitors) are allocated to groups randomly, and each group is presented with a different variant of the website or page (e.g., variant A is assigned to the control group and variant B is assigned to the test group). Our aim is to increase the overall number of binary successes, where success can be defined as clicking a button or opening a new account. Hence, we track the number of trials in each group together with the number of successes. For a given group, the number of successes divided by number of trials is the group’s raw success rate.

Given the results of an experiment (trials and successes for each group), there are a few questions we would typically like to answer:

1. Should we choose variant A or variant B to maximise our success rate?
2. How much would our success rate change if we chose one variant over the other?
3. Do we have enough data or should we keep experimenting?

It’s important to note some points that might be obvious, but are often overlooked. First, we run an experiment because we assume that it will help us uncover a causal link, where something about A or B is hypothesised to cause people to behave differently, thereby affecting the overall success rate. Second, we want to make a decision and choose either A or B, rather than maintain multiple variants and present the best variant depending on a participant’s features (a problem that’s addressed by contextual bandits, for example). Third, online A/B testing is different from traditional experiments in a lab, because we often have little control over the characteristics of our participants, and when, where, and how they choose to interact with our experiment. This is an important point, because it means that we may need to wait a long time until we get a representative sample of the population. In addition, the raw numbers of trials and successes can’t tell us whether the sample is representative.

## Bayesian solutions

Many blog posts have been written on how to use Bayesian statistics to answer the above questions, so I won’t get into too much detail here (see the posts by David Robinson, Maciej Kula, Chris Stucchio, and Evan Miller if you need more background). The general idea is that we assume that the success rates for the control and test variants are drawn from Beta(αA, βA) and Beta(αB, βB), respectively, where Beta(α, β) is the beta distribution with shape parameters α and β (which yields values in the [0, 1] interval). As the experiment runs, we update the parameters of the distributions – each success gets added to the group’s α, and each unsuccessful trial gets added to the group’s β. It is often reasonable to assume that the prior (i.e., initial) values of α and β are the same for both variants. If we denote the prior values of the parameters with α0 and β0, and the number of successes and trials for group x with Sx and Tx respectively, we get that the success rates are distributed according to Beta(α0 + SA, β0 + TA – SA) for control and Beta(α0 + SB, β0 + TB – SB) for test.

For example, if α0 = β0 = 1, TA = 200, SA = 120, TB = 200, and SB = 100, plotting the probability density functions yields the following chart (A – blue, B – red):

Given these distributions, we can calculate the most probable range for the success rate of each variant, and estimate the difference in success rate between the variants. These can be calculated by deriving closed formulas, or by drawing samples from each distribution. In addition, it is important to note that the distributions change as we gather more data, even if the raw success rates don’t. For example, multiplying each count by 10 to obtain TA = 2000, SA = 1200, TB = 2000, and SB = 1000 doesn’t change the success rates, but it does change the distributions – they become much narrower:

In the second case we’ve gathered ten times the data, which made the distributions much more distinct. Intuitively, this means we can now be more confident that the success rate of A is higher than that of B. Quantifying this confidence and deciding when to conclude the experiment isn’t straightforward, and should depend on factors that aren’t fully captured by the raw counts. The way I chose to address this issue is presented below, after briefly discussing existing calculators and their limitations.

## Existing online calculators

The beauty of frequentist tools for significance testing is that they always give you a simple answer. For example, if we plug the numbers from the first case above (TA = 200, SA = 120, TB = 200, and SB = 100) into Evan Miller’s calculator, we get:

Unfortunately, both Bayesian calculators that I’m aware of have some limitations. Plugging the same numbers into the calculators by PeakConversion and Lyst would inform you that the probability of A being best is approximately 0.98, but it won’t tell you what’s the best way forward given this information. PeakConversion also outputs the 95% success rate intervals for A (between 53.1% and 66.7%) and B (between 43.1% and 56.9%), but it doesn’t let users set the prior values α0 and β0 (it uses α0 = β0 = 0.5). The ability to set priors based on what we know about our experimental setting is an important feature of Bayesian statistics that can help reduce the number of false positives. Hiding the priors in PeakConversion’s calculator makes it easier to use but less powerful than Lyst’s tool. In addition, Lyst’s calculator presents the distribution of differences between the success rates of A and B, i.e., the effect size. This is important because we may not bother implementing certain changes if the effect is negligible, even if the probability of one variant being better than the other is very close to 1.

Despite being more powerful, I find Lyst’s calculator just a bit too technical. Specifically, setting the α0 and β0 priors requires some familiarity with the beta distribution, which many people don’t have. Also, the effect size distribution is important, but can be hard to get one’s head around. Therefore, I decided to extend Lyst’s calculator, aiming to release a new tool that is both powerful and easy to use.

## Building the new calculator

The source code for Lyst’s calculator is available on GitHub, so I decided to use that as the foundation of the new calculator. The first step was to convert the code from HTML, CSS, and JavaScript to Jade, Sass, and CoffeeScript, and clean up some code duplication. As the calculator is served from my GitHub Pages domain, it was easiest to put all the code in that repository. Once I had an environment and codebase that I was happy with, it was time to make functional changes:

• Change the layout to be responsive, so it’d work well on mobile devices.
• Enable sharing of results by changing the URL when the input changes.
• Provide clear instructions, so that the calculator can be used by people who don’t necessarily have a strong background in statistics.
• Allow users to set priors based on more familiar figures than the beta distribution’s α0 and β0 priors.
• Make a clear and well-justified recommendation on how to proceed.

While the first two changes were straightforward to implement, the other points were somewhat more challenging. Specifically, providing clear explanations that assume little background knowledge isn’t simple, and I still feel that the current version of the new calculator is a bit too wordy (this may be improved in the future based on user feedback – suggestions welcome). Life would be easier if everyone thought of observed values as being drawn from distributions, but in my experience this is not always the case. However, I believe it is important to communicate the reality of uncertainty, so I don’t want to hide it from users of the calculator, even at the price of more elaborate explanations.

Making the priors more intuitive was a bit tricky. At first, I thought I’d let users state their prior knowledge in terms of the mean and variance of past performance, relying on the fact that for Beta(α, β) the mean μ is α / (α + β), and the variance σ2 is αβ / (α + β)2(α + β + 1). The problem is that while the mean is simple to set, as it is always in the (0, 1) range, the upper bound for the variance depends on the mean. Specifically, it can be shown that the variance is in the range (0, μ(1 – μ)). Therefore, I decided to let users quantify their uncertainty about the mean as a number u in the range (0, 1), where σ2 = uμ(1 – μ). Having played with the calculator a bit, I think this makes it easier to set good informative priors. It is also worth noting that I considered allowing users to set different priors for the control and test group, but decided against it to reduce complexity. In addition, it makes sense to have the same prior for both groups – if you have a strong belief or knowledge on which one is going to perform better, you probably don’t need to run an experiment.

One of the main reasons I decided to build the calculator was because I wanted a tool that outputs a clear recommendation. This proved to be the most challenging (and interesting) part of this project, as there are quite a few options for Bayesian stopping rules. After reading David Robinson’s review of the limitations of a stopping rule based on the expected loss, and a few of the other resources mentioned in his post, I decided to go with a combination of the third and fourth rules tested by John Kruschke. These rules rely on a threshold of caring, which is the minimum effect size that is seen as significant by the user. For example, if we’re running experiments on the conversion rate of a landing page, we may decide that we don’t care if the absolute change in conversion rate is less than 0.1%. Given this threshold and data from the experiment, the following recommendations are possible:

1. Stop the experiment and implement either variant, because the difference between the variants is smaller than the threshold.
2. Stop the experiment and implement the winning variant, because the difference between the variants is greater than the threshold.
3. Keep running the experiment, because there isn’t enough data to make a decision.

Formally, Kruschke’s rules work as follows. Given the minimum effect threshold t, we define a region of practical equivalence (ROPE) to zero difference as the interval [-tt]. Then, we compare the ROPE to the 95% high density interval (HDI) of the distribution of differences between A and B. When comparing the ROPE and HDI, there are three options that correspond to the recommendations above:

1. The ROPE is completely contained in the HDI (stop the experiment and implement either variant).
2. The intersection between the ROPE and HDI is empty (stop the experiment and implement the winning variant).
3. The ROPE and HDI only partly overlap (keep running the experiment).

Kruschke’s post shows that making the rule more restrictive by adding a notion of user-settable precision can reduce the rate of false positives. The idea is to stop only if the HDI is narrower than precision multiplied by the width of the ROPE. Intuitively, this forces the experimenter to collect more data because it makes the posterior distributions narrower (as shown by the charts above). I found it hard to explain the idea of precision, and didn’t want to confuse users by adding another parameter, so I decided to use a constant precision value of 0.8. If the ROPE and HDI don’t overlap, the tool makes a recommendation to stop, accompanied by a binary level of confidence: high if the precision condition is met, and low otherwise.

Putting in the numbers from the running example (TA = 200, SA = 120, TB = 200, and SB = 100) together with a minimum effect of 1%, prior success rate of 50%, and 57.74% uncertainty (equivalent to α0 = β0 = 1), we get the following output:

The full results also include plots of the distributions and their high density intervals. I’m pretty happy with the richer information provided by the calculator, though it still has some limitations and areas that can be improved.

## Limitations and potential improvements

As mentioned above, I’d love to reduce the wordiness of the calculator while keeping it self-contained, but I need some feedback to understand if any explanations are redundant. It’d also be great to reduce the reliance on magic numbers, such as the 95% HDI and 0.8 precision used for generating a recommendation. However, making these settable by users would increase the complexity of using the calculator, which is already harder to use than the frequentist alternative. Nonetheless, it’s important to remember that oversimplification is the reason why it’s easier to make the wrong decision when following the classical approach.

Other potential changes include switching to a closed-form formula rather than draws from a distribution, comparing more than two variants, and improving Kruschke’s stopping rules by simulating more scenarios than those considered in his post. In addition, I’d like to go beyond binary responses (success/failure) to support continuous rewards (e.g., revenue), and allow users to specify different costs for the variants (e.g., implementing B may cost more than sticking with A).

Finally, it is important to keep in mind that significance testing can’t tell you whether your sample is representative of the population. For example, if you run an experiment on a very popular website, you can get a sample of thousands of people within a few minutes. Concluding an experiment based on such a sample is probably a bad idea, as it is plausible that you would reach different conclusions if you kept running the experiment for a few days, to reduce the effect that the time of day has on the results. Similarly, a few days may not be enough if your user population behaves differently on weekends – you would need to run the experiment over a few weeks. This can be extended to months and years to rule out seasonal effects, but it is up to the experimenter to weigh the practicality of considering such factors versus the need to make decisions (see articles by Peep Laja, Martin Goodson, Sam Ju, and Kohavi et al. for more details). The main thing to remember is that you just cannot completely eliminate uncertainty and the need to consider background knowledge, which is why I believe that helping more people follow the Bayesian approach is a step in the right direction.

# Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptions

Background: I have previously written about the need for real insights that address the why behind events, not only the what and how. This was followed by a fairly popular post on causality, which was heavily influenced by Samantha Kleinberg’s book Why: A Guide to Finding and Using Causes. This post continues my exploration of the field, and is primarily based on Kleinberg’s previous book: Causality, Probability, and Time.

The study of causality and causal inference is central to science in general and data science in particular. Being able to distinguish between correlation and causation is key to designing effective interventions in business, public policy, medicine, and many other fields. There are quite a few approaches to inferring causal relationships from data. In this post, I discuss some aspects of Judea Pearl’s graphical modelling approach, and how its limitations are addressed in recent work by Samantha Kleinberg. I then finish with a brief survey of the Bradford Hill criteria and their applicability to a key limitation of all causal inference methods: The need for untested assumptions.

## Overcoming my Pearl bias

First, I must disclose that I have a personal bias in favour of Pearl’s work. While I’ve never met him, Pearl is my academic grandfather – he was the PhD advisor of my main PhD supervisor (Ingrid Zukerman). My first serious exposure to his work was through a Sydney reading group, where we discussed parts of Pearl’s approach to causal inference. Recently, I refreshed my knowledge of Pearl causality by reading Causal inference in statistics: An overview. I am by no means an expert in Pearl’s huge body of work, but I think I understand enough of it to write something of use.

Pearl’s theory of causality employs Bayesian networks to represent causal structures. These are directed acyclic graphs, where each vertex represents a variable, and an edge from X to Y implies that X causes Y. Pearl also introduces the `do(X)` operator, which simulates interventions by removing all the causes of X, setting it to a constant. There is much more to this theory, but two of its main contributions are the formalisation of causal concepts that are often given only a verbal treatment, and the explicit encoding of causal assumptions. These assumptions must be made by the modeller based on background knowledge, and are encoded in the graph’s structure – a missing edge between two vertices indicates that there is no direct causal relationship between the two variables.

My main issue with Pearl’s treatment of causality is that he doesn’t explicitly handle time. While time can be encoded into Pearl’s models (e.g., via dynamic Bayesian networks), there is nothing that prevents creation of models where the future causes changes in the past. A closely-related issue is that Pearl’s causal models must be directed acyclic graphs, making it hard to model feedback loops. For example, Pearl says that “mud does not cause rain”, but this isn’t true – water from mud evaporates, causing rain (which causes mud). What’s true is that “mud now doesn’t cause rain now” or something along these lines, which is something that must be accounted for by adding temporal information to the models.

Nonetheless, Pearl’s theory is an important step forward in the study of causality. In his words, “in the bulk of the statistical literature before 2000, causal claims rarely appear in the mathematics. They surface only in the verbal interpretation that investigators occasionally attach to certain associations, and in the verbal description with which investigators justify assumptions.” The importance of formal causal analysis cannot be overstated, as it underlies many decisions that affect our lives. However, it seems to me like there’s still plenty of work to be done before causal analysis becomes as established as other statistical tools.

## Kleinberg: Addressing gaps in Pearl’s work

I recently finished reading Samantha Kleinberg’s Causality, Probability, and Time. Kleinberg dedicates a good portion of the book to presenting the history of causality and discussing its many definitions. As hinted by the book’s title, Kleinberg believes that one cannot discuss causality without considering time. In her words: “One of the most critical pieces of information about causality, though – the time it takes for the cause to produce its effect – has been largely ignored by both philosophical theories and computational methods. If we do not know when the effect will occur, we have little hope of being able to act successfully using the causal relationship.” Following this assertion, Kleinberg presents a new approach to causal inference that is based on probabilistic computation tree logic (PCTL). With PCTL, one can concisely express probabilistic temporal statements. For example, if we observe a potential cause c occurring at time t, and a possible effect e occurring at time t’, we can use PCTL to state the hypothesis that in general, after c becomes true, it takes between one and |t’ – t| time units for e to become true with probability at least p, i.e., c leads to e:

It is obvious why PCTL may be a better fit than Bayesian networks for expressing causal statements. For example, with a Bayesian network, we can easily express the statement that smoking causes lung cancer with probability 0.3, but this isn’t that useful, as it doesn’t tell us how long it’ll take for cancer to develop. With PCTL, we can state that smoking causes lung cancer in 5-30 years with probability at least 0.3. This matches our knowledge that cancer doesn’t develop immediately – one cigarette won’t kill you.

One of the key concepts introduced by Kleinberg is that of causal significance. Calculating the causal significance of a cause c to an effect e relies on first identifying the set X of potential (or prima facie) causes of e. The set X contains all discrete variables x such that E[e|x]≠E[e] and x occurs earlier than e. Given the set X, the causal significance of c to e is the mean of E[e|c∧x] – E[e|¬c∧x] for all x≠c. The intuition is that if a cause c is significant, its causal significance value will be high when other potential causes are held fixed. For example, if c is heavy smoking and e is severity of lung cancer (with e=0 meaning no cancer), the expected value of e given c is likely to be higher than the expected value of e given ¬c, when conditioned on any other potential cause. Once causal significance has been measured, we can separate significant causes from insignificant causes by setting a threshold on causal significance values (this threshold can be inferred from the data). Significant causes are considered to be genuine if the data is stationary and the common causes of all pairs of variables have been included, which is a very strong condition that may be hard to fulfil in realistic scenarios. However, causal significance is an evolving concept – last year, Huang and Kleinberg introduced a new definition of causal significance that can be inferred faster and yield more accurate results. My general feeling is that this line of research will continue to yield many interesting and useful results in coming years.

Kleinberg’s work is not without its limitations. In addition to the assumptions that causal relationships are stationary and the requirement to identify all potential causes, the recently-introduced definition of causal significance also requires the relationships to be linear and additive (though this limitation may be relaxed in future work). Another issue is that most of the evaluation in the studies I’ve read was done on synthetic datasets. While there are some results on real-life health and finance data, I find it hard to judge the practicality of utilising Kleinberg’s methods without applying them to problems that I’m more familiar with. Finally, as with other work in the field of causal inference, we need to have some degree of belief in untested assumptions to reach useful conclusions. In Kleinberg’s words:

Thus, a just so cause is genuine in the case where all of the outlined assumptions hold (namely that all common causes are included, the structure is representative of the system and, when data is used, a formula satisfied by the data will be satisfied by the structure). Our belief in whether a cause is genuine, in the case where it is not certain that the assumptions hold, should be proportional to how much we believe that the assumptions are true.

## Hill: Testing untested assumptions

To the best of my knowledge, all causal inference methods rely on untested assumptions. Specifically, we can never include all the variables in the universe in our models. Therefore, any conclusions drawn are reliant on deciding what, when, and how to measure potential causes and effects. Another issue is that no matter how good and believable our modelling is, we cannot use causal inference to convince unreasonable people. For example, some people may cite divine intervention as an unmeasurable cause of anything and everything. In addition, people with certain commercial interests often try to raise doubt about well-established causal mechanisms by making unreasonable claims for evidence of various hidden factors. For example, tobacco companies used to claim that both smoking and lung cancer were caused by a common hidden factor, making the link between smoking and lung cancer a mere association.

Assuming that we are dealing with reasonable people, there’s still the question of where we should get our untested assumptions from. This question is fairly old, and has been partly answered in 1965 by Austin Bradford Hill, with nine criteria that he recommended should be considered before calling an association causal:

1. Strength: How strong is the association? For example, lung cancer deaths of heavy smokers are 20-30 times greater than those of non-smokers.
2. Consistency: Has the association been repeatedly observed in various circumstances? For example, many different populations have exhibited an association between smoking rates and cancer.
3. Specificity: Can we pin down specific instances of the effect to specific instances of the cause? Hill sees this as a nice-to-have condition rather than a must-have – cases with multiple possible causes may not fulfil the specificity requirement.
4. Temporality: Do we know that c leads to e or are we observing them together? This is a condition that isn’t always easy to fulfil, especially when dealing with feedback loops and slow processes.
5. Biological gradient: Hill’s focus was on medicine, and this condition refers to the association exhibiting some dose-response curve. This can be generalised to other fields, as we can expect some regularity in the effect if it is a function of the cause (though it doesn’t have to be a linear function).
6. Plausibility: Do we know of a mechanism that can explain how the cause brings about the effect?
7. Coherence: Does the association conflict with our current knowledge? Even if it does, it isn’t enough to rule out causality, as our current knowledge may be incomplete or wrong.
8. Experiment: If possible, running controlled experiments may yield very powerful evidence in favour of causation.
9. Analogy: Do we know of any similar cause-and-effect relationships?

Hill summarises the list of criteria (or viewpoints) with the following statements.

Here then are nine different viewpoints from all of which we should study association before we cry causation. What I do not believe – and this has been suggested – is that we can usefully lay down some hard-and-fast rules of evidence that must be obeyed before we accept cause and effect. None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to make up our minds on the fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect?

No formal tests of significance can answer those questions. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis.

Hill then goes on to criticise the increased focus on statistical significance as a condition for accepting scientific papers for publication. Remembering that this was over 50 years ago, it is a bit worrying that it has taken so long for the statistical community to formally acknowledge the fact that statistical significance does not imply scientific importance, or constitutes enough evidence to support a causal hypothesis.

## Closing thoughts

This post has only scratched the surface of the vast field of study of causality. At this point, I feel like I’ve read quite a bit, and it is time to apply what I learned to real problems. I encounter questions of causality in my everyday work, but haven’t fully applied formal causal inference to any problem yet. My view is that everyone needs to at least be aware of the need to consider causality, and of what it’d take to truly prove causal impact. A large proportion of what many people need in practice may be addressed by Hill’s criteria, rather than by formal methods for causal analysis. Nonetheless, I will report back when I get a chance to apply formal causal inference to real datasets. Stay tuned!

# The rise of greedy robots

Given the impressive advancement of machine intelligence in recent years, many people have been speculating on what the future holds when it comes to the power and roles of robots in our society. Some have even called for regulation of machine intelligence before it’s too late. My take on this issue is that there is no need to speculate – machine intelligence is already here, with greedy robots already dominating our lives.

## Machine intelligence or artificial intelligence?

The problem with talking about artificial intelligence is that it creates an inflated expectation of machines that would be completely human-like – we won’t have true artificial intelligence until we can create machines that are indistinguishable from humans. While the goal of mimicking human intelligence is certainly interesting, it is clear that we are very far from achieving it. We currently can’t even fully simulate C. elegans, a 1mm worm with 302 neurons. However, we do have machines that can perform tasks that require intelligence, where intelligence is defined as the ability to learn or understand things or to deal with new or difficult situations. Unlike artificial intelligence, there is no doubt that machine intelligence already exists.

Airplanes provide a famous example: we don’t commonly think of them as performing artificial flight – they are machines that fly faster than any bird. Likewise, computers are super-intelligent machines. They can perform calculations that humans can’t, store and recall enormous amounts of information, translate text, play Go, drive cars, and much more – all without requiring rest or food. The robots are here, and they are becoming increasingly useful and powerful.

## Who are those greedy robots?

Greed is defined as a selfish desire to have more of something (especially money). It is generally seen as a negative trait in humans. However, we have been cultivating an environment where greedy entities – for-profit organisations – thrive. The primary goal of for-profit organisations is to generate profit for their shareholders. If these organisations were human, they would be seen as the embodiment of greed, as they are focused on making money and little else. Greedy organisations “live” among us and have been enjoying a plethora of legal rights and protections for hundreds of years. These entities, which were formed and shaped by humans, now form and shape human lives.

Humans running for-profit organisations have little choice but to play by their rules. For example, many people acknowledge that corporate tax avoidance is morally wrong, as revenue from taxes supports the infrastructure and society that enable corporate profits. However, any executive of a public company who refuses to do everything they legally can to minimise their tax bill is likely to lose their job. Despite being separate from the greedy organisations we run, humans have to act greedily to effectively serve their employers.

Greedy robots are all around us. Advertising-driven companies like Facebook and Google use sophisticated algorithms to get people to click on ads. Retail companies like Amazon use machine intelligence to mine through people’s shopping history and generate product recommendations. Banks and mutual funds utilise algorithmic trading to drive their investments. None of this is science fiction, and it doesn’t take much of a leap to imagine a world where greedy robots are even more dominant. Just like we have allowed greedy legal entities to dominate our world and shape our lives, we are allowing greedy robots to do the same, just more efficiently and pervasively.

## Will robots take your job?

The growing range of machine intelligence capabilities gives rise to the question of whether robots are going to take over human jobs. One salient example is that of self-driving cars, that are projected to render millions of professional drivers obsolete in the next few decades. The potential impact of machine intelligence on jobs was summarised very well by CGP Grey in his video Humans Need Not Apply. The main message of the video is that machines will soon be able to perform any job better or more cost-effectively than any human, thereby making humans unemployable for economic reasons. The video ends with a call to society to consider how to deal with a future where there are simply no jobs for a large part of the population.

Despite all the technological advancements since the start of the industrial revolution, the prevailing mode of wealth distribution remains paid labour, i.e., jobs. The implication of this is that much of the work we do is unnecessary or harmful – people work because they have no other option, but their work doesn’t necessarily benefit society. This isn’t a new insight, as the following quotes demonstrate:

• “Most men appear never to have considered what a house is, and are actually though needlessly poor all their lives because they think that they must have such a one as their neighbors have. […] For more than five years I maintained myself thus solely by the labor of my hands, and I found that, by working about six weeks in a year, I could meet all the expenses of living.” – Henry David Thoreau, Walden (1854)
• “I think that there is far too much work done in the world, that immense harm is caused by the belief that work is virtuous, and that what needs to be preached in modern industrial countries is quite different from what always has been preached. […] Modern technique has made it possible to diminish enormously the amount of labor required to secure the necessaries of life for everyone. […] If, at the end of the war, the scientific organization, which had been created in order to liberate men for fighting and munition work, had been preserved, and the hours of the week had been cut down to four, all would have been well. Instead of that the old chaos was restored, those whose work was demanded were made to work long hours, and the rest were left to starve as unemployed.” – Bertrand Russell, In Praise of Idleness (1932)
• “In the year 1930, John Maynard Keynes predicted that technology would have advanced sufficiently by century’s end that countries like Great Britain or the United States would achieve a 15-hour work week. There’s every reason to believe he was right. In technological terms, we are quite capable of this. And yet it didnâ€™t happen. Instead, technology has been marshaled, if anything, to figure out ways to make us all work more. In order to achieve this, jobs have had to be created that are, effectively, pointless. Huge swathes of people, in Europe and North America in particular, spend their entire working lives performing tasks they secretly believe do not really need to be performed. The moral and spiritual damage that comes from this situation is profound. It is a scar across our collective soul. Yet virtually no one talks about it.” – David Graeber, On the Phenomenon of Bullshit Jobs (2013)

This leads to the conclusion that we are unlikely to experience the utopian future in which intelligent machines do all our work, leaving us ample time for leisure. Yes, people will lose their jobs. But it is not unlikely that new unnecessary jobs will be invented to keep people busy, or worse, many people will simply be unemployed and will not get to enjoy the wealth provided by technology. Stephen Hawking summarised it well recently:

If machines produce everything we need, the outcome will depend on how things are distributed. Everyone can enjoy a life of luxurious leisure if the machine-produced wealth is shared, or most people can end up miserably poor if the machine-owners successfully lobby against wealth redistribution. So far, the trend seems to be toward the second option, with technology driving ever-increasing inequality.

## Where to from here?

Many people believe that the existence of powerful greedy entities is good for society. Indeed, there is no doubt that we owe many beneficial technological breakthroughs to competition between for-profit companies. However, a single-minded focus on profit means that in many cases companies do what they can to reduce their responsibility for harmful side-effects of their activities. Examples include environmental pollution, multinational tax evasion, and health effects of products like tobacco and junk food. As history shows us, in truly unregulated markets, companies would happily utilise slavery and child labour to reduce their costs. Clearly, some regulation of greedy entities is required to obtain the best results for society.

With machine intelligence becoming increasingly powerful every day, some people think that to produce the best outcomes, we just need to wait for robots to be intelligent enough to completely run our lives. However, as anyone who has actually built intelligent systems knows, the outputs of such systems are strongly dependent on the inputs and goals set by system designers. Machine intelligence is just a tool – a very powerful tool. Like nuclear energy, we can use it to improve our lives, or we can use it to obliterate everything around us. The collective choice is ours to make, but is far from simple.

Everywhere you go these days, you hear about deep learning’s impressive advancements. New deep learning libraries, tools, and products get announced on a regular basis, making the average data scientist feel like they’re missing out if they don’t hop on the deep learning bandwagon. However, as Kamil Bartocha put it in his post The Inconvenient Truth About Data Science, 95% of tasks do not require deep learning. This is obviously a made up number, but it’s probably an accurate representation of the everyday reality of many data scientists. This post discusses an often-overlooked area of study that is of much higher relevance to most data scientists than deep learning: causality.

## Causality is everywhere

An understanding of cause and effect is something that is not unique to humans. For example, the many videos of cats knocking things off tables appear to exemplify experimentation by animals. If you are not familiar with such videos, it can easily be fixed. The thing to notice is that cats appear genuinely curious about what happens when they push an object. And they tend to repeat the experiment to verify that if you push something off, it falls to the ground.

Humans rely on much more complex causal analysis than that done by cats – an understanding of the long-term effects of one’s actions is crucial to survival. Science, as defined by Wikipedia, is a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe. Causal analysis is key to producing explanations and predictions that are valid and sound, which is why understanding causality is so important to data scientists, traditional scientists, and all humans.

## What is causality?

It is surprisingly hard to define causality. Just like cats, we all have an intuitive sense of what causality is, but things get complicated on deeper inspection. For example, few people would disagree with the statement that smoking causes cancer. But does it cause cancer immediately? Would smoking a few cigarettes today and never again cause cancer? Do all smokers develop cancer eventually? What about light smokers who live in areas with heavy air pollution?

Samantha Kleinberg summarises it very well in her book, Why: A Guide to Finding and Using Causes:

While most definitions of causality are based on Hume’s work, none of the ones we can come up with cover all possible cases and each one has counterexamples another does not. For instance, a medication may lead to side effects in only a small fraction of users (so we can’t assume that a cause will always produce an effect), and seat belts normally prevent death but can cause it in some car accidents (so we need to allow for factors that can have mixed producer/preventer roles depending on context).

The question often boils down to whether we should see causes as a fundamental building block or force of the world (that can’t be further reduced to any other laws), or if this structure is something we impose. As with nearly every facet of causality, there is disagreement on this point (and even disagreement about whether particular theories are compatible with this notion, which is called causal realism). Some have felt that causes are so hard to find as for the search to be hopeless and, further, that once we have some physical laws, those are more useful than causes anyway. That is, “causes” may be a mere shorthand for things like triggers, pushes, repels, prevents, and so on, rather than a fundamental notion.

It is somewhat surprising, given how central the idea of causality is to our daily lives, but there is simply no unified philosophical theory of what causes are, and no single foolproof computational method for finding them with absolute certainty. What makes this even more challenging is that, depending on oneâ€™s definition of causality, different factors may be identified as causes in the same situation, and it may not be clear what the ground truth is.

## Why study causality now?

While it’s hard to conclusively prove, it seems to me like interest in formal causal analysis has increased in recent years. My hypothesis is that it’s just a natural progression along the levels of data’s hierarchy of needs. At the start of the big data boom, people were mostly concerned with storing and processing large amounts of data (e.g., using Hadoop, Elasticsearch, or your favourite NoSQL database). Just having your data flowing through pipelines is nice, but not very useful, so the focus switched to reporting and visualisation to extract insights about what happened (commonly known as business intelligence). While having a good picture of what happened is great, it isn’t enough – you can make better decisions if you can predict what’s going to happen, so the focus switched again to predictive analytics. Those who are familiar with predictive analytics know that models often end up relying on correlations between the features and the predicted labels. Using such models without considering the meaning of the variables can lead us to erroneous conclusions, and potentially harmful interventions. For example, based on the following graph we may make a recommendation that the US government decrease its spending on science to reduce the number of suicides by hanging.

Causal analysis aims to identify factors that are independent of spurious correlations, allowing stakeholders to make well-informed decisions. It is all about getting to the top of the DIKW (data-information-knowledge-wisdom) pyramid by understanding why things happen and what we can do to change the world. However, finding true causes can be very hard, especially in cases where you can’t perform experiments. Judea Pearl explains it well:

We know, from first principles, that any causal conclusion drawn from observational studies must rest on untested causal assumptions. Cartwright (1989) named this principle ‘no causes in, no causes out,’ which follows formally from the theory of equivalent models (Verma and Pearl, 1990); for any model yielding a conclusion C, one can construct a statistically equivalent model that refutes C and fits the data equally well.

What this means in practice is that you can’t, for example, conclusively prove that smoking causes cancer without making some reasonable assumptions about the mechanisms at play. For ethical reasons, we can’t perform a randomly controlled trial where a test group is forced to smoke for years while a control group is forced not to smoke. Therefore, our conclusions about the causal link between smoking and cancer are drawn from observational studies and an understanding of the mechanisms by which various cancers develop (e.g., the effect of cigarette smoke on individual cells can be studied without forcing people to smoke). Cancer Tobacco companies have exploited this fact for years, making the claim that the probability of both cancer and smoking is raised by some mysterious genetic factors. Fossil fuel and food companies use similar arguments to sell their products and block attempts to regulate their industries (as discussed in previous posts on the hardest parts of data science and nutritionism). Fighting against such arguments is an uphill battle, as it is easy to sow doubt with a few simplistic catchphrases, while proving and communicating causality to laypeople is much harder (or impossible when it comes to deeply-held irrational beliefs).

## My causality journey is just beginning

My interest in formal causal analysis was seeded a couple of years ago, with a reading group that was dedicated to Judea Pearl’s work. We didn’t get very far, as I was a bit disappointed with what causal calculus can and cannot do. This may have been because I didn’t come in with the right expectations – I expected a black box that automatically finds causes. Recently reading Samantha Kleinberg’s excellent book Why: A Guide to Finding and Using Causes has made my expectations somewhat more realistic:

Thousands of years after Aristotle’s seminal work on causality, hundreds of years after Hume gave us two definitions of it, and decades after automated inference became a possibility through powerful new computers, causality is still an unsolved problem. Humans are prone to seeing causality where it does not exist and our algorithms aren’t foolproof. Even worse, once we find a cause it’s still hard to use this information to prevent or produce an outcome because of limits on what information we can collect and how we can understand it. After looking at all the cases where methods havenâ€™t worked and researchers and policy makers have gotten causality really wrong, you might wonder why you should bother.

[…]

Rather than giving up on causality, what we need to give up on is the idea of having a black box that takes some data straight from its source and emits a stream of causes with no need for interpretation or human intervention. Causal inference is necessary and possible, but it is not perfect and, most importantly, it requires domain knowledge.

Kleinberg’s book is a great general intro to causality, but it intentionally omits the mathematical details behind the various methods. I am now ready to once again go deeper into causality, perhaps starting with Kleinberg’s more technical book, Causality, Probability, and Time. Other recommendations are very welcome!

Cover image source: xkcd: Correlation

# The joys of offline data collection

Many modern data scientists don’t get to experience data collection in the offline world. Recently, I spent a month sailing down the northern Great Barrier Reef, collecting data for the Reef Life Survey project. In addition to being a great diving experience, the trip helped me obtain general insights on data collection and machine learning, which are shared in this article.

## The Reef Life Survey project

Reef Life Survey (RLS) is a citizen scientist project, led by a team from the University of Tasmania. The data collected by RLS volunteers is freely available on the RLS website, and has been used for producing various reports and scientific publications. An RLS survey is performed along a 50 metre tape, which is laid at a constant depth following a reef’s contour. After laying the tape, one diver takes photos of the bottom at 2.5 metre intervals along the transect line. These photos are automatically analysed to classify the type of substrate or growth (e.g., hard coral or sand). Divers then complete two swims along each side of the transect. On the first swim (method 1), divers record all the fish species and large swimming animals found in a 5 metre corridor from the line. The second swim (method 2) requires keeping closer to the bottom and looking under ledges and vegetation in a 1 metre corridor from the line, targeting invertebrates and cryptic animals. The RLS manual includes all the details on how surveys are performed.

Performing RLS surveys is not a trivial task. In the tropics, it is not uncommon to record around 100 fish species on method 1. The scientists running the project are very conscious of the importance of obtaining high-quality data, so training to become an RLS volunteer takes considerable effort and dedication. The process generally consists of doing surveys together with an experienced RLS diver, and comparing the data after each dive. Once the trainee’s data matches that of the experienced RLSer, they are considered good enough to perform surveys independently. However, retraining is often required when surveying new ecoregions (e.g., an RLSer trained in Sydney needs further training to survey the Great Barrier Reef).

RLS requires a lot of hard work, but there are many reasons why it’s worth the effort. As someone who cares about marine conservation, I like the fact that RLS dives yield useful data that is used to drive environmental management decisions. As a scuba diver, I enjoy the opportunity to dive places that are rarely dived and the enhanced knowledge of the marine environment – doing surveys makes me notice things that I would otherwise overlook. Finally, as a data scientist, I find the exposure to the work of marine scientists very educational.

## Pre-training and thoughts on supervised learning

Doing surveys in the tropics is a completely different story from surveying temperate reefs, due to the substantially higher diversity and abundance of marine creatures. Producing high-quality results requires being able to identify most creatures underwater, while doing the survey. It is possible to write down descriptions and take photos of unidentified species, but doing this for a large number of species is impractical.

Training the neural network in my head to classify tropical fish by species was an interesting experience. The approach that worked best was making flashcards using reveal.js, photos scraped from various sources, and past survey data. As the image below shows, each flashcard consists of a single photo, and pressing the down arrow reveals the name of the creature. With some basic JavaScript, I made the presentation select a different subset of photos on each load. Originally, I tried to learn all the 1000+ species that were previously recorded in the northern Great Barrier Reef, but this proved to be too hard – I realised that a better strategy was needed. The strategy that I chose was to focus on the most frequently-recorded species: I started by memorising the most frequent ones (e.g., those recorded on more than 50% of surveys), and gradually made it more challenging by decreasing the frequency threshold (e.g., to 25% in 5% steps). This proved to be pretty effective – by the time I started diving I could identify about 50-100 species underwater, even though I had mostly been using static images. It’d be interesting to know whether this kind of approach would be effective in training neural networks (or other batch-trained models) in certain scenarios – spend a few epochs training with instances from a subset of the classes, and gradually increase the number of considered classes. This may be effective when errors on certain classes are more important than others, and may yield different results from simply weighting classes or instances. Please let me know if you know of anyone who has experimented with this idea (update: gwern from Reddit pointed me to the paper Curriculum Learning by Bengio et al., which discusses this idea).

RLS flashcard example (Chaetodon lunulatus)

While repeatedly looking at photos and their labels felt a lot like training an artificial neural network, as a human I have the advantage of being able to easily use information from multiple sources. For example, fish ID books such as Reef Fish Identification: Tropical Pacific provide concise descriptions of the identifying physical features of each fish (see the image below for the book’s entry for Chaetodon lunulatus – the butterflyfish from the flashcard above). Reading those descriptions made me learn more effectively, by helping me focus my attention on the parts that matter for classification. Learning only from static images can be hard when classifying creatures with highly variable colour schemes – using extraneous knowledge about what actually matters when it comes to classification is the way to go in practice. Further, features that are hard to decode from photos – like behaviour and habitat – are sometimes crucial to distinguishing different species. One interesting thought is that while photos can be seen as raw data, natural language descriptions are essentially models. Utilising such models is likely to be of benefit in many areas. For example, being able to tell a classifier what to look for in an image would make training a supervised classifier more similar to the way humans learn. This may be achieved using similar techniques to those used for generating image descriptions, except that the goal would be to use descriptions of the classes to improve classification accuracy.

Fish ID example (Chaetodon lunulatus). Source: Reef Fish Identification: Tropical Pacific

Another difference between my learning and supervised machine learning is that if I found a creature hard to identify, I would go and look for more photos or videos of them. Videos were especially valuable, because in practice I rarely had to identify static creatures. This approach may be applicable in situations where labelled data is abundant. Sometimes, using all the labelled data makes model training too slow to be practical. An approach I used in the past to overcome this issue is to randomly sample the data, but it often makes sense to sample in a way that yields the best model, e.g., by sampling more instances from classes that are harder to classify.

One similarity to supervised machine learning that I encountered was the danger of overfitting. Due to the relatively small number of photos and the fact that I had to view each one of them multiple times, I found that in some cases I memorised the entire photo rather than the creature. This was especially the case with low-quality photos or ones that were missing key features. My regularisation approach consisted of trying to memorise the descriptions from the book, and collecting more photos. I wish more algorithms were this self-conscious about overfitting!

## Can’t this be automated?

While doing surveys and studying species, I kept asking myself whether the whole thing can be automated. Thanks to deep learning, computers have recently gotten very good at classifying images, sometimes outperforming humans. It seems likely that at some point the survey methodology would be changed to just taking a video of the dive, and letting an algorithm do the hard job of identifying the creatures. Analysis of the bottom photos is automated, so it is reasonable to automate the other survey methods as well. However, there are quite a few challenges that need to be overcome before full automation can be implemented.

If the results of the LifeCLEF 2015 Fish Task are any indication, we are quite far from automating fish identification. The precision of the top methods in that challenge was around 80% for identifying 15 fish species from underwater videos, where the chosen species are quite distinct from each other. In tropical surveys it is not uncommon to record around 100 fish species along the 50 metre transect, with many species being similar to each other. It’s usually the case that it’s not same species on every dive (even at the same site), so replacing humans would require training a highly accurate classifier on thousands of species.

Dealing with high diversity isn’t the only challenge in automating RLS. The appearance of many species varies by gender and age, so the classifier would have to learn all those variations (see image below for an example). Getting good training data can be very challenging, since the labelling process is labour-intensive, and elements like colour and backscatter are highly dependent on dive site conditions and the quality of the camera. Another complication is that RLS data includes size estimates, which can be hard to obtain from videos and photos without knowing how far the camera was from the subject and the type of lens used. In addition, accounting for side information (geolocation, behaviour, depth, etc.) can make a huge difference in accurately identifying species, but it isn’t easy to integrate with some learning models. Finally, it is likely that some species will be missed when videos are taken without any identification done underwater, because RLSers tend to get good photos of species that they know will be hard to identify, even if it means spending more time at one spot or shining strobes under ledges.

Chlorurus sordidus variations. Source: Tropical Marine Fishes of Australia

Another aspect of automating surveys is completely removing the need for human divers by sending robots down. This is an active research area, and is the only way of surveying deep waters. However, this approach still requires a boat-based crew to deploy the robots. It may also yield different data from RLS for cryptic species, though this depends on the type of robots used. In addition, there’s the issue of cost – RLS relies on volunteer scuba divers who are diving anyway, so the cost of getting RLSers to do surveys is rather low (especially for shore dives near a diver’s home, where there is no cost to RLS). Further, RLS’s mission is “to inspire and engage a global volunteer community to survey reefs using scientific methods and share knowledge about marine ecosystem health”. Engaging the community is a crucial part of RLS because robots do not care about the environment. Humans do.

## Small data is valuable

When compared to datasets commonly encountered online, RLS data is small. As the image below shows, fewer than 10,000 surveys have been conducted to date. However, this data is still valuable, as it provides a high-quality snapshot of the state of marine ecosystems in areas that wouldn’t be surveyed if it wasn’t for RLS volunteers. For example, in a recent Nature article, the authors used RLS data to assess the vulnerability of marine fauna to global warming.

RLS surveys by Australian financial year (July-June). Source: RLS Foundation Annual Report 2015

Each RLS survey requires several hours of work. In addition to performing the survey itself, a lot of work goes into entering the data and verifying its quality. Getting to the survey sites is not always a trivial task, especially for remote sites such as some of those we dived on my recent trip. Spending a month diving the Great Barrier Reef is a good way of appreciating its greatness. As the map shows, the surveys we did covered only the top part of the reef’s 2300 kilometres, and we only sampled a few sites within that part. The Great Barrier Reef is very vast, and it is hard to convey its vastness with just words or a map. You have to be there to understand – it is quite humbling.

In summary, the RLS experience has given me a new appreciation for small data in the offline world. Offline data collection is often expensive and labour-intensive – you need to work hard to produce a few high-quality data points. But the size of your data doesn’t matter (though having more quality data is always good). What really matters is what you do with the data – and the RLS team and their collaborators have been doing quite a lot. The RLS experience also illustrates the importance of domain expertise: I’ve looked at the RLS datasets, but I have no idea what questions are worth asking and answering using those datasets. The RLS project is yet another example of how in science collecting data is time-consuming, and coming up with appropriate research questions is hard. It is a lot of fun, though.

# This holiday season, give me real insights

Merriam-Webster defines an insight as an understanding of the true nature of something. Many companies seem to define an insight as any piece of data or information, which I would call a pseudo-insight. This post surveys some examples of pseudo-insights, and discusses how these can be built upon to provide real insights.

## Exhibit A: WordPress stats

This website is hosted on wordpress.com. I’m generally happy with WordPress – though it’s not as exciting and shiny as newer competitors, it is rock-solid and very feature-rich. An example of a great WordPress feature is the new stats area (available under wordpress.com/stats if you have a WordPress website). This area includes an insights page, which is full of prime examples of pseudo-insights.

At the top of the insights page, there is a visualisation of posting activity. As the image below shows, this isn’t very interesting for websites like mine. I already know that I post irregularly, because writing a blog post is time-consuming. I suspect that this visualisation isn’t very useful even for more active multi-author blogs, as it is essentially just a different way of displaying the raw data of post dates. Without joining this data with other information, we won’t gain a better understanding of how the blog is performing and why it performs the way it does.

An attempt to extract more meaningful insights from posting times appears further down the page, in the form of a widget that tells you the most popular day and hour. The help text says that This is the day and hour when you have been getting the most Views on average. The best timing for publishing a post may be around this period. Unfortunately, I’m pretty certain that this isn’t true in my case. Monday happens to be the most popular day because that’s when I published two of my most popular posts, and I usually try to spread the word about a new post as soon as I publish it. Further, blog posts can become popular a long time after publication, so it is unlikely that the best timing for publishing a post is around Monday 3pm.

What would real WordPress insights look like? If we stick to idea of exploring the effect of publication timing, I would be curious to know if there is indeed a link between when a post is published and its popularity. Automattic (the company behind WordPress) is in a position to test this, as they can explore data from millions of blogs. My gut feeling is that the time of publication has a negligible effect on popularity. Things that matter much more are a post’s title, content, and effective distribution channels. Given the amount of data that they have, Automattic data scientists can definitely explore all of these factors. This would allow them to surface insights that will help authors drive more quality traffic to their websites.

## Exhibit B: Facebook page insights

As anyone who manages a Facebook page probably knows, Facebook provides pretty rich analytics of pages on their platform. For example, you can see the likes you’ve received over time and how your posts perform, and slice and dice this information in various ways. This is a great feature, but again, calling it insights is a misuse of the word and somewhat of an insult for those of us who work to extract real insights from data. An analytics dashboard is not insights.

What would real Facebook page insights look like? Working off the assumption that people manage a Facebook page to reach and engage their audience, real insights would enhance a page administrator’s understanding of their audience and improve their ability to engage them and reach new people. However, Facebook is famous for having a conflict of interest here, because they require you to pay to reach more people. For example, if a post you shared is performing better than usual, Facebook will send you a notification, asking you to pay to boost the post further. It would be better if they told you what has caused this post to reach more people, and how to reproduce this success with future posts (for free). But this is very unlikely to happen. In the words of CGP Grey: professional sharers cannot trust the platforms upon which they stand, audiences cannot trust the platform to show what they asked to see.

## Exhibit C: LinkedIn profile views

Who’s viewed your profile is a popular LinkedIn feature. A key part of this feature is a graph that includes your weekly profile views together with actions taken on LinkedIn. The official LinkedIn blog calls this graph the insights graph and provides some examples for its uses:

So, for example, if you are trying to attract new clients or business leads, you can see how many potential partners looked at your profile after you joined an important industry group. Or, if you’re looking for a new job, you can look at your insights graph to see whether adding a skill to your profile or endorsing a peer gave you a bigger bump in views by recruiters. No matter your goal, you’ll be able to see which actions lead to the most relevant profile views – then start reaching out and closing the sale or applying for your dream job.

As the examples show, the so-called insights graph merely provides information about past actions and profile views on the LinkedIn platform. It is up to you to come up with the insights, but this may be hard if you consider only the actions taken within the walled garden of LinkedIn. For example, as shown in the following graph, my profile views received a boost on the week starting November 23, which was mostly due to publishing a popular post on this website. In general, social networks such as LinkedIn, Twitter, and Facebook tend to have a very narrow view of the world – as if the only interesting things happen on the platform. In reality, most of the action happens off-platform, either within other digital assets or in the physical world.

What would real LinkedIn insights look like? First, I think that the focus on profile views is somewhat misguided. It’s not that hard to artificially generate profile views – simply view other people’s profiles. There is no intrinsic value in someone having viewed your profile – the value comes from a connection that leads to an interesting offer or conversation. Second, LinkedIn is about professional networking that is based on real-world activity. As such, it only forms a small part of the world of professional networking by allowing people to have an online presence that makes them contactable by people they don’t already know. When it comes to insights, it’d be useful to know the true causal factors that lead to interesting connections – much more useful than suggestions such as add software development as a skill on your profile to get up to 3% more profile views.

## Summary: Real insights are about the why

There are many other examples of pseudo-insights out there. The reason is probably that the field of analytics is becoming increasingly commoditised, and it is easier to rebrand an analytics dashboard as an insights dashboard than to provide real insights. Providing real insights requires moving up the DIKW pyramid from data and information to knowledge and wisdom – from describing the past to learning general lessons that allow you to influence the future. Providing real insights can be very hard, as it often requires inferring the causes of events – the why that comes after the what and how. More on this later – I have just started reading Samantha Kleinberg’s Why: A Guide to Finding and Using Causes and will report (hopefully real) insights on causality in future posts.

# The hardest parts of data science

Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions. This post discusses some examples of these issues and how they can be addressed.

## The not-so-hard parts

Before discussing the hardest parts of data science, it’s worth quickly addressing the two main contenders: model fitting and data collection/cleaning.

Model fitting is seen by some as particularly hard, or as real data science. This belief is fuelled in part by the success of Kaggle, that calls itself the home of data science. Most Kaggle competitions are focused on model fitting: Participants are given a well-defined problem, a dataset, and a measure to optimise, and they compete to produce the most accurate model. Coupling Kaggle’s excellent marketing with their competition setup leads many people to believe that data science is all about fitting models. In reality, building reasonably-accurate models is not that hard, because many model-building phases can easily be automated. Indeed, there are many companies that offer model fitting as a service (e.g., Microsoft, Amazon, Google and others). Even Ben Hamner, CTO of Kaggle, has said that he is “surprised at the number of ‘black box machine learning in the cloud’ services emerging: model fitting is easy. Problem definition and data collection are not.”

Data collection/cleaning is the essential part that everyone loves to hate. DJ Patil (US Chief Data Scientist) is quoted as saying that “the hardest part of data science is getting good, clean data. Cleaning data is often 80% of the work.” While I agree that collecting data and cleaning it can be a lot of work, I don’t think of this part as particularly hard. It’s definitely important and may require careful planning, but in many cases it just isn’t very challenging. In addition, it is often the case that the data is already given, or is collected using previously-developed methods.

## Problem definition is hard

There are many reasons why problem definition can be hard. It is sometimes due to stakeholders who don’t know what they want, and expect data scientists to solve all their data problems (either real or imagined). This type of situation is summarised by the following Dilbert strip. It is best handled by cleverly managing stakeholder expectations, while stirring them towards better-defined problems.

Well-defined problems are great, for the obvious reason that they can actually be addressed. Examples of such problems include:

• Build a model to predict the sales of a marketing campaign
• Create a system that runs campaigns that automatically adapt to customer feedback
• Identify key objects in images
• Improve click-through rates on search engine results, ads, or any other element
• Detect whale calls from underwater recordings to prevent collisions

Often, it can be hard to get to the stage where the problem is agreed on, because this requires dealing with people who only have a fuzzy idea of what can be done with data science. Dilbertian situations aside, these people often have real problems that they care about, so exploring the core issues with them is time well-spent.

## Solution measurement is often harder than problem definition

Many problems that actually matter have solutions that are really hard to measure. For example, improving the well-being of the population (e.g., a company’s customers or a country’s citizens) is an overarching problem that arises in many situations. However, this problem gives rise to the hard question of how well-being can be measured and aggregated. The following paragraphs discuss issues that occur in solution measurement, often making it the hardest part of data science.

Ideally, we would always be able to run randomised controlled trials to measure treatment effects. However, the reality is that experimental data is often censored, there many constraints on running experiments (ethics, practicality, budget, etc.), and confounding factors may make it impossible to identify the true causal impact of interventions. These issues seriously influence many aspects of our lives. I’ve written a post on how these issues manifest themselves in research on the connection between nutrition and our health. Here, I’ll discuss another major example: the health effects of smoking and anthropogenic climate change.

While smoking and anthropogenic climate change may seem unrelated, they actually have a lot in common. In both cases it is hard (or impossible) to perform experiments to determine causality, and in both cases this fact has been used to mislead the public by parties with commercial and ideological interests. In the case of smoking, due to ethical reasons, one can’t perform an experiment where a random control group is forced not to smoke, while a treatment group is forced to smoke. Further, since it can take many years for smoking-caused diseases to develop, it’d take a long time to obtain the results of such an experiment. Tobacco companies have exploited this fact for years, claiming that there may be some genetic factor that causes both smoking and a higher susceptibility to smoking-related diseases. Fortunately, we live in a world where these claims have been widely discredited, and it is now clear to most people that smoking is harmful. However, similar doubt-casting techniques are used by polluters and their supporters in the debate on anthropogenic climate change. While no serious climate scientist doubts the fact that human activities are causing climate change, this can’t be proved through experimentation on another Earth. In both cases, the answers should be clear when looking at the evidence and the mechanisms at play without an ideological bias. It doesn’t take a scientist to figure out that pumping your lungs full of smoke on a regular basis is likely to be harmful, as is pumping the atmosphere full of greenhouse gases that have been sequestered for millions of years. However, as said by Upton Sinclair, “it is difficult to get a man to understand something, when his salary depends upon his not understanding it.”

Assuming that we have addressed the issues raised so far, there is the matter of choosing a measure or metric of success. How do we know that our solution works well? A common approach is to choose a single metric to focus on, such as increasing conversion rates. However, all metrics have their flaws, and there are quite a few problems with metric selection and its maintenance over time.

First, focusing on a single metric can be harmful, because no metric is perfect. A classic example of this issue is the focus on growing the economy, as measured by gross domestic product (GDP). The article What is up with the GDP? by Frank Shostak summarises some of the problems with GDP:

The GDP framework cannot tell us whether final goods and services that were produced during a particular period of time are a reflection of real wealth expansion, or a reflection of capital consumption.

For instance, if a government embarks on the building of a pyramid, which adds absolutely nothing to the well-being of individuals, the GDP framework will regard this as economic growth. In reality, however, the building of the pyramid will divert real funding from wealth-generating activities, thereby stifling the production of wealth.

[…]

The whole idea of GDP gives the impression that there is such a thing as the national output. In the real world, however, wealth is produced by someone and belongs to somebody. In other words, goods and services are not produced in totality and supervised by one supreme leader. This in turn means that the entire concept of GDP is devoid of any basis in reality. It is an empty concept.

Shostak’s criticism comes from a right-winged viewpoint – his argument is that the GDP is used as an excuse for unnecessary government intervention with the market. However, the focus on GDP growth is also heavily-criticised by the left due to the fact that it doesn’t consider environmental effects and inequalities in the distribution of wealth. It is a bit odd that GDP growth is still considered a worthwhile goal by many people, given that it can easily be skewed by a few powerful individuals who choose to build unnecessary pyramids (though perhaps this is the real reason why the GDP persists – wealthy individuals have an interest in keeping it this way).

Even if we decide to use multiple metrics to evaluate our solution, our troubles aren’t over yet. Using multiple metrics often means that there are trade-offs between the different metrics. For example, with the precision and recall measures that are commonly used to evaluate the performance of search engines, it is rare to be able to increase both precision and recall at the same time. Precision is the percentage of relevant items out of those that have been returned, while recall is the percentage of relevant items that have been returned out of the overall number of relevant items. Hence, it is easy to artificially increase recall to 100% by always returning all the items in the database, but this would mean settling for near-zero precision. Similarly, one can increase precision by always returning a single item that the algorithm is very confident about, but this means that recall would suffer. Ultimately, the best balance between precision and recall depends on the application.

Another issue with choosing metrics is the impossibility of reliably evaluating our choices. This is summarised well by Scott Berkun in his book The Year Without Pants:

All metrics create temptations. Even with great intentions and smart minds, data runs you faster and faster into a stupid self-destructive circle. Data can’t decide things for you. It can help you see things more clearly if captured carefully, but that’s not the same as deciding. Just as there is an advice paradox, there is a data paradox: no matter how much data you have, you still depend on your intuition for deciding how to interpret and then apply the data.

Put another way, there is no good KPI for measuring KPIs. There are no good metrics for evaluating metrics (or for evaluating metrics for evaluating metrics for evaluating metrics, and on it goes).

OK, so we’ve picked some flawed measures that we can’t really evaluate, and we’ve accepted the imperfections of the evaluation process. Are we done yet? No. There’s still the small matter of Goodhart’s Law, which states that “when a measure becomes a target, it ceases to be a good measure.” This is often the case because people will tend to manipulate results and game the system (not necessarily maliciously) in order to hit measured goals. However, even without manipulation and gaming, we often deal with moving targets. Just because the measure we’ve chosen is suitable today, it doesn’t mean it will still be relevant in a few months or years because reality changes. For example, in the 1990s, the number of page views was a good measure of interaction with websites, but nowadays it is a pretty weak measure because many websites are single-page applications. Reality changes and so should our problems, solutions, measures, and goals.

## Embracing ambiguity and uncertainty

Personally, I find the complexities of measurement and problem definition quite interesting. However, many people aren’t that interested in this stuff – they just want working solutions and simple stories. As demonstrated by the examples throughout this article, over-simplification of complicated matters is a pervasive issue that goes beyond what’s commonly considered “data science”. This is why storytelling is seen as a key skill that data scientists should possess. I believe it’s also important to maintain one’s integrity and not just make up stories that people would buy, but it’d be naive to assume that this never happens. Either way, good data scientists embrace uncertainty and ambiguity, but can still tell a simple story if needed.

Note: The ideas in this post were first presented at The Sydney Data Science Breakfast Meetup Group. The slides for that talk are available here.

# Migrating a simple web application from MongoDB to Elasticsearch

Bandcamp Recommender (BCRecommender) is a web application that serves music recommendations from Bandcamp. I recently switched BCRecommender’s data store from MongoDB to Elasticsearch. This has made it possible to offer a richer search experience to users at a similar cost. This post describes the migration process and discusses some of the advantages and disadvantages of using Elasticsearch instead of MongoDB.

## Motivation: Why swap MongoDB for Elasticsearch?

I’ve written a few posts in the past on BCRecommender’s design and implementation. It is a fairly simple application with two main components: the backend worker that crawls data and generates recommendations in batch, and the webapp that serves the recommendations. Importantly, each of these components has its own data store, with the recommendations synced up from the worker to the webapp, and data like events and subscriptions synced down from the webapp to the worker. Recently, I migrated the webapp component from Parse to DigitalOcean, replacing Parse’s data store with MongoDB. Choosing MongoDB was meant to simplify the transition – Parse uses MongoDB behind the scenes, as does the backend worker. However, moving out of Parse’s sandboxed environment freed me to choose any data store, and Elasticsearch seemed like a good candidate that would make it possible to expose advanced search capabilities to end users.

Advanced search means different things to different people. In BCRecommender’s case what I had in mind was rather modest, at least for the initial stages. BCRecommender presents recommendations for two types of entities: fans and tralbums (tracks/albums). In both cases, the recommended items are tralbums. When the key is a fan, the recommendations are tralbums that they may like, and when the key is a tralbum, the recommendations are similar tralbums. Each tralbum has a title, an artist name, and a list of tags. Each fan has its Bandcamp username as a primary key, and a list of tags that is derived from the tralbums in the fan’s collection. Originally, “searching” required users to either enter the exact username of a Bandcamp fan, or the exact Bandcamp link of a tralbum – not the best user experience! Indeed, I was tracking the search terms and found that many people were unsuccessfully trying to use unstructured queries. My idea of advanced search was to move away from the original key-value approach to full-text search that considers tags, titles, artists, and other fields that may get added later.

It was clear that while it may be possible to provide advanced search with MongoDB, it wouldn’t be a smooth ride. While recent versions of MongoDB include support for full-text search, it isn’t as feature-rich as Elasticsearch. For example, MongoDB text indices do not store phrases or information about the proximity of words in the documents, making phrase queries run slowly unless the entire collection fits in memory. The names really say it all: MongoDB is a database with some search capabilities, and Elasticsearch is a search engine with some database capabilities. It seems pretty common to use MongoDB (or another database) as a data store and supply search through Elasticsearch, so I figured it isn’t a bad idea to apply this pattern to BCRecommender.

It is worth noting that if BCRecommender were a for-profit project, I would probably use Algolia rather than Elasticsearch. My experience with Algolia on a different project has been excellent – they make it easy for you to get started, have great customer service, and deliver good and fast results with minimal development and operational effort. The two main disadvantages of Algolia are its price and the fact that it’s a closed-source solution (see further discussion on Quora). At over two million records, the monthly cost of running Algolia for BCRecommender would be around US\$649, which is more than what I’m willing to spend on this project. However, for a business this may be a reasonable cost because deploying and maintaining an Elasticsearch cluster may end up costing more. Nonetheless, many businesses use Elasticsearch successfully, which is why I have no doubt that it’s a great choice for my use case – it just requires more work than Algolia to get up and running.

## Executing the migration plan

The plan for migrating the webapp from MongoDB to Elasticsearch was pretty simple:

1. Read the Elasticsearch manual to ensure it suits my needs
2. Replace MongoDB with Elasticsearch without making any user-facing changes
3. Expose full-text search to BCRecommender users
4. Improve search performance based on user behaviour
5. Implement more search features

Reading the manual is not something I do for every piece of technology I use (there are just too many tools out there these days), but for Elasticsearch it seemed to be worth the effort. I’m not done reading yet, but covering the material in the Getting Started and Search in Depth sections gave me enough information to complete steps 2 & 3. The main things I was worried about was Elasticsearch’s performance as a database and how memory-hungry it’d be. Reading the manual allowed me to avoid some memory-use pitfalls and gave me insights on the way MongoDB and Elasticsearch compare (see details below).

Switching from MongoDB to Elasticsearch as a simple database was pretty straightforward. Both are document-based, so there were no changes required to the data models, but I did use the opportunity to fix some issues. For example, I changed the sitemap generation process from dynamic to static to avoid having to scroll through the entire dataset to fetch deep sitemap pages. To support BCRecommender’s feature of browsing through random fans, I replaced MongoDB’s somewhat-hacky approach of returning random results with Elasticsearch’s cleaner method. As the webapp is implemented in Python, I originally used the elasticsearch-dsl package, but found it too hard to debug queries (e.g., figuring out how to rank results randomly was a bit of a nightmare). Instead, I ended up using the elasticsearch-py package, which is only a thin wrapper around the Elasticsearch API. This approach yields code that doesn’t look very Pythonic – rather than following the Zen of Python’s flat is better than nested aphorism, the API follows the more Java-esque belief of you can never have enough nesting (see image below for example). However, I prefer overly-nested structures that I can debug to flat code that doesn’t work. I may try using the DSL again in the future, once I’ve gained more experience with Elasticsearch.

As mentioned, one of my worries was that I would have to increase the amount of memory allocated to the machine where Elasticsearch runs. Since BCRecommender is a fairly low-budget project, I’m willing to sacrifice high availability to save a bit on operational costs. Therefore, the webapp and its data store run on the same DigitalOcean instance, which is enough to happily serve the current amount of traffic (around one request per second). By default, Elasticsearch indexes all the fields, and even includes an extra indexed _all field that is a concatenation of all string fields in a document. While indexing everything may be convenient, it wasn’t necessary for the first stage. Choosing the minimal index settings allowed me to keep using the same instance size as before (1GB RAM and 30GB SSD). In fact, due to the switch to static sitemaps and the removal of MongoDB’s random attribute hack, fewer indexes were required after the change.

Once I had all the code converted and working on my local Vagrant environment, it was time to deploy. The deployment was fairly straightforward and required no downtime, as I simply provisioned a new instance and switched over the floating IP once it was all tested and ready to go. I monitored response time and memory use closely and everything seemed to be working just fine – similarly to MongoDB. After a week of monitoring, it was time to take the next step and enable advanced search.

Enabling full-text search is where things got interesting. This phase required adding a search result page (previously users were redirected to the queried page if it was found), and reindexing the data. For this phase, I tried to keep things as simple as possible, and just indexed the string fields (tags, artist, and title) using the standard analyser. I did some manual testing of search results based on common queries, and played a bit with improving precision and recall. Perhaps the most important tweak was allowing an item’s activity level to influence the ranking. For each tralbum, the activity level is the number of fans that have the tralbum in their collection, and for each fan, it is the size of the collection. For example, when searching for amanda, the top result is the fan with username amanda, followed by tralbums by the popular Amanda Palmer. Before I added the consideration of activity level, all tralbums and fans that contained the word amanda had the same ranking.

I deployed full-text search earlier this week, and so far it’s looking pretty good. Elasticsearch seems to be coping well with having the same level of resources allocated as before, but it’s still too early to tell if this is sustainable over time. Most importantly, users are finally seeing results when they enter unstructured queries, which increases their engagement and retention. Woohoo!

Improving search performance based on user behaviour is expected to be an ongoing effort. Despite having many ideas, I resisted the temptation of endless offline tinkering and opted to release a working search page quickly. With Google Analytics now set up to track site search, the plan is keep identifying gaps and tweak the search settings continuously. This will take a while, as the number of daily users is currently 200-300, and they don’t all use site search.

Implementing more search features is another set of items on my to-do list that will be addressed over time. For example, it’d be great to have search auto-completion and a prettier result page. However, I have more ideas than time to implement them, and I’m not working on BCRecommender full-time. For now, I’m pretty happy with finally having the search function.

## Elasticsearch versus MongoDB: Key findings

Comparisons between tools should always be taken with a grain of salt. General comparisons may not address features that are important for your specific use case, or may overemphasise aspects that you don’t care about. In addition, actively developed tools are moving targets. Since I started the transition to Elasticsearch, version 2.0 has been released, and MongoDB 3.2 is expected very soon. The following list is derived from my experience and may not apply to you. You have been warned!

With the disclaimer out of the way, here are some of the advantages of Elasticsearch over MongoDB:

• Better full-text search support (duh!).
• Enforceable schemas and type validation (note: some form of optional schema is expected in MongoDB 3.2).
• All fields are indexed by default, making it easy to explore unstructured data without worrying about adding indices.
• It appears that indexing is implemented in a more efficient way that doesn’t block the node. Slowness due to indexing operations seems to be a common issue with MongoDB, even with background index creation.
• It’s possible to query multiple indices and types (same as MongoDB databases & collections, respectively) in the same query. This is a huge advantage in my case as it makes it possible to efficiently search both fans and tralbums in a single query.
• Index aliases make it easy to change the indices without changing the application.
• Multi-get by IDs returns results in the order they were requested. This is not the case with MongoDB, where using \$in doesn’t have any guarantees on the returned documents’ order. It’s easy to work around this issue, but it can be the source of subtle bugs. In my case, recommendations were unintentionally sorted in random order until I added an additional step to sort them correctly.
• Built-in support for random scoring (note: random sampling will finally be available in MondoDB 3.2 – the ticket for this has been open for 5 years).
• Built-in support for multiple types of analysis on the same field.

Some disadvantages of Elasticsearch in comparison to MongoDB are:

• All fields are indexed by default, making it easy to run into memory issues. Adjusting these default settings is strongly recommended if you know how you’re going to query the data.
• Documents are immutable, so every update requires deleting the original document and re-inserting it (in practice, it seems like this isn’t much of an issue).
• Sorting results by a field requires reading all the field’s values and sorting them in memory. The sorted results are cached, but this may cause issues if memory is too limited.

In conclusion, my experience with Elasticsearch has been mostly positive so far and I’m glad I’ve made the switch. I’m looking forward to taking further advantage of advanced search features to improve user experience on BCRecommender. New posts on the topic may be published in the future, so please subscribe to be notified when this happens. As always, I’m happy to receive feedback through the comments or privately.

# Miscommunicating science: Simplistic models, nutritionism, and the art of storytelling

I recently finished reading the book In Defense of Food: An Eater’s Manifesto by Michael Pollan. The book criticises nutritionism – the idea that one should eat according to the sum of measured nutrients while ignoring the food that contains these nutrients. The key argument of the book is that since the knowledge derived using food science is still very limited, completely relying on the partial findings and tools provided by this science is likely to lead to health issues. Instead, the author says we should “Eat food. Not too much. Mostly plants.” One of the reasons I found the book interesting is that nutritionism is a special case of misinterpretation and miscommunication of scientific results. This is something many data scientists encounter in their everyday work – finding the balance between simple and complex models, the need to “sell” models and their results to non-technical stakeholders, and the requirement for well-performing models. This post explores these issues through the example of predicting human health based on diet.

As an aside, I generally agree with the book’s message, which is backed by fairly thorough research (though it is a bit dated, as the book was released in 2008). There are many commercial interests invested in persuading us to eat things that may be edible, but shouldn’t really be considered food. These food-like products tend to rely on health claims that dumb down the science. A common example can be found in various fat-free products, where healthy fat is replaced with unhealthy amounts of sugar to compensate for the loss of flavour. These products are then marketed as healthy due to their lack of fat. The book is full of such examples, and is definitely worth reading, especially if you live in the US or in a country that’s heavily influenced by American food culture.

### Running example: Predicting a person’s health based on their diet

Predicting health based on diet isn’t an easy problem. First, how do you quantify and measure health? You could use proxies like longevity and occurrence/duration of disease, but these are imperfect measures because you can have a long unhealthy life (thanks to modern medicine) and some diseases are more unbearable than others. Another issue is that there are many factors other than diet that contribute to health, such as genetics, age, lifestyle, access to healthcare, etc. Finally, even if you could reliably study the effect of diet in isolation from other factors, there’s the question of measuring the diet. Do you measure each nutrient separately or do you look at foods and consumption patterns? Do you group foods by time (e.g., looking at overall daily or monthly patterns)? If you just looked at the raw data of foods and nutrients consumed at certain points in time, every studied subject is likely to be an outlier (due to the curse of dimensionality). The raw data on foods consumed by individuals has to be grouped in some way to build a generalisable model, but groupings necessitate removal of some data.

Modelling real-world data is rarely straightforward. Many assumptions are embedded in the measurements and models. Good scientific papers are explicit about the shortcomings and limitations of the presented work. However, by the time scientific studies make it to the real world, shortcomings and limitations are removed to present palatable (and often wrong) conclusions to a general audience. This is illustrated nicely by the following comic:

Source: “Piled Higher and Deeper” by Jorge Cham www.phdcomics.com

### Selling your model with simple explanations

People like simple explanations for complex phenomena. If you work as a data scientist, or if you are planning to become/hire one, you’ve probably seen storytelling listed as one of the key skills that data scientists should have. Unlike “real” scientists that work in academia and have to explain their results mostly to peers who can handle technical complexities, data scientists in industry have to deal with non-technical stakeholders who want to understand how the models work. However, these stakeholders rarely have the time or patience to understand how things truly work. What they want is a simple hand-wavy explanation to make them feel as if they understand the matter – they want a story, not a technical report (an aside: don’t feel too smug, there is a lot of knowledge out there and in matters that fall outside of our main interests we are all non-technical stakeholders who get fed simple stories).

One of the simplest stories that most people can understand is the story of correlation. Going back to the running example of predicting health based on diet, it is well-known that excessive consumption of certain fats under certain conditions is correlated with an increase in likelihood of certain diseases. This is simplified in some stories to “consuming more fat increases your chance of disease”, which leads to the conclusion that consuming no fat at all decreases the chance of disease to zero. While this may sound ridiculous, it’s the sad reality. According to a recent survey, while the image of fat has improved over the past few years, 42% of Americans still try to limit or avoid all fats.

A slightly more involved story is that of linear models – looking at the effect of the most important factors, rather than presenting a single factor’s contribution. This storytelling technique is commonly used even with non-linear models, where the most important features are identified using various techniques. The problem is that people still tend to interpret this form of presentation as a simple linear relationship. Expanding on the previous example, this approach goes from a single-minded focus on fat to the need to consume less fat and sugar, but more calcium, protein and vitamin D. Unfortunately, even linear models with tens of variables are hard for people to use and follow. In the case of nutrition, few people really track the intake of all the nutrients covered by recommended daily intakes.

### Few interesting relationships are linear

Complex phenomena tend to be explained by complex non-linear models. For example, it’s not enough to consume the “right” amount of calcium – you also need vitamin D to absorb it, but popping a few vitamin D pills isn’t going to work well if you don’t consume them with fat, though over-consumption of certain fats is likely to lead to health issues. This list of human-friendly rules can go on and on, but reality is much more complex. It is naive to think that it is possible to predict something as complex as human health with a simple linear model that is based on daily nutrient intake. That being said, some relationships do lend themselves to simple rules of thumb. For example, if you don’t have enough vitamin C, you’re very likely to get scurvy, and people who don’t consume enough vitamin B1 may contract beriberi. However, when it comes to cancers and other diseases that take years to develop, linear models are inadequate.

An accurate model to predict human health based on diet would be based on thousands to millions of variables, and would consider many non-linear relationships. It is fairly safe to assume that there is no magic bullet that simply explains how diet affects our health, and no superfood is going to save us from the complexity of our nutritional needs. It is likely that even if we had such a model, it would not be completely accurate. All models are wrong, but some models are useful. For example, the vitamin C versus scurvy model is very useful, but it is often wrong when it comes to predicting overall health. Predictions made by useful complex models can be very hard to reason about and explain, but it doesn’t mean we shouldn’t use them.

### The ongoing quest for sellable complex models

All of the above should be pretty obvious to any modern data scientist. The culture of preferring complex models with high predictive accuracy to simplistic models with questionable predictive power is now prevalent (see Leo Breiman’s 2001 paper for a discussion of these two cultures of statistical modelling). This is illustrated by the focus of many Kaggle competitions on producing accurate models and the recent successes of deep learning for computer vision. Especially with deep learning for vision, no one expects a handful of variables (pixels) to be predictive, so traditional explanations of variable importance are useless. This does lead to a general suspicion of such models, as they are too complex for us to reason about or fully explain. However, it is very hard to argue with the empirical success of accurate modelling techniques.

Nonetheless, many data scientists still work in environments that require simple explanations. This may lead some data scientists to settle for simple models that are easier to sell. In my opinion, it is better to make up a simple explanation for an accurate complex model than settle for a simple model that doesn’t really work. That being said, some situations do call for simple or inflexible models due to a lack of data or the need to enforce strong prior assumptions. In Albert Einstein’s words, “it can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience”. Make things as simple as possible, but not simpler, and always consider the interests of people who try to sell you simplistic (or unnecessarily complex) explanations.

# The wonderful world of recommender systems

I recently gave a talk about recommender systems at the Data Science Sydney meetup (the slides are available here). This post roughly follows the outline of the talk, expanding on some of the key points in non-slide form (i.e., complete sentences and paragraphs!). The first few sections give a broad overview of the field and the common recommendation paradigms, while the final part is dedicated to debunking five common myths about recommender systems.

### Motivation: Why should we care about recommender systems?

The key reason why many people seem to care about recommender systems is money. For companies such as Amazon, Netflix, and Spotify, recommender systems drive significant engagement and revenue. But this is the more cynical view of things. The reason these companies (and others) see increased revenue is because they deliver actual value to their customers – recommender systems provide a scalable way of personalising content for users in scenarios with many items.

Another reason why data scientists specifically should care about recommender systems is that it is a true data science problem. That is, at least according to my favourite definition of data science as the intersection between software engineering, machine learning, and statistics. As we will see, building successful recommender systems requires all of these skills (and more).

### Defining recommender systems

When trying to the define anything, a reasonable first step is to ask Wikipedia. Unfortunately, as of the day of this post’s publication, Wikipedia defines recommender systems too narrowly, as “a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that a user would give to an item” (I should probably fix it, but this wrong definition helped my talk flow better – let me know if you fix it and I’ll update this paragraph).

The problem with Wikipedia’s definition is that there’s so much more to recommender systems than rating prediction. First, recommender is a misnomer – calling it a discovery assistant is better, as the so-called recommendations are far from binding. Second, system means that elements like presentation are important, which is part of what makes recommendation such an interesting data science problem.

My definition is simply:

Recommender systems are systems that help users discover items they may like.

Depending on who you ask, there are between two and twenty different recommendation paradigms. The usual classification is by the type of data that is used to generate recommendations. The distinction between approaches is more academic than practical, as it is often a good idea to use hybrids/ensembles to address each method’s limitations. Nonetheless, it is worthwhile discussing the different paradigms. The way I see it, if you ignore trivial approaches that often work surprisingly well (e.g., popular items, and “watch it again”), there are four main paradigms: collaborative filtering, content-based, social/demographic, and contextual recommendation.

Collaborative filtering is perhaps the most famous approach to recommendation, to the point that it is sometimes seen as synonymous with the field. The main idea is that you’re given a matrix of preferences by users for items, and these are used to predict missing preferences and recommend items with high predictions. One of the key advantages of this approach is that there has been a huge amount of research into collaborative filtering, making it pretty well-understood, with existing libraries that make implementation fairly straightforward. Another important advantage is that collaborative filtering is independent of item properties. All you need to get started is user and item IDs, and some notion of preference by users for items (ratings, views, etc.).

The major limitation of collaborative filtering is its reliance on preferences. In a cold-start scenario, where there are no preferences at all, it can’t generate any recommendations. However, cold starts can also occur when there are millions of available preferences, because pure collaborative recommendation doesn’t work for items or users with no ratings, and often performs pretty poorly when there are only a few ratings. Further, the underlying collaborative model may yield disappointing results when the preference matrix is sparse. In fact, this has been my experience in nearly every situation where I deployed collaborative filtering. It always requires tweaking, and never simply works out of the box.

Content-based algorithms are given user preferences for items, and recommend similar items based on a domain-specific notion of item content. The main advantage of content-based recommendation over collaborative filtering is that it doesn’t require as much user feedback to get going. Even one known user preference can yield many good recommendations (which can lead to the collection of preferences to enable collaborative recommendation). In many scenarios, content-based recommendation is the most natural approach. For example, when recommending news articles or blog posts, it’s natural to compare the textual content of the items. This approach also extends naturally to cases where item metadata is available (e.g., movie stars, book authors, and music genres).

One problem with deploying content-based recommendations arises when item similarity is not so easily defined. However, even when it is natural to measure similarity, content-based recommendations may end up being too homogeneous to be useful. Such recommendations may also be too static over time, thereby failing to adjust to changes in individual user tastes and other shifts in the underlying data.

Social and demographic recommenders suggest items that are liked by friends, friends of friends, and demographically-similar people. Such recommenders don’t need any preferences by the user to whom recommendations are made, making them very powerful. In my experience, even trivially-implemented approaches can be depressingly accurate. For example, just summing the number of Facebook likes by a person’s close friends can often be enough to paint a pretty accurate picture of what that person likes.

Given this power of social and demographic recommenders, it isn’t surprising that social networks don’t easily give their data away. This means that for many practitioners, employing social/demographic recommendation algorithms is simply impossible. However, even when such data is available, it is not always easy to use without creeping users out. Further, privacy concerns need to be carefully addressed to ensure that users are comfortable with using the system.

Contextual recommendation algorithms recommend items that match the user’s current context. This allows them to be more flexible and adaptive to current user needs than methods that ignore context (essentially giving the same weight to all of the user’s history). Hence, contextual algorithms are more likely to elicit a response than approaches that are based only on historical data.

The key limitations of contextual recommenders are similar to those of social and demographic recommenders – contextual data may not always be available, and there’s a risk of creeping out the user. For example, ad retargeting can be seen as a form of contextual recommendation that follows users around the web and across devices, without having the explicit consent of the users to being tracked in this manner.

### Five common myths about recommender systems

There are some common myths and misconceptions surrounding recommender systems. I’ve picked five to address in this post. If you disagree, agree, or have more to add, I would love to hear from you either privately or in the comment section.

The accuracy myth
Offline optimisation of an accuracy measure is sufficient for creating a successful recommender
Reality
Users don’t really care about accuracy

This is perhaps the most prevalent myth of all, as evidenced by Wikipedia’s definition of recommender systems. It’s somewhat surprising that it still persists, as it’s been almost ten years since McNee et al.’s influential paper on the damage the focus on accuracy measures has done to the field.

It is therefore worth asking where this myth came from. My theory is that it is a feedback loop between academia and industry. In academia it is pretty easy to publish papers with infinitesimal improvements to arbitrary accuracy measures on offline datasets (I’m also guilty of doing just that), while it’s relatively hard to run experiments on live systems. However, one of the moves that significantly increased focus on offline predictive accuracy came from industry, in the form of the \$1M Netflix prize, where the goal was to improve the accuracy of Netflix’s rating prediction algorithm by 10%.

Notably, most of the algorithms that came out of the three-year competition were never integrated into Netflix. As discussed on the Netflix blog:

You might be wondering what happened with the final Grand Prize ensemble that won the \$1M two years later… We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.

Our business objective is to maximize member satisfaction and month-to-month subscription retention… Now it is clear that the Netflix Prize objective, accurate prediction of a movie’s rating, is just one of the many components of an effective recommendation system that optimizes our members’ enjoyment.

The following chart says it all (taken from the second part of the blog post quoted above):

An important question that arises is: If users don’t really care about predictive accuracy, what do they care about? The answer is that predictive accuracy has some importance (as evidenced by the above chart), but it is not the only thing. In my opinion, the key consideration is UI/UX. You can have the most accurate recommendations in the world, but no one would know about it (or care) if they are not served in a timely manner through a friendly interface.

Of course, even with a great user interface and accurate predictions, there are other issues that require attention when designing recommender systems. Examples include diversity (showing various types of items), serendipity/novelty (showing non-obvious recommendations that users don’t already know about), and coverage (being able to generate recommendations for all users and items). Many other considerations are covered in an excellent survey by Guy Shani and Asela Gunawardana.

It’s also worth noting that there is an inherent problem with common accuracy measures. Specifically, when using a measure like root mean square error, a rating prediction algorithm can be made to perform better by reducing errors on low ratings. This is rather pointless, because items with low ratings will not be shown to users in any case.

Finally, a key issue that arises with offline evaluation is that there are biases in offline datasets that do not necessarily carry over to online scenarios. For instance, in many cases there is an implicit assumption that data is missing at random, when it really isn’t, e.g., the fact that users took the effort to watch and rate a movie already tells us a lot about a bias they have towards this movie (the team that won the Netflix prize used this bias to their advantage). Hiding this rating and trying to predict it is not the same as predicting a rating for a movie that is picked at random from the entire set of movies.

The black box myth
You can build successful recommender systems without worrying about what’s being recommended and how recommendations are being served
Reality
UI/UX is king, item type is critical

A good recommender system has to consider how users interact with the recommendations. For example, the number of displayed recommendations should inform the optimisation procedure (e.g., are you aiming for precision@1 or precision@10?). How these recommendations are laid out (e.g., horizontally/vertically) tends to influence user interaction. In addition, being able to explain the reasons for the recommendations can yield easy wins. Finally, in many cases there are constraints on the amount of time that can be spent generating recommendations.

In addition to UI/UX, the design of good recommender systems has to account for what’s being recommended. For example, music tracks and short videos can be played many times, so it’s probably a good idea to recommend items that the user has already seen. On the other hand, items like washing machines and cars don’t get consumed as often. If a user has just bought a washing machine, they’re unlikely to want another one anytime soon (but they may want a dryer or a clothes line).

Hynt is a recommender-system-as-a-service for e-commerce whose development I led up until the middle of last year. The general idea is that merchants simply add a few lines of JavaScript to their shop pages and Hynt does the hard work of recommending relevant items from the store, while considering the user and page context. Going live with Hynt reaffirmed many well-known UI/UX lessons. Most notably:

• Above the fold is better than below. Engagement with Hynt widgets that were visible without scrolling was higher than those that were lower on the page.
• More recommendations are better than a few. Hynt widgets are responsive, adapting to the size of the container they’re placed in. Engagement was more likely when more recommendations were displayed, because users were more likely to find something they liked without scrolling through the widget.
• Fast is better than slow. If recommendations load faster, more people see them, which increases engagement. In Hynt’s case speed was especially important because the widgets load asynchronously after the host page finishes loading.

Another important UI/UX element is explanations. Displaying a plausible explanation next to a recommendation can do wonders, without making any changes to the underlying recommendation algorithms. The impact of explanations has been studied extensively by Nava Tintarev and Judith Masthoff. They have identified seven different aims of explanations, which are summarised in the following table (reproduced from their survey of explanations in recommender systems).

Aim Definition
Transparency Explain how the system works
Scrutability Allow users to tell the system it is wrong
Trust Increase user confidence in the system
Effectiveness Help users make good decisions
Persuasiveness Convince users to try or buy
Efficiency Help users make decisions faster
Satisfaction Increase ease of usability or enjoyment

Explanations are ubiquitous in real-world recommender systems. For example, Amazon uses explanations like “frequently bought together”, and “customers who bought this item also bought”, while Netflix presents different lists of recommendations where each list is driven by a different reason. However, as the following Netflix example shows, it is worth making sure that the explanations you provide don’t make you look stupid.

The solved problem myth
The space of recommender systems has been exhaustively explored
Reality
Development of new methods is often required

When I finished my PhD, about three years ago, I joined a small startup called Giveable as the first employee (essentially part of the founding team that was formed after Adam Neumann, the original founder, graduated from AngelCube and raised some seed funding). Giveable’s original product was a webapp where users could connect with their Facebook account and find gifts for their friends.

At the time, there wasn’t much published research on gift recommendation, and there was more or less nothing about the specific problem of recommending gifts for Facebook friends using liked pages. Here are some of the ways this problem differs from classic recommendation scenarios.

• Need to consider giver and receiver. Unlike traditional scenarios, the recommended items aren’t consumed by the user to whom they’re shown. In practice, this meant that we had to ensure the items are giftable, and take into account the relationship between the giver and the receiver. For example, the type of gift your mum may give you is different from gifts your partner may give you.
• Likes are historical, sparse, and often nonsensical. This is best illustrated by an example: What does liking a page such as Tony Abbott – Worst PM in Australian History tell us about gifts the user may like? Tony Abbott is no longer prime minister (thankfully), so it’s historical, and while this page is quite popular, there are many other pages out there that are difficult to interpret and are liked by only a handful of people (this video is a good summary of why Tony is disliked, for those who are unfamiliar with Australian politics).
• Likes are not for recommended items. As the above example shows, just because you like disliking Tony, it doesn’t exactly lead to useful gifts. Even with things that are more related to interests, such as authors and bands, the liked pages aren’t recommendable as gifts.
• Likes are not always available offline. This was an important engineering consideration: We didn’t have much time to generate recommendations from the point where a new user gave us permission to view their likes and the likes of their friends. Ideally, recommendation generation would take less than a second from the time we got all the data from Facebook. This puts a strong constraint on the types of algorithms we could use.

The key to effectively addressing the Giveable recommendation problem was doing as much processing offline as possible. Specifically:

• Similar pages were inferred using Latent Dirichlet Allocation (which can be seen as a collaborative filtering technique). This made it possible to use information from pages that are not directly linked to giftable products, e.g., for the above Tony Abbott example, people who dislike him are likely to be left-leaning, which implies many other interests.
• Facebook pages were matched to giftable products with heuristics + Mechanical Turk + machine learning. This took a few iterations of what was essentially partly-manual semi-supervised learning, where we obtained high-confidence matches through heuristics and manual tagging, and then used this to train a classifier that was used to classify uncertain matches. The results of classification on a hold-out set were then verified through manual tagging of subsamples.
• We enriched the page and product data with structured information from the Freebase knowledge graph (which has since been deprecated). This allowed us to easily match giftable products to liked pages, e.g., books to authors.

The online part included taking a receiver’s liked pages, inferring likes for similar pages, and matching all these pages to a ranked and diversified list of giftable product recommendations. These recommendations came with explanations, which were quite important in this case because the giver of a gift has to know why they’re giving it.

The silver bullet myth
Optimising a single measure or using a single algorithm is sufficient for generating a good recommendation list
Reality
Hybrids work best

Netflix provides another example for how focusing on a single algorithm or measure of success is far from sufficient. In a recent blog post, they talk about how they use multiple algorithms to optimise the order of different recommendation lists and each list’s internal ranking, while considering device-specific UI constraints, relevance, engagement, diversity, business requirements, and more.

An example from my experience comes from Giveable (which ended up evolving into Hynt), where a single list was generated by mixing the outputs of the following recommendation approaches: contextual, direct likes, inferred likes, content-based, social, collaborative filtering of products, previously viewed items, and popular interests/products. The weight of each algorithm in the mix was static – it was either set manually or through A/B testing, and then left as a hardcoded constant.

This kind of static mix can get you very far, but there’s a better way that I haven’t gotten around to implementing before leaving to work on other things. This way is described in a series of posts on bandits for recommenders by Sergey Feldman of RichRelevance. The general idea is to train recommendation models offline using a small number of strategies/paradigms. Online, recommendations are served from strategies that maximise clickthrough and revenue, given a context of features that describe the user, merchant, and web page where the RichRelevance widget is embedded. Rather than setting static weights for the strategies, the bandit model continuously adjusts the weights, while balancing between exploring new strategy weights and exploiting strategies that have been known to work well in a specific context. This allows the overall recommendation engine to adjust to changes in reality and in the underlying data.

The omnipresence myth
Every personalised system is a recommender system
Reality
This one is kinda true, but not necessarily useful…

The first conference I attended as a PhD student was the 18th International Conference on User Modeling, Adaptation and Personalization (UMAP), back in 2010. The field of recommender systems was getting increased attention, and Peter Brusilovsky, who has been working in the UMAP field for decades, argued that recommender systems are the new expert systems. This was partly because the hype was causing people to broaden the definition of the field to allow them to say that they’re working on recommender systems.

I don’t think it’s incorrect that personalisation and recommender systems are different things. However, one problem that this may cause is making people think that common recommendation techniques would apply in scenarios where they’re unlikely to work. For example, web search can be seen as a recommender system for pages that gives a high weight to the user’s intent, as captured by the query. Hence, when personalising web search, it seems sensible to use collaborative filtering techniques. This was indeed my experience with the Yandex search personalisation competition: employing a matrix factorisation approach that was inspired by collaborative filtering turned out to be a waste of time compared to domain-specific methods.

In conclusion, recommenders are about as murky as data science. Just like data science, the boundaries of recommender systems are hard to define and they are sometimes over-hyped. This hype may lead to people investing in a recommender system they don’t really need, just like the common issue of premature investment in data science. However, the hype is based on real value, which can definitely be delivered by recommender systems when they are used correctly.