Correlation and causation XKCD: https://xkcd.com/552/

Why you should stop worrying about deep learning and deepen your understanding of causality instead

Everywhere you go these days, you hear about deep learning’s impressive advancements. New deep learning libraries, tools, and products get announced on a regular basis, making the average data scientist feel like they’re missing out if they don’t hop on the deep learning bandwagon. However, as Kamil Bartocha put it in his post The Inconvenient Truth About Data Science, 95% of tasks do not require deep learning. This is obviously a made up number, but it’s probably an accurate representation of the everyday reality of many data scientists. This post discusses an often-overlooked area of study that is of much higher relevance to most data scientists than deep learning: causality.

Causality is everywhere

An understanding of cause and effect is something that is not unique to humans. For example, the many videos of cats knocking things off tables appear to exemplify experimentation by animals. If you are not familiar with such videos, it can easily be fixed. The thing to notice is that cats appear genuinely curious about what happens when they push an object. And they tend to repeat the experiment to verify that if you push something off, it falls to the ground.

Humans rely on much more complex causal analysis than that done by cats – an understanding of the long-term effects of one’s actions is crucial to survival. Science, as defined by Wikipedia, is a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe. Causal analysis is key to producing explanations and predictions that are valid and sound, which is why understanding causality is so important to data scientists, traditional scientists, and all humans.

What is causality?

It is surprisingly hard to define causality. Just like cats, we all have an intuitive sense of what causality is, but things get complicated on deeper inspection. For example, few people would disagree with the statement that smoking causes cancer. But does it cause cancer immediately? Would smoking a few cigarettes today and never again cause cancer? Do all smokers develop cancer eventually? What about light smokers who live in areas with heavy air pollution?

Samantha Kleinberg summarises it very well in her book, Why: A Guide to Finding and Using Causes:

While most definitions of causality are based on Hume’s work, none of the ones we can come up with cover all possible cases and each one has counterexamples another does not. For instance, a medication may lead to side effects in only a small fraction of users (so we can’t assume that a cause will always produce an effect), and seat belts normally prevent death but can cause it in some car accidents (so we need to allow for factors that can have mixed producer/preventer roles depending on context).

The question often boils down to whether we should see causes as a fundamental building block or force of the world (that can’t be further reduced to any other laws), or if this structure is something we impose. As with nearly every facet of causality, there is disagreement on this point (and even disagreement about whether particular theories are compatible with this notion, which is called causal realism). Some have felt that causes are so hard to find as for the search to be hopeless and, further, that once we have some physical laws, those are more useful than causes anyway. That is, “causes” may be a mere shorthand for things like triggers, pushes, repels, prevents, and so on, rather than a fundamental notion.

It is somewhat surprising, given how central the idea of causality is to our daily lives, but there is simply no unified philosophical theory of what causes are, and no single foolproof computational method for finding them with absolute certainty. What makes this even more challenging is that, depending on one’s definition of causality, different factors may be identified as causes in the same situation, and it may not be clear what the ground truth is.

Why study causality now?

While it’s hard to conclusively prove, it seems to me like interest in formal causal analysis has increased in recent years. My hypothesis is that it’s just a natural progression along the levels of data’s hierarchy of needs. At the start of the big data boom, people were mostly concerned with storing and processing large amounts of data (e.g., using Hadoop, Elasticsearch, or your favourite NoSQL database). Just having your data flowing through pipelines is nice, but not very useful, so the focus switched to reporting and visualisation to extract insights about what happened (commonly known as business intelligence). While having a good picture of what happened is great, it isn’t enough – you can make better decisions if you can predict what’s going to happen, so the focus switched again to predictive analytics. Those who are familiar with predictive analytics know that models often end up relying on correlations between the features and the predicted labels. Using such models without considering the meaning of the variables can lead us to erroneous conclusions, and potentially harmful interventions. For example, based on the following graph we may make a recommendation that the US government decrease its spending on science to reduce the number of suicides by hanging.

US science spending versus suicides

Source: Spurious Correlations by Tyler Vigen

Causal analysis aims to identify factors that are independent of spurious correlations, allowing stakeholders to make well-informed decisions. It is all about getting to the top of the DIKW (data-information-knowledge-wisdom) pyramid by understanding why things happen and what we can do to change the world. However, finding true causes can be very hard, especially in cases where you can’t perform experiments. Judea Pearl explains it well:

We know, from first principles, that any causal conclusion drawn from observational studies must rest on untested causal assumptions. Cartwright (1989) named this principle ‘no causes in, no causes out,’ which follows formally from the theory of equivalent models (Verma and Pearl, 1990); for any model yielding a conclusion C, one can construct a statistically equivalent model that refutes C and fits the data equally well.

What this means in practice is that you can’t, for example, conclusively prove that smoking causes cancer without making some reasonable assumptions about the mechanisms at play. For ethical reasons, we can’t perform a randomly controlled trial where a test group is forced to smoke for years while a control group is forced not to smoke. Therefore, our conclusions about the causal link between smoking and cancer are drawn from observational studies and an understanding of the mechanisms by which various cancers develop (e.g., the effect of cigarette smoke on individual cells can be studied without forcing people to smoke). Cancer Tobacco companies have exploited this fact for years, making the claim that the probability of both cancer and smoking is raised by some mysterious genetic factors. Fossil fuel and food companies use similar arguments to sell their products and block attempts to regulate their industries (as discussed in previous posts on the hardest parts of data science and nutritionism). Fighting against such arguments is an uphill battle, as it is easy to sow doubt with a few simplistic catchphrases, while proving and communicating causality to laypeople is much harder (or impossible when it comes to deeply-held irrational beliefs).

My causality journey is just beginning

My interest in formal causal analysis was seeded a couple of years ago, with a reading group that was dedicated to Judea Pearl’s work. We didn’t get very far, as I was a bit disappointed with what causal calculus can and cannot do. This may have been because I didn’t come in with the right expectations – I expected a black box that automatically finds causes. Recently reading Samantha Kleinberg’s excellent book Why: A Guide to Finding and Using Causes has made my expectations somewhat more realistic:

Thousands of years after Aristotle’s seminal work on causality, hundreds of years after Hume gave us two definitions of it, and decades after automated inference became a possibility through powerful new computers, causality is still an unsolved problem. Humans are prone to seeing causality where it does not exist and our algorithms aren’t foolproof. Even worse, once we find a cause it’s still hard to use this information to prevent or produce an outcome because of limits on what information we can collect and how we can understand it. After looking at all the cases where methods haven’t worked and researchers and policy makers have gotten causality really wrong, you might wonder why you should bother.

[…]

Rather than giving up on causality, what we need to give up on is the idea of having a black box that takes some data straight from its source and emits a stream of causes with no need for interpretation or human intervention. Causal inference is necessary and possible, but it is not perfect and, most importantly, it requires domain knowledge.

Kleinberg’s book is a great general intro to causality, but it intentionally omits the mathematical details behind the various methods. I am now ready to once again go deeper into causality, perhaps starting with Kleinberg’s more technical book, Causality, Probability, and Time. Other recommendations are very welcome!

Cover image source: xkcd: Correlation

32 comments

  1. It seems to me that causality is another of our thought conveniences, just one more attempt at linearising our frustratingly non-linear existence, akin to teaching with Newtonian physics, segue to Einstein’s relativistic mechanics when the kids are ready (if ever). Cyclic systems can self-perpetuate in non-repeating cycles (chaos theory) but also respond with or resist change arising from external inputs. I believe when people speak of causality, what they are really thinking about (and desiring) is a conversation around stability versus volatility.

    Liked by 2 people

  2. Hey Yanir – great post.

    If you’ve not already, you should read Mostly Harmless Econometrics. They take quite a different approach to causality than Pearl (though there is a lot of conceptual overlap). It definitely helps build intuition for the topic. It’s also worth reading the relevant mid-70s papers from Rubin.

    Liked by 1 person

  3. I took a look at the Amazon sample for Causality, Probability and Time but I doubt if I’ll buy it just yet. I’ve got Judea Pearl’s Probabilistic Reasoning in Intelligent Systems already and think I want to work through that in a programming language (R is my first choice) before buying any more books.😉

    Liked by 1 person

  4. I appreciate this post. I teach General Psychology, and this is a central issue that I present to my students. In the meantime, I regularly come across articles, in peer-reviewed as wells as mainstream publications, which discuss correlational data as if it were supporting a causal relationship. As I tell my students, one of the difficulties is the use of the word “factor” in both types of discussions. In correlation, factors are pieces of information which give you a more likely guess about an unknown piece of information. In causation, factors are things that contribute to something else existing. Both concepts feed the mind’s desire to find patterns in the relevant world which inform our decisions/behaviors so that we can continue living, hopefully in a pleasant state. We are often tricked by these patterns (illusions, etc.), but most of the time they pan out in a beneficial way. Making the leap from “this is how things tend to work in my immediate experience” to “this is how things work everywhere for everyone” is where theories are born, where science lives, and where we often make mistakes along the way. Proceed with caution from observation to theory, but by all means, proceed!

    Liked by 1 person

  5. The problem with the search for causality (or, more generally, explainability) is that in many cases, it is “not interesting”. If I click on Google search results, neither me nor Google algorithm developers are truly interested how the algorithm decided to rank Page A before Page B. It is OK for me, as an end user, not to care about those details, as much as I don’t care hydraulics every time I take a shower. Is it OK for me, as a data scientist, not to care about the reasons behind my models? Honestly, I don’t yet know.

    Like

    1. I agree that in many cases the reasoning behind models isn’t interesting, as long as the models produce satisfactory results. Web search is actually a good example. Yes, many end users don’t really care how Google ranks pages, but SEO practitioners go to great lengths to understand search algorithms and get pages to rank well (see https://moz.com/search-ranking-factors for example).

      As data scientists, it’s important to consider model stability in production. Sculley et al. said it well in their paper on machine learning technical debt (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43146.pdf): “Machine learning systems often have a difficult time distinguishing the impact of correlated features. This may not seem like a major problem: if two features are always correlated, but only one is truly causal, it may still seem okay to ascribe credit to both and rely on their observed co-occurrence. However, if the world suddenly stops making these features co-occur, prediction behavior may change significantly.”

      Finally, in many cases what we really care about is interventionality. I don’t think it’s a real word, but what it means is that you don’t really care whether A causes B, you want to know whether intervening to change A would change B. These inferences are critical in fields like medicine and marketing, but we can look at an example from the world of blogging, which is probably more relevant to you. Many bloggers would like to attract more readers. A possible costly intervention would be to switch platforms from WordPress to Medium. Cheaper interventions may be changing the site’s layout, writing titles that get people interested, and posting links to your content on relevant channels. Another intervention would be trying to post at different times (as implied by WordPress insights and discussed in https://yanirseroussi.com/2015/12/08/this-holiday-season-give-me-real-insights/). Obviously, one would like to apply the interventions with the highest return on investment first, and data that helps with ranking the interventions is very interesting.

      Liked by 2 people

  6. I’ve been thinking about this lately quite a bit. The fact that I can type this comment and send it across the internet rests on the ability to create a completely controlled causal environment. Inside the computer, all noise and randomness is kept below the threshold of the data, and every process is completely causal. Meanwhile, outside the computer, most measurements are mostly noise, and extracting any sort of causal relation is very difficult and often impossible. My mind seems to have some sort of idea of cause as something like the interaction of balls on a pool table. The que ball strikes the eight ball and knocks it into the corner pocket, etc. But when one tries to measure things, mostly one finds nothing like this. Instead, one finds that some measurements tend to be found with other measurements most of the time, but not all of the time. Cause thus seems a statistical thing, and in no way absolute. I have difficulty reconciling the two views. One thing that occured to me to investigate, was the manner in which several huge internet outages developed involving the BGP protocol. It seemed to me that every individual packet must experience a completely causal path, but the aggregate turns into the statistical causal form we most usually deal with. I haven’t followed up with this idea so far, however

    Liked by 1 person

    1. Interesting. I think that one of the dividing factors between traditional software engineering and data science is the attitude towards uncertainty. Whereas, as you say, coding is all about creating a controlled deterministic environment, data science and statistics thrive on uncertainty. It’s similar with computer networks as well, where there is always a non-deterministic element (e.g., packets may be lost, arrive out-of-order, or come in bursts).

      Like

  7. There is a subtle difference between Woodward’s approach and that of Pearl and of Spirtes et al., which Glymour discusses in the following places:

    https://www.ncbi.nlm.nih.gov/pubmed/24887161
    http://repository.cmu.edu/cgi/viewcontent.cgi?article=1280&context=philosophy

    Basically, Woodward starts with the notion of an intervention on a variable and defines other concepts (e.g. direct cause) in terms of it, whereas Pearl and Spirtes et al. start with the notion of direct cause. One consequence of this difference is that properties like sex and race that cannot be intervened upon in a straightforward way cannot be causes for Woodward, strictly speaking, but can be for Pearl and Spirtes et al. This is a fine point, however, and it’s very nearly true that they simply provide alternative formulations of the same theory, with Woodward focusing on conceptual issues and the others focus on methodology.

    Liked by 1 person

    1. Thanks for all the pointers, Greg! I’ll definitely check them out. Personally, I have a slight bias towards Pearl, as he is my academic grandfather (he was my advisor’s advisor), but I’m keen on learning as much as possible on all the different approaches to causality. It is a fascinating area!

      Liked by 1 person

  8. “Thinking, Fast & Slow” touches on some of this in later chapters. Some algebra is used to help illustrate the deception causality and efforts towards finding it can cause.

    Like

  9. Great Post, Thanks Yanir.
    I have been afraid of being lost in Big data ie swarmed by such a vast amount of correlations.
    So understanding causality is important to make better better decision.
    Not being a data scientist but a startuper, my approach is to trying to understand how many signals I perceive and what storytelling I do with them; then what vision of reality I get…D Kahneman it’s great help as we should always be aware that our story is made up of only a small proportion of all signals and the causilty we build into a story it’s only one among many.
    So, how we improve the process??
    Still searching;)
    Jerome

    Like

  10. I am tangentially interested in that fundamental topic. I enjoyed a lot a short research course by marloes mathuis on causality in high dimensions. There were both interesting algorithms and mathematics results maybe consistency. It’s worth a look !

    Liked by 1 person

  11. Great writeup!
    I think causality will become something that data scientists will need to acknowledge and think about more explicitly.

    Coming from machine learning, it took me a while to wrap my head around the subtle but important differences in the way similar ideas are used in prediction vs. causal inference.

    Recently I gave an ICML tutorial about causality, together with David Sontag. This might be of interest to your readers as a starting point, particulalry for people who are well-versed in ML. It’s here:
    http://www.cs.nyu.edu/~shalit/tutorial.html

    Liked by 1 person

  12. Hi Yanir,
    Thanks for addressing this huge gap in how we interpret data! Models that handle causality well will take us from finding cats in images to solving more subtle problems in robotics and adaptive systems. The topic deserves more attention than it gets. Thanks for keeping it in the spotlight.
    Brandon

    Liked by 1 person

  13. My understanding is that causality is always the central focus of science. Machine learning/data mining is a relatively more recent thing, and its greatest benefit lies in solving complex prediction problems. But I think causality study and data mining can help each other. For the purpose of understanding causality, data mining can be used in an exploratory way that helps scientists to generate theories (I think it is possible to study the features (i.e., hidden units) extracted by deep learning networks), then experiments, longitudinal studies and traditional stats can be used to test the theories. For the purpose of solving practical prediction problems, theories developed from causality studies can help identify useful features as input to the machine learning algorithms. In fact, this was done all the times especially before deep learning became popular. I agree that scientists should improve their understanding of causality, but picking up new technologies that take advantage of modern computers and large data won’t really hurt.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s