correlation: neither everything nor nothing

Intellectual  culture, like most social phenomena, is subject to pendulum swings, to backlashes and overcorrections. Right now, I think we’re witnessing a pretty wild back and forth about statistical reasoning, under the (not entirely helpful) frame of Big Data. I recently expressed some reservations about how statistical significance testing can lead us astray, but I also think that statistical approaches to knowledge generation are vital and useful. They require an appropriate skepticism. The phrase “correlation does not imply causation,”  at this point in the internet cycle, risks becoming a kind of research nihilism. I worry that the denial of the importance of correlation is a bigger impediment to human knowledge and understanding than belief in  specious relationships between correlation and causation.

First, you should read two pieces on the “correlation does not imply causation” phenomenon, which has gone from a somewhat arcane notion common to research methods classes to a full-fledged meme. This piece by Greg Laden is absolute required reading on correlation and causation and how to think about both. Second, this piece by Daniel Engber does good work talking about how “correlation does not imply causation” became an overused and unhelpful piece of internet lingo.

As Laden points out, the question is really this: what does “imply” mean? The people who employ “correlation does not imply causation” as a kind of argumentative trump card are typically using “imply” in a way that nobody actually means, which is as synonymous with “prove.” That’s pretty far from what we mean by “implies”! In fact, using the typical meaning of implication, correlation often implies causation, in the sense that it provides powerful evidence for a causal relationship. In careful, rigorously conducted research, a strong correlation can offer very strong evidence of causation, if that correlation is embedded in a theoretical argument for how that causative relationship works.

A few things I’d like people to think about.

Correlation describes a relationship between quantitative variables. I often read people saying something like “I think there’s a correlation between party affiliation and IQ.” Party affiliation is a categorical variable, and correlation describes relationship between two or more quantitative variables. There may be an association between an explanatory categorical variable and a response quantitative variable, but not a correlation. To investigate that kind of association we would use an ANOVA instead of a correlation. I say this simply because if we’re going to understand how to draw responsible conclusions from evidence, we need to be clear about terms and procedures. Likewise, saying something like “there’s a correlation between your opinion of a writer and if you usually agree with them” just strikes me as a misuse of specific terminology. It’s too vague to really make anyone more informed. (Don’t check my archives to see if I’ve been guilty of this!)

There are specific reasons that an assertion of causation from correlation data might be incorrect. There is a vast literature of research methodology, across just about every research field you can imagine. Correlation-causation fallacies have been investigated and understood for a long time. Among the potential dangers is the confounding variable, where an unknown variable is driving the change in two other variables, making them appear to influence one another. This gives us the famous drownings-and-ice cream correlation– as drownings go up, so do ice cream sales. The confounding variable, of course, is temperature. There are all sorts of nasty little interpretation problems in the literature. These dangers are real. But in order to have understanding, we have to actually investigate why a particular relationship is spurious. Just saying “correlation does not imply causation” doesn’t do anything to actually improve our understanding.

Correlation evidence can be essential when it is  difficult or impossible to investigate a causative mechanism. Cigarette smoking causes cancer. We know that. We know it because of many, many rigorous and careful studies have established that connection. It might surprise you to know that the large majority of our evidence demonstrating that relationship comes from correlation studies, rather than experiments. Why? Well, as my statistics instructor frequently says– here, let’s prove cigarette smoking causes cancer. We’ll round up some infants, and we’ll divide them into experimental and control groups, and we’ll expose the experimental group to tobacco smoke, and in a few years, we’ll have proven a causal relationship. Sound like a good idea to you? Me neither. We knew that cigarettes were contributing to lung cancer long before we identified what was actually happening in the human body, and we have correlational studies to thank for that. Same with heart attacks and diet, and a variety of other relationships.

Or consider relationships which we believe to be strong but in which we are unlikely to ever identify a specific causal mechanism. I have on my desk a raft of research showing a strong negative correlation between parental income and student performance on various educational metrics. It’s a relationship we find in a variety of locations, across a variety of ages, and through a variety of different research contexts. This is important research, it has stakes; it helps us to understand the power of structural advantage and contributes to political critique of our supposedly meritocratic social systems. Suppose I was prohibited from asserting that this correlation proved anything because I couldn’t prove causation. My question is this: how could I find a specific causal mechanism? The relationship is likely very complex, and in some cases, not subject to external observation by researchers at all. To refuse to consider this relationship in our knowledge making or our policy decisions because of an overly skeptical attitude towards correlational data would be profoundly misguided. Of course there’s limitations and restrictions we need to keep in mind– the relationship is consistent but not universal, its effect is different for different parts of the income scale, it varies with a variety of factors. It’s not a complete or simple story. But I’m still perfectly willing to say: poverty causes poor educational performance. That’s the only reasonable conclusion from the data.

Correlation is a statistical relationship. Causation is a judgement call. I frequently find that people seem to believe that there is some sort of mathematical proof of causation that a high correlation does not merit, some number that can be spit out by statistical packages that says “here’s causation.” But causation is always a matter of the informed judgment of the research community. Controlled experiments are the gold standard in that regard, but there are controlled experiments that can’t prove causation and other research methods that have established causation to the satisfaction of most members of a discipline.

Human beings have the benefit of human reasoning. One of my frustrations with the “correlation does not imply causation” line is that it’s often deployed in instances where no one is asserting that we’ve adequately proved causation. I sometimes feel as though people are trying to protect us from mistakes of reasoning that no one would actually fall victim to. In an (overall excellent) piece for the Times, Gary Marcus and Ernest Davis write, “A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two.” That’s true– it is hard to imagine! So hard to imagine that I don’t think anyone would have that problem. I get the point that it’s a deliberately exaggerated example, and I also fully recognize that there are some correlation-causation assumptions that are tempting but wrong. But I think that, when people state the dangers of drawing specious relationships, they sometimes act as if we’re all dummies.

Those disagreeing with conclusions drawn from correlational data have a burden of proof too. This is the thing, for me, more than anything. It’s fine to dispute a suggestion of causation drawn from correlation data. Just recognize that you have to actually make the case. Different people can have responsible, reasonable disagreements about statistical inferences. Both sides have to present evidence and make a rational argument drawn from theory. “Correlation does not imply causation” is the beginning of discussion, not the end.

I consider myself on the skeptical side when it comes to research, and as someone who is frequently frustrated by hype and woowoo, I’m firmly in the camp that says we need skepticism ingrained in how we think and write about new types of inquiry. I personally do think that many of the claims about Big Data applications are overblown, and I also think that the notion that we’ll ever be post-theory or purely empirical are dangerously misguided. But there’s no need to throw the baby out with the bathwater. While we should maintain a healthy criticism of them, new ventures dedicated to researched, data-driven writing should be greeted as a welcome development. What we need, I think, is to contribute to a communal understanding of research methods and statistics, including healthy skepticism, and there’s reason for optimism in that regard.


  1. Using “correlation does not imply causation” as an oversimplified charm against things we feel like rejecting out of hand is so 2012. Now all the cool kids are saying ‘it’s not a free speech issue if it’s a private actor doing it”. (The use of “free speech”, like the use of ‘imply”, is often used colloquially in a much broader way than it’s used a technical context. The definitely correct statement is that “it’s not a violation of the first amendment if it’s a private actor doing it” Although I have seen some people go so far as to say “it’s not censorship if…”, which is definitely incorrect.)

  2. A lot of people do use “imply” to mean “logically implies or entails” or some such. That’s the way I use it, and hear it used. But that may be a habit from reading in math and logic. If I want to say something weaker, I say “x suggests y.”

    Asking the skeptic for an alternative explanation of the correlation usually turns the conversation in an ok direction.

  3. Hmm. “Correlation does not imply causation” is another way of saying “don’t jump to conclusions”.

    Besides, causation is tricky and multilayered. You say that smoking causes cancer. But for a couple of decades now everyone knows that smoking causes cancer. And people still smoke. You say: addiction. But everyone knows about addiction, and yet they still choose to start smoking. Advertisement is banned, terrifying pictures are printed on the cigarette boxes, and yet they still do.

    At this point, saying “cigarette smoking causes cancer” becomes nearly as meaningless as saying “self-inflicted bullet wound in one’s head causes death”. Cigarette smoking has, in fact, ceased causing cancer; you now have to look for psychological and social causes.

    1. That is an absurd semantic distinction that does not accurately describe how human beings use the idea of causation.

      1. It’s not absurd, just a different angle, different scope, different context. For the pathologist drug overdose is a cause of death, the psychiatrist is looking for a different kind of cause (e.g. psychological trauma), and the sociologist for a different one yet (e.g. egotistic competitive culture, or whatever). Correlations won’t help much, you need a context, structural framework. Otherwise you’ll have a bunch of disconnected cause-effect items that don’t add up into any consistent logical worldview.

        1. The Big Bang caused cancer.

          Joking aside: The appropriate scope of the causal chain is determined by usefulness. If one is trying to determine whether a product causes disease, it is not useful to say, “No, it doesn’t, because the decision to use the product is caused by advertising/popularity/genetics, which are determined by…” ad infinitum.

  4. Well said, but it might help to consider physical disciplines where these terms have simpler and more direct meaning.

    I am sure it is not my unique construction. I will lazily avoid your links for now. There is a test for correlation and causation. And maybe more useful understanding.

    Correlation is a state measure. Accelerator depression is correlated to vehicle speed.

    Causation is a relationship between derivatives. If I change the accelerator, vehicle speed changes: ds/da is not zero. But if I change the speed (push the car downhill) the accelerator doesn’t move. You can more naturally write this as the correlation between the time derivatives of the two. This also let’s you naturally extend to confounding variables. And explains experiment directions.

    Thanks for he links and nice writeup. Your writing really is a pleasure to read.

  5. Correlation implies symptom or cause.

    We have a guy at work constantly pointing out metrics that change over the same time frame as problems we’re trying to diagnose. He’s always pointing to them as causes, so far they’ve all been symptoms.

  6. There may be an association between an explanatory categorical variable and a response quantitative variable, but not a correlation. To investigate that kind of association we would use an ANOVA instead of a correlation. I say this simply because if we’re going to understand how to draw responsible conclusions from evidence, we need to be clear about terms and procedures. Likewise, saying something like “there’s a correlation between your opinion of a writer and if you usually agree with them” just strikes me as a misuse of specific terminology. “

    This is unfair to everyday English. We (statisticians) appropriated the term, which was already very much in common use indicating that some things had a relationship together. In fact, when we say that “correlation does not imply causation” this is exactly how we’re using the term. Saying “association does not imply causation” conveys the same sentiment but lacks the charming consonance. The “correlation does not imply causation” lesson holds for categorical variables in the same ways it holds for quantitative variables.

    When we say things like “the correlation coefficient measures the strength and direction of the linear association between two quantitative variables,” we’re being sloppy in ways the layperson is innocent of. We know (or should know) that “correlation” is a vague idea, that a “correlation coefficient” is one of many specific ways to make use of the vague idea, that there is in fact more than one such measurement sailing under the “correlation coefficient” flag, and that the one we’re usually talking about is more precisely referred to as “Pearson’s product-moment correlation coefficient.” We have measures of correlation the way we have measures of center and measures of spread, and “correlation” is a vague idea just as “center” and “spread” are vague ideas.

  7. Correlation implies causation. It doesn’t prove it. A correlation could be due to a random fluke, an error, or fraud. Nor does correlation prove in which direction does the area of causation points: from A to B, from B to A, or from C to both A and B, or some combination of those. But, yes, consistent correlation suggests some form of causation.

  8. I enjoyed your exploration of the “causation is not correlation” overuse.

    However, here you state ” But I’m still perfectly willing to say: poverty causes poor educational performance. That’s the only reasonable conclusion from the data.”

    You fall prey to the very thing you are pointing out. The belief that “poverty” causes poor educational performance is a knee jerk response just as “correlation is not causation” is.

    The causes of poor performance must take into account all the “black swans” of people who, despite their economic hardships and entire countries in the same situation, rising to the top. Abraham Lincoln is but 1 example. Israel is another.

    It must also explain the other causes that are associated with why poverty exists beyond the luck, work ethic, religious affiliations, disease, proper curriculum designed to prepare students for success rather than not, to mention a few.

    Povery is the cause of poor performance is just not true and far to simplistic a statement.

    I will pursue your other links to further my understanding of your arguements.
    Once again, thank you for the reality check.

    1. Saying “causation” doesn’t imply a single cause. But a strong direction, yes. I think a good and practical definition of cause is: measurably increases the probability of an outcome.

      Otherwise, we end up with no causation outside of the hard sciences, and even there only in simple cases (at what error level). And without causal models, you can’t improve or affect much of anything

  9. Thanks, good article. If two things are related you’d expect to see a correlation. The existence of a correlation between two things does not, on its own, imply a causal relationship between them, but it hints at the possibility one might exist. The absence of any correlation between two factors generally kills the hypothesis that one causes the other.

  10. I think that if we consider that everything boils down to a few universal laws of physics that what we call “causation” isn’t really relevant. Everything is correlation (something that exists besides another thing and neither are the “cause of the other”).
    You can try to argue timeline ordering (but at the basic interaction level they can be reversed which makes it harder to say that it is relevant).
    I think we talk about causation because of a very human-centered view of the universe.
    Of course you can argue there are useful correlations and useless correlations. (the useful one is when actively affecting one variable will cause a distinct variation of another variable).

Leave a Comment

Your email address will not be published. Required fields are marked *