I’ve written, in the past, that I think the reflexive statement “correlation is not causation” has actually become more dangerous than people naively assuming that correlation does equal causation. I was reminded of this recently when I was reading Siddartha Mukherjee’s magnificent “biography of cancer,” The Emperor of All Maladies. The relationship between lung cancer and smoking is the perfect example of how correlational studies can lead us to better understand the world, and in a way that has clear stakes. In time, causal evidence for the link was found, but this took years, and in fact Mukherjee devotes many pages to the difficulty of defining causation and establishing it in a situation where an experiment would have been deeply unethical. Correlational data can go wrong. It can also save lives.
I wrote a long piece about this issue here, and you should check it out if you’re interested. I just want to look at one example of how statistical skepticism could become more of a hindrance than a benefit.
Consider this graph, courtesy of The Washington Post and via the Dish:

Now. Here is a case where we have a simple, intuitive relationship between a statistical observation, the spike in searches for hangover cures, and a potential cause, a holiday associated with the consumption of alcohol. What I want to point out first is that you could just as accurately say “correlation does not equal causation” here as you can with any of the intentionally absurd correlations that are trotted out for rhetorical effect. That correlation does not prove causation is just as true with people Googling “hangover cure” as it is with the famous example of ice cream sales being strongly correlated with deaths by drowning. (Sticklers would in fact call this particular relationship an association, not a correlation, as the holiday is a categorical value and not a numerical one.) In both cases, it is accurate to say that the observed correlation does not prove causation. And yet with this example, but not the latter, I am willing to say that correlation in fact strongly implies causation.
How can I say that, when the fact that “correlation does not imply causation” has become holy writ on the internet? I can say it because I have a functioning human intelligence and the power of discrimination. I can say it because I have common sense. I can say it because I know that the definition of the word “imply” is not identical to the definition of the word “prove.” Most of all, I can say it because I have a strong theoretical, deductive basis for assuming causation. Or, to put it another way, I lack any remotely satisfying alternative explanation for why this search would spike around New Years. It might be true that all of these people are Googling hangover cures around this date because their great uncle Pappy died around this time, and they are drinking to forget. But that is exceedingly unlikely, far less likely than the simpler explanation that people drink too much at New Years and get hangovers. Contrast that with the ice cream sales and drowning deaths one. We lack a coherent explanation for how one could cause the other, and we have a perfectly good deductive reason for understanding the association — people both eat more ice cream and swim more often when the temperature goes up. It would indeed be dumb to assume that people eating more ice cream causes drownings, but luckily, we aren’t dumb. We have a broader understanding of the world and can use that understanding to avoid such a confused interpretation.
Now we could complicate things. Here’s the same search over a larger time frame, from January of 2010 to December of 2014.
Here, we can see that a similar pattern exists — peaks that are fairly consistent around December and January. We’ve lost some of the granularity by broadening out, making it harder to see the specific concentration around New Years. But we can also see that the relationship, while consistent, is not exclusive; the rate of interest in hangover cures is not static at other times of the year, either. Nor is it immediately clear why the relative volume of searches for hangover cures has risen over time. You could easily imagine people making an intuitive leap based on some of this data that may not be responsible. But it is not irresponsible to look at the massive peak in searches in the first graph and assume a causal relationship.
What I’m arguing here, ultimately, is simple: that we have the benefit of our broader understanding of the world when we examine statistical data, and we should use it. I am also arguing that people who say “correlation is not causation” face a burden of proof too. If you want to look at that peak and sniff that correlation does not imply causation, that’s fine, but you better be able to bring theory and evidence to bear to justify that skepticism. And your burden of proof will be higher than if I said that ice cream sales cause drowning deaths, because the situation is different. Methodological ideas do not exist in a vacuum, but occupy a broad theoretical and empirical framework that complicates them at every turn. We don’t have a general problem, online, with people being either too credulous or too skeptical towards statistical data. We have both too many writers running simple linear regressions and drawing overly broad conclusions from them without appropriate checks and skepticism, and too many people repeating this cliche like parrots without actually bothering to dig into the intellectual work of understanding the world. When I become frustrated by the overt credulity of some data journalism, I read the comments on sites that are home to many skeptics, such as io9, where I find standards of evidence so absurdly high, no real truth claims could ever survive them. The goal has to be to develop a happy medium, one that recognizes the difference between different claims, their strengths, their theoretical backing, and the presence or absence of plausible alternatives.
People who blindly repeat that correlation does not imply causation act as though appropriate empirical skepticism requires us to act as though we are all dumb. We aren’t all dumb. We get things wrong, we fall prey to spurious associations, we bite off more than we can chew, but we aren’t dumb. Let’s be skeptics, not nihilists, in 2015.

Fundamental principle of actuarial science: Correlation is not causation, but that’s the way to bet.
I don’t recall exactly where but Richard Lewontin made a very similar argument a good while back basically saying that there’s no point to science if correlation doesn’t imply causation. While I agree with you about the knee-jerk playing of that card on the internet, and also among graduate students (and some faculty) over-committed to appropriately critical stances on scientism, I really don’t want to throw the critical baby out with the pro-science bathwater. Having spent a good bit of time as a sociologist of agriculture and environmental, a remarkable amount of mainstream “basic” and “applied” science so mis-defines problems as to generate really problematic correlations treated as causative and then made more “real” by being implemented in policy.
Moving from statistical claims and random sampling to machine learning, there’s been a huge back-and-forth in claims about causality with Bayesian networks. One of the jokes in that community is that correlation does not imply causality with two variables, but it might with four variables. As studies become more finely detailed in the data that is collected I could well imagine truisms about causality and correlations becoming a bit more nuanced–check out just about anything by Judea Pearl and the book Causality in particular.
Just sayin, but the verb ‘imply’ very often does mean ‘prove’, in the sense of ‘include as a necessary consequence.’
You and I are on board a plane, about to skydive. I hand you a parachute designed for children. You say to me, “did the salesman say that this parachute could work for an adult?” I reply, “He implied that it would.”
Do you jump?
That’s the use of ‘imply’, usually to describe a person’s speech or behavior, that means ‘suggest indirectly’. ‘Imply’, used to describe relations of logical parts, can *also* mean ‘logically entail’. As in ‘responsibility implies freedom’ or the like.
I think it’s the latter definition (which is the first one in my dictionary) that’s at work in the correlation/causation maxim.
Then let’s resolve to use “suggest,” instead.
Sure. Probably no one is so dumb as to say ‘correlation doesn’t suggest causation’, since everyone knows that sometimes it does, sometimes it doesn’t.
Between no parachute and children’s prachute i would jump right away from the cancer plane if you get what i mean.
“Correlation is not causation” is a hack debating tool of anyone who wants to knock down someone else’s argument. Correlation is a key piece of data on the way to proving causation. You also want to find a mechanism, a systematic study controlling for co-factors, etc. Correlation is certainly a motivation for furhter study.
Demanding scientific certainty from a political adversary is also a hack trick. Scientists are extremely cautious and self-critical, and that’s a good thing, but that means that they often can’t provide a solid answer at a time when it’s needed. If you’re dealing with a problem in real time you can’t wait ten years for a scientist to tell you what you should have done ten years eralier. So you go with estimates, the scientific consensus, and whatever else you have in order to make your decision right now.
Hacks always use the legal “beyond reasonable doubt” principle, as if the law did not also have a “preponderance of evidence” principle (used in civil suits).
And the trick is ASSUMING your own point of view at the beginning, and demanding that the adversary PROVE his. Dirty pool, but it works.
I’m sort of more curious about why May 15 (or thereabouts) is the second-highest search of “hangover cure” on the first graph.
Memorial Day?
Sampling variability. The big effects in the time series appear to be (1) New Years and (2) weekends. Suppose we’re sampling 52 weekend values at random from some distribution. One of them is bound to be a maximum. In this particular sample that inevitable maximum happens to have fallen around mid May.
Imagine two naive, non-English-speaking androids reading this data. “The date being January 1st causes people to spontaneously search for the characters ‘hangover cure’,” says the first. “I am skeptical,” says the 2nd. “January 1st being correlated with that search string doesn’t imply that it causes it. I think there is a third variable, a hidden variable, at work, affecting these.”
And in fact there is, as you said. January 1st doesn’t cause drinking. But the holiday celebrated on that day does cause non-regular drinkers to drink (regular drinkers wouldn’t be searching for hangover cures).
Trying to make sense of the spike on that date without a deeper familiarity with the problem space – a familiarity deeper than what’s presented on the graph – is crucial to understanding what’s being evidenced, both in this and the ice cream/death by drowning.