archives: p-value weirdness in a world of Big Data

This post was originally published on March 30, 2014.

I thought this Tim Harford piece, on the seduction of so-called Big Data and the notion of post-theory, was really good. Harford makes several important points about the ways in which Big Data enthusiasts have underestimated or misunderstood long-term issues with analyzing statistical data. I want to expand a little bit on the question of statistical significance in order to back up his skepticism.

Think of statistical significance like this. Suppose I came to you and claimed that I had found an unbalanced or trick quarter, one that was more likely to come up heads than tails. As proof, I tell you that I had flipped the quarter 15 times, and 10 times of those times, it had come up heads. Would you take that as acceptable proof? You would not; with that small number of trials, there is a relatively high probability that I would get that result simply through random chance. In fact, we could calculate that probability easily. But if I instead said to you that I had flipped the coin 15,000 times, and it had come up heads 10,000 times, you would accept my claim that the coin was weighted. Again, we could calculate the odds that this happened simply thanks to the underlying variability of flipping a coin, which would be quite low – close to zero. This example shows that we have a somewhat intuitive understanding of what we mean by statistically significant.

We would call the coin in question being in fact like any other coin the null hypothesis. We would call the coin in question being in fact meaningfully different from any other coin the alternative hypothesis. The statistic p-value tells us the odds of seeing a given quantitative result even if the null hypothesis were true. In other words, p-value tells us an observed quantitative difference is likely to be the result of the underlying variation in our experiment, or if it is likely (never certain) to be the result of a real underlying phenomenon in our data. If p-value is low, we tend to reject the null hypothesis and proceed as if there is some underlying difference in what we’re comparing. But we’re never 100% certain; there’s always variability.

We account for this variability by setting a particular threshold of p-value that we are willing to tolerate, which is called alpha. So if our alpha is .01, our analysis would require that there only be a 1% chance that our observations were the product of random error before we called the result significant. Choosing alpha comes down to a variety of things, the most powerful of which, in practice, is usually the convention for a particular discipline or journal. In the human sciences, where what we’re looking at tends to be much more multivariate and harder to control than the physical sciences, we often use an alpha of .05. That means that 1 out of every 20 trials, we would expect to get a statistically significant result just by random chance. In other domains, particularly where it’s super important that our results be conservative – think, say, whether a drug is effective – we could use an alpha as low as .0001. It’s easy to say that we should just choose the smallest alpha that’s feasible, but the danger there is that you may never make any claims about real differences in the world because you’ve diminished the power of your mechanisms. (Statistical power = the ability to avoid false negatives.) It all depends, and statisticians and researchers have to make a series of judgment calls throughout research.

Harford talks about the multiple-comparisons problem, which really gets at some of the profound weirdness of how p-value operates, and how much theory there always is hiding in empiricism.

The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.

There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.

I wrote about this exact temptation awhile back when I talked about  the Texas Sharpshoot problem. The Texas Sharpshooter is the guy who fires his gun into the side of his barn, looks to see where the bullets are clustered, and then paints a bullseye around the cluster. In a world where we have sets of massive data and the ability to perform near-instantaneous computations with spreadsheets and statistical packages, it’s potentially a major problem.There are techniques to manipulate alpha in a way that can help account for these issues – the Bonferonni adjustment, Tukey’s range test, Burnett’s, the Scheffe method. There are lots of ways to use these intelligently, but you have to be careful. In the simplest, Bonferonni, you just divide your alpha by the number of comparisons you’re making. The problem is that, if you’re making as many as 7 or 8 or more comparisons, you’re giving yourself a tiny alpha for every individual comparison, making it more and more likely you won’t be able to see real differences. Again: a balance of interests, presumptions, judgment calls.

Philosophically, this is also all kind of strange. A big part of dealing with the vagaries of significance testing is beginning with a set plan, a number of comparisons that you want to look at before you even do data collection, and one that you have good reason to explore thanks to theory. The real serious stats guys I know are super strict about that stuff. That comes from precisely the kind of problems that Harford is talking about. But it is kind of weird: if a data set has a particular statistically significant relationship hidden in it, what difference does it make if you look for it or not? The numbers are the same. Not looking at particular relationships may prevent you from finding a spurious relationship, but it doesn’t change the numbers. The point is not that the numbers aren’t there if we don’t look. The point is that, while sample size, the central limit theorem, and independence of observations can help, in a world of variability, there will often be relationships that look real but aren’t, so we should only go looking for them if we have good reason to– if we have a good theory, in other words.

Things get even stranger when we’re talking about multivariable methods like regression and ANOVAs. Individual predictors confound and interact with each other. Entire models can be significant without any individual predictors being significant; entire models can be insignificant but specific predictors can be super significant. (OK, can have super low p-values. Some stats guys are strongly against ever say “more” or “less” significant.) A predictor can be totally insignificant when in the model by itself but become super significant in the presence of another predictor. Two predictors can be individually significant if they go in first but not if the other is in their first, thanks to overlapping sums of squares. In polynomial regression, we have to keep insignificant predictors in our model if they precede significant predictors. And so on.

The really interesting thing is that, as Harford points out, we’ve known the basic contours of these problems for decades. They really aren’t new. They’ve just taken on new urgency in a world of Big Data hubris.

The point is not to be nihilistic about data. The point is that all of these are theory-laden, value-laden, choice-laden exercises. They require us to make decisions, often involving tradeoffs between predictability and interpretability, often involving a more conservative, less interesting model vs. a more interesting but potentially distorting model. All of that is theory. In a world where the Nate Silver/Ezra Klein “I consider myself an empiricist” vision of knowledge is ascendant, we have to remind people that there has never been a statistical test devised that can’t go wrong, even in the hands of a smart and principled investigator.