p-value weirdness in a world of big data

I thought this Tim Harford piece, on the seduction of so-called Big Data and the notion of post-theory, was really good. Harford makes several important points about the ways in which Big Data enthusiasts have underestimated or misunderstood long-term issues with analyzing statistical data. I want to expand a little bit on the question of statistical significance in order to back up his skepticism.

Think of statistical significance like this. Suppose I came to you and claimed that I had found an unbalanced or trick quarter, one that was more likely to come up heads than tails. As proof, I tell you that I had flipped the quarter 15 times, and 10 times of those times, it had come up heads. Would you take that as acceptable proof? You would not; with that small number of trials, there is a relatively high probability that I would get that result simply through random chance. In fact, we could calculate that probability easily. But if I instead said to you that I had flipped the coin 15,000 times, and it had come up heads 10,000 times, you would accept my claim that the coin was weighted. Again, we could calculate the odds that this happened by random chance, which would be quite low– close to zero. This example shows that we have a somewhat intuitive understanding of what we mean by statistically significant. We call something significant if it has a low p-value, or the chance that a given quantitative result is the product of random error. (Remember, in statistics, error = inevitable, bias = really bad.) The p-value of the 15 trial example would be relatively high, too high to trust the result. The p-value of the 15,000 trial example, low enough to be treated as zero.

We account for this error (variability) by setting a particular threshold of p-value that we are willing to tolerate, which is called alpha. So if our alpha is .01, our analysis would require that there only be a 1% chance that our observations were the product of random error before we called the result significant. Choosing alpha comes down to a variety of things, the most powerful of which, in practice, is usually the convention for a particular discipline or journal. In the human sciences, where what we’re looking at tends to be much more multivariate and harder to control than the physical sciences, we often use an alpha of .05. That means that 1 out of every 20 trials, we would expect to get a statistically significant result just by random chance. In other domains, particularly where it’s super important that our results be conservative– think, say, whether a drug is effective– we could use an alpha as low as .0001. It’s easy to say that we should just choose the smallest alpha that’s feasible, but the danger there is that you may never make any claims about real differences in the world because you’ve diminished the power of your mechanisms. (Statistical power = the ability to avoid false negatives.) It all depends, and statisticians and researchers have to make a series of judgment calls throughout research.

Harford talks about the multiple-comparisons problem, which really gets at some of the profound weirdness of how p-value operates, and how much theory there always is hiding in empiricism.

The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.

There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.

I wrote about this exact temptation awhile back when I talked about the Texas Sharpshooter problem. The Texas Sharpshooter is the guy who fires his gun into the side of his barn, looks to see where the bullets are clustered, and then paints a bullseye around the cluster. In a world where we have sets of massive data and the ability to perform near-instantaneous computations with spreadsheets and statistical packages, it’s potentially a major problem. As Harford points out, there are techniques to manipulate alpha in a way that can help account for these issues– the Bonferonni adjustment, Tukey’s range test, Dunnett’s, the Scheffe method. There are lots of ways to use these intelligently, but you have to be careful. In the simplest, Bonferonni, you just divide your alpha by the number of comparisons you’re making. The problem is that, if you’re making as many as 7 or 8 or more comparisons, you’re giving yourself a tiny alpha for every individual comparison, making it more and more likely you won’t be able to see real differences. Again: a balance of interests, presumptions, judgment calls.

Philosophically, this is also all kind of strange. A big part of dealing with the vagaries of significance testing is beginning with a set plan, a number of comparisons that you want to look at before you even do data collection, and one that you have good reason to explore thanks to theory. The real serious stats guys I know are super strict about that stuff. That comes from precisely the kind of problems that Harford is talking about. But it is kind of weird: if a data set has a particular statistically significant relationship hidden in it, what difference does it make if you look for it or not? The numbers are the same. Not looking at particular relationships may prevent you from finding a spurious relationship, but it doesn’t change the numbers. The point is not that the numbers aren’t there if we don’t look. The point is that, while sample size, the central limit theorem, and independence of observations can help, in a world of variability, there will often be relationships that look real but aren’t, so we should only go looking for them if we have good reason to– if we have a good theory, in other words.

Things get even stranger when we’re talking about multivariable methods like regression and ANOVAs. Individual predictors confound and interact with each other. Entire models can be significant without any individual predictors being significant; individual predictors entire models can be insignificant but specific predictors can be super significant. (OK, can have super low p-values. Some stats guys are strongly against ever saying “more” or “less” significant.) A predictor can be totally insignificant when in the model by itself but become super significant in the presence of another predictor. Two predictors can be individually significant if they go in first but not if the other is in there first, thanks to overlapping sums of squares. In polynomial regression, we have to keep insignificant predictors in our model if they precede significant predictors. And so on.

The really interesting thing is that, as Harford points out, we’ve known the basic contours of these problems for decades. They really aren’t new. They’ve just taken on new urgency in a world of Big Data hubris.

The point is not to be nihilistic about data. The point is that all of these are theory-laden, value-laden, choice-laden exercises. They require us to make decisions, often involving tradeoffs between predictability and interpretability, often involving a more conservative, less interesting model vs. a more interesting but potentially distorting model. All of that is theory. In a world where the Nate Silver/Ezra Klein/”I consider myself an empiricist” vision of knowledge is ascendant, we have to remind people that there has never been a statistical test devised that can’t go wrong, even in the hands of a smart and principled investigator.


  1. Nice piece. I would suggest, however, that you consider using a gender-neutral term for “stats guys.” Perhaps you could try statisticians.

    1. I confess that I’ve always thought of “guys” as gender neutral, although perhaps I didn’t mean it that way at the time. It’s a fair point.

  2. Great post. The one thing I would add is that the search for significance typically ignores the effect size (how large the significant effect is). Reporting both p and strength of effect would probably rule out some of the spurious results.

    1. Totally. But I think effect size can lack that kind of intuitive understanding that I mentioned with p-value above, so it’s easier for people to ignore.

  3. What about controlling for the “false discovery rate” instead of adjusting for multiple comparisons? Isn’t that in vogue now?

      1. The FDR leads to rejecting the null hypothesis when you shouldn’t (it’s less stringent than many multiple comparison methods such as Bonferroni). It’s useful, though, when you’re generating potential hypotheses.

        1. This is my impression too- I’ve heard of the FDR getting increasingly fashionable, but I still have qualms about the lack of conservatism.

          FDR methods, as far as I can tell (my knowledge of statistics is fairly rudimentary) strive to control the ratio

          false positives / (false positives + true positives),

          whereas more traditional methods strive to control the experimentwise error rate, i.e.

          false positives / (false positives + true negatives).

    1. Oh, no expertise, no. I just have learned a lot in the last few years in grad school and want to put some ideas into conversation. Definitely not trying to claim expertise.

      1. We call something significant if it has a low p-value, or the chance that a given quantitative result is the product of random error. (Remember, in statistics, error = inevitable, bias = really bad.) The p-value of the 15 trial example would be relatively high, too high to trust the result. The p-value of the 15,000 trial example, low enough to be treated as zero.

        Wait, you forgot there are two kinds of error.

        1. I tell you that you do not have cancer; in fact you do.

        2. I tell that you do have cancer; in fact you don’t.

        p-values and alpha levels only cover one of these. Statistical power comes from the number of observations (like in your coin example).

        Check out http://notstatschat.tumblr.com/post/67132955516/moving-the-goalposts

  4. I spend an annoying amount of time chasing my own tail about multiple comparisons corrections.

    When data are plentiful, another idea I’ve heard is to split your data into an exploratory and a confirmatory set: you do whatever you want to in the exploratory set, but you only trust the models that look similar in the confirmatory set. At least, this idea is good in theory. I have never gotten to try it out myself. In fact I don’t think I have ever seen anyone use it in a paper, maybe because nobody ever feels they have enough data. In fairness, convincing NIH to give you enough money for 2n subjects, where n is the number your power calculations support, is probably a non-starter.

    1. You’ve just described a big cultural problem– everybody feeling forced to chase a particular p-value rather than being more pragmatic with their data.

      1. Well, and “pragmatic” to scientists often means “do a zillion comparisons and then come up with some justification for only reporting the ones that look good.” When I started in a stats program I thought (/hoped) I’d hear a lot about this issue, but weirdly I don’t. The math stat folks aren’t thinking much about data at all!

        Andrew Gelman (a statistician in political science at Columbia) has been on this issue for the last year or so as it’s gained attention in psychology, another field that relies on stats a lot. I recommend his blog if you are interested in the topic.

  5. In my experience, part of the difficulty is that people too easily conflate “statistically significant” with “true” when it more correctly means “not obviously false.” As suggested above, a statistically significant result may be worth your time to follow up and attempt to confirm with an independent data set while an insignificant result is probably not worth the effort.

    We should all bear in mind that science is about preponderance of evidence, not proof. My favorite public domain example is the case of Vioxx: according to the court documents, in every trial that looked at heart issues the evidence for rejecting the “causes no heart problems” was only borderline significant and easily dismissed – but taken together it was impossible to argue that every single study had been unlucky in the same direction and thus the preponderance of evidence showed that Vioxx had serious side effects.

  6. Basic FDR can be quite devastating for your results indeed, which should be totally unnecessary especially if you have clear (and preferably one-directional) hypotheses.

    I found q-values (which has a GUI in R) quite appealing after doing some digging into the subject a while ago, much smaller chance of false negatives than the FDR.. It does however require independence of p-values especially in small datasets, so ideally you’d design your experiment to accomodate for this.

  7. Can’t find a way to comment on http://www.academia.edu/6516483/Statistical_Hands_Rhetorical_Hearts on academia.edu nor does it look like you’ve posted the paper here, so …. just leaving a few thoughts here.

    As a “stats person”: I’m not sure how stats could benefit rhetoric or humanities people. But there are pretty obvious blind spots where people who spent their time broadening their intellect rather than (like us) narrowing it to the point of mere niftiness: we suffer from the groupthink of everyone having done their undergraduate studies in some soft science or other, and I end up talking to people who think the answer to various problems is moar hyperparameters.

    The basics [distributions, conditional dist, Taylor expansion, a couple measures of dispersion, centreing data, Q-Q vs a Gaussian, mean/median/mode, median polish, covariance] aren’t super hard to get (especially if you skip the proofs of everything except Gauss-Markov theorem). If I were giving advice I’d say find a good exploratory data analysis class and do some regressions on data you give a sh_t about. Any teacher worth their salt should focus on assumptions more than the proofs and use a data set worth caring about. I’d also say look at what a Gaussian smoother does to photos and think about 1/2^|x| for x=0,±1,±2,±3 as a way to not be scared by the standard normal eqn.

    Come to think of it I have a lot of opinions about how statisticians shouldn’t waste students’ time and should do a better job of attracting humanities people. But the real reason is that I’d have better discussions as a result. Maybe the fact that “we” are actually suffering from the lack of “you”‘s around us, is more compelling than the need to «speak Dean»?

  8. The point is that all of these are theory-laden, value-laden, choice-laden exercises.

    You aren’t focussing on this as much in this post, but the choice of what variables to measure, defining how the variables will be coded, and choosing what subjects to study in the first place (i.e., pre-statistics experiments—and remember that statisticians usually have little or no experimental design training unless they got it on the job) are where [I personally] think the humanities people might add the most value. Within the entire universe of dumb things to do (rather than in a specified model), what dumb things are we doing, before we even get numbers to play with?

Leave a Comment

Your email address will not be published. Required fields are marked *