Statistical significance is an essential but confusing concept. As I’ve discussed in this space before, statistical significance fundamentally concerns the chance that a given quantitative result is the product of random chance, of variability that’s inherent to real-world numbers. (**Update:** Better — the odds that we’d observe the given result, or a more extreme result, simply through the underlying variability of our data.) Suppose I have a class of 20, ten girls and ten boys. I give the class a math test. I find that the average score on the math test is 17 for girls and 14 for boys. I conclude that this means that girls are better at math than boys. Do you trust this result?

Intuitively, you almost certainly don’t, and you shouldn’t. That’s because we know that life is full of variability — of individual differences that are the product of factors that aren’t related to the construct that we’re investigating. (In this case, gender differences in math ability.) In this case, with only 10 observations in each group, we can imagine all sorts of reasons that the data shows this difference that aren’t actually a direct result of gender. For example, a couple of the boys might not have had breakfast that morning, and maybe one didn’t get enough sleep that night, and maybe another didn’t take the test that seriously. Or, more generally, maybe my class just has boys who are a little behind and girls who are a little ahead by random sorting processes. Statistical significance tests are designed to help us avoid mistakenly perceiving an effect, or a difference, or a relationship, based on the random variability that’s inherent to life.

By far the most common statistic used to represent statistical significance is a p-value. A p-value is calculated by looking at the size of a perceived difference between groups (or, alternatively, at the strength of a relationship), at the variability within the groups that you’re examining (that is, if the averages for our boys and our girls are derived from test scores that are all lumped together or test scores that are all spread apart), and at the number of our observations. (These are used along with an assumed ideal distribution that we don’t need to worry about here.) Generally speaking, the bigger the difference between the groups (or the more they vary together when measuring their relationship), the lower the spread of the data (that is, the tighter together the data points are clustered), and the more repetitions we have, the more likely the results are to be statistically significant. This should make intuitive sense: a bigger difference is less likely to be the result of random noise, which if truly random would tend to move data in both directions at once; less spread to our data makes us more confident that the result is real, as our individual data points are more alike rather than more different; and the more repetitions we have, the less likely it is that we’ve simply gotten unlucky and randomly drawn data that has a fake difference within it. A p-value tells us the chance that a given result is simply the product of random variation, so the lower, the better. By convention, we compare this p-value to a set value, called an alpha, to determine whether or not we consider the result significant. Alpha is typically determined by field- and journal-specific standards. In human research, such as in education and language, we’ll frequently use a .05 alpha — that is, a result is significant if it has less than a 5% chance of being the result of random variation. In other contexts, we might use a p-value of .01 or .001.

Statistical significance is absolutely essential in a world of randomness and error. But the concept is often misapplied or misunderstood. In particular, it’s always easy to mistake statistical significance for practical significance. A very large study can find a very low p-value, suggesting that the observed difference is real, even while the effect is so small as to be of negligible real-world value. In other words, *p-value is not a measure of the strength of a difference or effect.*** **If we did do a large-scale study that proved inherent gender differences in math ability to a very low p-value, that would only suggest that there is a real difference — a difference that is unlikely to be the product of random variation. But the actual difference could be so small as to have no meaningful consequences for educators. For this reason, my old stats instructor was a stickler for not saying “very significant” or similar terms, as he felt they made it easier for people to misinterpret a very low p-value for a very strong result. Instead, a result is either significant to a preselected alpha or it isn’t.

Because a p-value can vary dramatically even among values that are all considered significant, some people tend to think that a very low p-value guarantees a practically significant (strong) difference or relationship. But remember, the spread of our data and the number of repetitions we run have a large impact on our p-value. A large sample size can make a tiny difference statistically significant. Indeed: a classic bit of research trickery is to run an experiment, find that your difference is not significant to the alpha proscribed by the journal you want to publish in, run a few dozen more repetitions (expand your sample size), and hey, there’s a significant result! This isn’t technically cheating on the numbers, but it undermines the inherent conservatism of significance testing. You’re getting a significant result not because the underlying effect you’re studying is stronger, but because you’ve increased your statistical power — the ability of your research to avoid missing a real, underlying relationship in your data.

What we want, then, is a metric that can be reported alongside p-value to tell us the strength of our effect. This is effect size. Effect size tells us the size of a difference, the strength of a relationship, the impact of an intervention, the power of signal relative to noise, and similar.

The most common way to represent effect size (and the way that I understand) is Cohen’s d. Cohen’s d is a statistic derived by pooling the standard deviation of the groups that you’re comparing — that is, by finding out how variable the data in both groups are. You then divide the difference in means of these two groups by that pooled standard deviation. What does that do for you? It allows you to compare the average of one group to where it would fall in the percentile distribution of the other. So think again about our girls and boys. If our Cohen’s d is 0 for gender effects in math, that means that the mean of our girls would sit at exactly the 50% mark for boys. In other words, the average girl would be just like the average boy. But if, on the other hand, the effect size was .5, the average girl would be at the 69% percentile range if she was a boy, indicating a healthy advantage for girls relative to boys in math. You can find a good conversion chart here. Because it is based on standard deviation comparisons, Cohen’s d has a practical range from -3 to +3, although practically speaking you are very unlikely to find anything bigger than -2 or +2 in most research contexts, and will very often by dealing with results between 0 and 1 in human research. Cohen defined a small effect size at .2, medium at .5, and strong at .8, but I personally don’t think that such definitions are much use outside of a specific research context.

Correlational studies are easy because the most common coefficient of correlation, the Pearson r (what most people think about when they think of correlation) is essentially an effect size itself. The Pearson r itself tells us the strength of a correlation. And because correlation calculations usually automate p-values, we can tell if those correlations are statistically significant alongside the strength of the relationship (that is, again, if the perceived relationship is likely the result of random variation). In my own research, where I am often mining thousands of text samples for data, I will often find very weak correlations (say .20) that are nonetheless significant to an alpha of .001, thanks to the size of my sample. This is a good example of the difference between a significant relationship and a meaningful relationship.

Statistical significance is still very important. I could run a comparison based on very few observations and find a difference that appears very powerful. But, just as in our initial discussed boys vs. girls math study, it could easily be the case that a difference that seems large actually stems from random differences in our population. Statistical significance can help assuage our fears that the power of an observed effect size is not the product of random variation. In the future, researchers should report both effect size and p-values together in order to minimize the always-present risk of random variation or bias appearing to be a real difference.

Part of the beauty of effect size is that, because these measures are standardized, they can be compared to each other from one research study to the next. P-values can’t be meaningfully compared in this way; you can’t say, for example, that a study that was significant to .01 proved a stronger effect than one that was significant to .05. This ease of comparability is particularly important given the deep need for researchers to perform meta-analysis and replication studies. We’re in something of a crisis moment when it comes to the replicability and robustness of our research; too many studies are proving impossible to replicate, casting doubt on a great deal of our research archives. By reporting effect size, we make it much easier to distinguish statistical significance from practical significance, and we make it easier for other researchers to replicate and validate our work.

There’s tons of more detailed and sophisticated information out there by people who know much more about this than I do. You can easily find effect size calculators online, and Cohen’s d is neither particularly onerous to calculate or hard to understand. Paul D. Ellis has written several books and this very useful FAQ on effect size and why it matters. If you’ve spotted an error in this post (always a possibility) or you have questions, feel free to drop a comment below.

Nice post! If people are interested in seeing a different (Bayesian) perspective, I recommend checking out Andrew Gelman’s blog, andrewgelman.com. I have a couple of small technical comments:

1. “A p-value tells us the chance that a given result is simply the product of random variation”: Not quite. It’s the chance that the given result *could be* observed simply because of random variation under the null model (e.g., assuming that there is no real difference between the groups). It’s almost what you wrote, but to see the difference, check out xkcd.com/1132.

2. “a classic bit of research trickery… This isn’t technically cheating on the numbers”: This *is* cheating on the numbers. Because the person checked the results at multiple points in the experiment and could have stopped at any of them, they need to do a multiple comparisons correction to all of their p-values. Otherwise, they’re effectively flipping a coin repeatedly and then claiming something interesting must have happened as soon as they get their first long run of heads.

Thanks very much for your input! Somebody on Twitter made the same point about p-value; I’ll provide a correction when I’m at a computer.

It’s a shame that the social sciences are still stuck in the dark ages of frequentist methodology. In recent decades, the inherent superiority of Bayesian methodology has gotten greater and greater recognition in the hard sciences (or at least in physics, the one I know the most about).

Luckily you’re here to tell us how much we suck!

You only suck if you’re not willing to learn. I converted to Bayesianism many years ago when frequentist methods gave me a nonsensical result to a problem I was working on (specifically, a frequentist upper limit on the occurrence of an unobserved phenomenon got worse when the experiment was re-analyzed assuming better resolution). This sort of thing happens a lot with frequentist methods in low-statistics situations. (In high-statistics situations, the conclusions are essentially independent of method, as they should be.) Frequentists have no coherent position about this, they just change methods willy-nilly until hitting on one not obviously wrong. Bayesians, on the other hand, start with a philosophically coherent ideology, from which everything flows smoothly.

Dude, you’re making Bayesianism look terrible.

“In high-statistics situations, the conclusions are essentially independent of method”: There are situations where the frequentist approach works but a Bayesian one fails, even in the limit of infinite data (http://www.biostat.harvard.edu/robins/coda.pdf).

As for that “philosophically coherent ideology,” check out Gelman and Shalizi: http://arxiv.org/abs/1006.3868. (But first, anyone who’s actually reading this should check out Shalizi’s blog, bactra.org/weblog — statistics, socialism, good writing, it’s got it all!)

In the first paper you cite, Bayesian methods “fail” only if priors are chosen that assign zero probability to the possibility that some variables are correlated. To which I say: duh.

I simply disagree with the second paper, as do many others. And it offers no coherent frequentist philosophy, which I continue to argue does not exist.

Gelman, the Bayesian whose blog I recommended, is a political scientist.

Thanks for this. I had a great stats teacher in grad school. I failed the mid term, and then busted my ass trying to make up the grade. Along the way I got really excited about the subject, and certain I wanted to pursue a related career.

Now I can’t remember a damn thing, but I’m always thinking I’ll find the time to read up on it. Your blog is the only place I ever actually do.

Yes, to add to Daniel Weissman, shorthand journalistic definitions of p values are pernicious because they are always wrong. [You know this, Freddie, but] p value has a 4-part technical definition. It is tied to the underlying model, to an assumed normal distribution of results around the mean, and to the assumption that your *null hypothesis is correct.* That’s what p means. No more, no less. If you try to summarize it you’ll be wrong. P value definitely does not mean “the odds that my finding was the result of random chance,” no matter how desperately we want that information. (That’s the inverse probability fallacy.)

It’s a pet peeve of mine, because if you try to explain to someone what a p value is, in a context where it matters, they can always say, “No, you’re wrong, I read a different definition in” the New York Times, the New Yorker, Vox, or wherever. Even intro statistics textbooks get it wrong.

Assuming a normal distribution isn’t necessarily part of calculating a p-value. Your null hypothesis could be (for example) that the data are Poisson-distributed with two groups having equal means.

This is a nice write up. For brevity, I explain p-values as helping us answer the question “Is it real?” when we’re looking at an observed difference where an effect size measurement like Cohen’s d or Pearson’s r answers “Is it big?”

Having one without the either is often misleading.

I’m a little confused by Pearson’s r being an effect size. A data set that shows a nearly (but not exactly) horizontal line wouldn’t seem to show a strong effect, because the dependent variable is nearly the same regardless of the value of the independent variable. But it will have an r value very close to 1 if all of the points lie very close to the line. On the other hand, that would seem to indicate that there are probably not any other variables messing with your dependent variable, so that the effect is strong compared to uncontrolled effects. Is that the sense in which Pearson’s r is an effect size?

I’m a little confused by Pearson’s r being an effect size. A data set that shows a nearly (but not exactly) horizontal line wouldn’t seem to show a strong effect, because the dependent variable is nearly the same regardless of the value of the independent variable.Indeed. A horizontal line will in fact have a Pearson r of 0, as one measure varies entirely independently of another. That also indicates an effect size of 0. Horizontal line scatter plots are tricky, as the consistency of the line suggests that there is some underlying relationship, but fundamentally they represent a lack of relationship — there’s no effect of one on the other. Of course, in real-world data we will almost never see a perfectly straight horizontal line, and rather we’ll see “snow” as two variables with no relationship vary randomly above and below an imagined horizontal line.

A good gloss on r as an effect size can be found here: http://staff.bath.ac.uk/pssiw/stats2/page2/page14/page14.html

“You already know the most common effect-size measure, as the correlation/regression coefficients r and R are actually measures of effect size. Because r covers the whole range of relationship strengths, from no relationship whatsoever (zero) to a perfect relationship (1, or -1), it is telling us exactly how large the relationship really is between the variables we’ve studied — and is independent of how many people were tested.”

FYI, as a medical researcher, I’ve never heard of Cohen’s d except via your blog. A lot of clinical research will use mean differences (e.g. in blood pressure, or weight loss, or whatever). Epidemiology in particular uses risk ratios, risk differences, rate differences and ratios, and odds ratios – all measures to compare the frequency of events between groups. (i.e. if group A’s risk of lung cancer is 1% and group B’s is 3%, the risk difference is 2% and the risk ratio is (3/1)=3.)

We generally only use Rs to report on the fit of models – they’re never used as the primary measure of effect size. R values tell you the extent to which variation in Y is explained by variation in Z, but it doesn’t tell you the actual magnitude of the change. For example, smoking status might explain 90% of variation seen in the risk of lung cancer between different groups, but that doesn’t actually tell you what smoking *does* to your risk of lung cancer in practical terms – does it double it? Triple it? Increase it by an order of magnitude? (The latter).

In epidemiology we’d never call an R value a measure of effect size – I’d argue it’s still only a measure of statistical soundness, not practical significance (what we call “clinical significance”). I don’t really care what percentage of variation in lung cancer rates are explained by smoking except if I’m trying to judge whether you’ve left other important explanatory variables out of your statistical model – I care *how much* it moves that risk in real terms. If the increase in risk is trivial, even if it’s entirely attributable to smoking (according to the R value – which only pertains to your model), then it doesn’t matter. In reality, lung cancer is common and smoking significantly increases your risk, which is a big problem. In comparison, cranial CT scans double your risk of brain cancer, but your baseline risk is so trivial it hardly matters. Neither R values nor P values can communicate that information – you need to know the actual risk, and the risk ratio or the risk difference. Those are what I would call measures of effect size – not the R.

D. Weissman — You are right, of course. (1) *Some* distribution must be hypothesized, and (2) my understanding (although I’m many years out of it) is in social sciences the normal distribution (z test) is the default. The typical reader/consumer of a p value doesn’t even know this much, and has probably imbued the inverse proportion fallacy from years of bad journalism.

/ *imbibed* or absorbed