Statistical significance is an essential but confusing concept. As I’ve discussed in this space before, statistical significance fundamentally concerns the chance that a given quantitative result is the product of random chance, of variability that’s inherent to real-world numbers. (Update: Better — the odds that we’d observe the given result, or a more extreme result, simply through the underlying variability of our data.) Suppose I have a class of 20, ten girls and ten boys. I give the class a math test. I find that the average score on the math test is 17 for girls and 14 for boys. I conclude that this means that girls are better at math than boys. Do you trust this result?
Intuitively, you almost certainly don’t, and you shouldn’t. That’s because we know that life is full of variability — of individual differences that are the product of factors that aren’t related to the construct that we’re investigating. (In this case, gender differences in math ability.) In this case, with only 10 observations in each group, we can imagine all sorts of reasons that the data shows this difference that aren’t actually a direct result of gender. For example, a couple of the boys might not have had breakfast that morning, and maybe one didn’t get enough sleep that night, and maybe another didn’t take the test that seriously. Or, more generally, maybe my class just has boys who are a little behind and girls who are a little ahead by random sorting processes. Statistical significance tests are designed to help us avoid mistakenly perceiving an effect, or a difference, or a relationship, based on the random variability that’s inherent to life.
By far the most common statistic used to represent statistical significance is a p-value. A p-value is calculated by looking at the size of a perceived difference between groups (or, alternatively, at the strength of a relationship), at the variability within the groups that you’re examining (that is, if the averages for our boys and our girls are derived from test scores that are all lumped together or test scores that are all spread apart), and at the number of our observations. (These are used along with an assumed ideal distribution that we don’t need to worry about here.) Generally speaking, the bigger the difference between the groups (or the more they vary together when measuring their relationship), the lower the spread of the data (that is, the tighter together the data points are clustered), and the more repetitions we have, the more likely the results are to be statistically significant. This should make intuitive sense: a bigger difference is less likely to be the result of random noise, which if truly random would tend to move data in both directions at once; less spread to our data makes us more confident that the result is real, as our individual data points are more alike rather than more different; and the more repetitions we have, the less likely it is that we’ve simply gotten unlucky and randomly drawn data that has a fake difference within it. A p-value tells us the chance that a given result is simply the product of random variation, so the lower, the better. By convention, we compare this p-value to a set value, called an alpha, to determine whether or not we consider the result significant. Alpha is typically determined by field- and journal-specific standards. In human research, such as in education and language, we’ll frequently use a .05 alpha — that is, a result is significant if it has less than a 5% chance of being the result of random variation. In other contexts, we might use a p-value of .01 or .001.
Statistical significance is absolutely essential in a world of randomness and error. But the concept is often misapplied or misunderstood. In particular, it’s always easy to mistake statistical significance for practical significance. A very large study can find a very low p-value, suggesting that the observed difference is real, even while the effect is so small as to be of negligible real-world value. In other words, p-value is not a measure of the strength of a difference or effect. If we did do a large-scale study that proved inherent gender differences in math ability to a very low p-value, that would only suggest that there is a real difference — a difference that is unlikely to be the product of random variation. But the actual difference could be so small as to have no meaningful consequences for educators. For this reason, my old stats instructor was a stickler for not saying “very significant” or similar terms, as he felt they made it easier for people to misinterpret a very low p-value for a very strong result. Instead, a result is either significant to a preselected alpha or it isn’t.
Because a p-value can vary dramatically even among values that are all considered significant, some people tend to think that a very low p-value guarantees a practically significant (strong) difference or relationship. But remember, the spread of our data and the number of repetitions we run have a large impact on our p-value. A large sample size can make a tiny difference statistically significant. Indeed: a classic bit of research trickery is to run an experiment, find that your difference is not significant to the alpha proscribed by the journal you want to publish in, run a few dozen more repetitions (expand your sample size), and hey, there’s a significant result! This isn’t technically cheating on the numbers, but it undermines the inherent conservatism of significance testing. You’re getting a significant result not because the underlying effect you’re studying is stronger, but because you’ve increased your statistical power — the ability of your research to avoid missing a real, underlying relationship in your data.
What we want, then, is a metric that can be reported alongside p-value to tell us the strength of our effect. This is effect size. Effect size tells us the size of a difference, the strength of a relationship, the impact of an intervention, the power of signal relative to noise, and similar.
The most common way to represent effect size (and the way that I understand) is Cohen’s d. Cohen’s d is a statistic derived by pooling the standard deviation of the groups that you’re comparing — that is, by finding out how variable the data in both groups are. You then divide the difference in means of these two groups by that pooled standard deviation. What does that do for you? It allows you to compare the average of one group to where it would fall in the percentile distribution of the other. So think again about our girls and boys. If our Cohen’s d is 0 for gender effects in math, that means that the mean of our girls would sit at exactly the 50% mark for boys. In other words, the average girl would be just like the average boy. But if, on the other hand, the effect size was .5, the average girl would be at the 69% percentile range if she was a boy, indicating a healthy advantage for girls relative to boys in math. You can find a good conversion chart here. Because it is based on standard deviation comparisons, Cohen’s d has a practical range from -3 to +3, although practically speaking you are very unlikely to find anything bigger than -2 or +2 in most research contexts, and will very often by dealing with results between 0 and 1 in human research. Cohen defined a small effect size at .2, medium at .5, and strong at .8, but I personally don’t think that such definitions are much use outside of a specific research context.
Correlational studies are easy because the most common coefficient of correlation, the Pearson r (what most people think about when they think of correlation) is essentially an effect size itself. The Pearson r itself tells us the strength of a correlation. And because correlation calculations usually automate p-values, we can tell if those correlations are statistically significant alongside the strength of the relationship (that is, again, if the perceived relationship is likely the result of random variation). In my own research, where I am often mining thousands of text samples for data, I will often find very weak correlations (say .20) that are nonetheless significant to an alpha of .001, thanks to the size of my sample. This is a good example of the difference between a significant relationship and a meaningful relationship.
Statistical significance is still very important. I could run a comparison based on very few observations and find a difference that appears very powerful. But, just as in our initial discussed boys vs. girls math study, it could easily be the case that a difference that seems large actually stems from random differences in our population. Statistical significance can help assuage our fears that the power of an observed effect size is not the product of random variation. In the future, researchers should report both effect size and p-values together in order to minimize the always-present risk of random variation or bias appearing to be a real difference.
Part of the beauty of effect size is that, because these measures are standardized, they can be compared to each other from one research study to the next. P-values can’t be meaningfully compared in this way; you can’t say, for example, that a study that was significant to .01 proved a stronger effect than one that was significant to .05. This ease of comparability is particularly important given the deep need for researchers to perform meta-analysis and replication studies. We’re in something of a crisis moment when it comes to the replicability and robustness of our research; too many studies are proving impossible to replicate, casting doubt on a great deal of our research archives. By reporting effect size, we make it much easier to distinguish statistical significance from practical significance, and we make it easier for other researchers to replicate and validate our work.
There’s tons of more detailed and sophisticated information out there by people who know much more about this than I do. You can easily find effect size calculators online, and Cohen’s d is neither particularly onerous to calculate or hard to understand. Paul D. Ellis has written several books and this very useful FAQ on effect size and why it matters. If you’ve spotted an error in this post (always a possibility) or you have questions, feel free to drop a comment below.