[Update: I’ve been convinced by readers that, while this is a good example of the power of selection bias, it doesn’t really work as an example of regression to the mean.]
I thought that this graph was a good way to talk about regression to the mean. You hear that term thrown a lot online these days, often in discussions of advanced metrics in sports. (Not always helpfully, but that’s a discussion for another time.) The data here isn’t so important. It’s SAT data from the mid-90s. The data points are states. For this year, the average SAT score was about 965. (This was on the old 1600 point scale.)
As you can see, there is strong, negative relationship between SAT score and the percentage of eligible students taking the test. Now these points aren’t labeled, but way up high on the Y dimension sits Mississippi and way down low sits Massachusetts. Anyone who’s familiar with American educational data would be surprised at this result; that’s pretty much the opposite of what we’d expect. But Mississippi is way to the left of the X axis and Massachusetts is way to the right. A way, way lower percentage of students from Mississippi took the test than from Massachusetts. When we have a population mean, as we increase the size of samples within that population, we expect to get sample means closer and closer to that population mean. (Not because the sample is proportionally closer to the population!) The states with very low participation in the test have a restricted range. Although we would expect some variation between states, given variables that affect SATs, if we increased participation rates, we would expect these state averages to get closer to the national average. This is regression towards the mean, the tendency of sample means to get closer to the population mean as sample size increases.
You’ll note that these scores are averages rather than individual observations, but the same fundamental phenomenon applies in either case.
Why are the states with low participation clustered above the population mean? Well, think about it: the students most likely to take the test are the students who expect to do well. They’re the students who intend to go to college. They’re also likely to be the kind of students who have parents motivating them to take the test, and motivated parents tend to be associated with many factors that are themselves positively associated with educational outcomes.