**[Update: **I’ve been convinced by readers that, while this is a good example of the power of selection bias, it doesn’t really work as an example of regression to the mean.]

I thought that this graph was a good way to talk about regression to the mean. You hear that term thrown a lot online these days, often in discussions of advanced metrics in sports. (Not always helpfully, but that’s a discussion for another time.) The data here isn’t so important. It’s SAT data from the mid-90s. The data points are states. For this year, the average SAT score was about 965. (This was on the old 1600 point scale.)

As you can see, there is strong, negative relationship between SAT score and the percentage of eligible students taking the test. Now these points aren’t labeled, but way up high on the Y dimension sits Mississippi and way down low sits Massachusetts. Anyone who’s familiar with American educational data would be surprised at this result; that’s pretty much the opposite of what we’d expect. But Mississippi is way to the left of the X axis and Massachusetts is way to the right. A way, way lower percentage of students from Mississippi took the test than from Massachusetts. When we have a population mean, as we increase the size of samples within that population, we expect to get sample means closer and closer to that population mean. (Not because the sample is proportionally closer to the population!) The states with very low participation in the test have a restricted range. Although we would expect some variation between states, given variables that affect SATs, if we increased participation rates, we would expect these state averages to get closer to the national average. This is regression towards the mean, the tendency of sample means to get closer to the population mean as sample size increases.

You’ll note that these scores are averages rather than individual observations, but the same fundamental phenomenon applies in either case.

Why are the states with low participation clustered above the population mean? Well, think about it: the students most likely to take the test are the students who expect to do well. They’re the students who intend to go to college. They’re also likely to be the kind of students who have parents motivating them to take the test, and motivated parents tend to be associated with many factors that are themselves positively associated with educational outcomes.

This seems a little off to me. As you say, the sample from Massachusetts on the right is representative of the population mean (i.e. all SAT scores nationally), while the sample from Mississippi on the left is not representative of the population mean. However, the Mississippi sample is extreme for a very non-random reason, namely the sampling bias you describe: the smaller set of students taking the SAT in Mississippi are predictably better than average. To my mind, “regression to the mean” refers to a dynamic in which an extreme *but randomly encountered* sample is likely to be followed by a less extreme sample as a direct consequence of the prior samples “extremeness”. Daniel Kahneman offers a story illustrating this concept (summarized here: http://neilbendle.com/regression-to-the-mean). Expressing this point another way: If this is a graph that illustrates “regression to the mean”, then what graph would you use to illustrate the distinct (but somewhat similar) concept of “sampling bias”?

Regardless, it’s always good to explore our thinking on these topics because they are useful concepts, but our mental systems are poorly suited to them and, for most of us, it’s an endless struggle.

Hmmmm, here’s how I’d put it: the fact that the states with the lower participation rates have a distribution that is higher than the mean is a product of selection bias. The fact that the trend moves towards the mean as participation rate grows is indicative of the fact that expanding the range will result in regression towards the mean.

I see what you’re saying; the arc of the model is not itself indicative of a regression to the mean. I guess what I’m trying to say is that within an individual state’s, as the range increases, the state (sample) mean will move towards the national (population) mean because of regression towards the mean. I suppose what I really would have to do would be to show how an individual state’s average changes (almost certainly towards the mean) as sample size is increased. But here, given that there isn’t a

tonof variability by state, I feel like the relationship between increasing sample size and score moving towards the mean is an effective proxy. Could be wrong!In other words, it’s not that the trend line shows regression towards the mean. It’s the fact that as you expand the range, the trend will necessarily be towards the mean. Expanding the range could move a sample with a restricted range towards the mean from either direction, but if the sample size is large enough, it would be extremely unlikely for it to move away from the population mean. Yeah?

Regression to the mean typically refers to repeated measurements, in that more extreme scorers appear to score less extremely on follow-up, due to the nature of measurement error. Sampling bias, such as shown in your example, can also contribute to regression to the mean if you test an extreme group more than once.