I can teach you regression discontinuity design in two images

Like so. (Fake charts I made with fake data, btw.)

No Treatment Effect
Significant Treatment Effect

You already get it, right?

Typically, when we perform some sort of an experiment, we want to look at how a particular number responds to the treatment – how blood pressure reacts to a new drug, say, or how students improve on a reading test when they’re given a new kind of lesson. We want to make sure that the observed differences are really the product of the treatment and not some underlying difference in observed groups. That’s what random controlled trials are for. So we randomly assign subjects to test and control groups, look at what the different averages are for the two different groups, note the size of the effect, and determine whether it is statistically significant.

But sometimes we have real-world conditions that dictate that subjects get sorted into one group or another non-randomly. If we then look at how different groups perform after some treatment, we know that we’re potentially facing severe selection effects thanks to that non-random assignment. But consider if we have assignment based purely on some quantitative metric, with a cutoff score that sorts people into one group or another. (Suppose, for example, students only became eligible for a gifted student program if they score above a cut score on some test.) Here we have a non-random distribution that we can actually exploit for research purposes. A regression discontinuity design allows us to explore the impact of such a program because, so long as students aren’t able to impact their assignment beyond their score on that test, we can be confident that students just above or just below the cutoff score are very similar.

Regression analyses will be run on all of the data, with subjects below and above the cut score combined but flagged into different groups. Researchers will run statistical models to determine whether there is a difference between groups who receive the treatment and those who don’t. As you can see in the scatterplots above, a large effect will be readily apparent in how the data looks. In the above scenario, the X axis represents the score students received on the test, the cut score is 15, and the Y axis represents performance on some later educational metric. In the top scatterplot, there is no meaningful difference from the gift students program, as the relationship between these two metrics is the same above and below the cut score. But in the bottom graph, there’s a significant jump at the cut score. Note that even after the intervention, the relationship is still linear – students who did better on the initial test do better on the later metric. But the scores of everyone have jumped right at the cut score.

There are, as you’d probably imagine, a number of potential pitfalls here, and assumption checks and quality controls are essential. All of the people tested would have to be able to be sorted into the gifted program solely on the basis of the test, the cutoff score has to be near the mean, and you need sufficient numbers to see the relationship on either side of the cut score, among other things. But if you have the right conditions, regression discontinuity design is a great way to get near-random experimental design quality in situations where you can’t do that for pragmatic or ethical reasons.

restriction of range: what it is and why it matters

Let’s imagine a bit of research that we could easily perform, following standard procedures, and still get a misleading result.

Say I’m an administrator at Harvard, a truly selective institution. I want to verify the College Board’s confidence that the SAT effectively predicts freshman year academic performance. I grab the SAT data, grab freshmen GPAs, and run a simple Pearson correlation to find out the relationship between the two. To my surprise, I find that the correlation is quite low. I resolve to argue to colleagues that we should not be requiring students to submit SAT or similar scores for admissions, as those scores don’t tell us anything worthwhile anyway.

Ah. But what do we know about the SAT scores of Harvard freshmen? We know that they’re very tightly grouped because they are almost universally very high. Indeed, something like a quarter of all of your incoming freshman got a perfect score on the (new-now-old) 2400 scale:

Section Average 25th Percentile 75th Percentile
Math 750 700 800
Reading 750 700 800
Writing 750 710 790
Composite 2250 2110 2390

The reason your correlation is so low (and note that this dynamic applies to typical linear regression procedures as well) is that there simply isn’t enough variation in one of your numbers to get a high metric of relationship. You’ve fallen victim to a restriction of range.

Think about it. When we calculate a correlation, we take pairs of numbers and see how one number changes compared to the other. So if I restrict myself to children and I look at age in months compared to height, I’m going to see consistent changes in the same direction – my observations of height at 6 months will be smaller than my observations at 12 months and those will in turn be smaller than at 24 months. This correlation will not be perfect, as different children are of different height and grow at different rates. The overall trend, however, will be clear and strong. But in simple mathematical terms, in order to get a high degree of relationship you have to have a certain range of scores in both numbers – if you only looked at children between 18 and 24 months you’d be necessarily restricting the size of the relationship. In the above example, if Harvard became so competitive that every incoming freshman had a perfect SAT score, the correlation between SAT scores and GPA (or any other number) would necessarily be 0.

Of course, most schools don’t have incoming populations similar to Harvard’s. Their average SAT scores, and the degree of variation in their SAT scores, would likely be different. Big public state schools, for example, tend to have a much wider achievement band of incoming students, who run the gamut from those who are competitive with those Ivy League students to those who are marginally prepared, and perhaps gained admission via special programs designed to expand opportunity. In a school like that, given adequate sample size and an adequate range of SAT scores, the correlation would be much less restricted – and it’s likely, given the consistent evidence that SAT scores are a good predictor of GPA, significantly higher.

Note that we could also expect a similar outcome in the opposite direction. In many graduate school contexts, it’s notoriously hard to get bad grades. (This is not, in my opinion, a problem in the same way that grade inflation is a potential problem for undergraduate programs, given that most grad school-requiring jobs don’t really look at grad school GPA as an important metric.) With so many GPAs clustering in such a narrow upper band, you’d expect raw GRE-GPA correlations to be fairly low – which is precisely what the research finds.

Here’s a really cool graphic demonstration of this in the form of two views on the same scatterplot. (I’m afraid I don’t know where this came from, otherwise I’d give credit.)

This really helps show restricted range in an intuitive way: when you’re looking in too close at a small range on one variable, you just don’t have the perspective to see the broader trends.

What can we do about this? Does this mean that we just can’t look for these relationships when we have a restricted range? No. There are a number of statistical adjustments that we can make to estimate a range-corrected value for metrics of relationship. The most common of these, Thorndike’s case 2, was (like a lot of stats formulas) patiently explained to me by a skilled instructor who guided me to an understanding of how it works which then escaped my brain in my sleep one night like air slowly getting let out of a balloon. But you can probably intuitively understand how such a correction would work in broad strokes – we have a certain data set restricted on X variable, the relationship is strong along that restricted range, its spread is s in that range, so let’s use that to guide an estimate of the relationship further along X. As you can probably guess, we can do so with more confidence if we have a stronger relationship and lower spread in the data that we do have. And there is a certain degree of range we have to have in our real-world data to be able to calculate a responsible adjustment.

There have been several validation studies of Thorndike’s case 2 where researchers had access to both a range-restricted sample (because of some set cut point) and an unrestricted sample and were able to compare the corrected results on the restricted sample to the raw correlations on unrestricted samples. The results have provided strong validating evidence for the correction formula. Here’s a good study.

There are also imputation models that are used to correct for range restriction. Imputation is a process common to regression when we have missing data and want to fill in the blanks, sometimes by making estimates based on the strength of observed relationships and spread, sometimes by using real values pulled from other data points…. It gets very complicated and I don’t know much about it. As usual if you really need to understand this stuff for research purposes – get ye to a statistician!

two concepts about sampling that were tricky for me

Here’s a couple related points about statistics that it took me a long time to grasp, and which really improved my intuitive understanding of statistical arguments.

1. It’s not the sample size, it’s the sampling mechanism. 

Well, OK. It’s somewhat the sample size, obviously. My point is that most people who encounter a study’s methodology are much more likely to remark on the sample size – and pronounce it too small – than to remark on the sampling mechanism. I can’t tell you have often I’ve seen studies with an n = 100 that have been dismissed by commenters online as too small to take seriously. Depending on the design of the study, and the variables being evaluated, 100 can be a good-enough sample size. In fact, under certain circumstances (medical testing of rare conditions, say) an n of 30 is sufficient to draw some conclusions about populations.

We can’t say with 100% accuracy what a population’s average for a given trait is when we use inferential statistics. (We actually can’t say that with 100% accuracy even when taking a census, but that’s another discussion.) But we can say with a chosen level of confidence that the average lies in a particular range, which can often be quite small, and from which we can make predictions of remarkable accuracy – provided the sampling mechanism was adequately random. By random, we mean that every member of the population has an equivalent chance of being selected for the sample. If there are factors that make one group more or less likely to be selected for the sample, that is statistical bias (as opposed to statistical error).

It’s important to understand the declining influence of sample size in reducing statistical error as sample size grows. Because calculating confidence intervals and margins of error involves placing the n under a square root sign, the power of sample size declines exponentially (fixed). Here’s the formula for margin of error:

Z* σ/√(n) 

where Z is a Z-value that you look up in a chart for a given confidence level (often 95% or 99%), σ is the standard deviation, and n is your number of observations. You see two clear things here: first, spread (standard deviation) is super important to how confident we can be about the accuracy of an average. (Report spread when reporting an average!) Second, that we get declining improvements to accuracy as we increase sample size.That means that after a point, adding hundreds of more observations gets you less power than you got from adding 10 at lower ns. Given the resources involved in data collection, this can make expanding sample size a low-value proposition.

Now compare a rigorously controlled study with an n = 30 which was drawn with a random sampling mechanism to, say, those surveys that ESPN.com used to run all the time. Those very often get sample sizes in the hundreds of thousands. But the sampling mechanism is a nightmare. They’re voluntary response instruments that are biased in any number of ways: underrepresenting people without internet access, people who aren’t interested in sports, people who go to SI.com instead of ESPN.com, on and on. The value of the 30 person instrument is far higher than that of the ESPN.com data. The sampling mechanism makes the sample size irrelevant.

Sample size does matter, but in common discussions of statistics, its importance is misunderstood, and the value of increasing sample size declines as grows.

2. For any reasonable definition of a sample, population size relative to sample size is irrelevant for the statistical precision of findings.

A 1,000 person sample, if drawn with some sort of rigorous random sampling mechanism, is exactly as descriptive and predictive when drawn randomly from the ~570,000 person population of Wyoming as it is when drawn randomly from the ~315 million person population of the United States. (If intended as samples of people in Wyoming and people in the United States respectively, of course.)

I have found this one very hard to wrap my mind around, but it’s the case. The formulas for margin of error, confidence intervals, and the like do not involve any reference to the size of the total population. You can think about it this way: each time you pull a sample at random from some population, the odds of your sample being unlike the population goes down regardless of the size of that population. The mistake lies in thinking that the point of increasing sample size lies in making it closer in proportion to population. In reality, the point is just to increase the number of attempts in order to reduce the possibility that previous attempts produced statistically unlikely results. Even if you had an infinite population, every time you draw a sample from that population you would be decreasing the chance that you’re randomly pulling an unrepresentative sample.

The essential caveat lies in “for any reasonable definition of a sample.” Yes, testing 900 out of a population of 1000 is more accurate than testing 900 out of a population of 1,000,000. But nobody would ever call 90% of a population a sample. You see different thresholds for where a sample begins and ends; some people say that anything larger than 1/100th of the total population is no longer a sample, but it varies. The point holds: when we’re dealing with real-world samples, where the population we care about is vastly larger than any reasonable sample size, the population size is irrelevant to the calculation error in our statistical inferences. This one is quite counterintuitive and took me a long time to really grasp.

Reporting Regression Results Responsibly

We’re in a Golden Age for access to data, which unfortunately also means we’re in a Golden Age for the potential to misinterpret data. Though the absurdity of gated academic journals persists, academic research is more accessible now than ever before. We’ve also seen a rapid growth in the use of arguments based on statistics in the popular media in the last several years. This is potentially a real boon to our ability to understand the world around us, but it carries with it all of the potential for misleading statistical arguments.

My request is pretty simple. All statistical techniques, particularly the basic parametric statistical techniques that are most likely to show up in data journalism, require the satisfaction of assumptions and checking of diagnostic measures to ensure that hidden bias isn’t misleading us. Many of these assumptions and diagnostics are ultimately judgment calls, relying on practitioners to make informed decisions about what degree of wiggle room is appropriate given the research scenario. There are, however, conventions and implied standards that people can use to guide their decisions. The most important and useful kind of check, though, is the  eyes of other researchers. Given that the ability to host graphs, tables, and similar kinds of data online is simple and nearly free, I think that researchers and data journalists alike should provide links to their data and to the graphs and tables they use to check assumptions and diagnostic measures. In the digital era, it’s crazy this is still a rare practice. I don’t expect to find these graphs and tables sitting square in the center of a blog post, and I expect that 90% of readers wouldn’t bother to look. But there’s nothing to risk in having them available, and transparency, accountability, and collaboration to gain.

That’s the simple part, and you can feel free to close tab. For a little more:

What kind of assumptions and diagnostics am I talking about? Let’s consider the case of one of the most common types of parametric methods, linear regression. Whether we have a single predictor for simple linear regression or multiple predictors for multilinear regression, fundamentally regression is a matter of assessing the relationship between quantitative (continuous) predictor variables and a quantitative (continuous) outcome variable. For example, we might ask how well SAT scores predict college GPA; we might ask how well age, weight, and height predict blood pressure. When someone talks about how one number predicts another, the strength of their relationship, and how we might attempt to change one by changing the other, they’re probably making an appeal to regression.

The types of regression analysis, and the issues therein, are vast, and there are many technical issues at play that I’ll never understand. But I think it’s worthwhile to talk about some of the assumptions we need to check and some problems we have to look out for. Regression has come in for a fair amount of abuse lately from sticklers and skeptics, and not for no reason; it’s easy to use the techniques irresponsibly. But we’re inevitably going to ask basic questions of how X and Y predict Z, so I think we should expand public literacy about these things. I want to talk a little bit about these issues not because I think I’m qualified to teach statistics to others, or because regression is the only statistical process that we need to see assumptions and diagnostics for. Rather, I think regression is an illustrative example through which to explore why we need to check this stuff, to talk about both the power and pitfalls of public engagement with data.

There are four assumptions that need to be true to run a linear (least squares) regression: independence of observations, linearity, constancy of variance, and normality. (Some purists add a fifth, existence, which, whatever.)

Independence of Observations

This is the biggie, and it’s why doing good research can be so hard and expensive. It’s the necessary assumption that one observation does not affect another. This is the assumption that requires randomness. Remember that in statistics error, or necessary and expected variation, is inevitable, but bias, or the systematic influence on observations, is lethal.

Suppose you want to see how eating ice cream affects blood sugar level. You gather 100 students into the gym and have them all eat ice cream. You then go one by one through the students and give them a blood test. You dutifully record everyone’s values. When you get back to the lab, you find that your data does not match that of much of the established research literature. Confused, you check your data again. You use your spreadsheet software to arrange the cells by blood sugar. You find a remarkably steady progression of results running higher to lower. Then it hits you: it took you several hours to test the 100 students. The highest readings are all from the students who were first to be tested, the lowest from those who were tested last. Your data was corrupted by an uncontrolled variable, time-after-eating-to-test. Your observations were not truly independent of each other – one observation influenced another because taking one delayed taking the other. This is an example that you’d hope most people would avoid, but the history of research is the history of people making oversights that were, in hindsight, quite obvious.

Independence is scary because threats to it so often lurk out of sight. And the presumption of independence often prohibits certain kind of analysis that we might find natural. For example, think of assigning control and test conditions to classes rather than individual students in educational research. This is often the only practical way to do it; you can’t fairly ask teachers to only teach half their students one technique and half another. You give one set of randomly-assigned classes a new pedagogical technique, while using the old standard with your control classes. You give a pre- and post-test to both and pop both sets of results in an ANOVA. You’ve just violated the assumption of independence. We know that there are clustering effects of children within classrooms; that is, their results are not entirely independent of each other. We can correct for this sort of thing using techniques like hierarchical modeling, but first we have to recognize that those dangers exist!

Independence is the assumption that is least subject to statistical correction. It’s also the assumption that is the hardest to check just by looking at graphs. Confidence in independence stems mostly from rigorous and careful experimental design. You can check a graph of your observations (your actual data points) against your residuals (the distance between your observed values and the linear progression from your model), which can sometimes provide clues. But ultimately, you’ve just got to know your data was collected appropriately. On this one, we’re largely on our own. However, I think it’s a good idea for academic researchers to provide online access to a Residuals vs. Observations graph when they run a regression. This is very rare, currently.

Here’s a Residuals vs. Observations graph I pulled off of Google Images. This is what we want to see: snow. Clear nonrandom patterns in this plot are bad.

Linearity

The name of the technique is linear regression, which means that observed relationships should be roughly linear to be valid. In other words, you want your relationship to fall along a more or less linear path as you move across the x axis; the relationship can be weaker or it can be stronger, but you want it to be more or less as strong as you move across the line. This is particularly the case because curvilinear relationships can appear to regression analysis to be no relationship. Regression is all about interpolation: if I check  my data and find a strong linear relationship, and my data has a range from A to B, I should be able to check any x value within A and B and have a pretty good prediction for y. (What “pretty good” means in practice is a matter of residuals and r-squared, or the portion of the variance in y that’s explained by my xs.) If my relationship isn’t linear, my confidence in that prediction is unfounded.

Take a look at these scatter plots. Both show close to zero linear relationship according to Pearson’s product-moment coefficient:

And yet clearly, there’s something very different going on from one plot to the next. The first is true random variance; there is no consistent relationship between our x and y variables. The second is a very clear association; it’s just not a linear relationship. The degree and direction of y varying along x changes over different values for x. Failure to recognize that non-linear relationship could compel us to think that there is no relationship at all. If the violation of linearity is as clear and consistent as in this scatter plot, it can be cleaned up fairly easily by transforming the data.

Regression is fairly robust to violations of linearity, and it’s worth noting that any relationship that is sufficiently lower than 1 will be non-linear in the strict sense. But clear, consistent curves in data can invalidate our regression analyses.

Readers could check data for linearity if scatter plots are posted for simple linear regression. For multilinear regression, it’s a bit messier; you could plot every individual predictor, but I would be satisfied if you just mention that you checked linearity.

Constancy of variance

Also known by one of my very favorite ten-cent words, homoscedasticity. Constancy of variance means that, along your range of x predictors, your y varies about as much; it has as much spread, as much error. Remember, when I’m doing inferential statistics, I’m sampling, and sampling means sampling error – even if I’m getting quality results, I’m inevitably going to get differences in my data from one collection of samples to the next. But if our assumptions are true, we can trust that those samples will vary in predictable intervals relative to the true mean. That is, if an SAT score predicts freshman year GPA with a certain degree of consistency for students scoring 400, it should be about as consistent for students scoring 800, 1200, and 1600, even though we know that from one data set to the next, we’re not going to get the exact same values even if we assume that all of the variables of interest are the same. We just need to know that the degree to which they vary for a given is constant over our range.

Why is this important? Think again about interpolation. I run a regression because I want to understand a relationship between various quantitative variables, and often because I want to use my predictor variables to… predict. Regression is useful insofar as I can move along the axes of my x values and produce a meaningful, subject-to-error-but-still-useful value for y. Violating the assumption of constant variance means that you can’t predict y with equal confidence as you move around x(s); the relationship is stronger at some points than others, making you vulnerable to inaccurate predictions.

Here’s a residuals plot showing the dreaded megaphone effect: the error (size of residuals, difference between observations and results expected from the regression equation) increases as we move from low to high values of x. The relationship is strong at low values of x and much weaker at high values.

We could check homoscedasticity by having access to residual plots. Violations of constant variance can often be fixed via transformation, although it may often be easier to use techniques that are more inherently robust to this violation, such as quantile regression.

Normality

The concept of the normal distribution is at once simple and counterintuitive, and I’ve spent a lot of my walks home trying to think of the best way to explain it. The “parametric” in parametric statistics refers to the assumption that there is a given underlying distribution for most observable data, and frequently this distribution is the normal distribution or bell curve. Think of yourself walking down the street and noticing that someone is unusually tall or unusually short. The fact that you notice is in and of itself a consequence of the normal distribution. When we think of someone that is unusually tall or short, we are implicitly assuming that we will find fewer and fewer people as we move further along the extremes of the height distribution. If you see a man in North American who is 5’10, he is above average height, but you wouldn’t bat an eye; if you see a man who is 6’3, you might think yourself, that’s a tall guy; when you see someone who is 6’9, you say, wow, he is tall!, and when you see a 7 footer, you take out your cell phone. This is the central meaning of the normal distribution: that the average is more likely to occur than extremes, and that the relationship between position on the distribution and probability of occurrence is predictable.

Not everything in life is normally distributed. Poll 1,000 people and ask how much money they received in car insurance payments last year and it won’t look normal. But a remarkable amount of naturally occurring phenomena are normally distributed, simply thanks to the reality of numbers and extremes, and the central limit theorem teaches us that essentially all averages are normally distributed. (That is, if I take a 100 person sample of a population for a given quantitative trait, I will get a mean; if I take another 100 person sample, I will get a similar but not exact mean, and so on. If I plot those means, they will be normal even if the overall distribution is not.)

The assumption of normality in regression requires our data to be roughly normally distributed; in order to assess the relationship of y as it moves across x, we need to know the relative frequency of extreme observations to observations close to the mean. It’s a fairly robust assumption, and you’re never going to have perfectly normal data, but too strong of a violation will invalidate your analysis. We check normality with what’s called a qq plot. Here’s an almost-perfect one, again scraped from Google Images:

That strongly linear, nearly 45 degree angle is just what we want to see. Here’s a bad one, demonstrating the “fat tails” phenomenon – that is, too many observations clustered at the extremes relative to the mean:

Usually the rule is that unless you’ve got a really clear break from a straightish 45 degree angle, you’re probably alright. When the going gets tough, seek help from a statistician.

Diagnostics

OK, so 2000 words into this thing, we’ve checked out four assumptions. Are we good? Well, not so fast. We need to check a few diagnostic measures, or what my stats instructor  used to call “the laundry list.” This is a matter of investigating influence. When we run an analysis like regression, we’re banking on the aggregate power of all of our observations to help us make responsible observations and inferences. We never want to rely too heavily on individual or small numbers of observations because that increases the influence of error in our analysis. Diagnostic measures in regression typically involve using statistical procedures to look for influential observations that have too much sway over our analysis.

The first thing to say about outliers is that you want a systematic reason for eliminating them. There are entire books about the identification and elimination of outliers, and I’m not qualified to say what the best method is in any given situation. But you never want to toss an observation simply because it would help your analysis. When you’ve got that one data point that’s dragging your line out of significance, it’s tempting to get rid of it, but you want to analyze that observation for a methodology-internal justification for eliminating it. On the other hand, sometimes you have the opposite situation: your purported effect is really the product of a single or small number of influential outliers that have dragged the line in your favor (that is, to a p-value you like). Then, of course, the temptation is simply to not mention the outlier and publish anyway. Especially if a tenure review is in your future…

Some examples of influential observation diagnostics in regression include examining leverage, or outliers in your predictors that have a great deal of influence on your overall model; Cook’s Distance, which tells you how different your model will be if you delete a given observation; DFBetas, which tells you how a given predictor observation influences on a particular parameter estimate; and more. Most modern statistical packages like SAS or R have commands for checking diagnostic measures like these. While offering numbers would be nice, I would mostly like it if researchers reassured readers that they had run diagnostic measures for regression and found acceptable results. Just let me know: I looked for outliers and influential observations and things came back fairly clean.

*****

Regression is just one part of a large number of techniques and applications that are happening in data journalism right now. But essentially any statistical techniques are going to involve checking assumptions and diagnostic measures. A typical ANOVA, for example, the categorical equivalent of regression, will involve checking some of the same assumptions. In the era of the internet, there is no reason not to provide a link to a brief, simple rundown of what quality controls were pursued in  your analysis.

None of these things are foolproof. Sums of squares are spooky things; we get weird results as we add and remove predictors from our models. Individual predictors are strongly significant by themselves but not when added together; models are significant with no individual predictors significant; individual predictors are highly significant without model significance; the order you put your predictors in changes everything; and so on. It’s fascinating and complicated. We’re always at the mercy of how responsible and careful researchers are. But by sharing information, we raise the odds that what we’re looking at is a real effect.

This might all sound like an impossibly high bar to clear. There are so many ways things can go wrong. And it’s true that, in general, I worry that people today are too credulous towards statistical arguments, which are often advanced without sufficient qualifications. There are some questions where statistics more often mislead than illuminate. But there is a lot we can and do know. We know that age is highly predictive of height in children but not in adults; we know that there is a relationship between SAT scores and freshman year GPA; we know point differential is a better predictor of future win-loss record than past win-loss record. We can learn lots of things, but we always do it better together. So I think that academic researchers and data journalists should share their work to a greater degree than they do now. That requires a certain compromise. After all, it’s scary to have tons of strangers looking over your shoulder. So I propose that we get more skeptical and critical on our statistical arguments as a media and readership, but more forgiving of individual researchers who are, after all, only human. That strikes me as a good bargain.

And one I’m willing to make myself, so please email me to point out the mistakes I’ve inevitably made in this post.

correlation: neither everything nor nothing

via Overthinking

One thing that everyone on the internet knows, about statistics, is this: correlation does not imply causation. It’s a stock phrase, a bauble constantly polished and passed off in internet debate. And it’s not wrong, at least not on its face. But I worry that the denial of the importance of correlation is a bigger impediment to human knowledge and understanding than belief in specious relationships between correlation and causation.

First, you should read two pieces on the “correlation does not imply causation” phenomenon, which has gone from a somewhat arcane notion common to research methods classes to a full-fledged meme. This piece by Greg Laden is absolute required reading on correlation and causation and how to think about both. Second, this piece by Daniel Engber does good work talking about how “correlation does not imply causation” became an overused and unhelpful piece of internet lingo.

As Laden points out, the question is really this: what does “imply” mean? The people who employ “correlation does not imply causation” as a kind of argumentative trump card are typically using “imply” in a way that nobody actually means, which is as synonymous with “prove.” That’s pretty far from what we usually mean by “implies”! In fact, using the typical meaning of implication, correlation sometimes implies causation, in the sense that it provides evidence for a causal relationship. In careful, rigorously conducted research, a strong correlation can offer some evidence of causation, if that correlation is embedded in a theoretical argument for how that causative relationship works. If nothing else, correlation is often the first stage in identifying relationships of interest that we might then investigate in more rigorous ways, if we can.

A few things I’d like people to think about.

There are specific reasons that an assertion of causation from correlation data might be incorrect. There is a vast literature of research methodology, across just about every research field you can imagine. Correlation-causation fallacies have been investigated and understood for a long time. Among the potential dangers is the confounding variable, where an unknown variable is driving the change in two other variables, making them appear to influence one another. This gives us the famous drownings-and-ice cream correlation – as drownings go up, so do ice cream sales. The confounding variable, of course, is temperature.1 There are all sorts of nasty little interpretation problems in the literature. These dangers are real. But in order to have understanding, we have to actually investigate why a particular relationship is spurious. Just saying “correlation does not imply causation” doesn’t do anything to actually improve our understanding. Explore why, if you want to be useful. Use the phrase as the beginning of a conversation, not a talisman.

Correlation evidence can be essential when it is difficult or impossible to investigate a causative mechanism. Cigarette smoking causes cancer. We know that. We know it because of many, many rigorous and careful studies have established that connection. It might surprise you to know that the large majority of our evidence demonstrating that relationship comes from correlation studies, rather than experiments. Why? Well, as my statistics instructor used to say – here, let’s prove cigarette smoking causes cancer. We’ll round up some infants, and we’ll divide them into experimental and control groups, and we’ll expose the experimental group to tobacco smoke, and in a few years, we’ll have proven a causal relationship. Sound like a good idea to you? Me neither. We knew that cigarettes were contributing to lung cancer long before we identified what was actually happening in the human body, and we have correlational studies to thank for that. Blinded randomized controlled experimental studies are the gold standard, but they are rare precisely because they are hard, sometimes impossible. To refuse to take anything else as meaningful evidence is nihilism, not skepticism.

Sometimes what we care about is association. Consider relationships which we believe to be strong but in which we are unlikely to ever identify a specific causal mechanism. I have on my desk a raft of research showing a strong correlation between parental income and student performance on various educational metrics. It’s a relationship we find in a variety of locations, across a variety of ages, and through a variety of different research contexts. This is important research, it has stakes; it helps us to understand the power of structural advantage and contributes to political critique of our supposedly meritocratic social systems.

Suppose I was prohibited from asserting that this correlation proved anything because I couldn’t prove causation. My question is this: how could I find a specific causal mechanism? The relationship is likely very complex, and in some cases, not subject to external observation by researchers at all. To refuse to consider this relationship in our knowledge making or our policy decisions because of an overly skeptical attitude towards correlational data would be profoundly misguided. Of course there’s limitations and restrictions we need to keep in mind – the relationship is consistent but not universal, its effect is different for different parts of the income scale, it varies with a variety of factors. It’s not a complete or simple story. But I’m still perfectly willing to say that poverty is associated with poor educational performance. That’s the only reasonable conclusion from the data. That association matters, even if we can’t find a specific causal mechanism.

Correlation is a statistical relationship. Causation is a judgement call. I frequently find that people seem to believe that there is some sort of mathematical proof of causation that a high correlation does not merit, some number that can be spit out by statistical packages that says “here’s causation.” But causation is always a matter of the informed judgment of the research community. Controlled experiments are the gold standard in that regard, but there are controlled experiments that can’t prove causation and other research methods that have established causation to the satisfaction of most members of a discipline.

Human beings have the benefit of human reasoning. One of my frustrations with the “correlation does not imply causation” line is that it’s often deployed in instances where no one is asserting that we’ve adequately proved causation. I sometimes feel as though people are trying to protect us from mistakes of reasoning that no one would actually fall victim to. In an (overall excellent) piece for the Times, Gary Marcus and Ernest Davis write, “A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two.” That’s true – it is hard to imagine! So hard to imagine that I don’t think anyone would have that problem. I get the point that it’s a deliberately exaggerated example, and I also fully recognize that there are some correlation-causation assumptions that are tempting but wrong. But I think that, when people state the dangers of drawing specious relationships, they sometimes act as if we’re all dummies. No one will look at these correlations and think they’re describing real causal relationships because no one is that senseless. So why are we so afraid of that potential bad reasoning?

Those disagreeing with conclusions drawn from correlational data have a burden of proof too. This is the thing, for me, more than anything. It’s fine to dispute a suggestion of causation drawn from correlation data. Just recognize that you have to actually make the case. Different people can have responsible, reasonable disagreements about statistical inferences. Both sides have to present evidence and make a rational argument drawn from theory. “Correlation does not imply causation” is the beginning of discussion, not the end.

I consider myself on the skeptical side when it comes to Big Data, at least in certain applications. As someone who is frequently frustrated by hype and woowoo, I’m firmly in the camp that says we need skepticism ingrained in how we think and write about statistical inquiry. I personally do think that many of the claims about Big Data applications are overblown, and I also think that the notion that we’ll ever be post-theory or purely empirical are dangerously misguided. But there’s no need to throw the baby out with the bathwater. While we should maintain a healthy criticism of them, new ventures dedicated to researched, data-driven writing should be greeted as a welcome development. What we need, I think, is to contribute to a communal understanding of research methods and statistics, including healthy skepticism, and there’s reason for optimism in that regard. Reasonable skepticism, not unthinking rejection; a critical utilization, not a thoughtless embrace.


 

norm referencing, criterion referencing, and ed policy

I want to talk a bit about a distinction between different types of educational testing/assessment and how they interface with some basic questions we have about education policy. The two concepts are norm referencing and criterion referencing.

Criterion Referencing

Why do we perform tests? What are their purpose? One common reason is to ensure that people are able to perform some sort of an essential task. Take a driver’s test. The point is to make sure that people who are on the roads possess certain minimal abilities to safely pilot a car, based on social standards of competence that are written into policy. While we might understand that some people are better or worse drivers than others, we’re not really interested in using a driver’s test to say who is adequate, who is good, and who is excellent. Rather we just want to know: do you meet this minimal threshold? The name for that kind of test is a criterion referenced test. We have some criterion (or criteria) and we check and see if the people taking the test fulfill them. Sometimes we want these tests to be fairly generous; society couldn’t function, in many locales, if a majority of competent adults couldn’t pass a driver’s test. On the other hand, we probably want the benchmarks for, say, running a nuclear reactor to be fairly strict. The social costs would here be higher for a test that was too lenient rather than too strict. In either case, though, our interest is not in discriminating between different individuals to a fine level of gradation, particularly for those who are clearly good enough or clearly not. Rather we just want to know: is the test taker competent to perform the real-world task?

Norm Referencing

Criterion referencing depends on, well, the existence of a criterion. That is, there has to be some sort of benchmark or goal that the test taker will either reach or not. What would it mean to have a criterion referenced test for, say, college readiness? We can certainly imagine a set benchmark for being prepared for college, and we’d probably like to think that there’s some minimal level of preparation that’s required for any college-bound student. But in a broader sense we are probably aware that there is no one set criterion that would work given the large range of schools and students that “college readiness” reflects. What’s more, we also know that colleges are profoundly interested in relative readiness; elite colleges spend vast amounts of money attracting the most highly-qualified students to campus.

For that we need to use tests like the SAT and ACT which are not oriented around fulfilling a given criterion but for creating a scale of test-takers and being able to discriminate between different students in exacting detail. We need, that is, norm referenced tests. When we say “norm” here we mean in comparison to others, to an average and to quintiles, and in particular to the normal distribution. I don’t want to get too into the weeds on that big topic, for those who aren’t already versed in it. Suffice to say for now that the normal distribution is a very common distribution of observed values for all sorts of naturally-occurring phenomenon that have a finite range (that is, a beginning and an ending) and which are affected by multiple variables. The ideal normal distribution looks like this:

The big center line in there is the mean, median, and mode – that it, the arithmetic average, the line that divides one half of the data from the other, and the observation that occurs most. As you can see, in a true normal distribution the observations fall in very particular patterns relative to the average. In particular, the further away we go from the average, the less likely we are to find observations, again in predictable quantitative relationships. When we talk about something changing relative to a standard deviation, using that statistic (actually a measure of spread) as a measure of distance or extremity, we’re doing so in relation to the normal distribution. Tests like the SAT and ACT, GRE, IQ tests, and all manner of other tests used for the purpose of screening applicants for some finite number of slots use norm referencing.

Why? Again, think about what we’re trying to accomplish with norm referencing. I want to give a test to be able to say that test taker X is better than test taker Y but worse than test take Z. But we also want to be able to say where they each fall relative to the mass of data. Norm referencing allows us to make meaningful statements about how someone will perform relative to others, and this in turn gives us information about how to, for example, select people for our scarce admissions slots at an exclusive college.

As Glenn Fulcher puts it in his (excellent) book Practical Language Testing:

As we move away from the mean the scores in the distribution become more
extreme, and so less common. It is very rare for test takers to get all items correct, just as
it is very rare for them to get all items incorrect. But there are a few in every large group
who do exceptionally well, or exceptionally poorly. The curve of normal distribution
tells us what the probability is that a test taker could have got the score they have, given
the place of the score in a particular distribution. And this is why we can say that a score
is ‘exceptional’ or ‘in the top 10 per cent’, or ‘just a little better than average.

(Incidentally, it’s a very common feature of various types of educational, intelligence, and language testing that scores become less meaningful as the move towards the extremes. That is, a 10 point difference on a well-validated IQ test means a lot when it comes to the difference between a 95 and a 105, but it means much less when it comes to a difference between 25 and 35 or 165 and 175. Why? In part because outliers are rare, by their nature, which means we have less data to validate that range of our scale. Also, practically speaking there are floors and ceilings. Someone who gets a 20 on the TOEFL ibt and someone who gets a 30 share one most important thing: they’re functionally unable to communicate in English. This is also why you shouldn’t trust anyone who tells you they have an IQ over, say, 150 or so. The scale just doesn’t mean anything up that high.)

How do we get these pretty normal distributions? That’s the work of test development, and part of why it can be a seriously difficult and expensive undertaking. The nature of numbers (and the central limit theorem) help, but ultimately the big testing companies have to spend a ton of time and money getting a distribution as close to normal as possible – and whatever else their flaws, organizations like ETS do that very well. Either way, it’s essential to say that the normal distribution does not arrange itself like magic in tests. It has to be produced with careful work.

The Question of Grades

Thinking about these two paradigms can help us think through some questions. Here’s a simple one: are grades norm referenced or criterion referenced? The answer is surely both and neither, but I think it’s useful to consider the dynamics here for a minute. In one sense, grades are clearly criterion referenced, in that they are meant to reflect a given student’s mastery of the given course’s subject matter. And we aren’t likely to say that only X% of students in a class should pass or fail according to a model distribution; rather, we think we should pass everyone who demonstrates the knowledge, skills, and competencies a class is designed to instill. And we have little reason to think that grades are actually normally distributed between students in the average class.

Yet, in another sense, we clearly think of grades as norm referenced. When we talk about grade inflation, we are often speaking in terms of norm referencing, complaining that we’re losing the ability to discriminate between different levels of ability. Grades are used to compare different students from remarkably different educational contexts in the college admissions process. And sometimes, we “curve a test,” meaning adjusting a test to better match the normal distribution – which does not, contrary to undergraduate presumption, necessarily help the people who took the test. There’s an intuitive sense that grades should match a fairness distribution that, while not normal around the natural mean (you wouldn’t want the average grade to be a 50, I trust), still essentially replicates normality in that both very high and very low grades should be rare. In practice, this is not often the case. A’s, in my experience, outnumber F’s.

In grad school a perpetual concern was the very high average grade of freshman composition, something like a B+. And the downwards pressure on that average was largely the product of students who stopped coming to class and thus got F’s, meaning that the average grade of students who actually completed the course was probably very high indeed. Is this a problem? I guess it depends on your point of view. If we were trying to meaningfully discriminate between different students based on their freshman comp grades, then certainly – but I’m not sure if we’d want to do that. On the other hand, it might be that freshman comp is a class that we think most students might naturally be expected to do well; it isn’t written anywhere that the criterion for success be at a particularly high level. Certainly the major departments wouldn’t like too many students failing that Gen Ed, given that they’d then have to fill valuable schedule space when they retook it. The question, though, is whether those grades actually reflect a meaningful meeting of the given benchmarks – fulfilling the criterion. I’m of the opinion that the answer is often “no” when it comes to freshman comp, meaning that the average grades are probably too high no matter what.

I don’t have any grand insight here, and I think most people are able to meaningfully think about grades in a way that reflects both norm-referenced and criterion-referenced interests. But I do think that these dynamics are important to think about. As I’ve been saying lately, I think that there are some basic aspects of education and education policy that we simply haven’t thought through adequately, and we all could benefit from going back to the basics and pulling apart what we think we want.

selection bias: a parable

To return to this blog’s favorite theme, how selection bias dictates educational outcomes, let me tell you a true story about a real public university in a large public system. (Not my school, for the record.) Let’s call it University X. University X, like most, has an Education department. Like at most universities, University X’s Education majors largely plan on becoming teachers in the public school system of that state. Unfortunately, like at many schools across the country, University X’s Education department has poor metrics – low graduation rate, discouraging post-graduation employment and income measures, poor performance on assessments of student learning relative to same-school peers. (The reasons for this dynamic are complex and outside of the scope of this post.)

The biggest problem, though, is this: far too many of University X’s graduates from the Education program go on to fail the test that the state mandates as part of teacher accreditation. This means that they are effectively barred from their chosen profession. Worse, going to another state to work as a public teacher is often not feasible, as accreditation standards will sometimes require them to take a year or more of classes in that state’s system post-graduation to become accredited. So you end up with a lot of students with degrees designed to get them into a profession they can’t get into, and eventually the powers that be looked askance.

So what did University X’s Education department do? Their move was ingenious, really: they required students applying to the major to take a practice version of the accreditation test, one built from using real questions on the real test and otherwise identical to the real thing. They then only accepted into the major those students who scored above benchmarks on the test. And the problem, in a few years, was solved! They saw their graduates succeed on the accreditation test at vastly higher rates. Unfortunately, they also saw their number of majors decline precipitously, which in turn put them in a tough spot with the administration of University X, who used declining enrollment as a reason to cut their funding.

Now here’s the question for all of you: was introducing the screening mechanism of the practice accreditation test a cynical ploy to artificially boost their metrics? Or by preventing students from entering a program designed to lead to a job they ultimately couldn’t get, were they doing the only moral thing?

lots of fields have p-value problems, not just psychology

You likely have heard of the replication crisis going on, where past research findings cannot be reproduced by other researchers using the same methods. The issue, typically, lies with p-value, an essential but limited statistic that we use to establish statistical significance. (There are other replication problems than just p-value, but that’s the one that you read about the most.) You can read about p-value here and the replication crisis here.

These problems are often associated with the social sciences in general and the fields of psychology and education specifically. This is largely due to the inherent complexities of human-subject research, which typically involves many variables that researchers cannot control; the inability to perform true control-grouped experimental studies due to practical or ethical limitations; and the relatively high alpha thresholds associated with these fields, typically .05, which are necessary because effects studied in the social sciences are often weak compared to those in the natural or applied sciences.

However, it is important to be clear that the p-value problem exists in all manner of fields, including in some that are among the “hardest” of scientific disciplines. In a 2016 story for Slate, Daniel Engber writes of much cancer research, “much of cancer research in the lab—maybe even most of it—simply can’t be trusted. The data are corrupt. The findings are unstable. The science doesn’t work,” because of p-value and associated problems. In a 2016 article for the Proceedings for the National Academy of Sciences of the United States, Eklund, Nichols, and Knutsson found that inferences drawn from fMRI brain imaging are frequently invalid, sharing concerns voiced in a 2016 eNeuro article by Katherine S. Button about replication problems across the biomedical sciences. A 2016 paper by Erik Turkhemier, an expert in genetic heritability of behavioral traits, discussed the ways that even replicable weak associations between genes and behavior prevent researchers from drawing meaningful conclusions about the relationship between genes and behavior. In a 2014 article for Science, Erik Stokstad expressed concerns that ecology literature was more and more likely to list p-values, but that the actual explained effects were becoming weaker and weaker, and that p-values were not adequately contextualized through reference to other statistics.

Clearly, we can’t reassure ourselves that p-value problems are found only in the “soft” sciences. There is a far broader problem with basic approaches to statistical inference that affect a large number of fields. The implications of this are complex; as I have said and will say again, research nihilism is not the answer. But neither is laughing it off as a problem inherent just to those “soft” sciences. More here.

journalists, beware the gambler’s fallacy

One persistent way that human beings misrepresent the world is through the gambler’s fallacy, and there’s a kind of implied gambler’s fallacy that works its way into journalism quite often. It’s hugely important to anyone who cares about research and journalism about research.

The gambler’s fallacy is when you expect a certain periodicity in outcomes when you have no reason to expect it. That is, you look at events that happened in the recent past, and say “that is an unusually high/low number of times for that event to happen, so therefore what will follow is an unusually low/high number of times for it to happen.” The classic case is roulette: you’re walking along the casino floor, and you see the electronic sign showing that a roulette table has hit black 10 times in a row. You know the odds of this are very small, so you rush over to place a bet on red. But of course that’s not justified: the table doesn’t “know” it has come up black 10 times in a row. You’ve still got the same (bad) odds of hitting red, 47.4%. You’re still playing with the same house edge. A coin that’s just come up heads 50 times in a row has the same odds of being heads again as being tails again. The expectation that non-periodic random events are governed by some sort of god of reciprocal probabilities is the source of tons of bad human reasoning – and journalism is absolutely stuffed with it. You see it any time people point out that a particular event hasn’t happened in a long time, so therefore we’ve got an increased chance of it happening in the future.

Perhaps the classic case of this was Kathryn Schulz’s Pulitzer Prize-winning, much-celebrated New Yorker article on the potential mega-earthquake in the Pacific northwest. This piece was a sensation when it appeared, thanks to its prominent placement in a popular publication, the deftness of Schulz’s prose, and the artful construction of her story – but also because of the gambler’s fallacy. At the time I heard about the article constantly, from a lot of smart, educated people, and it was all based on the idea that we were “overdue” for a huge earthquake in that region. People I know were considering selling their homes. Rational adults started stockpiling canned goods. The really big one was overdue.

Was Schulz responsible for this idea? After publication, she would go on to be dismissive of the idea that she had created the impression that we were overdue for such an earthquake. She wrote in a followup to the original article,

Are we overdue for the Cascadia earthquake?

No, although I heard that word a lot after the piece was published. As DOGAMI’s Ian Madin told me, “You’re not overdue for an earthquake until you’re three standard deviations beyond the mean”—which, in the case of the full-margin Cascadia earthquake, means eight hundred years from now. (In the case of the “smaller” Cascadia earthquake, the magnitude 8.0 to 8.6 that would affect only the southern part of the zone, we’re currently one standard deviation beyond the mean.) That doesn’t mean that the quake won’t happen tomorrow; it just means we are not “overdue” in any meaningful sense.

How did people get the idea that we were overdue? The original:

we now know that the Pacific Northwest has experienced forty-one subduction-zone earthquakes in the past ten thousand years. If you divide ten thousand by forty-one, you get two hundred and forty-three, which is Cascadia’s recurrence interval: the average amount of time that elapses between earthquakes. That timespan is dangerous both because it is too long—long enough for us to unwittingly build an entire civilization on top of our continent’s worst fault line—and because it is not long enough. Counting from the earthquake of 1700, we are now three hundred and fifteen years into a two-hundred-and-forty-three-year cycle.

By saying that there is a “two-hundred-and-forty-three-year cycle,” Schulz implied a regular periodicity. The definition of a cycle, after all, is “a series of events that are regularly repeated in the same order.” That simply isn’t how a recurrence interval functions, as Schulz would go on to clarify in her followup – which of course got vastly less attention. I appreciate that, in her followup, Schulz was more rigorous and specific, referring to an expert’s explanation, but it takes serious chutzpah to have written the preceding paragraph and then to later act as though there’s no reason your readers thought the next quake was “overdue.” The closest thing to a clarifying statement in the original article is as follows:

It is possible to quibble with that number. Recurrence intervals are averages, and averages are tricky: ten is the average of nine and eleven, but also of eighteen and two. It is not possible, however, to dispute the scale of the problem.

If we bother to explain that first sentence thoroughly, we can see it’s a remarkable to-be-sure statement – she is obliquely admitting that since there is no regular periodicity to a recurrence interval, there is no sense in which that “two-hundred-and-forty-three-year cycle” is actually a cycle. It’s just an average. Yes, the “really big one” could hit the Pacific northwest tomorrow – and if it did, it still wouldn’t imply that we’ve been overdue, as her later comments acknowledge. The earthquake might also happen 500 years from now. That’s not a quibble; it’s the root of the very panic she set off by publishing the piece. But by immediately leaping from such an under-explained discussion of what a recurrence interval is and isn’t to the irrelevant and vague assertion about “the scale of the problem,” Schulz ensured that her readers would misunderstand in the most sensationalistic way possible. However well crafted her story was, it left people getting a very basic fact wrong, and was thus bad science writing. I don’t think Schulz was being dishonest, but this was a major problem with a piece that received almost universal praise.

I just read another good example of an implied gambler’s fallacy in a comprehensively irresponsible Gizmodo piece on supposed future pandemics. I am tempted to just fisk the whole thing, but I’ll spare you. For our immediate interests let’s just look at how a gambler’s fallacy can work by implication. George Dvorsky:

Experts say it’s not a matter of if, but when a global scale pandemic will wipe out millions of people…. Throughout history, pathogens have wiped out scores of humans. During the 20th century, there were three global-scale influenza outbreaks, the worst of which killed somewhere between 50 and 100 million people, or about 3 to 5 percent of the global population. The HIV virus, which went pandemic in the 1980s, has infected about 70 million people, killing 35 million.

Those specific experts are not named or quoted, so we’ll have to take Dvorsky’s word for it. But note the implication here: because we’ve had pandemics in the past that killed significant percentages of the population, we are likely to have more in the future. An-epidemic-is-upon-us stories are a dime a dozen in contemporary news media, given their obvious ability to drive clicks. Common to these pieces are the implication that we are overdue for another epidemic because epidemics used to happen regularly in the past. But of course, conditions change, and there’s few fields where conditions have changed more in the recent past than infectious diseases. Dvorsky implies that they have changed for the worse:

Diseases, particularly those of tropical origin, are spreading faster than ever before, owing to more long-distance travel, urbanization, lack of sanitation, and ineffective mosquito control—not to mention global warming and the spread of tropical diseases outside of traditional equatorial confines.

Sure, those are concerns. But since he’s specifically set us up to expect more pandemics by referencing those in the early 20th century, maybe we should take a somewhat broader perspective and look at how infectious diseases have changed in the past 100 years. Let’s check with the CDC.

The most salient change, when it comes to infectious, has been the astonishing progress of modern medicine. We have a methodology for fighting infectious disease that has saved hundreds of millions of lives. Unsurprisingly, the diseases that keep getting nominated as the source of the next great pandemic keep failing to spread at expected rates. Dvorsky names diseases likes SARs (global cases since 2004: zero) and Ebola (for which we just discovered a very promising vaccine), not seeming to realize that these are examples of victories for the control of infectious disease, as tragic as the loss of life has been. The actual greatest threats to human health remain what they have been for some time, the deeply unsexy threats of smoking, heart disease, and obesity.

Does the dramatically lower rate of deaths from infectious disease mean a pandemic is impossible? Of course not. But “this happened often in the past, and it hasn’t happened recently, so….” is fallacious reasoning. And you see it in all sorts of domains of journalism. “This winter hasn’t seen a lot of snow so far, so you know February will be rough.” “There hasn’t been a murder in Chicago in weeks, and police are on their toes for the inevitable violence to come.” “The candidate has been riding a crest of good polling numbers, but analysts expect he’s due for a swoon.” None of these are sound reasoning, even though they seem superficially correct based on our intuitions about the world. It’s something journalists in particular should watch out for.

why selection bias is the most powerful force in education

Imagine that you are a gubernatorial candidate who is making education and college preparedness a key facet of your campaign. Consider these two state average SAT scores.

                                    Quantitative            Verbal         Total

Connecticut                   450                       480             930

Mississippi                     530                       550             1080

Your data analysts assure you that this difference is statistically significant. You know that SAT scores are a strong overall metric for educational aptitude in general, and particularly that they are highly correlated with freshman year performance and overall college outcomes. Those who score higher on the test tend to receive higher college grades, are less likely to drop out in their freshman year, are more likely to complete their degrees in four or six years, and are more likely to gain full-time employment when they’re done.

You believe that making your state’s high school graduates more competitive in college admissions is a key aspect of improving the economy of the state. You also note that Connecticut has powerful teacher unions which represent almost all of the public teachers in the state, while Mississippi’s public schools are largely free of public teacher unions. You resolve to make opposing teacher unions in your state a key aspect of your educational platform, out of a conviction that getting rid of the unions will ultimately benefit your students based on this data.

Is this a reasonable course of action?

Anyone who follows major educational trends would likely be surprised at these SAT results. After all, Connecticut consistently places among the highest-achieving states in educational outcomes, Mississippi among the worst. In fact, on the National Assessment of Educational Progress (NAEP), widely considered the gold standard of American educational testing, Connecticut recently ranked as the second-best state for 4th graders and the best for 8th graders. Mississippi ranked second-to-worst for both 4th graders and 8th graders. So what’s going on?

The key is participation rate, or the percentage of eligible juniors and seniors taking the SAT, as this scatter plot shows.

As can be seen, there is a strong negative relationship between participation rate and average SAT score. Generally, the higher the percentage of students taking the test in a given state, the lower the average score. Why? Think about what it means for students in Mississippi, where the participation rate is 3%, to take the SAT. Those students are the ones who are most motivated to attend college and the ones who are most college-ready. In contrast, in Connecticut 88% of eligible juniors and seniors take the test. (Data.) This means that almost everyone of appropriate age takes the SAT in Connecticut, including many students who are not prepared for college or are only marginally prepared. Most Mississippi students self-select themselves out of the sample. The top performing quintile (20%) of Connecticut students handily outperform the top performing quintile of Mississippi students. Typically, the highest state average in the country is that of North Dakota—where only 2% of those eligible take the SAT at all.

In other words, what we might have perceived as a difference in education quality was really the product of systematic differences in how the considered populations were put together. The groups we considered had a hidden non-random distribution. This is selection bias.

*****

My hometown had three high schools – the local coed public high school (where I went), and both a boys and girls private Catholic high school. People involved with the private high schools liked to brag about the high scores their students scored on standardized tests – without bothering to mention that you had to score well on such a test to get into them in the first place. This is, as I’ve said before, akin to having a height requirement for your school and then bragging about how tall your student body is. And of course, there’s another set of screens involved here that also powerfully shape outcomes: private schools cost a lot of money, and so students who can’t afford to attend are screened out. Students from lower socioeconomic backgrounds have consistently lower performance on a broad variety of metrics, and so private schools are again advantaged in comparison to public. To draw conclusions about educational quality from student outcomes without rigorous attempts to control for differences in which students are sorted into which schools, programs, or pedagogies – without randomization – is to ensure that you’ll draw unjustified conclusions.

Here’s an image that I often use to illustrate a far broader set of realities in education. It’s a regression analysis showing institutional averages for the Collegiate Learning Assessment, a standardized test of college learning and the subject of my dissertation. Each dot is a college’s average score. The blue dots are average scores for freshmen; the red dots, for seniors. The gap between the red and blue dots shows the degree of learning going on in this data set, which is robust for essentially all institutions. The very strong relationship between SAT scores and CLA scores show the extent to which different incoming student populations – the inherent, powerful selection bias of the college admissions process – determine different test outcomes. (Note that very similar relationships are observed in similar tests such as ETS’s Proficiency Profile.) To blame educators at a school on the left hand side of the regression for failing to match the schools on the right hand side of the graphic is to punish them for differences in the prerequisite ability of their students.

Harvard students have remarkable post-collegiate outcomes, academically and professionally. But then, Harvard invests millions of dollars carefully managing their incoming student bodies. The truth is most Harvard students are going to be fine wherever they go, and so our assumptions about the quality of Harvard’s education itself are called into question. Or consider exclusive public high schools like New York’s Stuyvesant, a remarkably competitive institution where the city’s best and brightest students compete to enroll, thanks to the great educational benefits of attending. After all, the alumni of high schools such as Stuyvesant are a veritable Who’s Who of high achievers and success stories; those schools must be of unusually high quality. Except that attending those high schools simply doesn’t matter in terms of conventional educational outcomes. When you look at the edge cases – when you restrict your analysis to those students who are among the last let into such schools and those who are among the last left out – you find no statistically meaningful differences between them. Of course, when you have a mechanism in place to screen out all of the students with the biggest disadvantages, you end up with an impressive-looking set of alumni. The admissions procedures at these schools don’t determine which students get the benefit of a better education; the perception of a better education is itself an artifact of the admissions procedure. The screening mechanism is the educational mechanism.

Thinking about selection bias compels us to consider our perceptions of educational cause and effect in general. A common complaint of liberal education reformers is that students who face consistent achievement gaps, such as poor minority students, suffer because they are systematically excluded from the best schools, screened out by high housing prices in these affluent, white districts. But what if this confuses cause and effect? Isn’t it more likely that we perceive those districts to be the best precisely because they effectively exclude students who suffer under the burdens of racial discrimination and poverty? Of course schools look good when, through geography and policy, they are responsible for educating only those students who receive the greatest socioeconomic advantages our society provides. But this reversal of perceived cause and effect is almost entirely absent from education talk, in either liberal or conservative media.

Immigrant students in American schools outperform their domestic peers, and the reason is about culture and attitude, the immigrant’s willingness to strive and persevere, right? Nah. Selection bias. So-called alternative charters have helped struggling districts turn it around, right? Not really; they’ve just artificially created selection bias. At Purdue, where there is a large Chinese student population, I always chuckled to hear domestic students say “Chinese people are all so rich!” It didn’t seem to occur to them that attending a school that costs better than $40,000 a year for international students acted as a natural screen to exclude the vast number of Chinese people who live in deep poverty. And I had to remind myself that my 8:00 AM writing classes weren’t going so much better than my 2:00 PM classes because I was somehow a better teacher in the mornings, but because the students who would sign up for an 8:00 AM class were probably the most motivated and prepared. There’s plenty of detailed work by people who know more than I do about the actual statistical impact of these issues and how to correct for them. But we all need to be aware of how deeply unequal populations influence our perceptions of educational quality.

Selection bias hides everywhere in education. Sometimes, in fact, it is deliberately hidden in education. A few years ago, Reuters undertook an exhaustive investigation of the ways that charter schools deliberately exclude the hardest-to-educate students, despite the fact that most are ostensibly required to accept all kinds of students, as public schools are bound to. For all the talk of charters as some sort of revolution in effective public schooling, what we find is that charter administrators work feverishly to tip the scales, finding all kinds of crafty ways to ensure that they don’t have to educate the hardest students to educate. And even when we look past all of the dirty tricks they use – like, say, requiring parents to attend meetings held at specific times when most working parents can’t – there are all sorts of ways in which students are assigned to charter schools non-randomly and in ways that advantage those schools. Excluding students with cognitive and developmental disabilities is a notorious example. (Despite what many people presume, a majority of students with special needs take state-mandated standardized tests and are included in data like graduation rates, in most locales.) Simply the fact that parents typically have to opt in to charter school lotteries for their students to attend functions as a screening mechanism.

Large-scale studies of charter efficacy such as Stanford’s CREDO project argue confidently that they have controlled for the enormous number of potential screening mechanisms that hide in large-scale education research. These researchers are among the best in the world and I don’t mean to disparage their work. But given the enormity of the stakes and the truth of Campbell’s Law, I have to report that I remain skeptical that we have truly ever controlled effectively for all the ways that schools and their leaders cook the books and achieve non-random student populations. Given that random assignment to condition is the single most essential aspect of responsible social scientific study, I think caution is warranted. And as I’ll discuss in a post in the future, the observed impact of school quality on student outcomes in those cases where we have the most confidence in the truly random assignment to condition is not encouraging.

I find it’s nearly impossible to get people to think about selection bias when they consider schools and their quality. Parents look at a private school and say, look, all these kids are doing so well, I’ll send my troubled child and he’ll do well, too. They look at the army of strivers marching out of Stanford with their diplomas held high and say, boy, that’s a great school. And they look at the Harlem Children’s Zone schools and celebrate their outcome metrics, without pausing to consider that it’s a lot easier to get those outcomes when you’re constantly expelling the students most predisposed to fail. But we need to look deeper and recognize these dynamics if we want to evaluate the use of scarce educational resources fairly and effectively.

Tell me how your students are getting assigned to your school, and I can predict your outcomes – not perfectly, but well enough that it calls into question many of our core presumptions about how education works.