I can teach you regression discontinuity design in two images

Like so. (Fake charts I made with fake data, btw.)

No Treatment Effect
Significant Treatment Effect

You already get it, right?

Typically, when we perform some sort of an experiment, we want to look at how a particular number responds to the treatment – how blood pressure reacts to a new drug, say, or how students improve on a reading test when they’re given a new kind of lesson. We want to make sure that the observed differences are really the product of the treatment and not some underlying difference in observed groups. That’s what random controlled trials are for. So we randomly assign subjects to test and control groups, look at what the different averages are for the two different groups, note the size of the effect, and determine whether it is statistically significant.

But sometimes we have real-world conditions that dictate that subjects get sorted into one group or another non-randomly. If we then look at how different groups perform after some treatment, we know that we’re potentially facing severe selection effects thanks to that non-random assignment. But consider if we have assignment based purely on some quantitative metric, with a cutoff score that sorts people into one group or another. (Suppose, for example, students only became eligible for a gifted student program if they score above a cut score on some test.) Here we have a non-random distribution that we can actually exploit for research purposes. A regression discontinuity design allows us to explore the impact of such a program because, so long as students aren’t able to impact their assignment beyond their score on that test, we can be confident that students just above or just below the cutoff score are very similar.

Regression analyses will be run on all of the data, with subjects below and above the cut score combined but flagged into different groups. Researchers will run statistical models to determine whether there is a difference between groups who receive the treatment and those who don’t. As you can see in the scatterplots above, a large effect will be readily apparent in how the data looks. In the above scenario, the X axis represents the score students received on the test, the cut score is 15, and the Y axis represents performance on some later educational metric. In the top scatterplot, there is no meaningful difference from the gift students program, as the relationship between these two metrics is the same above and below the cut score. But in the bottom graph, there’s a significant jump at the cut score. Note that even after the intervention, the relationship is still linear – students who did better on the initial test do better on the later metric. But the scores of everyone have jumped right at the cut score.

There are, as you’d probably imagine, a number of potential pitfalls here, and assumption checks and quality controls are essential. All of the people tested would have to be able to be sorted into the gifted program solely on the basis of the test, the cutoff score has to be near the mean, and you need sufficient numbers to see the relationship on either side of the cut score, among other things. But if you have the right conditions, regression discontinuity design is a great way to get near-random experimental design quality in situations where you can’t do that for pragmatic or ethical reasons.

restriction of range: what it is and why it matters

Let’s imagine a bit of research that we could easily perform, following standard procedures, and still get a misleading result.

Say I’m an administrator at Harvard, a truly selective institution. I want to verify the College Board’s confidence that the SAT effectively predicts freshman year academic performance. I grab the SAT data, grab freshmen GPAs, and run a simple Pearson correlation to find out the relationship between the two. To my surprise, I find that the correlation is quite low. I resolve to argue to colleagues that we should not be requiring students to submit SAT or similar scores for admissions, as those scores don’t tell us anything worthwhile anyway.

Ah. But what do we know about the SAT scores of Harvard freshmen? We know that they’re very tightly grouped because they are almost universally very high. Indeed, something like a quarter of all of your incoming freshman got a perfect score on the (new-now-old) 2400 scale:

Section Average 25th Percentile 75th Percentile
Math 750 700 800
Reading 750 700 800
Writing 750 710 790
Composite 2250 2110 2390

The reason your correlation is so low (and note that this dynamic applies to typical linear regression procedures as well) is that there simply isn’t enough variation in one of your numbers to get a high metric of relationship. You’ve fallen victim to a restriction of range.

Think about it. When we calculate a correlation, we take pairs of numbers and see how one number changes compared to the other. So if I restrict myself to children and I look at age in months compared to height, I’m going to see consistent changes in the same direction – my observations of height at 6 months will be smaller than my observations at 12 months and those will in turn be smaller than at 24 months. This correlation will not be perfect, as different children are of different height and grow at different rates. The overall trend, however, will be clear and strong. But in simple mathematical terms, in order to get a high degree of relationship you have to have a certain range of scores in both numbers – if you only looked at children between 18 and 24 months you’d be necessarily restricting the size of the relationship. In the above example, if Harvard became so competitive that every incoming freshman had a perfect SAT score, the correlation between SAT scores and GPA (or any other number) would necessarily be 0.

Of course, most schools don’t have incoming populations similar to Harvard’s. Their average SAT scores, and the degree of variation in their SAT scores, would likely be different. Big public state schools, for example, tend to have a much wider achievement band of incoming students, who run the gamut from those who are competitive with those Ivy League students to those who are marginally prepared, and perhaps gained admission via special programs designed to expand opportunity. In a school like that, given adequate sample size and an adequate range of SAT scores, the correlation would be much less restricted – and it’s likely, given the consistent evidence that SAT scores are a good predictor of GPA, significantly higher.

Note that we could also expect a similar outcome in the opposite direction. In many graduate school contexts, it’s notoriously hard to get bad grades. (This is not, in my opinion, a problem in the same way that grade inflation is a potential problem for undergraduate programs, given that most grad school-requiring jobs don’t really look at grad school GPA as an important metric.) With so many GPAs clustering in such a narrow upper band, you’d expect raw GRE-GPA correlations to be fairly low – which is precisely what the research finds.

Here’s a really cool graphic demonstration of this in the form of two views on the same scatterplot. (I’m afraid I don’t know where this came from, otherwise I’d give credit.)

This really helps show restricted range in an intuitive way: when you’re looking in too close at a small range on one variable, you just don’t have the perspective to see the broader trends.

What can we do about this? Does this mean that we just can’t look for these relationships when we have a restricted range? No. There are a number of statistical adjustments that we can make to estimate a range-corrected value for metrics of relationship. The most common of these, Thorndike’s case 2, was (like a lot of stats formulas) patiently explained to me by a skilled instructor who guided me to an understanding of how it works which then escaped my brain in my sleep one night like air slowly getting let out of a balloon. But you can probably intuitively understand how such a correction would work in broad strokes – we have a certain data set restricted on X variable, the relationship is strong along that restricted range, its spread is s in that range, so let’s use that to guide an estimate of the relationship further along X. As you can probably guess, we can do so with more confidence if we have a stronger relationship and lower spread in the data that we do have. And there is a certain degree of range we have to have in our real-world data to be able to calculate a responsible adjustment.

There have been several validation studies of Thorndike’s case 2 where researchers had access to both a range-restricted sample (because of some set cut point) and an unrestricted sample and were able to compare the corrected results on the restricted sample to the raw correlations on unrestricted samples. The results have provided strong validating evidence for the correction formula. Here’s a good study.

There are also imputation models that are used to correct for range restriction. Imputation is a process common to regression when we have missing data and want to fill in the blanks, sometimes by making estimates based on the strength of observed relationships and spread, sometimes by using real values pulled from other data points…. It gets very complicated and I don’t know much about it. As usual if you really need to understand this stuff for research purposes – get ye to a statistician!

two concepts about sampling that were tricky for me

Here’s a couple related points about statistics that it took me a long time to grasp, and which really improved my intuitive understanding of statistical arguments.

1. It’s not the sample size, it’s the sampling mechanism. 

Well, OK. It’s somewhat the sample size, obviously. My point is that most people who encounter a study’s methodology are much more likely to remark on the sample size – and pronounce it too small – than to remark on the sampling mechanism. I can’t tell you have often I’ve seen studies with an n = 100 that have been dismissed by commenters online as too small to take seriously. Depending on the design of the study, and the variables being evaluated, 100 can be a good-enough sample size. In fact, under certain circumstances (medical testing of rare conditions, say) an n of 30 is sufficient to draw some conclusions about populations.

We can’t say with 100% accuracy what a population’s average for a given trait is when we use inferential statistics. (We actually can’t say that with 100% accuracy even when taking a census, but that’s another discussion.) But we can say with a chosen level of confidence that the average lies in a particular range, which can often be quite small, and from which we can make predictions of remarkable accuracy – provided the sampling mechanism was adequately random. By random, we mean that every member of the population has an equivalent chance of being selected for the sample. If there are factors that make one group more or less likely to be selected for the sample, that is statistical bias (as opposed to statistical error).

It’s important to understand the declining influence of sample size in reducing statistical error as sample size grows. Because calculating confidence intervals and margins of error involves placing the n under a square root sign, the power of sample size declines exponentially (fixed). Here’s the formula for margin of error:

Z* σ/√(n) 

where Z is a Z-value that you look up in a chart for a given confidence level (often 95% or 99%), σ is the standard deviation, and n is your number of observations. You see two clear things here: first, spread (standard deviation) is super important to how confident we can be about the accuracy of an average. (Report spread when reporting an average!) Second, that we get declining improvements to accuracy as we increase sample size.That means that after a point, adding hundreds of more observations gets you less power than you got from adding 10 at lower ns. Given the resources involved in data collection, this can make expanding sample size a low-value proposition.

Now compare a rigorously controlled study with an n = 30 which was drawn with a random sampling mechanism to, say, those surveys that ESPN.com used to run all the time. Those very often get sample sizes in the hundreds of thousands. But the sampling mechanism is a nightmare. They’re voluntary response instruments that are biased in any number of ways: underrepresenting people without internet access, people who aren’t interested in sports, people who go to SI.com instead of ESPN.com, on and on. The value of the 30 person instrument is far higher than that of the ESPN.com data. The sampling mechanism makes the sample size irrelevant.

Sample size does matter, but in common discussions of statistics, its importance is misunderstood, and the value of increasing sample size declines as grows.

2. For any reasonable definition of a sample, population size relative to sample size is irrelevant for the statistical precision of findings.

A 1,000 person sample, if drawn with some sort of rigorous random sampling mechanism, is exactly as descriptive and predictive when drawn randomly from the ~570,000 person population of Wyoming as it is when drawn randomly from the ~315 million person population of the United States. (If intended as samples of people in Wyoming and people in the United States respectively, of course.)

I have found this one very hard to wrap my mind around, but it’s the case. The formulas for margin of error, confidence intervals, and the like do not involve any reference to the size of the total population. You can think about it this way: each time you pull a sample at random from some population, the odds of your sample being unlike the population goes down regardless of the size of that population. The mistake lies in thinking that the point of increasing sample size lies in making it closer in proportion to population. In reality, the point is just to increase the number of attempts in order to reduce the possibility that previous attempts produced statistically unlikely results. Even if you had an infinite population, every time you draw a sample from that population you would be decreasing the chance that you’re randomly pulling an unrepresentative sample.

The essential caveat lies in “for any reasonable definition of a sample.” Yes, testing 900 out of a population of 1000 is more accurate than testing 900 out of a population of 1,000,000. But nobody would ever call 90% of a population a sample. You see different thresholds for where a sample begins and ends; some people say that anything larger than 1/100th of the total population is no longer a sample, but it varies. The point holds: when we’re dealing with real-world samples, where the population we care about is vastly larger than any reasonable sample size, the population size is irrelevant to the calculation error in our statistical inferences. This one is quite counterintuitive and took me a long time to really grasp.

Reporting Regression Results Responsibly

We’re in a Golden Age for access to data, which unfortunately also means we’re in a Golden Age for the potential to misinterpret data. Though the absurdity of gated academic journals persists, academic research is more accessible now than ever before. We’ve also seen a rapid growth in the use of arguments based on statistics in the popular media in the last several years. This is potentially a real boon to our ability to understand the world around us, but it carries with it all of the potential for misleading statistical arguments.

My request is pretty simple. All statistical techniques, particularly the basic parametric statistical techniques that are most likely to show up in data journalism, require the satisfaction of assumptions and checking of diagnostic measures to ensure that hidden bias isn’t misleading us. Many of these assumptions and diagnostics are ultimately judgment calls, relying on practitioners to make informed decisions about what degree of wiggle room is appropriate given the research scenario. There are, however, conventions and implied standards that people can use to guide their decisions. The most important and useful kind of check, though, is the  eyes of other researchers. Given that the ability to host graphs, tables, and similar kinds of data online is simple and nearly free, I think that researchers and data journalists alike should provide links to their data and to the graphs and tables they use to check assumptions and diagnostic measures. In the digital era, it’s crazy this is still a rare practice. I don’t expect to find these graphs and tables sitting square in the center of a blog post, and I expect that 90% of readers wouldn’t bother to look. But there’s nothing to risk in having them available, and transparency, accountability, and collaboration to gain.

That’s the simple part, and you can feel free to close tab. For a little more:

What kind of assumptions and diagnostics am I talking about? Let’s consider the case of one of the most common types of parametric methods, linear regression. Whether we have a single predictor for simple linear regression or multiple predictors for multilinear regression, fundamentally regression is a matter of assessing the relationship between quantitative (continuous) predictor variables and a quantitative (continuous) outcome variable. For example, we might ask how well SAT scores predict college GPA; we might ask how well age, weight, and height predict blood pressure. When someone talks about how one number predicts another, the strength of their relationship, and how we might attempt to change one by changing the other, they’re probably making an appeal to regression.

The types of regression analysis, and the issues therein, are vast, and there are many technical issues at play that I’ll never understand. But I think it’s worthwhile to talk about some of the assumptions we need to check and some problems we have to look out for. Regression has come in for a fair amount of abuse lately from sticklers and skeptics, and not for no reason; it’s easy to use the techniques irresponsibly. But we’re inevitably going to ask basic questions of how X and Y predict Z, so I think we should expand public literacy about these things. I want to talk a little bit about these issues not because I think I’m qualified to teach statistics to others, or because regression is the only statistical process that we need to see assumptions and diagnostics for. Rather, I think regression is an illustrative example through which to explore why we need to check this stuff, to talk about both the power and pitfalls of public engagement with data.

There are four assumptions that need to be true to run a linear (least squares) regression: independence of observations, linearity, constancy of variance, and normality. (Some purists add a fifth, existence, which, whatever.)

Independence of Observations

This is the biggie, and it’s why doing good research can be so hard and expensive. It’s the necessary assumption that one observation does not affect another. This is the assumption that requires randomness. Remember that in statistics error, or necessary and expected variation, is inevitable, but bias, or the systematic influence on observations, is lethal.

Suppose you want to see how eating ice cream affects blood sugar level. You gather 100 students into the gym and have them all eat ice cream. You then go one by one through the students and give them a blood test. You dutifully record everyone’s values. When you get back to the lab, you find that your data does not match that of much of the established research literature. Confused, you check your data again. You use your spreadsheet software to arrange the cells by blood sugar. You find a remarkably steady progression of results running higher to lower. Then it hits you: it took you several hours to test the 100 students. The highest readings are all from the students who were first to be tested, the lowest from those who were tested last. Your data was corrupted by an uncontrolled variable, time-after-eating-to-test. Your observations were not truly independent of each other – one observation influenced another because taking one delayed taking the other. This is an example that you’d hope most people would avoid, but the history of research is the history of people making oversights that were, in hindsight, quite obvious.

Independence is scary because threats to it so often lurk out of sight. And the presumption of independence often prohibits certain kind of analysis that we might find natural. For example, think of assigning control and test conditions to classes rather than individual students in educational research. This is often the only practical way to do it; you can’t fairly ask teachers to only teach half their students one technique and half another. You give one set of randomly-assigned classes a new pedagogical technique, while using the old standard with your control classes. You give a pre- and post-test to both and pop both sets of results in an ANOVA. You’ve just violated the assumption of independence. We know that there are clustering effects of children within classrooms; that is, their results are not entirely independent of each other. We can correct for this sort of thing using techniques like hierarchical modeling, but first we have to recognize that those dangers exist!

Independence is the assumption that is least subject to statistical correction. It’s also the assumption that is the hardest to check just by looking at graphs. Confidence in independence stems mostly from rigorous and careful experimental design. You can check a graph of your observations (your actual data points) against your residuals (the distance between your observed values and the linear progression from your model), which can sometimes provide clues. But ultimately, you’ve just got to know your data was collected appropriately. On this one, we’re largely on our own. However, I think it’s a good idea for academic researchers to provide online access to a Residuals vs. Observations graph when they run a regression. This is very rare, currently.

Here’s a Residuals vs. Observations graph I pulled off of Google Images. This is what we want to see: snow. Clear nonrandom patterns in this plot are bad.

Linearity

The name of the technique is linear regression, which means that observed relationships should be roughly linear to be valid. In other words, you want your relationship to fall along a more or less linear path as you move across the x axis; the relationship can be weaker or it can be stronger, but you want it to be more or less as strong as you move across the line. This is particularly the case because curvilinear relationships can appear to regression analysis to be no relationship. Regression is all about interpolation: if I check  my data and find a strong linear relationship, and my data has a range from A to B, I should be able to check any x value within A and B and have a pretty good prediction for y. (What “pretty good” means in practice is a matter of residuals and r-squared, or the portion of the variance in y that’s explained by my xs.) If my relationship isn’t linear, my confidence in that prediction is unfounded.

Take a look at these scatter plots. Both show close to zero linear relationship according to Pearson’s product-moment coefficient:

And yet clearly, there’s something very different going on from one plot to the next. The first is true random variance; there is no consistent relationship between our x and y variables. The second is a very clear association; it’s just not a linear relationship. The degree and direction of y varying along x changes over different values for x. Failure to recognize that non-linear relationship could compel us to think that there is no relationship at all. If the violation of linearity is as clear and consistent as in this scatter plot, it can be cleaned up fairly easily by transforming the data.

Regression is fairly robust to violations of linearity, and it’s worth noting that any relationship that is sufficiently lower than 1 will be non-linear in the strict sense. But clear, consistent curves in data can invalidate our regression analyses.

Readers could check data for linearity if scatter plots are posted for simple linear regression. For multilinear regression, it’s a bit messier; you could plot every individual predictor, but I would be satisfied if you just mention that you checked linearity.

Constancy of variance

Also known by one of my very favorite ten-cent words, homoscedasticity. Constancy of variance means that, along your range of x predictors, your y varies about as much; it has as much spread, as much error. Remember, when I’m doing inferential statistics, I’m sampling, and sampling means sampling error – even if I’m getting quality results, I’m inevitably going to get differences in my data from one collection of samples to the next. But if our assumptions are true, we can trust that those samples will vary in predictable intervals relative to the true mean. That is, if an SAT score predicts freshman year GPA with a certain degree of consistency for students scoring 400, it should be about as consistent for students scoring 800, 1200, and 1600, even though we know that from one data set to the next, we’re not going to get the exact same values even if we assume that all of the variables of interest are the same. We just need to know that the degree to which they vary for a given is constant over our range.

Why is this important? Think again about interpolation. I run a regression because I want to understand a relationship between various quantitative variables, and often because I want to use my predictor variables to… predict. Regression is useful insofar as I can move along the axes of my x values and produce a meaningful, subject-to-error-but-still-useful value for y. Violating the assumption of constant variance means that you can’t predict y with equal confidence as you move around x(s); the relationship is stronger at some points than others, making you vulnerable to inaccurate predictions.

Here’s a residuals plot showing the dreaded megaphone effect: the error (size of residuals, difference between observations and results expected from the regression equation) increases as we move from low to high values of x. The relationship is strong at low values of x and much weaker at high values.

We could check homoscedasticity by having access to residual plots. Violations of constant variance can often be fixed via transformation, although it may often be easier to use techniques that are more inherently robust to this violation, such as quantile regression.

Normality

The concept of the normal distribution is at once simple and counterintuitive, and I’ve spent a lot of my walks home trying to think of the best way to explain it. The “parametric” in parametric statistics refers to the assumption that there is a given underlying distribution for most observable data, and frequently this distribution is the normal distribution or bell curve. Think of yourself walking down the street and noticing that someone is unusually tall or unusually short. The fact that you notice is in and of itself a consequence of the normal distribution. When we think of someone that is unusually tall or short, we are implicitly assuming that we will find fewer and fewer people as we move further along the extremes of the height distribution. If you see a man in North American who is 5’10, he is above average height, but you wouldn’t bat an eye; if you see a man who is 6’3, you might think yourself, that’s a tall guy; when you see someone who is 6’9, you say, wow, he is tall!, and when you see a 7 footer, you take out your cell phone. This is the central meaning of the normal distribution: that the average is more likely to occur than extremes, and that the relationship between position on the distribution and probability of occurrence is predictable.

Not everything in life is normally distributed. Poll 1,000 people and ask how much money they received in car insurance payments last year and it won’t look normal. But a remarkable amount of naturally occurring phenomena are normally distributed, simply thanks to the reality of numbers and extremes, and the central limit theorem teaches us that essentially all averages are normally distributed. (That is, if I take a 100 person sample of a population for a given quantitative trait, I will get a mean; if I take another 100 person sample, I will get a similar but not exact mean, and so on. If I plot those means, they will be normal even if the overall distribution is not.)

The assumption of normality in regression requires our data to be roughly normally distributed; in order to assess the relationship of y as it moves across x, we need to know the relative frequency of extreme observations to observations close to the mean. It’s a fairly robust assumption, and you’re never going to have perfectly normal data, but too strong of a violation will invalidate your analysis. We check normality with what’s called a qq plot. Here’s an almost-perfect one, again scraped from Google Images:

That strongly linear, nearly 45 degree angle is just what we want to see. Here’s a bad one, demonstrating the “fat tails” phenomenon – that is, too many observations clustered at the extremes relative to the mean:

Usually the rule is that unless you’ve got a really clear break from a straightish 45 degree angle, you’re probably alright. When the going gets tough, seek help from a statistician.

Diagnostics

OK, so 2000 words into this thing, we’ve checked out four assumptions. Are we good? Well, not so fast. We need to check a few diagnostic measures, or what my stats instructor  used to call “the laundry list.” This is a matter of investigating influence. When we run an analysis like regression, we’re banking on the aggregate power of all of our observations to help us make responsible observations and inferences. We never want to rely too heavily on individual or small numbers of observations because that increases the influence of error in our analysis. Diagnostic measures in regression typically involve using statistical procedures to look for influential observations that have too much sway over our analysis.

The first thing to say about outliers is that you want a systematic reason for eliminating them. There are entire books about the identification and elimination of outliers, and I’m not qualified to say what the best method is in any given situation. But you never want to toss an observation simply because it would help your analysis. When you’ve got that one data point that’s dragging your line out of significance, it’s tempting to get rid of it, but you want to analyze that observation for a methodology-internal justification for eliminating it. On the other hand, sometimes you have the opposite situation: your purported effect is really the product of a single or small number of influential outliers that have dragged the line in your favor (that is, to a p-value you like). Then, of course, the temptation is simply to not mention the outlier and publish anyway. Especially if a tenure review is in your future…

Some examples of influential observation diagnostics in regression include examining leverage, or outliers in your predictors that have a great deal of influence on your overall model; Cook’s Distance, which tells you how different your model will be if you delete a given observation; DFBetas, which tells you how a given predictor observation influences on a particular parameter estimate; and more. Most modern statistical packages like SAS or R have commands for checking diagnostic measures like these. While offering numbers would be nice, I would mostly like it if researchers reassured readers that they had run diagnostic measures for regression and found acceptable results. Just let me know: I looked for outliers and influential observations and things came back fairly clean.

*****

Regression is just one part of a large number of techniques and applications that are happening in data journalism right now. But essentially any statistical techniques are going to involve checking assumptions and diagnostic measures. A typical ANOVA, for example, the categorical equivalent of regression, will involve checking some of the same assumptions. In the era of the internet, there is no reason not to provide a link to a brief, simple rundown of what quality controls were pursued in  your analysis.

None of these things are foolproof. Sums of squares are spooky things; we get weird results as we add and remove predictors from our models. Individual predictors are strongly significant by themselves but not when added together; models are significant with no individual predictors significant; individual predictors are highly significant without model significance; the order you put your predictors in changes everything; and so on. It’s fascinating and complicated. We’re always at the mercy of how responsible and careful researchers are. But by sharing information, we raise the odds that what we’re looking at is a real effect.

This might all sound like an impossibly high bar to clear. There are so many ways things can go wrong. And it’s true that, in general, I worry that people today are too credulous towards statistical arguments, which are often advanced without sufficient qualifications. There are some questions where statistics more often mislead than illuminate. But there is a lot we can and do know. We know that age is highly predictive of height in children but not in adults; we know that there is a relationship between SAT scores and freshman year GPA; we know point differential is a better predictor of future win-loss record than past win-loss record. We can learn lots of things, but we always do it better together. So I think that academic researchers and data journalists should share their work to a greater degree than they do now. That requires a certain compromise. After all, it’s scary to have tons of strangers looking over your shoulder. So I propose that we get more skeptical and critical on our statistical arguments as a media and readership, but more forgiving of individual researchers who are, after all, only human. That strikes me as a good bargain.

And one I’m willing to make myself, so please email me to point out the mistakes I’ve inevitably made in this post.

correlation: neither everything nor nothing

via Overthinking

One thing that everyone on the internet knows, about statistics, is this: correlation does not imply causation. It’s a stock phrase, a bauble constantly polished and passed off in internet debate. And it’s not wrong, at least not on its face. But I worry that the denial of the importance of correlation is a bigger impediment to human knowledge and understanding than belief in specious relationships between correlation and causation.

First, you should read two pieces on the “correlation does not imply causation” phenomenon, which has gone from a somewhat arcane notion common to research methods classes to a full-fledged meme. This piece by Greg Laden is absolute required reading on correlation and causation and how to think about both. Second, this piece by Daniel Engber does good work talking about how “correlation does not imply causation” became an overused and unhelpful piece of internet lingo.

As Laden points out, the question is really this: what does “imply” mean? The people who employ “correlation does not imply causation” as a kind of argumentative trump card are typically using “imply” in a way that nobody actually means, which is as synonymous with “prove.” That’s pretty far from what we usually mean by “implies”! In fact, using the typical meaning of implication, correlation sometimes implies causation, in the sense that it provides evidence for a causal relationship. In careful, rigorously conducted research, a strong correlation can offer some evidence of causation, if that correlation is embedded in a theoretical argument for how that causative relationship works. If nothing else, correlation is often the first stage in identifying relationships of interest that we might then investigate in more rigorous ways, if we can.

A few things I’d like people to think about.

There are specific reasons that an assertion of causation from correlation data might be incorrect. There is a vast literature of research methodology, across just about every research field you can imagine. Correlation-causation fallacies have been investigated and understood for a long time. Among the potential dangers is the confounding variable, where an unknown variable is driving the change in two other variables, making them appear to influence one another. This gives us the famous drownings-and-ice cream correlation – as drownings go up, so do ice cream sales. The confounding variable, of course, is temperature.1 There are all sorts of nasty little interpretation problems in the literature. These dangers are real. But in order to have understanding, we have to actually investigate why a particular relationship is spurious. Just saying “correlation does not imply causation” doesn’t do anything to actually improve our understanding. Explore why, if you want to be useful. Use the phrase as the beginning of a conversation, not a talisman.

Correlation evidence can be essential when it is difficult or impossible to investigate a causative mechanism. Cigarette smoking causes cancer. We know that. We know it because of many, many rigorous and careful studies have established that connection. It might surprise you to know that the large majority of our evidence demonstrating that relationship comes from correlation studies, rather than experiments. Why? Well, as my statistics instructor used to say – here, let’s prove cigarette smoking causes cancer. We’ll round up some infants, and we’ll divide them into experimental and control groups, and we’ll expose the experimental group to tobacco smoke, and in a few years, we’ll have proven a causal relationship. Sound like a good idea to you? Me neither. We knew that cigarettes were contributing to lung cancer long before we identified what was actually happening in the human body, and we have correlational studies to thank for that. Blinded randomized controlled experimental studies are the gold standard, but they are rare precisely because they are hard, sometimes impossible. To refuse to take anything else as meaningful evidence is nihilism, not skepticism.

Sometimes what we care about is association. Consider relationships which we believe to be strong but in which we are unlikely to ever identify a specific causal mechanism. I have on my desk a raft of research showing a strong correlation between parental income and student performance on various educational metrics. It’s a relationship we find in a variety of locations, across a variety of ages, and through a variety of different research contexts. This is important research, it has stakes; it helps us to understand the power of structural advantage and contributes to political critique of our supposedly meritocratic social systems.

Suppose I was prohibited from asserting that this correlation proved anything because I couldn’t prove causation. My question is this: how could I find a specific causal mechanism? The relationship is likely very complex, and in some cases, not subject to external observation by researchers at all. To refuse to consider this relationship in our knowledge making or our policy decisions because of an overly skeptical attitude towards correlational data would be profoundly misguided. Of course there’s limitations and restrictions we need to keep in mind – the relationship is consistent but not universal, its effect is different for different parts of the income scale, it varies with a variety of factors. It’s not a complete or simple story. But I’m still perfectly willing to say that poverty is associated with poor educational performance. That’s the only reasonable conclusion from the data. That association matters, even if we can’t find a specific causal mechanism.

Correlation is a statistical relationship. Causation is a judgement call. I frequently find that people seem to believe that there is some sort of mathematical proof of causation that a high correlation does not merit, some number that can be spit out by statistical packages that says “here’s causation.” But causation is always a matter of the informed judgment of the research community. Controlled experiments are the gold standard in that regard, but there are controlled experiments that can’t prove causation and other research methods that have established causation to the satisfaction of most members of a discipline.

Human beings have the benefit of human reasoning. One of my frustrations with the “correlation does not imply causation” line is that it’s often deployed in instances where no one is asserting that we’ve adequately proved causation. I sometimes feel as though people are trying to protect us from mistakes of reasoning that no one would actually fall victim to. In an (overall excellent) piece for the Times, Gary Marcus and Ernest Davis write, “A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two.” That’s true – it is hard to imagine! So hard to imagine that I don’t think anyone would have that problem. I get the point that it’s a deliberately exaggerated example, and I also fully recognize that there are some correlation-causation assumptions that are tempting but wrong. But I think that, when people state the dangers of drawing specious relationships, they sometimes act as if we’re all dummies. No one will look at these correlations and think they’re describing real causal relationships because no one is that senseless. So why are we so afraid of that potential bad reasoning?

Those disagreeing with conclusions drawn from correlational data have a burden of proof too. This is the thing, for me, more than anything. It’s fine to dispute a suggestion of causation drawn from correlation data. Just recognize that you have to actually make the case. Different people can have responsible, reasonable disagreements about statistical inferences. Both sides have to present evidence and make a rational argument drawn from theory. “Correlation does not imply causation” is the beginning of discussion, not the end.

I consider myself on the skeptical side when it comes to Big Data, at least in certain applications. As someone who is frequently frustrated by hype and woowoo, I’m firmly in the camp that says we need skepticism ingrained in how we think and write about statistical inquiry. I personally do think that many of the claims about Big Data applications are overblown, and I also think that the notion that we’ll ever be post-theory or purely empirical are dangerously misguided. But there’s no need to throw the baby out with the bathwater. While we should maintain a healthy criticism of them, new ventures dedicated to researched, data-driven writing should be greeted as a welcome development. What we need, I think, is to contribute to a communal understanding of research methods and statistics, including healthy skepticism, and there’s reason for optimism in that regard. Reasonable skepticism, not unthinking rejection; a critical utilization, not a thoughtless embrace.


 

norm referencing, criterion referencing, and ed policy

I want to talk a bit about a distinction between different types of educational testing/assessment and how they interface with some basic questions we have about education policy. The two concepts are norm referencing and criterion referencing.

Criterion Referencing

Why do we perform tests? What are their purpose? One common reason is to ensure that people are able to perform some sort of an essential task. Take a driver’s test. The point is to make sure that people who are on the roads possess certain minimal abilities to safely pilot a car, based on social standards of competence that are written into policy. While we might understand that some people are better or worse drivers than others, we’re not really interested in using a driver’s test to say who is adequate, who is good, and who is excellent. Rather we just want to know: do you meet this minimal threshold? The name for that kind of test is a criterion referenced test. We have some criterion (or criteria) and we check and see if the people taking the test fulfill them. Sometimes we want these tests to be fairly generous; society couldn’t function, in many locales, if a majority of competent adults couldn’t pass a driver’s test. On the other hand, we probably want the benchmarks for, say, running a nuclear reactor to be fairly strict. The social costs would here be higher for a test that was too lenient rather than too strict. In either case, though, our interest is not in discriminating between different individuals to a fine level of gradation, particularly for those who are clearly good enough or clearly not. Rather we just want to know: is the test taker competent to perform the real-world task?

Norm Referencing

Criterion referencing depends on, well, the existence of a criterion. That is, there has to be some sort of benchmark or goal that the test taker will either reach or not. What would it mean to have a criterion referenced test for, say, college readiness? We can certainly imagine a set benchmark for being prepared for college, and we’d probably like to think that there’s some minimal level of preparation that’s required for any college-bound student. But in a broader sense we are probably aware that there is no one set criterion that would work given the large range of schools and students that “college readiness” reflects. What’s more, we also know that colleges are profoundly interested in relative readiness; elite colleges spend vast amounts of money attracting the most highly-qualified students to campus.

For that we need to use tests like the SAT and ACT which are not oriented around fulfilling a given criterion but for creating a scale of test-takers and being able to discriminate between different students in exacting detail. We need, that is, norm referenced tests. When we say “norm” here we mean in comparison to others, to an average and to quintiles, and in particular to the normal distribution. I don’t want to get too into the weeds on that big topic, for those who aren’t already versed in it. Suffice to say for now that the normal distribution is a very common distribution of observed values for all sorts of naturally-occurring phenomenon that have a finite range (that is, a beginning and an ending) and which are affected by multiple variables. The ideal normal distribution looks like this:

The big center line in there is the mean, median, and mode – that it, the arithmetic average, the line that divides one half of the data from the other, and the observation that occurs most. As you can see, in a true normal distribution the observations fall in very particular patterns relative to the average. In particular, the further away we go from the average, the less likely we are to find observations, again in predictable quantitative relationships. When we talk about something changing relative to a standard deviation, using that statistic (actually a measure of spread) as a measure of distance or extremity, we’re doing so in relation to the normal distribution. Tests like the SAT and ACT, GRE, IQ tests, and all manner of other tests used for the purpose of screening applicants for some finite number of slots use norm referencing.

Why? Again, think about what we’re trying to accomplish with norm referencing. I want to give a test to be able to say that test taker X is better than test taker Y but worse than test take Z. But we also want to be able to say where they each fall relative to the mass of data. Norm referencing allows us to make meaningful statements about how someone will perform relative to others, and this in turn gives us information about how to, for example, select people for our scarce admissions slots at an exclusive college.

As Glenn Fulcher puts it in his (excellent) book Practical Language Testing:

As we move away from the mean the scores in the distribution become more
extreme, and so less common. It is very rare for test takers to get all items correct, just as
it is very rare for them to get all items incorrect. But there are a few in every large group
who do exceptionally well, or exceptionally poorly. The curve of normal distribution
tells us what the probability is that a test taker could have got the score they have, given
the place of the score in a particular distribution. And this is why we can say that a score
is ‘exceptional’ or ‘in the top 10 per cent’, or ‘just a little better than average.

(Incidentally, it’s a very common feature of various types of educational, intelligence, and language testing that scores become less meaningful as the move towards the extremes. That is, a 10 point difference on a well-validated IQ test means a lot when it comes to the difference between a 95 and a 105, but it means much less when it comes to a difference between 25 and 35 or 165 and 175. Why? In part because outliers are rare, by their nature, which means we have less data to validate that range of our scale. Also, practically speaking there are floors and ceilings. Someone who gets a 20 on the TOEFL ibt and someone who gets a 30 share one most important thing: they’re functionally unable to communicate in English. This is also why you shouldn’t trust anyone who tells you they have an IQ over, say, 150 or so. The scale just doesn’t mean anything up that high.)

How do we get these pretty normal distributions? That’s the work of test development, and part of why it can be a seriously difficult and expensive undertaking. The nature of numbers (and the central limit theorem) help, but ultimately the big testing companies have to spend a ton of time and money getting a distribution as close to normal as possible – and whatever else their flaws, organizations like ETS do that very well. Either way, it’s essential to say that the normal distribution does not arrange itself like magic in tests. It has to be produced with careful work.

The Question of Grades

Thinking about these two paradigms can help us think through some questions. Here’s a simple one: are grades norm referenced or criterion referenced? The answer is surely both and neither, but I think it’s useful to consider the dynamics here for a minute. In one sense, grades are clearly criterion referenced, in that they are meant to reflect a given student’s mastery of the given course’s subject matter. And we aren’t likely to say that only X% of students in a class should pass or fail according to a model distribution; rather, we think we should pass everyone who demonstrates the knowledge, skills, and competencies a class is designed to instill. And we have little reason to think that grades are actually normally distributed between students in the average class.

Yet, in another sense, we clearly think of grades as norm referenced. When we talk about grade inflation, we are often speaking in terms of norm referencing, complaining that we’re losing the ability to discriminate between different levels of ability. Grades are used to compare different students from remarkably different educational contexts in the college admissions process. And sometimes, we “curve a test,” meaning adjusting a test to better match the normal distribution – which does not, contrary to undergraduate presumption, necessarily help the people who took the test. There’s an intuitive sense that grades should match a fairness distribution that, while not normal around the natural mean (you wouldn’t want the average grade to be a 50, I trust), still essentially replicates normality in that both very high and very low grades should be rare. In practice, this is not often the case. A’s, in my experience, outnumber F’s.

In grad school a perpetual concern was the very high average grade of freshman composition, something like a B+. And the downwards pressure on that average was largely the product of students who stopped coming to class and thus got F’s, meaning that the average grade of students who actually completed the course was probably very high indeed. Is this a problem? I guess it depends on your point of view. If we were trying to meaningfully discriminate between different students based on their freshman comp grades, then certainly – but I’m not sure if we’d want to do that. On the other hand, it might be that freshman comp is a class that we think most students might naturally be expected to do well; it isn’t written anywhere that the criterion for success be at a particularly high level. Certainly the major departments wouldn’t like too many students failing that Gen Ed, given that they’d then have to fill valuable schedule space when they retook it. The question, though, is whether those grades actually reflect a meaningful meeting of the given benchmarks – fulfilling the criterion. I’m of the opinion that the answer is often “no” when it comes to freshman comp, meaning that the average grades are probably too high no matter what.

I don’t have any grand insight here, and I think most people are able to meaningfully think about grades in a way that reflects both norm-referenced and criterion-referenced interests. But I do think that these dynamics are important to think about. As I’ve been saying lately, I think that there are some basic aspects of education and education policy that we simply haven’t thought through adequately, and we all could benefit from going back to the basics and pulling apart what we think we want.

selection bias: a parable

To return to this blog’s favorite theme, how selection bias dictates educational outcomes, let me tell you a true story about a real public university in a large public system. (Not my school, for the record.) Let’s call it University X. University X, like most, has an Education department. Like at most universities, University X’s Education majors largely plan on becoming teachers in the public school system of that state. Unfortunately, like at many schools across the country, University X’s Education department has poor metrics – low graduation rate, discouraging post-graduation employment and income measures, poor performance on assessments of student learning relative to same-school peers. (The reasons for this dynamic are complex and outside of the scope of this post.)

The biggest problem, though, is this: far too many of University X’s graduates from the Education program go on to fail the test that the state mandates as part of teacher accreditation. This means that they are effectively barred from their chosen profession. Worse, going to another state to work as a public teacher is often not feasible, as accreditation standards will sometimes require them to take a year or more of classes in that state’s system post-graduation to become accredited. So you end up with a lot of students with degrees designed to get them into a profession they can’t get into, and eventually the powers that be looked askance.

So what did University X’s Education department do? Their move was ingenious, really: they required students applying to the major to take a practice version of the accreditation test, one built from using real questions on the real test and otherwise identical to the real thing. They then only accepted into the major those students who scored above benchmarks on the test. And the problem, in a few years, was solved! They saw their graduates succeed on the accreditation test at vastly higher rates. Unfortunately, they also saw their number of majors decline precipitously, which in turn put them in a tough spot with the administration of University X, who used declining enrollment as a reason to cut their funding.

Now here’s the question for all of you: was introducing the screening mechanism of the practice accreditation test a cynical ploy to artificially boost their metrics? Or by preventing students from entering a program designed to lead to a job they ultimately couldn’t get, were they doing the only moral thing?