lots of fields have p-value problems, not just psychology

You likely have heard of the replication crisis going on, where past research findings cannot be reproduced by other researchers using the same methods. The issue, typically, lies with p-value, an essential but limited statistic that we use to establish statistical significance. (There are other replication problems than just p-value, but that’s the one that you read about the most.) You can read about p-value here and the replication crisis here.

These problems are often associated with the social sciences in general and the fields of psychology and education specifically. This is largely due to the inherent complexities of human-subject research, which typically involves many variables that researchers cannot control; the inability to perform true control-grouped experimental studies due to practical or ethical limitations; and the relatively high alpha thresholds associated with these fields, typically .05, which are necessary because effects studied in the social sciences are often weak compared to those in the natural or applied sciences.

However, it is important to be clear that the p-value problem exists in all manner of fields, including in some that are among the “hardest” of scientific disciplines. In a 2016 story for Slate, Daniel Engber writes of much cancer research, “much of cancer research in the lab—maybe even most of it—simply can’t be trusted. The data are corrupt. The findings are unstable. The science doesn’t work,” because of p-value and associated problems. In a 2016 article for the Proceedings for the National Academy of Sciences of the United States, Eklund, Nichols, and Knutsson found that inferences drawn from fMRI brain imaging are frequently invalid, sharing concerns voiced in a 2016 eNeuro article by Katherine S. Button about replication problems across the biomedical sciences. A 2016 paper by Erik Turkhemier, an expert in genetic heritability of behavioral traits, discussed the ways that even replicable weak associations between genes and behavior prevent researchers from drawing meaningful conclusions about the relationship between genes and behavior. In a 2014 article for Science, Erik Stokstad expressed concerns that ecology literature was more and more likely to list p-values, but that the actual explained effects were becoming weaker and weaker, and that p-values were not adequately contextualized through reference to other statistics.

Clearly, we can’t reassure ourselves that p-value problems are found only in the “soft” sciences. There is a far broader problem with basic approaches to statistical inference that affect a large number of fields. The implications of this are complex; as I have said and will say again, research nihilism is not the answer. But neither is laughing it off as a problem inherent just to those “soft” sciences. More here.

journalists, beware the gambler’s fallacy

One persistent way that human beings misrepresent the world is through the gambler’s fallacy, and there’s a kind of implied gambler’s fallacy that works its way into journalism quite often. It’s hugely important to anyone who cares about research and journalism about research.

The gambler’s fallacy is when you expect a certain periodicity in outcomes when you have no reason to expect it. That is, you look at events that happened in the recent past, and say “that is an unusually high/low number of times for that event to happen, so therefore what will follow is an unusually low/high number of times for it to happen.” The classic case is roulette: you’re walking along the casino floor, and you see the electronic sign showing that a roulette table has hit black 10 times in a row. You know the odds of this are very small, so you rush over to place a bet on red. But of course that’s not justified: the table doesn’t “know” it has come up black 10 times in a row. You’ve still got the same (bad) odds of hitting red, 47.4%. You’re still playing with the same house edge. A coin that’s just come up heads 50 times in a row has the same odds of being heads again as being tails again. The expectation that non-periodic random events are governed by some sort of god of reciprocal probabilities is the source of tons of bad human reasoning – and journalism is absolutely stuffed with it. You see it any time people point out that a particular event hasn’t happened in a long time, so therefore we’ve got an increased chance of it happening in the future.

Perhaps the classic case of this was Kathryn Schulz’s Pulitzer Prize-winning, much-celebrated New Yorker article on the potential mega-earthquake in the Pacific northwest. This piece was a sensation when it appeared, thanks to its prominent placement in a popular publication, the deftness of Schulz’s prose, and the artful construction of her story – but also because of the gambler’s fallacy. At the time I heard about the article constantly, from a lot of smart, educated people, and it was all based on the idea that we were “overdue” for a huge earthquake in that region. People I know were considering selling their homes. Rational adults started stockpiling canned goods. The really big one was overdue.

Was Schulz responsible for this idea? After publication, she would go on to be dismissive of the idea that she had created the impression that we were overdue for such an earthquake. She wrote in a followup to the original article,

Are we overdue for the Cascadia earthquake?

No, although I heard that word a lot after the piece was published. As DOGAMI’s Ian Madin told me, “You’re not overdue for an earthquake until you’re three standard deviations beyond the mean”—which, in the case of the full-margin Cascadia earthquake, means eight hundred years from now. (In the case of the “smaller” Cascadia earthquake, the magnitude 8.0 to 8.6 that would affect only the southern part of the zone, we’re currently one standard deviation beyond the mean.) That doesn’t mean that the quake won’t happen tomorrow; it just means we are not “overdue” in any meaningful sense.

How did people get the idea that we were overdue? The original:

we now know that the Pacific Northwest has experienced forty-one subduction-zone earthquakes in the past ten thousand years. If you divide ten thousand by forty-one, you get two hundred and forty-three, which is Cascadia’s recurrence interval: the average amount of time that elapses between earthquakes. That timespan is dangerous both because it is too long—long enough for us to unwittingly build an entire civilization on top of our continent’s worst fault line—and because it is not long enough. Counting from the earthquake of 1700, we are now three hundred and fifteen years into a two-hundred-and-forty-three-year cycle.

By saying that there is a “two-hundred-and-forty-three-year cycle,” Schulz implied a regular periodicity. The definition of a cycle, after all, is “a series of events that are regularly repeated in the same order.” That simply isn’t how a recurrence interval functions, as Schulz would go on to clarify in her followup – which of course got vastly less attention. I appreciate that, in her followup, Schulz was more rigorous and specific, referring to an expert’s explanation, but it takes serious chutzpah to have written the preceding paragraph and then to later act as though there’s no reason your readers thought the next quake was “overdue.” The closest thing to a clarifying statement in the original article is as follows:

It is possible to quibble with that number. Recurrence intervals are averages, and averages are tricky: ten is the average of nine and eleven, but also of eighteen and two. It is not possible, however, to dispute the scale of the problem.

If we bother to explain that first sentence thoroughly, we can see it’s a remarkable to-be-sure statement – she is obliquely admitting that since there is no regular periodicity to a recurrence interval, there is no sense in which that “two-hundred-and-forty-three-year cycle” is actually a cycle. It’s just an average. Yes, the “really big one” could hit the Pacific northwest tomorrow – and if it did, it still wouldn’t imply that we’ve been overdue, as her later comments acknowledge. The earthquake might also happen 500 years from now. That’s not a quibble; it’s the root of the very panic she set off by publishing the piece. But by immediately leaping from such an under-explained discussion of what a recurrence interval is and isn’t to the irrelevant and vague assertion about “the scale of the problem,” Schulz ensured that her readers would misunderstand in the most sensationalistic way possible. However well crafted her story was, it left people getting a very basic fact wrong, and was thus bad science writing. I don’t think Schulz was being dishonest, but this was a major problem with a piece that received almost universal praise.

I just read another good example of an implied gambler’s fallacy in a comprehensively irresponsible Gizmodo piece on supposed future pandemics. I am tempted to just fisk the whole thing, but I’ll spare you. For our immediate interests let’s just look at how a gambler’s fallacy can work by implication. George Dvorsky:

Experts say it’s not a matter of if, but when a global scale pandemic will wipe out millions of people…. Throughout history, pathogens have wiped out scores of humans. During the 20th century, there were three global-scale influenza outbreaks, the worst of which killed somewhere between 50 and 100 million people, or about 3 to 5 percent of the global population. The HIV virus, which went pandemic in the 1980s, has infected about 70 million people, killing 35 million.

Those specific experts are not named or quoted, so we’ll have to take Dvorsky’s word for it. But note the implication here: because we’ve had pandemics in the past that killed significant percentages of the population, we are likely to have more in the future. An-epidemic-is-upon-us stories are a dime a dozen in contemporary news media, given their obvious ability to drive clicks. Common to these pieces are the implication that we are overdue for another epidemic because epidemics used to happen regularly in the past. But of course, conditions change, and there’s few fields where conditions have changed more in the recent past than infectious diseases. Dvorsky implies that they have changed for the worse:

Diseases, particularly those of tropical origin, are spreading faster than ever before, owing to more long-distance travel, urbanization, lack of sanitation, and ineffective mosquito control—not to mention global warming and the spread of tropical diseases outside of traditional equatorial confines.

Sure, those are concerns. But since he’s specifically set us up to expect more pandemics by referencing those in the early 20th century, maybe we should take a somewhat broader perspective and look at how infectious diseases have changed in the past 100 years. Let’s check with the CDC.

The most salient change, when it comes to infectious, has been the astonishing progress of modern medicine. We have a methodology for fighting infectious disease that has saved hundreds of millions of lives. Unsurprisingly, the diseases that keep getting nominated as the source of the next great pandemic keep failing to spread at expected rates. Dvorsky names diseases likes SARs (global cases since 2004: zero) and Ebola (for which we just discovered a very promising vaccine), not seeming to realize that these are examples of victories for the control of infectious disease, as tragic as the loss of life has been. The actual greatest threats to human health remain what they have been for some time, the deeply unsexy threats of smoking, heart disease, and obesity.

Does the dramatically lower rate of deaths from infectious disease mean a pandemic is impossible? Of course not. But “this happened often in the past, and it hasn’t happened recently, so….” is fallacious reasoning. And you see it in all sorts of domains of journalism. “This winter hasn’t seen a lot of snow so far, so you know February will be rough.” “There hasn’t been a murder in Chicago in weeks, and police are on their toes for the inevitable violence to come.” “The candidate has been riding a crest of good polling numbers, but analysts expect he’s due for a swoon.” None of these are sound reasoning, even though they seem superficially correct based on our intuitions about the world. It’s something journalists in particular should watch out for.

why selection bias is the most powerful force in education

Imagine that you are a gubernatorial candidate who is making education and college preparedness a key facet of your campaign. Consider these two state average SAT scores.

                                    Quantitative            Verbal         Total

Connecticut                   450                       480             930

Mississippi                     530                       550             1080

Your data analysts assure you that this difference is statistically significant. You know that SAT scores are a strong overall metric for educational aptitude in general, and particularly that they are highly correlated with freshman year performance and overall college outcomes. Those who score higher on the test tend to receive higher college grades, are less likely to drop out in their freshman year, are more likely to complete their degrees in four or six years, and are more likely to gain full-time employment when they’re done.

You believe that making your state’s high school graduates more competitive in college admissions is a key aspect of improving the economy of the state. You also note that Connecticut has powerful teacher unions which represent almost all of the public teachers in the state, while Mississippi’s public schools are largely free of public teacher unions. You resolve to make opposing teacher unions in your state a key aspect of your educational platform, out of a conviction that getting rid of the unions will ultimately benefit your students based on this data.

Is this a reasonable course of action?

Anyone who follows major educational trends would likely be surprised at these SAT results. After all, Connecticut consistently places among the highest-achieving states in educational outcomes, Mississippi among the worst. In fact, on the National Assessment of Educational Progress (NAEP), widely considered the gold standard of American educational testing, Connecticut recently ranked as the second-best state for 4th graders and the best for 8th graders. Mississippi ranked second-to-worst for both 4th graders and 8th graders. So what’s going on?

The key is participation rate, or the percentage of eligible juniors and seniors taking the SAT, as this scatter plot shows.

As can be seen, there is a strong negative relationship between participation rate and average SAT score. Generally, the higher the percentage of students taking the test in a given state, the lower the average score. Why? Think about what it means for students in Mississippi, where the participation rate is 3%, to take the SAT. Those students are the ones who are most motivated to attend college and the ones who are most college-ready. In contrast, in Connecticut 88% of eligible juniors and seniors take the test. (Data.) This means that almost everyone of appropriate age takes the SAT in Connecticut, including many students who are not prepared for college or are only marginally prepared. Most Mississippi students self-select themselves out of the sample. The top performing quintile (20%) of Connecticut students handily outperform the top performing quintile of Mississippi students. Typically, the highest state average in the country is that of North Dakota—where only 2% of those eligible take the SAT at all.

In other words, what we might have perceived as a difference in education quality was really the product of systematic differences in how the considered populations were put together. The groups we considered had a hidden non-random distribution. This is selection bias.

*****

My hometown had three high schools – the local coed public high school (where I went), and both a boys and girls private Catholic high school. People involved with the private high schools liked to brag about the high scores their students scored on standardized tests – without bothering to mention that you had to score well on such a test to get into them in the first place. This is, as I’ve said before, akin to having a height requirement for your school and then bragging about how tall your student body is. And of course, there’s another set of screens involved here that also powerfully shape outcomes: private schools cost a lot of money, and so students who can’t afford to attend are screened out. Students from lower socioeconomic backgrounds have consistently lower performance on a broad variety of metrics, and so private schools are again advantaged in comparison to public. To draw conclusions about educational quality from student outcomes without rigorous attempts to control for differences in which students are sorted into which schools, programs, or pedagogies – without randomization – is to ensure that you’ll draw unjustified conclusions.

Here’s an image that I often use to illustrate a far broader set of realities in education. It’s a regression analysis showing institutional averages for the Collegiate Learning Assessment, a standardized test of college learning and the subject of my dissertation. Each dot is a college’s average score. The blue dots are average scores for freshmen; the red dots, for seniors. The gap between the red and blue dots shows the degree of learning going on in this data set, which is robust for essentially all institutions. The very strong relationship between SAT scores and CLA scores show the extent to which different incoming student populations – the inherent, powerful selection bias of the college admissions process – determine different test outcomes. (Note that very similar relationships are observed in similar tests such as ETS’s Proficiency Profile.) To blame educators at a school on the left hand side of the regression for failing to match the schools on the right hand side of the graphic is to punish them for differences in the prerequisite ability of their students.

Harvard students have remarkable post-collegiate outcomes, academically and professionally. But then, Harvard invests millions of dollars carefully managing their incoming student bodies. The truth is most Harvard students are going to be fine wherever they go, and so our assumptions about the quality of Harvard’s education itself are called into question. Or consider exclusive public high schools like New York’s Stuyvesant, a remarkably competitive institution where the city’s best and brightest students compete to enroll, thanks to the great educational benefits of attending. After all, the alumni of high schools such as Stuyvesant are a veritable Who’s Who of high achievers and success stories; those schools must be of unusually high quality. Except that attending those high schools simply doesn’t matter in terms of conventional educational outcomes. When you look at the edge cases – when you restrict your analysis to those students who are among the last let into such schools and those who are among the last left out – you find no statistically meaningful differences between them. Of course, when you have a mechanism in place to screen out all of the students with the biggest disadvantages, you end up with an impressive-looking set of alumni. The admissions procedures at these schools don’t determine which students get the benefit of a better education; the perception of a better education is itself an artifact of the admissions procedure. The screening mechanism is the educational mechanism.

Thinking about selection bias compels us to consider our perceptions of educational cause and effect in general. A common complaint of liberal education reformers is that students who face consistent achievement gaps, such as poor minority students, suffer because they are systematically excluded from the best schools, screened out by high housing prices in these affluent, white districts. But what if this confuses cause and effect? Isn’t it more likely that we perceive those districts to be the best precisely because they effectively exclude students who suffer under the burdens of racial discrimination and poverty? Of course schools look good when, through geography and policy, they are responsible for educating only those students who receive the greatest socioeconomic advantages our society provides. But this reversal of perceived cause and effect is almost entirely absent from education talk, in either liberal or conservative media.

Immigrant students in American schools outperform their domestic peers, and the reason is about culture and attitude, the immigrant’s willingness to strive and persevere, right? Nah. Selection bias. So-called alternative charters have helped struggling districts turn it around, right? Not really; they’ve just artificially created selection bias. At Purdue, where there is a large Chinese student population, I always chuckled to hear domestic students say “Chinese people are all so rich!” It didn’t seem to occur to them that attending a school that costs better than $40,000 a year for international students acted as a natural screen to exclude the vast number of Chinese people who live in deep poverty. And I had to remind myself that my 8:00 AM writing classes weren’t going so much better than my 2:00 PM classes because I was somehow a better teacher in the mornings, but because the students who would sign up for an 8:00 AM class were probably the most motivated and prepared. There’s plenty of detailed work by people who know more than I do about the actual statistical impact of these issues and how to correct for them. But we all need to be aware of how deeply unequal populations influence our perceptions of educational quality.

Selection bias hides everywhere in education. Sometimes, in fact, it is deliberately hidden in education. A few years ago, Reuters undertook an exhaustive investigation of the ways that charter schools deliberately exclude the hardest-to-educate students, despite the fact that most are ostensibly required to accept all kinds of students, as public schools are bound to. For all the talk of charters as some sort of revolution in effective public schooling, what we find is that charter administrators work feverishly to tip the scales, finding all kinds of crafty ways to ensure that they don’t have to educate the hardest students to educate. And even when we look past all of the dirty tricks they use – like, say, requiring parents to attend meetings held at specific times when most working parents can’t – there are all sorts of ways in which students are assigned to charter schools non-randomly and in ways that advantage those schools. Excluding students with cognitive and developmental disabilities is a notorious example. (Despite what many people presume, a majority of students with special needs take state-mandated standardized tests and are included in data like graduation rates, in most locales.) Simply the fact that parents typically have to opt in to charter school lotteries for their students to attend functions as a screening mechanism.

Large-scale studies of charter efficacy such as Stanford’s CREDO project argue confidently that they have controlled for the enormous number of potential screening mechanisms that hide in large-scale education research. These researchers are among the best in the world and I don’t mean to disparage their work. But given the enormity of the stakes and the truth of Campbell’s Law, I have to report that I remain skeptical that we have truly ever controlled effectively for all the ways that schools and their leaders cook the books and achieve non-random student populations. Given that random assignment to condition is the single most essential aspect of responsible social scientific study, I think caution is warranted. And as I’ll discuss in a post in the future, the observed impact of school quality on student outcomes in those cases where we have the most confidence in the truly random assignment to condition is not encouraging.

I find it’s nearly impossible to get people to think about selection bias when they consider schools and their quality. Parents look at a private school and say, look, all these kids are doing so well, I’ll send my troubled child and he’ll do well, too. They look at the army of strivers marching out of Stanford with their diplomas held high and say, boy, that’s a great school. And they look at the Harlem Children’s Zone schools and celebrate their outcome metrics, without pausing to consider that it’s a lot easier to get those outcomes when you’re constantly expelling the students most predisposed to fail. But we need to look deeper and recognize these dynamics if we want to evaluate the use of scarce educational resources fairly and effectively.

Tell me how your students are getting assigned to your school, and I can predict your outcomes – not perfectly, but well enough that it calls into question many of our core presumptions about how education works.