Study of the Week: the Gifted and the Grinders

Back in high school, I was a pretty classic example of a kid that teachers said was bright but didn’t apply himself. There were complex reasons for that, some of them owing to my home life, some of it my failure to understand the stakes, and some of it laziness and arrogance. Though I wasn’t under the impression that I was a genius, I did think that in the higher placement classes there were people who got by on talent and people who were striver types, the ones who gritted out high grades more through work than through being naturally bright.

This is, of course, reductive thinking, and was self-flattery on my part. (In my defense, I was a teenager.) Obviously, there’s a range of smarts and a range when it comes to perseverance and work ethic, and there are all sorts of aspects of these things that are interacting with each other. And clearly those at the very top of the academic game likely have both smarts and work ethic in spades. (And luck. And privilege.) But my old vague sense that some people were smarties and some were grinders seems pervasive to me. Our culture is full of those archetypes. Is it really the case that intelligence and work ethic are separate, and that they’re often found in quite different amounts in individuals?

Kind of, yeah.

At least, there’s evidence for that in a recent replication study performed by Clemens Lechner, Daniel Danner, and Beatrice Rammstedt of the Leibniz Institute for the Social Sciences, which I will talk about today for the first Study of the Week, and which I’ll use to take a quick look at a few core concepts.

Construct and Operationalization

Social sciences are hard, for a lot of reasons. One is the famously complex number of variables that influence human behavior, which in turn makes it difficult to identify which variables (or interactions of variables) are responsible for a given outcome. Another is the concept of the construct.

In the physical sciences, we’re general measuring things that are straightforward facets of the physical universe, things that are to one degree or another accessible and mutually defined by different people. We might have different standards of measure, we might have different tools to measure them, and we might need a great deal of experimental sophistication to obtain these measurements, but there is usually a fundamental simplicity to what we’re attempting to measure. Take length. You might measure it in inches or in centimeters. You might measure it with a yard stick or a laser system. You might have to use complex ideas like cosmic distance ladders. But fundamentally the concept of length, or temperature, or mass, or luminosity, is pretty easy to define in a way that most every scientist will agree with.

The social sciences are mostly not that way. Instead, we often have to look at concepts like intelligence, reading ability, tolerance, anxiety…. Each of these reflect real-world phenomenon that most humans can agree exist, but what exactly they entail and how to measure them are matters of controversy. They just aren’t available to direct measurement in the ways common to the natural and physical sciences. So we need to define how we’re going to measure them in a way that will be regarded as valid by others – and that’s often not an uncomplicated task.

Take reading. Everybody knows what reading is, right? But testing reading ability turns out to be a complex task. If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

You get the idea. It’s complicated stuff. We can’t just say “reading ability” and know that everyone is going to agree with what that is or how to measure it. Instead, we recognize the social processes inherent in defining such concepts by referring to them as a construct and to the way we are measuring that construct as an operationalization. (You are invited to roll your eyes at the jargon if you’d like.) So we might have the concept “reading ability” and operationalize it with a multiple choice test. Note that the operationalization isn’t merely an instrument or a metric but the whole sense of how we take the necessarily indistinct construct and make it something measurable.

Construct and operationalization, as clunky as the terms are and as convoluted as they seem, are essential concepts for understanding the social sciences. In particular, I find the difficulty merely in defining our variables of interest and how to measure them a key reason for epistemic humility in our research.

So back to the question of intelligence vs. work ethic. The construct “intelligence” is notoriously contested, with hundreds of books written about its definition, its measurement, and the presumed values inherent to how we talk about it. For our purposes, let’s accept merely that this is a subject of a huge body of research, and that we have the concepts of IQ and in the public consciousness already. We’ll set aside all of the empirical and political issues with IQ for now. But what about work ethic/perseverance/”grinding”? How would we operationalize such a construct? Here we’ll have to talk about psychology’s Five Factor Model.

The “Big Five” or Five Factor Model

The Five Factor Model is a vision of human personality, particularly favored by those in behavioral genetics, that says there are essentially only five major factors in human personality: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, sometimes anagrammed to OCEAN. To one degree or another, proponents of the Five Factor Model argue that all of our myriad terms for personality traits are really just synonyms for these five things. That’s right, that’s all there is to it – those are the traits that make up the human personality, and we’re all found along a range on those scales. I’m exaggerating, of course, but in the case of some true believers not by much. Steven Pinker, for example, flogs the concept relentlessly in his most famous book, The Blank Slate. That’s not a coincidence; behavioral genetics, as a field, loves the Five Factor Model because it fits empirically with the maximalist case for genetic determinism. (Pinker’s MO is to say that both genetics and other factors matter about equally and then to speak as if only genetics matter.) In other words the Five Factor Model helps people make a certain kind of argument about human nature, so it gets a lot of extra attention. I sometimes call this kind of thinking the validity of convenience.

The standard defense of the Five Factor Model is, hey, it replicates – that is, its experimental reliability tends to be high, in that different researchers using somewhat different methods to measure these traits will find similar results. But this is the tail wagging the dog; that something replicates doesn’t mean it’s a valid theoretical construct, only that it tracks with some persistent real-world quality. As Louis Menand put it in New Yorker review of The Blank Slate that is quite entertaining if you find Pinker ponderous,

When Pinker and Harris say that parents do not affect their children’s personalities, therefore, they mean that parents cannot make a fretful child into a serene adult. It’s irrelevant to them that parents can make their children into opera buffs, water-skiers, food connoisseurs, bilingual speakers, painters, trumpet players, and churchgoers—that parents have the power to introduce their children to the whole supra-biological realm—for the fundamental reason that science cannot comprehend what it cannot measure.

That results of the Five Factor model can be replicated does not mean that the idea of dividing the human psyche into five reductive factors and declaring that the whole of personality is valid. It simply means that our operationalizations of the construct are indeed measuring some consistent property of individuals. It’s like answering the question “what is a human being?” by saying “a human being is bipedal.” If you then send a team of observers out into the world to measure the number of legs that tend to be found on humans, you will no doubt find that different researchers are likely to obtain similar findings when counting the number of legs of an individual person. But this doesn’t provide evidence that bipedalism is the sum of mankind; it merely suggests that legs are a thing you can consistently measure, among many. Reliability is a necessary criterion for validity, but it isn’t sufficient. I don’t doubt that the Five Factor Model describes consistent and real aspects of human personality, but the way that Pinker and others treat that model as a more or less comprehensive catalog of what it means to be human is not justified. I’m sure that you could meet two different people who share the same outcomes on the five measured traits in the model, fall madly in love with one of them, and declare the other the biggest asshole you’ve ever met in your life. We’re a multivariate species.

That windup aside, for this particular kind of analysis, I think a construct like “conscientiousness” can be analytically useful. That is, I think that we can avoid the question of whether the Five Factors are actually a comprehensive catalog of essential personality traits while recognizing that there’s some such property of educational perseverance and that it is potentially measurable. (Angela Lee Duckworth’s “grit” concept has been a prominent rebranding of this basic human capacity, although it has begun to generate some criticism.) The question is, does this trait really exist independent of intelligence, and how effective of a predictor is it compared to IQ testing?

Intelligence, Achievement, Test Scores, and Grades

In educational testing, it’s a constant debate: to what degree do various tests measure specific and independent qualities of tested subjects, and to what degree are they just rough approximations of IQ? You can find reams of studies concerning this question. The question hinges a great deal on the subject matter; obviously, a really high IQ isn’t going to mean much if you’re taking a Latin test and you’ve never studied Latin. On the other hand, tests like the SAT and its constituent sections tend to be very highly correlated with IQ tests, to the point where many argue that the test is simply a de facto test for g, the general intelligence factor that IQ tests are intended to measure. What makes these questions difficult, in part, is that we’re often going to be considering variables that are likely to be highly correlated within individuals. That is, the question of whether a given achievement test measures something other than is harder to answer because people with a high are also those who are likely to score highly on an achievement test even if that test effectively measures something other than g. Make sense?

Today’s study offers two main research questions:

first, whether achievement and intelligence tests are empirically distinct; second, how much variance in achievement measures is accounted for by intelligence vs. by personality, whereby R2 increments of personality after adjusting for intelligence are the primary interest

I’m not going to wade into the broader debate about whether various achievement tests effectively measure properties distinct from IQ. I’m not qualified, statistically, to try and separate the various overlapping sums of squares in intelligence and achievement testing. And given that the g-men are known for being rather, ah, strident, I’d prefer to avoid the issue. Besides I think the first question is of much more interest to professionals in psychometrics and assessment than the general public. (This week’s study is in fact a replication of a study that was in turn disputed by another researcher.) But the second question is interesting and relevant to everyone interested in education: how much of a given student’s outcomes are the product of intelligence and how much is the product of personality? In particular, can we see a difference in how intelligence (as measured with IQ and its proxies) influences test scores and grades and how personality (as operationalized through the Five Factor Model) influences them?

The Present Study

In the study at hand, the researchers utilized a data set of 13,648 German 9th graders. The student records included their grades; their results on academic achievement tests; their results on a commonly-used test of the Five Factors; and their performance on a test of reasoning/general intelligence (a Raven’s Standard Progressive Matrices analog) and a processing speed test, which are often used in this kind of cognitive research.

The researchers undertook a multivariable analysis of variance analysis called “exploratory structural equation modeling.” I would love to tell you what that is and how it works but I have no idea. I’m not equipped, statistically, to explain the process or judge whether it was appropriate in this instance. We’re just going to have to trust the researchers and recognize that the process does what analysis of variance does generally, which is to look at the quantitative relationships between variables to explain how they predict, or fail to predict, each other. The nut of it is here:

First, we regressed each of the four cognitive skill measures on all Big Five dimensions. Second, we decomposed the variance of the achievement measures (achievement test scores and school grades) by regressing them on intelligence alone and then on personality and intelligence jointly.

(“Decomposing” variables, in statistics, is a fancy way of saying that you’re using mathematical techniques to identify and separate variables that might be otherwise difficult to separate thanks to their close quantitative relationships.)

What did they find? The results are pretty intuitive. There is, as to be expected, a strong (.76) correlation between performance on the intelligence test and performance on achievement tests. There’s also a considerable but much weaker relationship between achievement tests and grades (.44) and the general intelligence test and grades (.32). So kids who are smarter as defined by achievement and reasoning tests do get better grades, but the relationship isn’t super strong. There are other factors involved. And a big part of that unexplained variance, according to that research, is personality.

The Big Five explain a substantial, and almost identical, share of variance in grades and achievement tests, amounting to almost one-fifth. By comparison, they explain less than half as much—<.10—of the variance in reasoning, and almost none in processing speed (0.07%).

In other words, if you’re trying to predict how students will do on grades and achievement tests, their personalities are pretty strong predictors. But if you’re trying to predict their pure reasoning ability, personality is pretty useless. And the good-at-tests, bad grades students like high school Freddie are pretty plentiful:

the predictive power of intelligence is markedly different for these two achievement measures: it is much higher than that of personality in the case of achievement—but much lower in the case of school grades, where personality alone explains almost two times more variance than intelligence alone does.

So it would seem there may be some validity to the concept of the naturally bright and grinders after all. And the obverse, the less naturally bright but highly motivated grinder types?

Conscientiousness has a substantial positive relationship with grades—but negative relationships with both achievement test scores and reasoning.

In other words, the more conscientious you are, the better the grades you receive, even though you score lower on achievement and intelligence tests. Unsurprisingly, Conscientiousness (the “grit,” perseverance, stick-to-itiveness factor) correlated most highly with school grades, at .27. The ability to continue to work diligently and through adversity makes a huge difference on getting good grades but is much less important when it comes to raw intelligence testing.

What It Means

Ultimately, this research result is intuitive and matches with the personal experience of many. As someone who spent a lot of his life skating by on being bright, and only really became academically focused late in my undergraduate education, there’s something selfishly comforting here. But in the broader, more socially responsible sense, I think we should take care not to perpetuate any stigmas about the grinders. On the one hand, our culture is absolutely suffused with celebrations of conscientiousness and hard work, so it’s not like I think grinders get no credit. And it is important to say that there are certain scenarios where pure reasoning ability matter; if you’re intent on being a research physicist or mathematician, for example, or if you’re bent on being a chess Grandmaster, hard work will not be sufficient, no matter what Malcolm Gladwell says. On the other hand, I am eager to contribute in whatever way to undermining the Cult of Smartness. We’ve perpetuated the notion that those naturally gifted with high intelligence are our natural leaders for decades, and to show for it we have immense elite failures and a sickening lack of social responsibility on Wall Street and in Silicon Valley, where the supposed geniuses roam.

What we really need, ultimately, from both our educational system and our culture, is a theme I will return to in this blog again and again: a broader, more charitable, more humanistic definition of what it means to be a worthwhile human being.

(Thanks to SlateStarCodex for bringing this study to my attention.)


Success Academy Charter Schools accepted $550,000 from pro-Trump billionaires

The charter school movement retains a reputation for being at least nominally politically progressive. This is strange, as the movement entails defunding public institutions and removing public accountability and replacing them with for-profit or not-for-profit-in-name-only institutions, and in doing so causing cuts in stable, unionized public sector jobs. Charter schools are notorious union busters and in some locales have resulted in hugely disproportionate job cuts against black women. The basic assumptions of school “choice” rely on conservative economic arguments – that markets always improve quality. As terrible as Betsy Devos’s appointment to Secretary of Education in the Trump administration may be, it at least helps clarify what should already be obvious: that support for charter schools is conservative on its face.

Here’s a little more evidence for you, concerning school reform darling Success Academy Charter Schools, celebrated for their strong metrics and notorious for their abusive methods in achieving them.

Via my friend and comrade Mindy Rosier, a public school special education teacher and tireless activist, I learned about these financial disclosures from the Mercer Family Foundation, provided by The Mercer Family Foundation is the funding arm of a secretive, powerful family of reactionary billionaires who have spent the last few years empowering Republicans generally and Donald Trump specifically. And in 2014 they donated $550,000 to Success Academy:

Overall, the Mercer Family Foundation’s donations are a veritable Who’s Who of reactionary conservatism, with large donations going to the Heritage Foundation, the Cato Institute, the George W. Bush Foundation, the Barry Goldwater Institute, the Manhattan Institute…. also receives key financial backing from the Mercer Family Foundation. What kind of publication is Breitbart?

How does Success Academy justify taking money from people who fund such hateful rhetoric? I don’t know. Betsy Woodruff has written about this connection, pointing out that Donald Trump has become a staunch advocate of charter schools, but otherwise it’s failed to generate much attention. Might be time for an enterprising reporter to pick up the phone.

One way or another, this should be just another clear indication of the obvious: the charter school movement is part of the same conservative movement that brought us Donald Trump.

do peer effects matter? nah

There’s long been a belief that peer effects play a significant role in how well students perform academically – that is, that learning alongside higher-achieving peers likely helps students achieve themselves, while learning alongside lower-achieving peers might drag them down. Is that the case?

Probably not. The newer, larger, higher-quality studies don’t show evidence for that in quantitative outcomes, anyway. A large study looking at exam schools in New York and Boston – that is, selective public high schools in large urban districts – found that even though enrolling in these institutions dramatically increased the average academic performance of peers (thanks to the screening process to get in), the impact on relative performance was essentially nil. That’s true in terms of test metrics like the PSAT, SAT, and AP Scores, and in terms of college outcomes after graduation. It just doesn’t matter much. A study among students transitioning from the primary school level to the secondary school level in England, where dramatic changes occur in peer group composition, found a significant but very small effect from peer group in quantitative indicators. Like, really small. Assuming the null is a pretty good bet in a lot of education research.

Of course, none of this means there’s no human value in sending your kids to school with elite peers. There are many things that matter in life beyond quantitative education indicators. (Though you’d never know that if you listen to some pundits.) Your kids may find their school experience more pleasant, and it may help them network later in life, if they attend school with high achievers. On the other hand, it will inevitably increase the homogeneity of their learning environment, which seems less than ideal to me in a multicultural democracy like ours. Public schools that are struggling desperately need financial secure parents who have the social capital necessary to advocate for them, too. Either way, though, if you’re worrying about how peer effects will impact your kid’s outcomes, you shouldn’t. Like an awful lot of things that parents worry about when it comes to their children, it just doesn’t matter much.

notes for 3/31/17

  • The first book review should be out for subscribers today! It’s a reprint of an academic review I wrote, but in the future they’ll all be new content. Just need a week to read a couple more books appropriate for this blog. This is the first time I’ve ever distributed a reward through Patreon so please let me know if you didn’t receive an email or can’t access the review. If you’re not a subscriber yet, think it over! I’m exploring some cool options for other rewards and I hope to let you know about some of them soon.
  • Be sure to spread the word about this project if you like it.
  • Some readers have pointed out that the rates for SAT participation in some states are so low (mentioned in this post) because those states require the ACT as a learning assessment. Which is certainly true! But note that the point of that post isn’t to say “look at how low these participation rates are” but rather to explore selection bias, which in the case of ACT-dominant regions would be even more pronounced – only the very motivated students, particularly those looking to attend elite private institutions, would be likely to take the SAT.
  • I have gotten a fair amount of pushback on the idea that randomized trials of charter school efficacy aren’t really random. I agree that this is an idea that I need to explore at greater length in the future. In addition to what I suspect is lurking non-random distribution, I think the bigger question is whether “charter school” even makes sense as a condition suitable for randomization. More to come.
  • On the other side, I appear to have been too kind to the CREDO studies. To call survivorship bias a demonstration of quality on the part of charters is just… not cool.
  • The first Study of the Week post should come out on Monday. It’s a big meaty one and I’m really happy with how it’s shaping up. Not 100% sure but I’m guessing I’ll distribute book reviews on the weekend and do Study of the Week on Monday or Tuesday. And feel free to email me with suggestions or requests.

looking beyond test scores in defense of after school programs

The Trump administration has proposed cutting funding for a program that provides after school programs for low-income students. At the Atlantic, Leah Askarinam defends the programs. I’m on board with continuing to fund them, but I find her defense counterproductive.

Askarinam’s argument is kind of strange. The Brookings Institution ran several large-n studies in the middle of the aughts that showed, without much ambiguity at all, that the quantitative learning gains from these programs are minimal. Askarinam fixates on the age of those studies as a reason to question their validity. It’s true that the latest study on the efficacy of these programs is about a decade old, which isn’t ideal, but also isn’t unusual; it’s really hard, far harder than most people think, to run effective large-scale social science research projects. More to the point, why would we assume that something fundamental has changed in the outcomes of these programs in the past 10 years? She notes that the federal policy situation was different then, but that hardly seems to be sufficiently explanatory for me – the federal education policy situation changes all the time, without seeing systematic differences in student outcomes. (Indeed, the irrelevance of federal education policy to student outcomes is the source of great lamentation.) Consider the standard here: if ten+ year old studies had shown robust learning gains, would Askarinam now say that they were too old to be trusted? Such a standard would cut both ways, after all.

And while it’s true that absence of evidence isn’t evidence of absence, absence of evidence is… absence of evidence. Askarinam offers some anecdotal evidence of academic improvement, discusses internal research, and speaks generally of gains not captured by those older studies. That’s fine as far as it goes, but none of it amounts to responsible evidence for the kind of quantitative gains the Brooking studies were looking for. More study is needed, obviously – you can use that phrase like a comma when you’re talking about ed research – but as with pre-K programs, I think if the question is “are test score and other quantitative gains in outcomes sufficient to justify the expense of publicly-funded after school programs?,” the answer is clearly no.

So am I opposed to funding for after school programs? No, not at all. I just think we should fund them for defensible reasons. Askarinam quotes David Muhlhausen of the conservative Heritage Foundation, “It’s a place to have their kids while the parents are at work,” Muhlhausen said. “That’s the real key to these programs and why they’re popular—not that they provide any benefits to the students. It’s basically a babysitting program for parents who aren’t home.”

Sounds good to me.

The birth of publicly-funded, federally-guaranteed education for children aged 5-18 was one of the greatest advancements in human well-being in history. It helped move millions of children into formal education, providing not only the various benefits of schooling to them but also the essential ancillary benefit of childcare. This in turn made it easier for both parents to work. While we might lament the fact that it’s now necessary for most households to have two incomes to survive, the fact is that it is necessary, and without the free childcare that public schools provide, family life would be impossible for much of the country. Public education also helps our slow, imperfect march towards gender equality. And in a world where digital technologies make it easier and easier to avoid interacting with people who are outside our immediate familial and friend networks, formal schooling can help make the kinds of connections between people from radically different backgrounds that are essential for a functioning democracy.

The cost of these programs is around $1 billion a year, or about one quarter of one percent of what we’ve spent on the failed F-35 jet project in the past 15 years.

In an era of stagnant real incomes for most workers and spiraling costs of housing, healthcare, and higher education, programs that provide safe supervision for children are worth supporting. “Traditional values” conservatives should embrace programs that make child rearing feasible for more families; liberals and leftists should appreciate expanding government assistance and taking more social goods (like childcare) out of the market.

Askarinam’s defensiveness, it seems to me, reflects the way that the widespread acceptance of test score obsession has boxed us in. Too many well-meaning progressives have adopted this reductive view of the purpose of education; they then end up unable to defend programs they favor when the results of those programs on test scores are inevitably small or nonexistent. The universal pre-K debate is a perfect example. The endless back-and-forth involves credible arguments from both supporters and skeptics, but few would question that the test score and other quantitative gains we’re arguing over are modest. So stop arguing through that frame. As long as test scores are taken as the criterion of interest, we’ll be playing defense. Instead, we should argue that the basic benefit of pre-K and after school programs is to provide essential childcare support to struggling families, and to provide social and personal enrichment that has value even if uncorrelated with test score increases. We need to expand our definitions of the purpose of education outside of the quantitative, rather than staying rooted in a frame that often doesn’t help us. Askarinam describes an after school program that offers social and emotional health benefits. That’s worth fighting for on its own. So articulate that case, and do the same with pre-K. Argue from strength, not weakness.

why selection bias is the most powerful force in education

Imagine that you are a gubernatorial candidate who is making education and college preparedness a key facet of your campaign. Consider these two state average SAT scores.

                                    Quantitative            Verbal         Total

Connecticut                   450                       480             930

Mississippi                     530                       550             1080

Your data analysts assure you that this difference is statistically significant. You know that SAT scores are a strong overall metric for educational aptitude in general, and particularly that they are highly correlated with freshman year performance and overall college outcomes. Those who score higher on the test tend to receive higher college grades, are less likely to drop out in their freshman year, are more likely to complete their degrees in four or six years, and are more likely to gain full-time employment when they’re done.

You believe that making your state’s high school graduates more competitive in college admissions is a key aspect of improving the economy of the state. You also note that Connecticut has powerful teacher unions which represent almost all of the public teachers in the state, while Mississippi’s public schools are largely free of public teacher unions. You resolve to make opposing teacher unions in your state a key aspect of your educational platform, out of a conviction that getting rid of the unions will ultimately benefit your students based on this data.

Is this a reasonable course of action?

Anyone who follows major educational trends would likely be surprised at these SAT results. After all, Connecticut consistently places among the highest-achieving states in educational outcomes, Mississippi among the worst. In fact, on the National Assessment of Educational Progress (NAEP), widely considered the gold standard of American educational testing, Connecticut recently ranked as the second-best state for 4th graders and the best for 8th graders. Mississippi ranked second-to-worst for both 4th graders and 8th graders. So what’s going on?

The key is participation rate, or the percentage of eligible juniors and seniors taking the SAT, as this scatter plot shows.

As can be seen, there is a strong negative relationship between participation rate and average SAT score. Generally, the higher the percentage of students taking the test in a given state, the lower the average score. Why? Think about what it means for students in Mississippi, where the participation rate is 3%, to take the SAT. Those students are the ones who are most motivated to attend college and the ones who are most college-ready. In contrast, in Connecticut 88% of eligible juniors and seniors take the test. (Data.) This means that almost everyone of appropriate age takes the SAT in Connecticut, including many students who are not prepared for college or are only marginally prepared. Most Mississippi students self-select themselves out of the sample. The top performing quintile (20%) of Connecticut students handily outperform the top performing quintile of Mississippi students. Typically, the highest state average in the country is that of North Dakota—where only 2% of those eligible take the SAT at all.

In other words, what we might have perceived as a difference in education quality was really the product of systematic differences in how the considered populations were put together. The groups we considered had a hidden non-random distribution. This is selection bias.


My hometown had three high schools – the local coed public high school (where I went), and both a boys and girls private Catholic high school. People involved with the private high schools liked to brag about the high scores their students scored on standardized tests – without bothering to mention that you had to score well on such a test to get into them in the first place. This is, as I’ve said before, akin to having a height requirement for your school and then bragging about how tall your student body is. And of course, there’s another set of screens involved here that also powerfully shape outcomes: private schools cost a lot of money, and so students who can’t afford to attend are screened out. Students from lower socioeconomic backgrounds have consistently lower performance on a broad variety of metrics, and so private schools are again advantaged in comparison to public. To draw conclusions about educational quality from student outcomes without rigorous attempts to control for differences in which students are sorted into which schools, programs, or pedagogies – without randomization – is to ensure that you’ll draw unjustified conclusions.

Here’s an image that I often use to illustrate a far broader set of realities in education. It’s a regression analysis showing institutional averages for the Collegiate Learning Assessment, a standardized test of college learning and the subject of my dissertation. Each dot is a college’s average score. The blue dots are average scores for freshmen; the red dots, for seniors. The gap between the red and blue dots shows the degree of learning going on in this data set, which is robust for essentially all institutions. The very strong relationship between SAT scores and CLA scores show the extent to which different incoming student populations – the inherent, powerful selection bias of the college admissions process – determine different test outcomes. (Note that very similar relationships are observed in similar tests such as ETS’s Proficiency Profile.) To blame educators at a school on the left hand side of the regression for failing to match the schools on the right hand side of the graphic is to punish them for differences in the prerequisite ability of their students.

Harvard students have remarkable post-collegiate outcomes, academically and professionally. But then, Harvard invests millions of dollars carefully managing their incoming student bodies. The truth is most Harvard students are going to be fine wherever they go, and so our assumptions about the quality of Harvard’s education itself are called into question. Or consider exclusive public high schools like New York’s Stuyvesant, a remarkably competitive institution where the city’s best and brightest students compete to enroll, thanks to the great educational benefits of attending. After all, the alumni of high schools such as Stuyvesant are a veritable Who’s Who of high achievers and success stories; those schools must be of unusually high quality. Except that attending those high schools simply doesn’t matter in terms of conventional educational outcomes. When you look at the edge cases – when you restrict your analysis to those students who are among the last let into such schools and those who are among the last left out – you find no statistically meaningful differences between them. Of course, when you have a mechanism in place to screen out all of the students with the biggest disadvantages, you end up with an impressive-looking set of alumni. The admissions procedures at these schools don’t determine which students get the benefit of a better education; the perception of a better education is itself an artifact of the admissions procedure. The screening mechanism is the educational mechanism.

Thinking about selection bias compels us to consider our perceptions of educational cause and effect in general. A common complaint of liberal education reformers is that students who face consistent achievement gaps, such as poor minority students, suffer because they are systematically excluded from the best schools, screened out by high housing prices in these affluent, white districts. But what if this confuses cause and effect? Isn’t it more likely that we perceive those districts to be the best precisely because they effectively exclude students who suffer under the burdens of racial discrimination and poverty? Of course schools look good when, through geography and policy, they are responsible for educating only those students who receive the greatest socioeconomic advantages our society provides. But this reversal of perceived cause and effect is almost entirely absent from education talk, in either liberal or conservative media.

Immigrant students in American schools outperform their domestic peers, and the reason is about culture and attitude, the immigrant’s willingness to strive and persevere, right? Nah. Selection bias. So-called alternative charters have helped struggling districts turn it around, right? Not really; they’ve just artificially created selection bias. At Purdue, where there is a large Chinese student population, I always chuckled to hear domestic students say “Chinese people are all so rich!” It didn’t seem to occur to them that attending a school that costs better than $40,000 a year for international students acted as a natural screen to exclude the vast number of Chinese people who live in deep poverty. And I had to remind myself that my 8:00 AM writing classes weren’t going so much better than my 2:00 PM classes because I was somehow a better teacher in the mornings, but because the students who would sign up for an 8:00 AM class were probably the most motivated and prepared. There’s plenty of detailed work by people who know more than I do about the actual statistical impact of these issues and how to correct for them. But we all need to be aware of how deeply unequal populations influence our perceptions of educational quality.

Selection bias hides everywhere in education. Sometimes, in fact, it is deliberately hidden in education. A few years ago, Reuters undertook an exhaustive investigation of the ways that charter schools deliberately exclude the hardest-to-educate students, despite the fact that most are ostensibly required to accept all kinds of students, as public schools are bound to. For all the talk of charters as some sort of revolution in effective public schooling, what we find is that charter administrators work feverishly to tip the scales, finding all kinds of crafty ways to ensure that they don’t have to educate the hardest students to educate. And even when we look past all of the dirty tricks they use – like, say, requiring parents to attend meetings held at specific times when most working parents can’t – there are all sorts of ways in which students are assigned to charter schools non-randomly and in ways that advantage those schools. Excluding students with cognitive and developmental disabilities is a notorious example. (Despite what many people presume, a majority of students with special needs take state-mandated standardized tests and are included in data like graduation rates, in most locales.) Simply the fact that parents typically have to opt in to charter school lotteries for their students to attend functions as a screening mechanism.

Large-scale studies of charter efficacy such as Stanford’s CREDO project argue confidently that they have controlled for the enormous number of potential screening mechanisms that hide in large-scale education research. These researchers are among the best in the world and I don’t mean to disparage their work. But given the enormity of the stakes and the truth of Campbell’s Law, I have to report that I remain skeptical that we have truly ever controlled effectively for all the ways that schools and their leaders cook the books and achieve non-random student populations. Given that random assignment to condition is the single most essential aspect of responsible social scientific study, I think caution is warranted. And as I’ll discuss in a post in the future, the observed impact of school quality on student outcomes in those cases where we have the most confidence in the truly random assignment to condition is not encouraging.

I find it’s nearly impossible to get people to think about selection bias when they consider schools and their quality. Parents look at a private school and say, look, all these kids are doing so well, I’ll send my troubled child and he’ll do well, too. They look at the army of strivers marching out of Stanford with their diplomas held high and say, boy, that’s a great school. And they look at the Harlem Children’s Zone schools and celebrate their outcome metrics, without pausing to consider that it’s a lot easier to get those outcomes when you’re constantly expelling the students most predisposed to fail. But we need to look deeper and recognize these dynamics if we want to evaluate the use of scarce educational resources fairly and effectively.

Tell me how your students are getting assigned to your school, and I can predict your outcomes – not perfectly, but well enough that it calls into question many of our core presumptions about how education works.

welcome to the ANOVA

Hi there, my name is Freddie deBoer. I’ve been blogging off and on since 2008. I’ve also written for many newspapers, magazines, and websites. (You can see some of my published writing by clicking the My Work tab above.) In my professional life, I work at Brooklyn College in the City University of New York in the Office of Academic Assessment, where I work with faculty to help them develop and implement faculty-led assessments of student learning, and as coordinator of the Writing Across the Curriculum program. This project, a new blog called the ANOVA, is designed to combine those two parts of my life while narrowing and focusing my engagement.

The ANOVA will be about education research and education policy. That way, I can continue to work and research in education in my professional life, and take the reading and engagement I’m doing and make them useful for a popular audience. I will discuss major trends in education, legislation and federal policy related to education, new and existing research in the field, and the philosophy and purpose of education. I expect I will post 3-4 times a week. One of these posts will be a Study of the Week, where I look at a prominent, problematic, or interesting research study in education, whether old or new, discussing the findings and what they mean for the broader world.

I will be attempting to monetize this blog through Patreon, so please consider pledging to support this project financially. Those who contribute $5 a month or more will get access to a weekly book review. If the amount of contributions exceeds my expectations, I will think of other ways to reward patrons. You can also make a one-time donation on PayPal.

Why “the ANOVA”? Because the term, which stands for Analysis Of VAriance, refers to a statistical technique commonly used in education research; because the attempt to define how variance in educational outcomes are determined by predictor variables is perhaps the essential question in quantitative study of education; and because it’s a beautiful word.

I will not avoid talking about the political dimensions of education. Education is an inherently political topic. However, this will not be a political blog and will feature no political writing that is not narrowly focused on education. I will not, for example, weigh in on the campus political wars in this space. When in doubt, I will err on the side of not engaging if a subject is not clearly directly concerned with education. Please bear that in mind if you’re thinking about contributing. It should go without saying that this project will not be affiliated with or endorsed by Brooklyn College in any way, and that I will not be working on it during my regular work hours.

I’ve gotten a lot out of writing online, but it has had downsides, especially concerning people targeting my employment. Online politics, are not good for my mental well-being. As someone with poor impulse control and bipolar disorder, it’s best to limit my political engagement in digital mediums that favor immediacy over thoughtfulness. I also have found much better ways to utilize my political energy in recent months. Since moving to New York I’ve gotten involved in my own union, in a tenant’s union, and in local education politics, along with attending many protests. This has been wonderful for my mood and sense of political purpose. Online politics leave me discouraged and unhappy; offline politics make me hopeful and energized. So I intend to keep my political engagement squarely offline.

This is a modest project with modest goals. I want an outlet where I can write for a small audience of interested people and share a little of my expertise and my opinions. I’m hoping to carve out a niche where I can engage productively and professionally about topics related to my expertise and which I am passionate about. I hope you join me.