(Actual) Study of the Week: Academic Outcomes for Preemies

Now back to our regularly scheduled programming….

There’s a lurking danger in the “nature vs nurture” debate that has been so prominent in educational research for so long: people tend to assume that genetic influence means that something is immutable, while environmental influences are assumed to be changeable. The former is not correct, at least in the sense that there are a lot of genetically influenced traits that can be altered or ameliorated – all manner of physical skills, for example, are subject to the impact of exercise, even while we acknowledge that at the top of the distribution tiers, natural/genetic talents play a big role. Likewise, we can believe in educational efforts that somewhat ameliorate genetic influences even while we recognize that biological parentage powerfully shapes intellectual outcomes.

The obverse is even more often forgotten: just because an influence is environmental in nature, that does not mean we can necessarily change its effects. Lead exposure, for example, leads to relatively small but persistent damage to cognitive function. This is certainly environmental influence, but not one that we have tools to ameliorate. I’m not quite sure if we would call neonatal development “environmental,” but influences on children in the womb are a good example of non-genetic influences that are potentially immutable. And they are also another lens through which I want us to consider our tangled, frequently-contradictory intuitions about academic performance and just deserts.

Today’s Study, written by the exceptionally-Dutch-named Cornelieke Sandrine Hanan Aarnoudse-Moens, Nynke Weisglas-Kuperus,  Johannes Bernard van Goudoever, and Jaap Oosterlaan, is a meta-analysis of extant research on the academic outcomes of children who were born very prematurely and/or at very low birth weight. (For an overview of meta-analysis and effect size, please see this post.)

The studies had a number of restrictions in addition to typical quality checks. First, the studies consider had to look at very premature births, defined as less than 33 weeks gestation and/or with very low birth weight, defined as less than 1500 grams. Additionally, for inclusion in the meta-analysis, the studies had to track student performance to at least age 5, as this is where formal schooling begins and where responsible analysis of academic outcomes can be considered. These studies reported on academic outcomes, behavioral outcomes as represented by teacher and parent observation checklists/surveys, and so-called executive functioning variables, which includes things like impulse control and ability to plan (and which have been pretty trendy). All in all, data from 14 studies on academic outcomes, 9 on behavioral outcomes, and 6 on executive functioning were considered. (There was some overlap.) All in all, 4125 very preterm and/or very low birth weight children were compared to 3197 children born at term. The authors performed standard meta-analytic procedures involved pooling SDs and weighting by sample size and reported effect sizes in good old Cohen’s d. 

They also used a couple of statistical tests to attempt to adjust for publication bias. Publication bias is a troubling aspect of research studies that can undermine meta-analysis, particularly problematic given that meta-analysis is often viewed as a way to ameliorate (never eliminate) other problems like p-value hacking or similar. Publishing bias refers to the fact that journals are much more likely to publish studies with significant effects that those without them. This has several bad outcomes – for one, it provides perverse incentives for academics trying to get jobs and tenure. But it also distorts our view of reality. We adjust for the various issues with individual studies, in part, by looking at a broad swath of research literature. But if the non-significant results are sitting in a drawer while significant results are in Google Scholar, that’s not going to help, even with meta-analysis.

The results are not particularly surprising, but are sad all the same: children born very prematurely and/or at very low birth way have persistently worse academic outcomes compared to similar children. In terms of academic outcomes, we’re talking about -0.48 SD for reading, -0.60 SD for mathematics, and -0.76 for spelling. These are, in context of educational research, large effects. There was some variation between studies, as is to be expected in any meta-analysis, but this variation was not large enough to undermine our confidence in these results. Checks for publication bias came up largely clean as well. There are also findings indicating that children born premature have problems with attention, verbal fluency, and working memory. These effect sizes had no meaningful relationship to the age of assessment, suggesting that these problems are persistent. With a few exceptions, these relationships are continuous – that is, children with lower gestational ages and birth weight are generally worse off in terms of outcomes even when compared to other children born prematurely and/or at low birth weight.

First, this is very important to say: the studies included in this meta-analysis represent averages. We live in a world of variability. There are certainly many children who are born severely prematurely and go on to academic excellence. It would be wrong to assume that these influences indicate a certain academic destiny, as it is for any variable we examine in educational research. The trends, however, are clear. Sadly, other research suggests that these problems are likely to extend into at least young adulthood.

What are some of the consequences here? Well, to begin with, I think it’s another important facet of how we think about educational outcomes and how much of those outcomes lie outside of the hands of students, parents, and teachers. No one has chosen this outcome. For another thing, there’s the breaking of the nature/nurture binary I pointed out above. This is a non-genetic but uncontrolled introduction of major influence into the educational outcomes of children. I don’t mean to be fatalistic about things;  there’s always a chance that we’ll find some interventions that help to close these gaps. But I think this is another reason for us to get outside of a moralistic framework for education, where every below-average outcome has to be the fault of someone – the parents, the teachers, or the student themselves. 

And again, I think this points in the direction of a societal need to expand our definition of what it means to be a good student, and through that, what it means to be a valuable human being. True, very early births are comparatively rare, though almost 10% of all American births are preterm. (Like seemingly everything else in the United States, preterm birth rates are influenced by race, class, and geography.) But this dynamic is just another data point in a large set of evidence that suggests that academic outcomes are largely outside of the hands of individuals, parents, and teachers, particularly if we recognize that genetic influence is not controlled by those groups. What’s interesting with premature babies is that I doubt anyone would think that they somehow deserve worse life outcomes as a result of their academic struggles. Who could be so callous? And yet when it comes to genetic gifts – which are just as uncontrolled by individuals as being born prematurely – there are many who think it’s fine to disproportionately hand out reward. I don’t get that.

Ultimately, rather than continuing to engage in a quixotic policy agenda designed to give every child the exact same odds of being a Stanford-trained computer scientist, we should recognize as a society that we will always have a range of academic outcomes, that this means we will always have people who struggle as well as excel, and that to a large extent these outcomes are not controlled by individuals. Therefore we should build a robust social safety net to protect people who are not fortunate enough to be academically gifted, and we should critique the Cult of Smart, recognizing that there are all manner of ways to be valuable human beings.

Study of the Week: Are Public Dollars Better Spent on Prisons Than On Schools?

We’re in the middle of a criminal justice reform movement that has demanded change in all manner of ways. Increasingly, activists have tied our criminal justice system into the broader social context, arguing that we must address our incarceration problem by concentrating on our education system and our economy as well. Only by looking at the entire life cycle can we begin to reduce our prison population. Key to this argument is the question of scarce resources, namely tax dollars. Activists point out that it is far cheaper to put someone through the school system than it is to imprison them in the penal system – inviting the question of what we actually value in this country.

But what if that’s wrong? What if, in fact, when it comes to educational outcomes and overall quality of life of our children, public money is better spent on prisons than on schools? What if you could raise test scores not by spending more on schools, but on spending more on prisons? And should progressives change their political commitments in response to this new evidence?

Those are questions raised by a provocative new study from the University of Mississippi. As anyone with experience in education research knows, Mississippi suffers from truly discouraging education metrics, with some of the worst test scores, high school graduation rates, and college attendance numbers in the country. Researchers at Ole Miss wanted to take a fresh look at just what could be done to improve this condition – and weren’t afraid to court controversy in doing so. They conducted a discriminant function analysis on a large data set of Mississippi residents as they moved through the school system, the employment world, and/or the penal system. They then compared various life outcomes like tendency to commit a felony or college graduation percentage to state expenditures on both the justice system and the school system. Thanks to the power of quadratic regression, they were able to decompose out these various inputs to determine what actually improved academic outcomes more, dollars spent on education or dollars spent on prisons. And the results may upend a lot of progressive assumptions about how best to build a healthy society.

Figure 1. Scatterplot of data

OK, I’ll level with you. There is no Mississippi study. This post is a con, a ruse. It’s a trap. People keep commenting on my work – getting into long debates with me and others, sharing it on social media, referring to it later on to prove some other point – when they clearly have not read the piece that they’re talking about. People are utterly shameless and unapologetic about it. I get paragraphs-long email responses from people that make assertions about what I’ve said that are directly and unambiguously contradicted by the text. My Facebook is a daily exercise in despair as I wade through response after response that demonstrates they maybe read the first couple paragraphs. So I’m out to snooker some of them, in the hopes that they will be shamed and change their ways in the future.

And I genuinely apologize to my serious readers for this. I really do. But I am exhausted and spent by the culture of not-reading that has deepened online. I know people are always skeptical of arguments that such conditions are getting worse – hey, people have never read! – but I just can’t go along with that. In my time writing online, which is approaching 10 years now, I’ve never seen anything like the current moment when it comes to the utter collapse of any communal expectation that people will read the work they’re commenting on.

I will read an interesting article, want to see what people are saying about it, pop the link in the Twitter search bar, and I will be absolutely amazed at what % of the reactions demonstrate that the people talking about it haven’t actually read the piece. You will see conversations about various essays that go on for dozens and dozens of exchanges where it is glaringly clear that not one person in the conversation actually has a grasp of what the essay says. And these aren’t just randoms, either, but usually writers themselves, people who have built careers producing text. Go to any event where established people give young writers advice and they always say, you have to read to write! But my impression is that many, many professional writers don’t.

I get that there are structural reasons that professional writers don’t read. I get that it’s not all a character or integrity issue. I get that the modern media economy forces people to be producing at a pace that makes reading enough difficult. I’m not unsympathetic. But at some point people have to make the personal decision to say “I’m not going to comment on something I haven’t read.”

I meet people IRL who know me from writing a lot more often, now that I live in New York. And sometimes there’s tension. I’ll be introduced by a friend of a friend to someone who is sure they don’t like me. If I get the chance, I’ll eventually try to tease out which of my opinions they reject. Likewise, I sometimes challenge people on social media or in my email to list their actual grievances, to tell me what I believe that is so objectionable. Often enough – maybe a majority of the time – it will turn out that they are mad at me about something I don’t believe and have never said. I am fine with being controversial or personally disliked for what I actually think and have actually said. But at present my online reputation has almost nothing to do with me or my actual beliefs, because no one online reads anything.

Read. If you’re going to engage with writing, read it first. If you are a private citizen who is not interested in reading and would like to be left alone, go with God. But if you are someone who regularly engages with writing, who comments or shares or writes response essays or generally takes part in the public conversation about what people have written, then I am literally begging you to read what you talk about. I entreat you. I implore you. I beseech you. Read. If you are a writer, read. If you want to be politically involved, read. If you are an activist, read. If you intend to change the world, read. If you are going to comment on writing publicly, read. If you are prepared to be offended by something, read. If you are looking to be convinced, read. If you are already convinced, read. Please. Please read. Please please please. Please read.

This is a fishing expedition. I apologize for taking some of you along with me. But I’m about to catch a lot of fish. And hey, if you see someone responding to this without reading it, screencap it for me.

two concepts about sampling that were tricky for me

Here’s a couple related points about statistics that it took me a long time to grasp, and which really improved my intuitive understanding of statistical arguments.

1. It’s not the sample size, it’s the sampling mechanism. 

Well, OK. It’s somewhat the sample size, obviously. My point is that most people who encounter a study’s methodology are much more likely to remark on the sample size – and pronounce it too small – than to remark on the sampling mechanism. I can’t tell you have often I’ve seen studies with an n = 100 that have been dismissed by commenters online as too small to take seriously. Depending on the design of the study, and the variables being evaluated, 100 can be a good-enough sample size. In fact, under certain circumstances (medical testing of rare conditions, say) an n of 30 is sufficient to draw some conclusions about populations.

We can’t say with 100% accuracy what a population’s average for a given trait is when we use inferential statistics. (We actually can’t say that with 100% accuracy even when taking a census, but that’s another discussion.) But we can say with a chosen level of confidence that the average lies in a particular range, which can often be quite small, and from which we can make predictions of remarkable accuracy – provided the sampling mechanism was adequately random. By random, we mean that every member of the population has an equivalent chance of being selected for the sample. If there are factors that make one group more or less likely to be selected for the sample, that is statistical bias (as opposed to statistical error).

It’s important to understand the declining influence of sample size in reducing statistical error as sample size grows. Because calculating confidence intervals and margins of error involves placing the n under a square root sign, the power of sample size declines exponentially (fixed). Here’s the formula for margin of error:

Z* σ/√(n) 

where Z is a Z-value that you look up in a chart for a given confidence level (often 95% or 99%), σ is the standard deviation, and n is your number of observations. You see two clear things here: first, spread (standard deviation) is super important to how confident we can be about the accuracy of an average. (Report spread when reporting an average!) Second, that we get declining improvements to accuracy as we increase sample size.That means that after a point, adding hundreds of more observations gets you less power than you got from adding 10 at lower ns. Given the resources involved in data collection, this can make expanding sample size a low-value proposition.

Now compare a rigorously controlled study with an n = 30 which was drawn with a random sampling mechanism to, say, those surveys that ESPN.com used to run all the time. Those very often get sample sizes in the hundreds of thousands. But the sampling mechanism is a nightmare. They’re voluntary response instruments that are biased in any number of ways: underrepresenting people without internet access, people who aren’t interested in sports, people who go to SI.com instead of ESPN.com, on and on. The value of the 30 person instrument is far higher than that of the ESPN.com data. The sampling mechanism makes the sample size irrelevant.

Sample size does matter, but in common discussions of statistics, its importance is misunderstood, and the value of increasing sample size declines as grows.

2. For any reasonable definition of a sample, population size relative to sample size is irrelevant for the statistical precision of findings.

A 1,000 person sample, if drawn with some sort of rigorous random sampling mechanism, is exactly as descriptive and predictive when drawn randomly from the ~570,000 person population of Wyoming as it is when drawn randomly from the ~315 million person population of the United States. (If intended as samples of people in Wyoming and people in the United States respectively, of course.)

I have found this one very hard to wrap my mind around, but it’s the case. The formulas for margin of error, confidence intervals, and the like do not involve any reference to the size of the total population. You can think about it this way: each time you pull a sample at random from some population, the odds of your sample being unlike the population goes down regardless of the size of that population. The mistake lies in thinking that the point of increasing sample size lies in making it closer in proportion to population. In reality, the point is just to increase the number of attempts in order to reduce the possibility that previous attempts produced statistically unlikely results. Even if you had an infinite population, every time you draw a sample from that population you would be decreasing the chance that you’re randomly pulling an unrepresentative sample.

The essential caveat lies in “for any reasonable definition of a sample.” Yes, testing 900 out of a population of 1000 is more accurate than testing 900 out of a population of 1,000,000. But nobody would ever call 90% of a population a sample. You see different thresholds for where a sample begins and ends; some people say that anything larger than 1/100th of the total population is no longer a sample, but it varies. The point holds: when we’re dealing with real-world samples, where the population we care about is vastly larger than any reasonable sample size, the population size is irrelevant to the calculation error in our statistical inferences. This one is quite counterintuitive and took me a long time to really grasp.

genetic behaviorism supports the influence of chance on life outcomes


I’ve been trying, in this space, to rehabilitate the modern science of genetic influence on individual variation in academic outcomes to progressives. Many left-leaning people have perfectly reasonable fears about this line of inquiry, as in the past similar-sounding arguments have been used to justify eugenics, while in the present many racists make pseudoscientific arguments based on similar evidence to justify their bigotry. Like others, I am interested in showing that there are progressive ways to understand genetic behaviorism that reject racism and which support, rather than undermine, redistributive visions of social justice.

I can’t deny, though, that there are many regressive ways to make these arguments. That’s particularly true given that there’s a large overlap in the Venn diagram of IQ determinists and economic libertarians. I want to take a moment and demonstrate how conservatives misread and misuse genetic behaviorism to advance their ideological preferences for free market economics.

In this post, Ben Southwood of the conservative Adam Smith Institute uses evidence from genetic behaviorism and education research to argue that luck really doesn’t play much of a role in life outcomes. To prove this point, he cites many high-quality studies showing that random assignment (or last in/last out models) to schools of supposedly differing quality has little impact on student academic outcomes. He argues that our understanding of genetic influence on intelligence should influence our perception of how much schools can really do to help struggling students. This is, in general, a line of thinking that fits with my own. But he makes a leap into then suggesting that what we call luck (let’s say the uncontrolled vicissitudes of chance and circumstance that are beyond the control of the individual) has little or nothing to do with life outcomes. He does so because this presumably lends credence to libertarian economics, which are based on a just deserts model – the notion that the market economy basically rewards and punishes people in line with their own merit. This leap is totally unsupportable and is undermined by the very evidence he points to.

To begin with, Southwood ignores a particularly inconvenient fact for his brand of conservative determinism: the large portion of unaccounted-for variation in IQ and academic outcomes even when accounting for genetics and the shared environment (code for the portion of the environment in a child’s life controlled by parents and the family). There is famously (or notoriously) a portion of variation in measurable psychological outcomes that we can’t explain, a large portion – as much as half of the variation, maybe, depending on what study you’re looking at. And this portion seems unlikely to be explainable in systematic terms. Plomin and Daniels called this the “gloomy prospect,” writing

One gloomy prospect is that the salient environment might be unsystematic, idiosyncratic, or serendipitous events such as accidents, illnesses, or other traumas . . . . Such capricious events are likely to prove a dead end for research.

Turkheimer wrote recently:

scientific study of  the nonshared environment and molecular aspects of the genome have proven much harder than anyone anticipated.  But I still feel bad about harping on it, as though I am spoiling the good vibes of hardworking scientists, who are naturally optimistic about the work they are conducting.  But ever since I was in graduate school, I have felt that biogenetic science has always oversold their contribution, tried to convince everyone that the next new method is going to be the one that finally turns psychology into a real natural science, drags our understanding of ourselves out of the humanistic muck.  But it never actually happens.

The gloomy prospect, in other words, represents exactly the influence of what we usually refer to as luck. Southwood claims that genetics explains perhaps .90 of the variation at adult, but this represents extreme upper bound predictions for that influence. Most of the literature suggests significantly more modest heritability estimates than that. So we are left with this big uncontrolled portion, which as Turkheimer says has proven resistant to systematic understanding and which likely reflects truly idiosyncratic and individual impacts on the lives of individuals. Unfortunately for progressives who want to dramatically improve educational outcomes by changing the home environment of children, quality studies consistently find that the impact of changes to that environment is minor. Unfortunately for Southwood, the unexplained portion of academic outcomes (and subsequent economic outcomes) looks precisely like chance, or at least, that which is uncontrolled by either the individual or his or her parents. The last line of his post is thus totally unsupported by the evidence.

But there’s an even bigger issue for Southwood here: no one is in control of their own genotype. It’s bizarre when conservative-leaning people endorse genetic determinism as a justification for just-deserts economic theories. Genetic influence on human behavior stands directly in contrast to the notion that we control our own destinies. How then can Southwood advance a vision of free market economics as a system in which reward is parceled out fairly, given that the distribution of genetic material between individuals is entirely outside of their control? Which genetic code you happen to be born with is a lottery. I happen to not have gotten a scratch off ticket that allows me to have been an NFL player or a research physicist. That’s not a tragedy because I am still able to secure my basic material needs and comforts. But not everyone is so lucky, and for many the free market will result only in suffering and hopelessness.

It is immoral, and irrational, to build a society in which conditions you do not choose dictate whether you live rich and prosperous or poor and hopeless. That is true if this inequality is caused by inheriting money from your rich parents or by inheriting their genes or by being deeply influenced by the vagaries of chance. The best, most rational system in a world of uncontrolled variation in outcomes is a system that guarantees a standard of living even under the worst of luck – that is, socialism.

Correction: Southwood has taken considerable umbrage to this post, which he expressed in a dozen-tweet missive and Medium post. You should read that. I concede that I was uncharitable in how he talks about luck, and I recognize that he sees luck as impact life events. I do not agree with his claim that path dependence and luck do not contribute to life outcomes, and it’s weird that his post title alludes to Gregory Clark’s The Son Also Rises, which demonstrates that wealth benefits from inheritance can persist for far longer than traditionally thought. But that’s immaterial to the question of whether I accurately reflected Southwood’s position on luck and redistribution. So consider this an apology. I should have spoken more carefully and read more charitably and for that I’m sorry.

As for the 90% of variance figure, my wording “perhaps .90” is an accurate reading of a presentation of a range, and I don’t withdraw it. If anyone objects, I am happy to tutor them in reading, for a fee.

Study of the Week: We’ll Only Scale Up the Good Ones

When it comes to education research and public policy, scale is the name of the game.

Does pre-K work? Left-leaning people (that is, people who generally share my politics) tend to be strong advocates of these programs. It’s true that generically, it’s easier to get meaningful educational benefits from interventions in early childhood than later in life. And pre-K proponents tend to cite some solid studies that show some gains relative to peer groups, though these gains are generally modest and tend to fade out over time. Unfortunately, while some of these studies have responsible designs, many that are still cited are old, from small programs, or both.

Today’s Study of the Week, by Mark W. Lipsey, Dale C. Farran, and Kerry G. Hofer, is a much-discussed, controversial study from Tennessee’s Voluntary Prekindergarten Program. The Vanderbilt University researchers investigated the academic and social impacts of the state’s pre-K programs on student outcomes. The study we’re looking at is a randomized experimental design, which was pulled from a larger observational study. The Tennessee program, in some locales, had more applicants than available seats. These seats are filled by a random lottery, creating a natural control and experimental group.

There is one important caveat here: the students examined in the intensive portion of the research had to be selected from those whose parents gave consent. That’s about a third of the potential students. This is a potential source of bias. While the randomized design will help, what we can responsibly say is that we have random selection within the group of students whose parents opted in, but with a nonrandom distribution relative to the overall group of students attending this program. I don’t think that’s a particularly serious problem, but it’s a source of potential selection bias and something to be aware of. There’s also my persistent question about the degree to which school selection lotteries can be gamed by parents and administrators. There are lots of examples of this happening. (Here’s one at a much-lauded magnet school in Connecticut.) Most people in the research field seem not to see this as a big concern. I don’t know.

In any event, the results of the research were not encouraging. Researchers examined six identified subtests (two language, two literacy, two math) from the Woodcock-Johnson tests of cognitive ability, a well-validated and widely-used battery of tests of student academic and intellectual skills. They also looked at a set of non-cognitive abilities related to behavior, socialization, and enthusiasm for school. A predictable pattern played out. Students who attended the Tennessee pre-K program saw short-term significant gains relative to their peers who did not attend the program. But over time, the peer group caught up, and in fact in this study, exceeded the test group. That is, students who attended Tennessee’s pre-K program ended up actually underperforming those who were not selected into it.

By the end of kindergarten, the control children had caught up to the TN‐VPK children and there were no longer significant differences between them on any achievement measures. The same result was obtained at the end of first grade using both composite achievement measures. In second grade, however, the groups began to diverge with the TN‐VPK children scoring lower than the control children on most of the measures….  In terms of behavioral effects, in the spring the first grade teachers reversed the fall kindergarten teacher ratings. First grade teachers rated the TN‐ VPK children as less well prepared for school, having poorer work skills in the classrooms, and feeling more negative about school.

This dispiriting outcome mimics that of the Head Start study, another much-discussed, controversial study that found similar outcomes: initial advantages for Head Start students that are lost entirely by 3rd grade.

Further study is needed1 but it seems that the larger and more representative the study, the less impressive – and the less persistent – the gains from pre-K. There’s a bit of uncertainty here about whether the differences in outcomes are really the product of differences in programs or due to differences in the research itself. And I don’t pretend that this is a settled question. But it is important to recognize that the positive evidence for pre-K comes from smaller, higher-resource, more-intensive programs. Larger programs have far less encouraging outcomes.

The best guess, it seems to me, is that at scale universal pre-K programs would function more like the Tennessee system and less like the small, higher-performing programs. That’s because scaling up any major institutional venture, in a country the size of the United States, is going to entail the inevitable moderating effects of many repetitions. That is, you can build one school or one program and invest a lot of time, effort, and resources into making it as effective as possible, and potentially see significant gains relative to other schools. But it strikes me as a simple statement of the nature of reality that this intensity of effort and attention can’t scale. As Farran and Lipsey say in a Brookings Institution essay, “To assert that these same outcomes can be achieved at scale by pre-K programs that cost less and don’t look the same is unsupported by any available evidence.”

Some will immediately say, well, let’s just pay as much for large-scale pre-K as they do in the other programs and model their techniques. The $26 billion question is, can you actually do that? Can what makes these programs special actually be scaled? Is there hidden bias here that will wash out as we expand the programs? I confess I’m skeptical that we’ll see these quantitative gains under even the best scenario. I think we need to understand the inevitability of mediocrity and regression to the mean. That doesn’t mean I don’t support universal pre-kindergarten childcare. As with after school programs, I do for social and political reasons, though, not out of the conviction much that they’ll change test scores much. I’d be happy to be proven wrong.

Now I don’t mean to extrapolate irresponsibly. But allow me to extrapolate irresponsibly: isn’t this precisely what we should expect with charter schools, too? We tend to see, survivorship-bias heavy CREDO studies aside, that at scale the median charter school does little or nothing to improve on traditional public schools. We also see a number of idiosyncratic, high-intensity, high-attention charters that report better outcomes. The question you have to ask, based on how the world works, is which is more likely to be replicated at scale – the median, or the exceptions?

I’ve made this point before about Donald Trump’s favorite charter schools, Success Academy here in New York. Let’s set aside questions of the abusive nature of the teaching that goes on in these schools. The basic charter proponent argument is that these schools succeed because they can fire bad teachers and replace them with good. Success Academy schools are notoriously high stress, long-hour, low pay affairs. This leads naturally to high teacher attrition. Luckily for the NYC-based Success Academy, New York is filled with lots of eager young people who want to get a foothold in the city, do some do-goodering, then bail for their “real” careers later on – essentially replicating the Teach for America model. So: even if we take all of the results from such programs at face value, do you think this is a situation that can be scaled up in places that are far less attractive to well-educated, striving young workers? Can you get that kind of churn and get the more talented candidates you say you need, at no higher cost, to come to the Ozarks or Flint, Michigan or the Native American reservations? Can you nationally have a profession of 3 million people, already caught in a teacher shortage, and then replicate conditions that lead to somewhere between 35%-50% annual turnover, depending on whose numbers you trust?

And am I really being too skeptical if my assumption is to say no, you can’t?


public services are not an ATM

Built into the rhetoric of school choice is a deeply misguided vision of how public investment works.

You sometimes hear people advocating for charters or voucher programs by saying that parents just want to take “their share” of public education funds and use it to get their child an education, whether by siphoning it from traditional public schools towards charters or by cutting checks to private schools. The “money should follow the child,” to use another euphemism. But this reflects a strange and deeply conservative vision of how public spending works. There is no “your share” of public funds. There is the money that we take via taxation from everyone which represents the pooled resources of civic society, and there is what civic society decides to spend it on via the democratic process. You might use that democratic process to create a system where some of the money goes to charter schools or private school vouchers or all manner of things I don’t approve of. But it’s not your money, no matter how much you paid into taxes. And the distinction matters.

To begin with, the constantly-repeated claim that charter schools don’t cost traditional public schools money is just proven wrong again and again. People lay out these theoretical systems where they don’t, like you can just subtract one student and all of the costs associated with that student and just shift the kid and the money to another school. But this reflects a basic failure to understand pooled costs and economies of scale. And when we go looking, that’s what we find: after years of promises that charters are not an effort to defund traditional public schools, our reality checks show they have that effect. Take Chicago, where the charter school system has absolutely contributed to the fiscal crisis in the traditional public schools. Or Nashville. Or Los Angeles. I could go on.

But suppose we knew that we could extract exactly as much, dollar for dollar and student for student, from public education for each student who leaves. Would that be a wise thing to do? Not according to any conventional progressive philosophy towards government.

Do we let you take “your share” out of the public transportation system so that you can use it to defray the cost of buying your own car? Can you take “your share” out of the police budgets to hire your own private security? Can I extract my tax dollars from the public highway system I almost never use in order to build my own bike lanes? Of course not. In many cases this simply wouldn’t make sense; how can you extract your share from a building, or a bridge, or any other type of physical infrastructure? And besides: the basic progressive nature of public ownership means that we are pooling resources so that those who have the least ability to pay for their own services can benefit from the contributions of those with the most ability to pay. To advance the notion of people pulling “their” tax dollars out from public schools undermines the very conception of shared social spending. And governmental spending should require true democratic accountability; letting the Bill and Melinda Gates Foundation dictate public education policy, Mark Zuckerberg become the wholly unqualified education czar of Newark, or the Catholic church control public education dollars through voucher programs directly undermines that accountability.

So of course there’s a deep and widening split opening up within the school reform coalition, which has always been filled with self-styled progressives. There’s a major, existential disagreement at play about the basic concepts of social spending and the public good. These have been papered over for years by the missionary zeal of choice acolytes and their crisis narrative. But there was never a coherent progressive political philosophy underneath. The Donald Trump and Betsey Devos education platform is a disaster in the making, but at least it has brought these basic conflicts into the light. These issues are not going away, nor should they, and the “progressive” ed reform movement is going to have to do a lot of soul searching.

Reporting Regression Results Responsibly

We’re in a Golden Age for access to data, which unfortunately also means we’re in a Golden Age for the potential to misinterpret data. Though the absurdity of gated academic journals persists, academic research is more accessible now than ever before. We’ve also seen a rapid growth in the use of arguments based on statistics in the popular media in the last several years. This is potentially a real boon to our ability to understand the world around us, but it carries with it all of the potential for misleading statistical arguments.

My request is pretty simple. All statistical techniques, particularly the basic parametric statistical techniques that are most likely to show up in data journalism, require the satisfaction of assumptions and checking of diagnostic measures to ensure that hidden bias isn’t misleading us. Many of these assumptions and diagnostics are ultimately judgment calls, relying on practitioners to make informed decisions about what degree of wiggle room is appropriate given the research scenario. There are, however, conventions and implied standards that people can use to guide their decisions. The most important and useful kind of check, though, is the  eyes of other researchers. Given that the ability to host graphs, tables, and similar kinds of data online is simple and nearly free, I think that researchers and data journalists alike should provide links to their data and to the graphs and tables they use to check assumptions and diagnostic measures. In the digital era, it’s crazy this is still a rare practice. I don’t expect to find these graphs and tables sitting square in the center of a blog post, and I expect that 90% of readers wouldn’t bother to look. But there’s nothing to risk in having them available, and transparency, accountability, and collaboration to gain.

That’s the simple part, and you can feel free to close tab. For a little more:

What kind of assumptions and diagnostics am I talking about? Let’s consider the case of one of the most common types of parametric methods, linear regression. Whether we have a single predictor for simple linear regression or multiple predictors for multilinear regression, fundamentally regression is a matter of assessing the relationship between quantitative (continuous) predictor variables and a quantitative (continuous) outcome variable. For example, we might ask how well SAT scores predict college GPA; we might ask how well age, weight, and height predict blood pressure. When someone talks about how one number predicts another, the strength of their relationship, and how we might attempt to change one by changing the other, they’re probably making an appeal to regression.

The types of regression analysis, and the issues therein, are vast, and there are many technical issues at play that I’ll never understand. But I think it’s worthwhile to talk about some of the assumptions we need to check and some problems we have to look out for. Regression has come in for a fair amount of abuse lately from sticklers and skeptics, and not for no reason; it’s easy to use the techniques irresponsibly. But we’re inevitably going to ask basic questions of how X and Y predict Z, so I think we should expand public literacy about these things. I want to talk a little bit about these issues not because I think I’m qualified to teach statistics to others, or because regression is the only statistical process that we need to see assumptions and diagnostics for. Rather, I think regression is an illustrative example through which to explore why we need to check this stuff, to talk about both the power and pitfalls of public engagement with data.

There are four assumptions that need to be true to run a linear (least squares) regression: independence of observations, linearity, constancy of variance, and normality. (Some purists add a fifth, existence, which, whatever.)

Independence of Observations

This is the biggie, and it’s why doing good research can be so hard and expensive. It’s the necessary assumption that one observation does not affect another. This is the assumption that requires randomness. Remember that in statistics error, or necessary and expected variation, is inevitable, but bias, or the systematic influence on observations, is lethal.

Suppose you want to see how eating ice cream affects blood sugar level. You gather 100 students into the gym and have them all eat ice cream. You then go one by one through the students and give them a blood test. You dutifully record everyone’s values. When you get back to the lab, you find that your data does not match that of much of the established research literature. Confused, you check your data again. You use your spreadsheet software to arrange the cells by blood sugar. You find a remarkably steady progression of results running higher to lower. Then it hits you: it took you several hours to test the 100 students. The highest readings are all from the students who were first to be tested, the lowest from those who were tested last. Your data was corrupted by an uncontrolled variable, time-after-eating-to-test. Your observations were not truly independent of each other – one observation influenced another because taking one delayed taking the other. This is an example that you’d hope most people would avoid, but the history of research is the history of people making oversights that were, in hindsight, quite obvious.

Independence is scary because threats to it so often lurk out of sight. And the presumption of independence often prohibits certain kind of analysis that we might find natural. For example, think of assigning control and test conditions to classes rather than individual students in educational research. This is often the only practical way to do it; you can’t fairly ask teachers to only teach half their students one technique and half another. You give one set of randomly-assigned classes a new pedagogical technique, while using the old standard with your control classes. You give a pre- and post-test to both and pop both sets of results in an ANOVA. You’ve just violated the assumption of independence. We know that there are clustering effects of children within classrooms; that is, their results are not entirely independent of each other. We can correct for this sort of thing using techniques like hierarchical modeling, but first we have to recognize that those dangers exist!

Independence is the assumption that is least subject to statistical correction. It’s also the assumption that is the hardest to check just by looking at graphs. Confidence in independence stems mostly from rigorous and careful experimental design. You can check a graph of your observations (your actual data points) against your residuals (the distance between your observed values and the linear progression from your model), which can sometimes provide clues. But ultimately, you’ve just got to know your data was collected appropriately. On this one, we’re largely on our own. However, I think it’s a good idea for academic researchers to provide online access to a Residuals vs. Observations graph when they run a regression. This is very rare, currently.

Here’s a Residuals vs. Observations graph I pulled off of Google Images. This is what we want to see: snow. Clear nonrandom patterns in this plot are bad.


The name of the technique is linear regression, which means that observed relationships should be roughly linear to be valid. In other words, you want your relationship to fall along a more or less linear path as you move across the x axis; the relationship can be weaker or it can be stronger, but you want it to be more or less as strong as you move across the line. This is particularly the case because curvilinear relationships can appear to regression analysis to be no relationship. Regression is all about interpolation: if I check  my data and find a strong linear relationship, and my data has a range from A to B, I should be able to check any x value within A and B and have a pretty good prediction for y. (What “pretty good” means in practice is a matter of residuals and r-squared, or the portion of the variance in y that’s explained by my xs.) If my relationship isn’t linear, my confidence in that prediction is unfounded.

Take a look at these scatter plots. Both show close to zero linear relationship according to Pearson’s product-moment coefficient:

And yet clearly, there’s something very different going on from one plot to the next. The first is true random variance; there is no consistent relationship between our x and y variables. The second is a very clear association; it’s just not a linear relationship. The degree and direction of y varying along x changes over different values for x. Failure to recognize that non-linear relationship could compel us to think that there is no relationship at all. If the violation of linearity is as clear and consistent as in this scatter plot, it can be cleaned up fairly easily by transforming the data.

Regression is fairly robust to violations of linearity, and it’s worth noting that any relationship that is sufficiently lower than 1 will be non-linear in the strict sense. But clear, consistent curves in data can invalidate our regression analyses.

Readers could check data for linearity if scatter plots are posted for simple linear regression. For multilinear regression, it’s a bit messier; you could plot every individual predictor, but I would be satisfied if you just mention that you checked linearity.

Constancy of variance

Also known by one of my very favorite ten-cent words, homoscedasticity. Constancy of variance means that, along your range of x predictors, your y varies about as much; it has as much spread, as much error. Remember, when I’m doing inferential statistics, I’m sampling, and sampling means sampling error – even if I’m getting quality results, I’m inevitably going to get differences in my data from one collection of samples to the next. But if our assumptions are true, we can trust that those samples will vary in predictable intervals relative to the true mean. That is, if an SAT score predicts freshman year GPA with a certain degree of consistency for students scoring 400, it should be about as consistent for students scoring 800, 1200, and 1600, even though we know that from one data set to the next, we’re not going to get the exact same values even if we assume that all of the variables of interest are the same. We just need to know that the degree to which they vary for a given is constant over our range.

Why is this important? Think again about interpolation. I run a regression because I want to understand a relationship between various quantitative variables, and often because I want to use my predictor variables to… predict. Regression is useful insofar as I can move along the axes of my x values and produce a meaningful, subject-to-error-but-still-useful value for y. Violating the assumption of constant variance means that you can’t predict y with equal confidence as you move around x(s); the relationship is stronger at some points than others, making you vulnerable to inaccurate predictions.

Here’s a residuals plot showing the dreaded megaphone effect: the error (size of residuals, difference between observations and results expected from the regression equation) increases as we move from low to high values of x. The relationship is strong at low values of x and much weaker at high values.

We could check homoscedasticity by having access to residual plots. Violations of constant variance can often be fixed via transformation, although it may often be easier to use techniques that are more inherently robust to this violation, such as quantile regression.


The concept of the normal distribution is at once simple and counterintuitive, and I’ve spent a lot of my walks home trying to think of the best way to explain it. The “parametric” in parametric statistics refers to the assumption that there is a given underlying distribution for most observable data, and frequently this distribution is the normal distribution or bell curve. Think of yourself walking down the street and noticing that someone is unusually tall or unusually short. The fact that you notice is in and of itself a consequence of the normal distribution. When we think of someone that is unusually tall or short, we are implicitly assuming that we will find fewer and fewer people as we move further along the extremes of the height distribution. If you see a man in North American who is 5’10, he is above average height, but you wouldn’t bat an eye; if you see a man who is 6’3, you might think yourself, that’s a tall guy; when you see someone who is 6’9, you say, wow, he is tall!, and when you see a 7 footer, you take out your cell phone. This is the central meaning of the normal distribution: that the average is more likely to occur than extremes, and that the relationship between position on the distribution and probability of occurrence is predictable.

Not everything in life is normally distributed. Poll 1,000 people and ask how much money they received in car insurance payments last year and it won’t look normal. But a remarkable amount of naturally occurring phenomena are normally distributed, simply thanks to the reality of numbers and extremes, and the central limit theorem teaches us that essentially all averages are normally distributed. (That is, if I take a 100 person sample of a population for a given quantitative trait, I will get a mean; if I take another 100 person sample, I will get a similar but not exact mean, and so on. If I plot those means, they will be normal even if the overall distribution is not.)

The assumption of normality in regression requires our data to be roughly normally distributed; in order to assess the relationship of y as it moves across x, we need to know the relative frequency of extreme observations to observations close to the mean. It’s a fairly robust assumption, and you’re never going to have perfectly normal data, but too strong of a violation will invalidate your analysis. We check normality with what’s called a qq plot. Here’s an almost-perfect one, again scraped from Google Images:

That strongly linear, nearly 45 degree angle is just what we want to see. Here’s a bad one, demonstrating the “fat tails” phenomenon – that is, too many observations clustered at the extremes relative to the mean:

Usually the rule is that unless you’ve got a really clear break from a straightish 45 degree angle, you’re probably alright. When the going gets tough, seek help from a statistician.


OK, so 2000 words into this thing, we’ve checked out four assumptions. Are we good? Well, not so fast. We need to check a few diagnostic measures, or what my stats instructor  used to call “the laundry list.” This is a matter of investigating influence. When we run an analysis like regression, we’re banking on the aggregate power of all of our observations to help us make responsible observations and inferences. We never want to rely too heavily on individual or small numbers of observations because that increases the influence of error in our analysis. Diagnostic measures in regression typically involve using statistical procedures to look for influential observations that have too much sway over our analysis.

The first thing to say about outliers is that you want a systematic reason for eliminating them. There are entire books about the identification and elimination of outliers, and I’m not qualified to say what the best method is in any given situation. But you never want to toss an observation simply because it would help your analysis. When you’ve got that one data point that’s dragging your line out of significance, it’s tempting to get rid of it, but you want to analyze that observation for a methodology-internal justification for eliminating it. On the other hand, sometimes you have the opposite situation: your purported effect is really the product of a single or small number of influential outliers that have dragged the line in your favor (that is, to a p-value you like). Then, of course, the temptation is simply to not mention the outlier and publish anyway. Especially if a tenure review is in your future…

Some examples of influential observation diagnostics in regression include examining leverage, or outliers in your predictors that have a great deal of influence on your overall model; Cook’s Distance, which tells you how different your model will be if you delete a given observation; DFBetas, which tells you how a given predictor observation influences on a particular parameter estimate; and more. Most modern statistical packages like SAS or R have commands for checking diagnostic measures like these. While offering numbers would be nice, I would mostly like it if researchers reassured readers that they had run diagnostic measures for regression and found acceptable results. Just let me know: I looked for outliers and influential observations and things came back fairly clean.


Regression is just one part of a large number of techniques and applications that are happening in data journalism right now. But essentially any statistical techniques are going to involve checking assumptions and diagnostic measures. A typical ANOVA, for example, the categorical equivalent of regression, will involve checking some of the same assumptions. In the era of the internet, there is no reason not to provide a link to a brief, simple rundown of what quality controls were pursued in  your analysis.

None of these things are foolproof. Sums of squares are spooky things; we get weird results as we add and remove predictors from our models. Individual predictors are strongly significant by themselves but not when added together; models are significant with no individual predictors significant; individual predictors are highly significant without model significance; the order you put your predictors in changes everything; and so on. It’s fascinating and complicated. We’re always at the mercy of how responsible and careful researchers are. But by sharing information, we raise the odds that what we’re looking at is a real effect.

This might all sound like an impossibly high bar to clear. There are so many ways things can go wrong. And it’s true that, in general, I worry that people today are too credulous towards statistical arguments, which are often advanced without sufficient qualifications. There are some questions where statistics more often mislead than illuminate. But there is a lot we can and do know. We know that age is highly predictive of height in children but not in adults; we know that there is a relationship between SAT scores and freshman year GPA; we know point differential is a better predictor of future win-loss record than past win-loss record. We can learn lots of things, but we always do it better together. So I think that academic researchers and data journalists should share their work to a greater degree than they do now. That requires a certain compromise. After all, it’s scary to have tons of strangers looking over your shoulder. So I propose that we get more skeptical and critical on our statistical arguments as a media and readership, but more forgiving of individual researchers who are, after all, only human. That strikes me as a good bargain.

And one I’m willing to make myself, so please email me to point out the mistakes I’ve inevitably made in this post.

diversifying the $5 reward tier

Hey gang, first I’m sorry content has been a bit light on the main site this week. Good things are coming in bunches soon. I have been releasing archival content to all subscribers on the Patreon page at a steady clip. I wanted to let you know that I’ve decided to diversify the $5 patron content a little. It’s not so much that I’m not keeping up with the book reading – it’s been a bit tough but not bad – but rather that I’m feeling a little constrained by the review format. So I’m going to alternate between book reviews and more general cultural writing, reading recommendations, considerations of contemporary criticism, etc. There will still not be any explicitly political content, which I host on Medium.

Book reviews return this weekend at last, though, and thanks for your patience. I’ve got a number of good ones coming up. Thank you for your continued support. If you aren’t yet a Patreon patron, please consider it. Also, thanks so much for the emails, and I apologize if I haven’t gotten back to you. I’ve taken some unexpected heat lately, and the support means more than I can say.

g-reliant skills seem most susceptible to automation

This post is 100% informed speculation.

As someone who is willing to acknowledge that IQ tests measure something real, measurable, and largely persistent, I take some flak from people who are skeptical of such metrics. As someone who does not think that IQ (or g, the general intelligence factor that IQ tests purport to measure) is the be-all, end-all of human worth, I take some flak from the internet’s many excitable champions of IQ. This is one of those things where I get accused of strawmanning – “nobody thinks IQ measures everything worthwhile!” – but please believe me that long experience shows that there are an awful lot of very vocal people online who are deeply insistent that IQ measures not just raw processing power but all manner of human value. Like so many other topics, IQ seems to be subject to a widespread binarism, with most people clustered at two extremes and very few with more nuanced positions. It’s kind of exhausting.

I want to make a point that, though necessarily speculative, seems highly intuitive to me. If we really are facing an era where superintelligent AI is capable of automating a great deal of jobs out from under human workers, it seems to me that many g-reliant jobs are precisely the ones most likely to be automated away. If the factor represents the ability to do raw intellectual processing, then it seems likely to me that the g-factor will become less economically worthwhile when such processing is offloaded to software. IQ-dominant tasks in specific domains like chess have already been conquered by task-specific AI. It doesn’t seem like a stretch to me to suggest that more obviously vocational skills will be colonized by new AI systems.

Meanwhile, contrast this with professions that are dependent on “soft” skills. Extreme IQ partisans are very dismissive of these things, often arguing that they aren’t real or that they’re just correlated with IQ anyway. But I believe that there are social, emotional, and therapeutic skills that are not validly measured by IQ tests, and these skills strike me as precisely those that AI will have the hardest time replicating. Human social interactions are incredibly complex and are barely understood by human observers who are steeped in them every day. And human beings need each other; we crave human contact and human interaction. It’s part of why people pay for human instructors in all sorts of tasks that they could learn from free online videos, why we pay three times as much for a drink at a bar than we would pay to mix it at home, why we have set up these odd edifices like coworking spaces that simply permit us to do solo tasks surrounded by other human beings. I don’t really know what’s going to happen with automation and the labor market; no one does. But that so many self-identified smart people are placing large intellectual bets on the persistent value of attributes that computers are best able to replicate seems very strange to me.

You could of course go too far with this. I don’t think that people at the very top of their games need to worry too much; research physicists, for example, probably combined high IQs and a creative/imaginative capacity we haven’t yet really captured in research. But the thing about these extremely high performers is that they’re so rare that they’re not really relevant from a big picture perspective anyway. It’s the larger tiers down, the people whose jobs are g-dependent but who aren’t part of a truly small elite, that I think should worry – maybe not that group today, but its analog 50 or 100 years from now. I mean, despite all of the “teach a kid to code” rhetoric, computer science is probably a heavily IQ-screened field and it’s silly to try and push everyone into it anyway. But even beyond that… someday it’s code that will write code.

Predictions are hard, especially about the future. I could be completely wrong. But this seems like an intuitively persuasive case to me, and yet I never hear it discussed much. That’s the problem with the popular conversation on IQ being dominated by those who consider themselves to have high IQs; they might have too much skin in the game to think clearly.

Study of the Week: Of Course Virtual K-12 Schools Don’t Work

This one seems kind of like shooting fish in a barrel, but given that “technology will solve our educational problems” is holy writ among the Davos crowd no matter what the evidence, I suppose this is worth doing.

Few people would ever come out and say this, but central to assumptions about educational technology is that human teachers are an inefficiency to be removed from the system by whatever means possible. Right now, not even the most credulous Davos type, nor the most shameless ed tech profiteer, is making the case for fully automated AI-based instruction. But attempts to dramatically increase the number of students that you can force through the capitalist pipeline at low cost that you can help nurture and grow are well under way, typically by using digital systems to let one teacher teach more students than you’d see in a brick-and-mortar classroom. This also cuts down on the costs of facilities, which give kids a safe and engaging place to go every day but which are expensive. So you build a virtual platform, policy types use words like “innovation” and “disrupt,” and for-profit entities start sucking up public money with vague promises of deliverance-through-digital-technology. Kids and parents get “choice,” which the ed reform movement has successfully branded as a good thing even though at scale school choice has not been demonstrated to have any meaningful relationship to improved outcomes at all.

Today’s Study of the Week, from a couple years ago, takes a look at whether these virtual K-12 schools actually, you know, work. It’s a part of the CREDO project. I have a number of issues, methodological and political, with the CREDO program generally, but I still think this is high-quality data. It’s a large data set that compares the outcomes of students in traditional public schools, brick and mortar charters, and virtual charters. The study uses a matched data method – in simple terms, comparing students from the different “conditions” who match on a variety of demographic and educational metrics in order to attempt to control construct-irrelevant variance. This can be help to ameliorate some of the problems with observational studies, but bear in mind that once again, this is not the same as a true randomized controlled trial. They had to do things this way because online charter seats are not assigned via lottery. (For the record, I do not trust the randomization effects of such lotteries because of the many ways in which they are gamed, but here that’s not even an issue because there’s no lottery at all.)

The matched variables, if you’re curious:

• Grade level
• Gender3
• Race/Ethnicity
• Free or Reduced-Price Lunch Eligibility
• English Language Learner Status
• Special Education Status
• Prior test score on state achievement test

So how well do online charters work? They don’t. They don’t work. Look at this.

Please note that, though these negative effect sizes may not seem that big to you, in a context where most attempted interventions are not statistically different than zero, they’re remarkable. I invite you to look at the “days of learning lost” scale on the right of the graphic. There’s only 180 days in the typical K-12 school year! This is educational malpractice. How could such a thing have been attempted with over 160,000 students without any solid evidence it could work? Because the constant, the-sky-is-falling crisis narrative in education has created a context where people believe they are entitled to try anything, so long as their intentions are good. Crisis narratives undermine checks and balances and the natural skepticism that we should ordinarily apply to the interests of young children and to public expenditure. So you get millions of dollars spent on online charter schools that leave students a full school year behind their peers.

Are policy types still going full speed ahead, working to send more and more students – and more and more public dollars – into these failed, broken online schools? Of course. Educational technology and the ed reform movement writ large cannot fail, they can only be failed, and nothing as trivial as reality is going to stand in the way.