Study of the Week: More Bad News for College Remediation

Today’s Study of the Week combines two subjects we’ve talked about recently on the ANOVA, college remediation and regression discontinuity design. The study, by the University of Warwick’s Emma Duchini, throws even more cold water on our efforts to fix gaps in college student readiness with remediation – and leaves us wondering what to do instead.

One of the basic difficulties in improving educational outcomes lies in the chain of disadvantage. Students who start out behind tend to stay behind, and it’s not productive to ask teachers to make up for the gaps that have been opened over the course of a student’s life. As I’ve said on this blog many times, most students tend to sort themselves into fairly stable academic rankings early in life, and though individuals move between those rankings fairly often, at scale and in numbers this hierarchy is remarkably persistent. So third grade reading group serves as a good predictor of high school graduation rates, which in turn obviously predicts college completion rates. Meanwhile, the racial achievement gap appears to exist before students ever show up in formal schooling at all. It’s discouraging.

This study comes from the United Kingdom, but it concerns a question of great interest on this side of the Atlantic: do college remediation classes work? We know that college student populations are profoundly different in incoming ability. The college admissions process makes sure of that. That means that institutions like mine, the City University of New York, face profoundly higher hurdles in getting students to typical levels of ability, as our admissions data tells us that many of our students are unprepared. Typically, this results in remedial classes, to the tune of $4 billion a year for public universities. But as Duchini notes, evidence for the effectiveness of remediation is thin on the ground. Her study takes another look.

Duchini’s study draws its data from the economics department of a public Italian university. This university implemented an entrance exam for potential students, consisting of a math section, a verbal section, and a logic section. The results of this test, combined with high school grades, determines whether students are admitted to the program. However, the math section alone is used to determine whether students need to take a remedial program. Because this involves using a cut score, the cut score is fairly close to the mean, and there are no other systematic differences between students placed in or out of the remediation program, this is an ideal situation for a regression discontinuity design, as I explained in this previous post.

I can teach you regression discontinuity design in two images

Ultimately Duchini considers the exam scores and educational and demographic data of 2,682 students, sorted into descriptive categories like gender, immigrant or domestic, vocational or general track, or similar. Importantly for a regression discontinuity design, there is no evidence of student groupings tightly on either side of the cut score, which can indicate that there is student manipulation of placement that would invalidate the design.

There’s an interesting dynamic in the data set, perhaps an example of Berkson’s paradox. Students who perform better on the entrance test are actually less likely to enroll in the program, even though doing well on the test is a requirement for attendance. Why? Think about what it means to do well on the test: those students are more academically prepared over all, and thus have more options for majors to take, meaning that more of them will choose to enroll in a different program.

In any event, Duchini uses a regression discontinuity design to see if there is any meaningful difference between students on students on either side of the cut score and how the trend line changes, looking at outcome variables like odds of dropping out, passing college-level math, and credits accumulated. The results are not encouraging. In particular, the real nut is here, how remediation affects the odds of passing college-level math. Note that the sample is restricted here to edge cases, as we don’t want to get a misleading picture from looking at students too far from the cutoff – this is a last in/last out style model, after all – and bear in mind that because this is a remediation test, the treatment is assigned to those on the left hand side of the cut line.

The upward-sloping trend is no surprise; we should expect student performance on an entrance exam to predict the likelihood that they’ll get through a class in the test subject. What we want to see here is a large break in the performance of the groups at the cut score, with a corresponding shift in the trend line, to suggest that the remediation program is meaningfully affecting outcomes – that is, that it’s bringing students below the cutline closer to the performance of those well above it. Neither eyeballing this scatterplot nor the statistical significance checks Duchini describes provides any such evidence. I find that fact that the data points are more tightly grouped on the left side of the cutline than on the right interesting, but I’m guessing it’s mostly noise. Look in the PDF for more scatterplots with similar trend lines as well as the model and threshold for significance.

Duchini goes into a lot of extra detail, breaking the data set down by demographic groupings and educational factors, though in every case there is little evidence of meaningful gains from the remediation program. Duchini also speaks at length about potential reasons why the program failed to meaningfully prepare students to pass college-level math, including wondering if being assigned remediation might discourage students by making them feel like the work of getting their degree will be even harder than they thought. It’s interesting stuff and worth reading, but for our purposes the conclusion is simple: this remediation program does not appear to meaningfully help students succeed in later college endeavors. It’s only one study from a particular context. But given similar studies that also find little value in remediation, this is more reason to question the value of such programs. More study is needed, but it’s not looking good.

Clearly, if remedial classes don’t work, and they cost students time and money, they should be scrapped. But scrapping them won’t solve the underlying problem: students are arriving at college without the necessary academic skills to ensure that they succeed. College educators will typically lament that they’re trying to solve the deficiencies of high school education, but of course high school teachers can fairly look back as well. Ultimately the dynamic is applicable to the whole system: students are profoundly unequal in their various academic talents from a very early age, and we’re all searching for ways to serve them better. Perhaps the conversation needs to turn to whether we should be pushing so many students into college in the first place, and whether we need to look for answers to economic woes outside of the education system entirely. But for now, we as college educators are left with a sticky problem: our students come to our schools unprepared, but our programs to fill those gaps show little sign of working.

Study of the Week: Hitting the Books, or Hitting the Bong?

Today’s Study of the Week, by Olivier Marie and Ulf Zölitz, considers the impact of access to legal marijuana on college performance. (Via Vox’s podcast The Weeds.) The researchers took advantage of an unusual legal circumstance to examine a natural experiment involving college students and  marijuana. For years now, the Netherlands has been working to avoid some of the negative consequences of its famous legal marijuana industry. While most in the country still support decriminalization, many have felt frustrated by the influx of (undoubtedly annoying) tourists who show up to Dutch cities simply looking to get high. This has led to some policies designed to ameliorate the negative impacts of marijuana tourism without going backwards towards criminalization.

In the city of Maastricht, one such policy involved only selling marijuana to people who had citizenship identification from the Netherlands, Germany, and Belgium, and not from other nationalities. These specific countries seem to have been chosen as a matter of geography – look at Maastricht on a map and you’ll see it’s part of a small Dutch “peninsula” wedged between Germany and Belgium. Importantly for our purposes here, Maastricht features a large university, and like a lot of European schools it attracts students from all over the continent. That means that when the selective-enforcement policy went into effect in 2011, one group of students still had access to marijuana, while another lost it, at least legally. That provided an opportunity to study how decriminalization impacts academic outcomes.

This research thus does not amount to a true randomized experiment, although I suppose that’s one that you could really do, given the long-established relative safety of marijuana use. (“Dude, I’ll slip you $100 not to end up in the control group! No placebo!”) Instead, like a couple of our Studies of the Week in the past, this research utilizes a difference-in-difference design, comparing outcomes for the two different groups using panel data, with a lot of the standard quality checks and corrections to try and root out construct-irrelevant variance between the groups. Ultimately they looked at 4,323 students from the School of Business and Economics. Importantly for our purposes here, there were about equivalent dropout rates between the two treatment groups, which can potentially wreak havoc on this kind of analysis if they are not closely matched.

There’s a couple obvious issues here. First, not only are these groups not randomly selected, they are deliberately selected by nationality. This could potentially open up a lot of confounds and makes me nervous. Still, it’s hard to imagine that there is a distinct impact of smoking marijuana on brains of people from different European nationalities, and the authors are quite confident in the power of their models to wash out nonrandom group variance. Second, you might immediately object that of course even students who are not legally permitted to smoke marijuana will frequently do so, and that many who can won’t. How do we know there aren’t crossover effects? Well, this is actually potentially a feature of the research, not a bug. See, that condition would be true in any decriminalization scheme; there will inevitably be people who use under a period of illegality and who don’t when decriminalized. In other words, this research really is looking at the overall aggregate impact of policy, not the impact of marijuana smoking on individual students. Much like the reasoning behind intent-to-treat models, we want to capture noncompliance because noncompliance will be present in real-world scenarios.

So what did they find? Effects of legal access to marijuana are negative, although to my mind quite modest. Their summary:

the temporary restriction of legal cannabis access increased performance by on average .093 standard deviations and raised the probability of passing a course by 5.4 percent

That effect size – not even a tenth of an SD – is interesting, as when I heard this study discussed casually, it sounded as if the effect was fairly powerful. Still, it’s not nothing, and the course-passing probability makes a difference, particularly given that we’re potentially multiplying these effects across thousands of students. The authors make the case for its practical significance like so:

Our reduced form estimates are roughly the same size as the effect as having a professor whose quality is one standard deviation above the mean (Carrell and West, 2010) or of the effect of being taught by a non-tenure track faculty member (Figlio, Shapiro and Soter, 2014). It is about twice as large as having a same gender instructor (Hoffmann and Oreopoulos, 2009) and of similar size as having a roommate with a one standard deviation higher GPA (Sacerdote, 2001). The effect of the cannabis prohibition we find is a bit smaller than the effect of starting school one hour later and therefore being less sleep-deprived (Carell, Maghakian & West, 2011).

This context strikes me as mostly being proof that most interventions into higher ed are low-impact, but still, the discussed effects are real, and given that marijuana use is associated with minor cognitive impairment, it’s an important finding. Interestingly, the negative effects were most concentrated in women students, lower-performing students, and in quantitative classes, suggesting that the average negative impact of legalization would be unequally distributed. One important note: these findings were consistent even when correcting for time spent studying, suggesting that it wasn’t merely that students who had access to marijuana were less inclined to work but actually performed less well on their tasks on a minute-per-minute basis.

What do we want to do with this information? Does this count as evidence supporting continued marijuana criminalization? No, not to me. Part of what makes achieving a sensible drug policy difficult lies in this shifting of the burden of proof: things that are already illegal are often treated as worthy of decriminalization only if they can be proven to be literally harmless. But all number of behaviors that are perfectly legal involve harms. Alcohol and tobacco use are obvious examples, but there are others, including eating junk food – which is not just legal but actively subsidized by our government, thanks to a raft of bad laws and regulation that provide perverse incentives for food production. Part of freedom means the freedom to make bad choices. The question is when those choices are so bad that society feels compelled to prevent individuals from making them. Even if you aren’t as attached to civil liberties as I am, I think you can believe that marijuana use simply doesn’t qualify.

As for myself, I actually mostly stopped smoking when I got to grad school. In part that’s because I didn’t enjoy it anymore the way I once did. But it was also because I knew I simply couldn’t read and write effectively after I had smoked, and graduate study required me to be reading and writing upwards of 12 hours a day. That’s by no means universal; some people I know find it helps them concentrate. Likewise, I am useless as a writer after more than one beer, though of course there are many writers who famously wrote best when soused. Still, it seems to me entirely intuitive that habitual marijuana use would have minor-but-real negative impacts on academic outcomes. Marijuana, as safe as it is, and as ridiculous as its continued federal illegality in the United States is, does tend to cause minor cognitive impairments, and it would be foolish to assume there’s no negative educational impacts associated with it.

I’d still rather have college kids getting stoned than binge drinking constantly. And ultimately this is a question of pluses and minuses that individual people should be able to weigh for themselves, just as they do when they decide on a cheeseburger or a salad. That’s what freedom is all about, and one part of college is giving young people a chance to make these kinds of adult decisions for themselves.

Study of the Week: Modest But Real Benefits From Lead Exposure Interventions

Today’s Study of the Week, via SlateStarCodex, considers the impact of intervention programs to help ameliorate the impact of lead exposure on children. Exposure to lead, even at relatively low doses, has a long-established set of negative consequences, particular pertaining to cognitive functioning and behavioral control. This dynamic has long been hypothesized as a source of a great deal of social problems, perhaps even explaining the dramatic rise and fall in crime rates in America in the 20th century, given the rise and fall of leaded gasoline. Those broader questions are persistently controversial and will take years to answer. In the meantime, we have interventions designed to ameliorate the negative impacts of lead exposure, but little in the way of large-scale responsible research to measure their impact. This study is a step in closing that gap.

In the study, written by Stephen B. Billings and Kevin T. Schnepel, a set of observational data is analyzed to see how children eligible for inclusion in a program of interventions for lead exposure compared to a control group that did not receive the intervention. The data, taken from North Carolina programs in the 1990s, is robust and full-featured, allowing the researchers to consider behavioral outcomes for children, later-in-life criminal behavior, educational outcomes, and some other metrics of overall quality of life.

For obvious reasons, the study is not a true experiment – you can’t expose children to lead as an experimental treatment and note the difference. But they are able to approximate an experimental design, first thanks to the number of statistical controls, and second thanks to a trick of the screening process. Lead testing is notoriously finicky, so children are usually tested twice in early childhood. If children were tested once and found to have lead levels higher than the threshold, they would then be tested again several months later. If they were found to have again exceeded that threshold, they would be assigned to the intervention protocol. This provided researchers with the opportunity to examine children who tested above the threshold the first time but not the second and compare them to those who tested above the threshold both times. Because only those who were above threshold twice were subject to interventions, these formed natural “control” and “test” groups, subject to quality and robustness checks. Because those in the intervention groups had higher lead exposure overall, their outcomes were statistically corrected for comparison to the control group.

As discussed in the last Study of the Week, this research uses an intent-to-treat model (“once randomized, always analyzed”) because it was not possible to tell what portion of test subjects actually completed the interventions, and because there will certainly be noncompliers in other real-world populations as well, helping to avoid overly strong estimates of intervention effects.

The interventions included education and awareness campaigns, general medical screenings for overall childhood health, nutritional interventions which are believed (but not proven) to be effective at mitigating the effects of lead exposure, educational interventions, and for higher levels of exposure, efforts to physically locate and remove the sources of contamination, usually lead paint. These efforts can be quite expensive, with an estimated average cost of intervention for in-study participants of $5,288. To my mind this is precisely the kind of thing a healthy society should ensure is paid for.

I want to note that this study strikes me as a monumental task. The sheer amount of types of data they pulled – birth records, housing records, educational data, criminal justice data, and others – must have taken great effort, and wrangling that amount of data from that many different sources is no mean feat. They even investigate which of their research subjects may have lived in the same house. And the sheer number of controls and quality tests employed here are remarkable. It’s admirable work, which will serve as a good model for replication going forward.

Unsurprisingly, lead exposure has a serious impact on educational outcomes:

This is consistent with a large body of research, as suggested previously. The behavioral outcomes are even more pronounced, which you can investigate in the paper. Bear in mind that in the raw numbers there are many confounds – poor people and people of color are disproportionately likely to live in lead-tainted environments, and they are also more likely to suffer from educational disadvantage in general, thanks to many social factors. But these trends are true within identified demographic groups as well.

Luckily, the intervention protocol does have an impact. To estimate it, the researchers combine this data (math and reading at 3rd and 8th grade and grade retention from grades 1-9) into an educational intervention index. They find an overall effect of .117 SD improvement relative to the control group in this index, though with a p-value only significant to .10, not typically considered significant in many contexts. This is perhaps explained in part by the of 301 and may be improved with larger-replications. There is a great deal of difference in the metrics that make up the index, listed in Table 4, so I urge you to investigate their individual effect sizes and p-values.

This overall effect size of .117 for the educational index is somewhat discouraging, even though they suggest the intervention does have a positive impact. The biggest positive educational interventions achievable by policy, such as high-intensity tutoring, tend to have a .4-.5 SD effect on quantitative outcomes; the black-white achievement gap in many metrics is around 1 SD. So we’re talking about modest gains that don’t close the educational disadvantages associated with lead exposure. This can perhaps intuitive given that these efforts are largely aimed at preventing more exposure, rather than counteracting the impact of past behaviors. Still, in a world where we’re grasping for the positive impacts we can, and given our clear moral responsibility to help children grow in lead-free environments regardless of the educational impacts, it’s an encouraging sign.

What’s more, the behavioral indices were more encouraging, The researchers assembled an antisocial behavior index including metrics related to school discipline and criminality. Here the effect size was .184, significant to an alpha of .05. These non-cognitive skills make a big impact on the quality of life of students, parents, teachers, and peers. Still fairly modest in impact, but more than worth the costs.

Seems pretty clear to me that we need robust efforts to clean up lead in our environment and to mitigate the damage done to people already exposed. This is an important study and I’m eager to see replication attempts.

Study of the Week: To Remediate or Not to Remediate?

Today’s Study of the Week comes from researchers at my own university, the City University of New York, and concerns an issue of profound professional interest to me: the success of students who are required to attend remedial math classes in our community colleges. CUNY is a system of vast differences between and within its institutions, playing host to programs at senior colleges with well-prepared students that could succeed anywhere and also to many severely under-prepared students who struggle and drop out at unacceptable rates. In this diversity of outcomes, you have a microcosm of American higher education writ large, which like seemingly all things American is plagued by profound inequality.

Here at Brooklyn College, fully two thirds of undergraduates are transfer students, the vast majority of them having come from the CUNY community college system. (Those who get a sufficient number of credits from the community colleges must be admitted to the institution under CUNY policy, even when they would ordinarily not have met the necessary academic standards.) Typical academic outcomes data for these students, as distinct from the third who start and finish their careers at Brooklyn College, are vastly different, and in a discouraging direction. Since my job entails demonstrating to people in positions of power that students are learning adequately here, and in explaining why unfortunate numbers may look that way, this difference is important. But CUNY policy and the overall rhetoric of American college agitates against this nuance. Indeed, the recent adoption of the Pathways system of credits is based on a simple premise, that a CUNY student is a CUNY student, a CUNY class a CUNY class, and a CUNY credit a CUNY credit. This assumption of equivalence across the very large system makes life easier for students and administrators. It is also, I would argue, empirically wrong. But this is a question far above my pay grade.

In any event, the fact is that CUNY colleges host many students who lack the level of prerequisite ability we would hope for. Today’s study asks an essential question: is the best way to serve CUNY community college students who lack basic math skills to send them to non-credit bearing remedial algebra classes? Or is it to substitute a credit-bearing college course in statistics? The question has relevance far beyond CUNY.

Algebra is a Problem

When we’re talking about incoming students who fail to meet standards, we’re also talking about how they fared at the high school level. And the failure to meet entrance requirements for college corresponds with a failure to meet graduation requirements for high school. Among the biggest, most intractable problems with getting students to meet standards comes from algebra. A raft of evidence tells us that algebra requirements stand as one of the biggest impediments to students graduating from high school in our system. Here in New York City, the pass rate for the relevant sections of the Regents Exam has fluctuated with changes to standards, with 65% passing the Algebra I in 2014, 52% passing in 2015, and 62% in 2016. Even with changing standards, in other words, more than a third of all NYC students are failing to meet Algebra I requirements – and that’s despite longstanding complaints that the standards are too low. Low standards in math might help explain why 57% of undergraduates in the CUNY system writ large were found to be unable to pass their math requirements in a 2012 study.

Indeed, rising graduation rates nation-wide have come along with concerns that this improvement is the product of lower standards. You can see this dynamic play out with allegations against “online credit recovery” or in the example of a San Diego charter school where graduation rates and grades are totally contrary to test performance. Someone I know who works in education policy in the think tank world told me recently that he suspects that less than half of American high school graduates actually have the skills and knowledge required of them by math standards, as distinct from just formally passing.

The political scientist Andrew Hacker, himself of CUNY, has made the case against algebra requirements at book length in his recent The Math Myth. As Hacker says, the screening mechanism of getting through algebra, pre-calculus, and similar required courses prevents many students who are otherwise academically sufficient for higher education from attending college. He marries this argument to a critique of the funnel-every-student-into-STEM-career school of ed philosophy that has become so dominant and which I myself have argued, at length, is empirically unjustifiable, economically illiterate, and educationally impossible. Rather than trying to get every kid to be an aspiring quant, Hacker recommends replacing algebra and calculus requirements with more forgiving, practically-aligned and conceptual courses in quantitative literacy.

The question is, can we do what Hacker has asked without wholesale remaking the college system, a very large boat that’s notoriously slow to turn? That’s the question this Study of the Week is intended to answer.

The Study

Today’s study was conducted by A. W. Logue, Mari Watanabe-Rose, and Daniel Douglas, all of CUNY. They were able to take advantage of an unusual degree of administrative access to conduct a true randomized experiment, assigning students to conditions randomly in a way very rarely possible in practical educational settings. The researchers conducted their study at three (unnamed) CUNY community colleges. Their research subjects were students who would ordinarily be required to take a remedial non-credit-bearing algebra course. These students were randomly assigned to three groups: the traditional elementary algebra class (which we can think of as a control), an elementary algebra class where students were required to participate in a support workshop of a type often recommended as a remediation effort, and a undergraduate level, course-bearing introduction to statistics course with its own workshop.

In order to control for instructor effects, all instructors in the research taught one section each of the various classes, helping to minimize systematic differences between the experimental groups. Additionally, there was an important quality check in place regarding non-participants. In an ideal world, true randomization would mean that everyone selected for a treatment or group would participate, but of course you can’t force participation in experiments. That means that there might be some bias if students assigned to one treatment were more likely to decline to participate. Because of the nature of this study, the researchers were able to track the performance of non-participants, who took the standard elementary algebra class. Those students performance similarly to the in-study control group, an important source of confidence in the research.

The researchers used several different techniques to examine their relationships of interest, specifically the odds of passing the course and the amount of credits earned in the following year. One technique was an intent-to-treat (ITT) analysis, which is a kind of model used to address the fact that participants in randomized controlled trials will often drop out or otherwise not comply with the experimental program. It generates conservative effect size estimates by simply assuming that everyone who was randomized into a group stayed there for statistical purposes, even if we know we had some attrition and non-compliance along the way. (“Once randomized, always analyzed.” ) Why would we do that? Because we know that in a real-world scenario “subjects” won’t stick with their assigned “treatments” either, and we want to avoid overly optimistic effect sizes that might come with only looking at compliance.

(As always, if you want the real skinny on these model adjustments I urge you to reading people who really know this stuff. Here’s a useful, simply stated brief on intent to treat.)

The results seem like a pretty big deal to me: after analysis, including throwing in some covariates, they find that there is no significant difference in passing the course between students enrolled in the traditional elementary algebra class and that class plus a workshop, but there is a significant and fairly large (16% without covariates in the model, 14% with) difference in odds of passing the course for those randomized to the intro stats course compared to the elementary algebra course. That is, after randomization students were 16% more likely to pass a credit-bearing college-level course than a non-credit-bearing elementary algebra course. Additionally, the stats group had a significantly higher number of total credits accumulated during the experimental semester and next year, even after subtracting the credits earned for that stats course.

(Please do take a look at the confidence interval numbers listed in brackets below, which tells you a range of effects that we can say with 95% confidence contains the true average effect. Starting to look at confidence intervals is an important step in reading research reports if you’re just getting started.)

Additionally, as the authors write, “as of 1 year after the end of the experiment, 57.32% of the Stat-WS students had passed a college-level quantitative course…, while 37.80% still had remedial need. In contrast, only 15.98% of the EA students had passed a college-level quantitative course and 50.00% still had remedial need.”

Another thing that jumped out at me: in an aside, the authors note that there was no significant difference between the groups in their likelihood of taking credits a year after the experimental semester, with all groups around 66% still enrolled. Think about that – just a year out, fully a third of all students in the study were not enrolled, reflecting the tendency to stop out or drop out that is endemic to community colleges.

Of course, none of this would be inconsistent with assuming a good deal of explanatory power of incoming ability effects, and the relationship between performance on the Compass algebra placement test and the odds of passing are about what you’d expect. Prerequisite ability matters.

In short, students who were randomly selected into an elementary algebra class with a supportive workshop attached were no more likely to pass that class than those sorted into a regular algebra class, but those sorted into an introductory statistics class were 16% more likely to have passed that course. Additionally, the latter group earned significantly more college credits in the following year than the other groups, and were much more likely to have completed a quantitative skills class.

OK. So what do we think about all of this?

First, I would be very skeptical about extrapolating these results into other knowledge domains such as literacy, writing, or similar. I don’t think all remediation efforts are the same across content domains and it’s likely that research will need to be done in other fields. Second, the fact that a supporting workshop did little to improve outcomes compared to students without such a workshop is discouraging but hardly surprising. Such interventions have been attempted for a long time and at scale, but their results have been frustratingly limited.

All in all, the evidence in this study supports Hacker’s point of view, and I suppose my own: students can achieve better results in terms of pure moving towards graduation quickly if we just let them take college stats instead of forcing them to take remedial algebra first. But there’s a dimension that the researchers leave largely unexplored, which is the question of whether this all just represents the benefits of lowering standards.

Are We Just Avoiding Rigor?

The authors examine many potential explanations about why the stats-taking students outperformed the other group, including potential non-random differences in groups, motivation, and similar, but seem oddly uninterested in what strikes me as the most obvious read of the data: that it’s just easier to pass Intro to Stats than it is to pass even a remedial algebra course. They do obliquely get at this point in the discussion, writing

degree progression is not the only consideration in setting remediation policy. The participants in Group Stat-WS were only taught elementary algebra material to the extent that such material was needed to understand the statistics material. Whether students should be graduating from college having learned statistics but without having learned all of elementary algebra is one of the many decisions that a college must make regarding which particular areas of knowledge should be required for a college degree. Views can differ as to which quantitative subjects a college graduate should know.

They sure can! This seems to me to be the root of the policy issue: should we substitute stats courses for algebra courses if we think doing so will make it less likely for students to drop out or otherwise be disrupted on the path to graduation?

This is not really a criticism of this research, though I’d have liked a little more straightforward discussion of this from the authors. But I will hold with Hacker in suggesting that this does represent a lowering of standards, and that this is a feature, not a bug. That is, I think we should allow some students to avoid harder math requirements precisely because the current standards are too high. Students in deeply quantitative fields will have higher in-major math requirements anyway. Of course, in order to take advantage of this, we’d have to acknowledge that the “every student a future engineer” school of educational policy is ill-conceived and likely to result only in a lot of otherwise talented students butting their heads up against the wall of math standards. But unlike most ed policy people, I am willing to say straightforwardly that there are real and obvious differences in the specific academic talents of different individual students, and that these differences cannot be closed through normal pedagogical means. That’s what the best evidence tells us, including this very study.

Hacker says that many ostensibly quantitative professions, like computer programmer or doctor, require far less abstract math skills than is presumed. I don’t doubt he’s correct. The question is whether we as a society – and, more important, whether employers – are will to accept a world where some significant % of people in such jobs never had to pass an Algebra II or Calculus class. Or, failing that, can we redefine our sense of what is valuable work so that the many people who seem incapable of reaching standards in math can go on to have productive, financially-secure lives?

What We’re Attempting with College is Very Hard

Colleges and universities have found themselves under a great deal of pressure, internal and external, in recent years. This is to be expected; they are charging exorbitant tuition and fees from their students, after all, and despite an army of concern trolls doubting their value, the degrees they hand out in return are arguably more essential than ever for securing the good life. Though enrollment rates have slowed in recent years, over time the trend is clear:

Policymakers and politicians must understand: these new enrollments are coming overwhelmingly from the ranks of those who would once have been considered unprepared for higher education, and this has increased the difficulty of what we’re attempting dramatically.

What we’re attempting is to admit millions of more people into the higher education system than before, almost all of whom come from educational and demographic backgrounds that would once have screened them out from attendance. Because those backgrounds are so deeply intertwined with traditional inequalities and social injustice, we have rightly felt a moral need to expand opportunity to those from them. Because the 21st century economy grants such economic rewards to those who earn a bachelor’s degree, we have developed a policy regime designed to do so. I cannot help but see the moral logic behind these sentiments. And yet.

Let’s set aside my perpetual questions about the difference between relative and absolute academic performance and how they are rewarded. (Can the economic advantage of a college degree survive the erosion of the rarity of holding one? How could it possibly under any basic theory of market goods?) We’re still left with this dilemma: can we possibly maintain some coherent standards for what a college degree means while dramatically expanding the people who get them?

One way or another, everyone with an interest in college must understand that the transformation that we’re attempting as a community of schools, educators, and policymakers is unprecedented. Today, the messages we receive in higher education seem impossible: we must educated more cheaply, we must educate more quickly, we must educated far more underprepared students, and we must do so without sacrificing standards. This seems quixotic, to me. Adjusting curricula in the way proposed in this research, and accepting that higher completion rates probably require lower standards, is one way forward. Or we can refuse to adapt to the clear implication of a mountain of discouraging data for students at the lower end of the performance distribution, and get back failure in return.

(Actual) Study of the Week: Academic Outcomes for Preemies

Now back to our regularly scheduled programming….

There’s a lurking danger in the “nature vs nurture” debate that has been so prominent in educational research for so long: people tend to assume that genetic influence means that something is immutable, while environmental influences are assumed to be changeable. The former is not correct, at least in the sense that there are a lot of genetically influenced traits that can be altered or ameliorated – all manner of physical skills, for example, are subject to the impact of exercise, even while we acknowledge that at the top of the distribution tiers, natural/genetic talents play a big role. Likewise, we can believe in educational efforts that somewhat ameliorate genetic influences even while we recognize that biological parentage powerfully shapes intellectual outcomes.

The obverse is even more often forgotten: just because an influence is environmental in nature, that does not mean we can necessarily change its effects. Lead exposure, for example, leads to relatively small but persistent damage to cognitive function. This is certainly environmental influence, but not one that we have tools to ameliorate. I’m not quite sure if we would call neonatal development “environmental,” but influences on children in the womb are a good example of non-genetic influences that are potentially immutable. And they are also another lens through which I want us to consider our tangled, frequently-contradictory intuitions about academic performance and just deserts.

Today’s Study, written by the exceptionally-Dutch-named Cornelieke Sandrine Hanan Aarnoudse-Moens, Nynke Weisglas-Kuperus,  Johannes Bernard van Goudoever, and Jaap Oosterlaan, is a meta-analysis of extant research on the academic outcomes of children who were born very prematurely and/or at very low birth weight. (For an overview of meta-analysis and effect size, please see this post.)

The studies had a number of restrictions in addition to typical quality checks. First, the studies consider had to look at very premature births, defined as less than 33 weeks gestation and/or with very low birth weight, defined as less than 1500 grams. Additionally, for inclusion in the meta-analysis, the studies had to track student performance to at least age 5, as this is where formal schooling begins and where responsible analysis of academic outcomes can be considered. These studies reported on academic outcomes, behavioral outcomes as represented by teacher and parent observation checklists/surveys, and so-called executive functioning variables, which includes things like impulse control and ability to plan (and which have been pretty trendy). All in all, data from 14 studies on academic outcomes, 9 on behavioral outcomes, and 6 on executive functioning were considered. (There was some overlap.) All in all, 4125 very preterm and/or very low birth weight children were compared to 3197 children born at term. The authors performed standard meta-analytic procedures involved pooling SDs and weighting by sample size and reported effect sizes in good old Cohen’s d. 

They also used a couple of statistical tests to attempt to adjust for publication bias. Publication bias is a troubling aspect of research studies that can undermine meta-analysis, particularly problematic given that meta-analysis is often viewed as a way to ameliorate (never eliminate) other problems like p-value hacking or similar. Publishing bias refers to the fact that journals are much more likely to publish studies with significant effects that those without them. This has several bad outcomes – for one, it provides perverse incentives for academics trying to get jobs and tenure. But it also distorts our view of reality. We adjust for the various issues with individual studies, in part, by looking at a broad swath of research literature. But if the non-significant results are sitting in a drawer while significant results are in Google Scholar, that’s not going to help, even with meta-analysis.

The results are not particularly surprising, but are sad all the same: children born very prematurely and/or at very low birth way have persistently worse academic outcomes compared to similar children. In terms of academic outcomes, we’re talking about -0.48 SD for reading, -0.60 SD for mathematics, and -0.76 for spelling. These are, in context of educational research, large effects. There was some variation between studies, as is to be expected in any meta-analysis, but this variation was not large enough to undermine our confidence in these results. Checks for publication bias came up largely clean as well. There are also findings indicating that children born premature have problems with attention, verbal fluency, and working memory. These effect sizes had no meaningful relationship to the age of assessment, suggesting that these problems are persistent. With a few exceptions, these relationships are continuous – that is, children with lower gestational ages and birth weight are generally worse off in terms of outcomes even when compared to other children born prematurely and/or at low birth weight.

First, this is very important to say: the studies included in this meta-analysis represent averages. We live in a world of variability. There are certainly many children who are born severely prematurely and go on to academic excellence. It would be wrong to assume that these influences indicate a certain academic destiny, as it is for any variable we examine in educational research. The trends, however, are clear. Sadly, other research suggests that these problems are likely to extend into at least young adulthood.

What are some of the consequences here? Well, to begin with, I think it’s another important facet of how we think about educational outcomes and how much of those outcomes lie outside of the hands of students, parents, and teachers. No one has chosen this outcome. For another thing, there’s the breaking of the nature/nurture binary I pointed out above. This is a non-genetic but uncontrolled introduction of major influence into the educational outcomes of children. I don’t mean to be fatalistic about things;  there’s always a chance that we’ll find some interventions that help to close these gaps. But I think this is another reason for us to get outside of a moralistic framework for education, where every below-average outcome has to be the fault of someone – the parents, the teachers, or the student themselves. 

And again, I think this points in the direction of a societal need to expand our definition of what it means to be a good student, and through that, what it means to be a valuable human being. True, very early births are comparatively rare, though almost 10% of all American births are preterm. (Like seemingly everything else in the United States, preterm birth rates are influenced by race, class, and geography.) But this dynamic is just another data point in a large set of evidence that suggests that academic outcomes are largely outside of the hands of individuals, parents, and teachers, particularly if we recognize that genetic influence is not controlled by those groups. What’s interesting with premature babies is that I doubt anyone would think that they somehow deserve worse life outcomes as a result of their academic struggles. Who could be so callous? And yet when it comes to genetic gifts – which are just as uncontrolled by individuals as being born prematurely – there are many who think it’s fine to disproportionately hand out reward. I don’t get that.

Ultimately, rather than continuing to engage in a quixotic policy agenda designed to give every child the exact same odds of being a Stanford-trained computer scientist, we should recognize as a society that we will always have a range of academic outcomes, that this means we will always have people who struggle as well as excel, and that to a large extent these outcomes are not controlled by individuals. Therefore we should build a robust social safety net to protect people who are not fortunate enough to be academically gifted, and we should critique the Cult of Smart, recognizing that there are all manner of ways to be valuable human beings.

Study of the Week: We’ll Only Scale Up the Good Ones

When it comes to education research and public policy, scale is the name of the game.

Does pre-K work? Left-leaning people (that is, people who generally share my politics) tend to be strong advocates of these programs. It’s true that generically, it’s easier to get meaningful educational benefits from interventions in early childhood than later in life. And pre-K proponents tend to cite some solid studies that show some gains relative to peer groups, though these gains are generally modest and tend to fade out over time. Unfortunately, while some of these studies have responsible designs, many that are still cited are old, from small programs, or both.

Today’s Study of the Week, by Mark W. Lipsey, Dale C. Farran, and Kerry G. Hofer, is a much-discussed, controversial study from Tennessee’s Voluntary Prekindergarten Program. The Vanderbilt University researchers investigated the academic and social impacts of the state’s pre-K programs on student outcomes. The study we’re looking at is a randomized experimental design, which was pulled from a larger observational study. The Tennessee program, in some locales, had more applicants than available seats. These seats are filled by a random lottery, creating a natural control and experimental group.

There is one important caveat here: the students examined in the intensive portion of the research had to be selected from those whose parents gave consent. That’s about a third of the potential students. This is a potential source of bias. While the randomized design will help, what we can responsibly say is that we have random selection within the group of students whose parents opted in, but with a nonrandom distribution relative to the overall group of students attending this program. I don’t think that’s a particularly serious problem, but it’s a source of potential selection bias and something to be aware of. There’s also my persistent question about the degree to which school selection lotteries can be gamed by parents and administrators. There are lots of examples of this happening. (Here’s one at a much-lauded magnet school in Connecticut.) Most people in the research field seem not to see this as a big concern. I don’t know.

In any event, the results of the research were not encouraging. Researchers examined six identified subtests (two language, two literacy, two math) from the Woodcock-Johnson tests of cognitive ability, a well-validated and widely-used battery of tests of student academic and intellectual skills. They also looked at a set of non-cognitive abilities related to behavior, socialization, and enthusiasm for school. A predictable pattern played out. Students who attended the Tennessee pre-K program saw short-term significant gains relative to their peers who did not attend the program. But over time, the peer group caught up, and in fact in this study, exceeded the test group. That is, students who attended Tennessee’s pre-K program ended up actually underperforming those who were not selected into it.

By the end of kindergarten, the control children had caught up to the TN‐VPK children and there were no longer significant differences between them on any achievement measures. The same result was obtained at the end of first grade using both composite achievement measures. In second grade, however, the groups began to diverge with the TN‐VPK children scoring lower than the control children on most of the measures….  In terms of behavioral effects, in the spring the first grade teachers reversed the fall kindergarten teacher ratings. First grade teachers rated the TN‐ VPK children as less well prepared for school, having poorer work skills in the classrooms, and feeling more negative about school.

This dispiriting outcome mimics that of the Head Start study, another much-discussed, controversial study that found similar outcomes: initial advantages for Head Start students that are lost entirely by 3rd grade.

Further study is needed1 but it seems that the larger and more representative the study, the less impressive – and the less persistent – the gains from pre-K. There’s a bit of uncertainty here about whether the differences in outcomes are really the product of differences in programs or due to differences in the research itself. And I don’t pretend that this is a settled question. But it is important to recognize that the positive evidence for pre-K comes from smaller, higher-resource, more-intensive programs. Larger programs have far less encouraging outcomes.

The best guess, it seems to me, is that at scale universal pre-K programs would function more like the Tennessee system and less like the small, higher-performing programs. That’s because scaling up any major institutional venture, in a country the size of the United States, is going to entail the inevitable moderating effects of many repetitions. That is, you can build one school or one program and invest a lot of time, effort, and resources into making it as effective as possible, and potentially see significant gains relative to other schools. But it strikes me as a simple statement of the nature of reality that this intensity of effort and attention can’t scale. As Farran and Lipsey say in a Brookings Institution essay, “To assert that these same outcomes can be achieved at scale by pre-K programs that cost less and don’t look the same is unsupported by any available evidence.”

Some will immediately say, well, let’s just pay as much for large-scale pre-K as they do in the other programs and model their techniques. The $26 billion question is, can you actually do that? Can what makes these programs special actually be scaled? Is there hidden bias here that will wash out as we expand the programs? I confess I’m skeptical that we’ll see these quantitative gains under even the best scenario. I think we need to understand the inevitability of mediocrity and regression to the mean. That doesn’t mean I don’t support universal pre-kindergarten childcare. As with after school programs, I do for social and political reasons, though, not out of the conviction much that they’ll change test scores much. I’d be happy to be proven wrong.

Now I don’t mean to extrapolate irresponsibly. But allow me to extrapolate irresponsibly: isn’t this precisely what we should expect with charter schools, too? We tend to see, survivorship-bias heavy CREDO studies aside, that at scale the median charter school does little or nothing to improve on traditional public schools. We also see a number of idiosyncratic, high-intensity, high-attention charters that report better outcomes. The question you have to ask, based on how the world works, is which is more likely to be replicated at scale – the median, or the exceptions?

I’ve made this point before about Donald Trump’s favorite charter schools, Success Academy here in New York. Let’s set aside questions of the abusive nature of the teaching that goes on in these schools. The basic charter proponent argument is that these schools succeed because they can fire bad teachers and replace them with good. Success Academy schools are notoriously high stress, long-hour, low pay affairs. This leads naturally to high teacher attrition. Luckily for the NYC-based Success Academy, New York is filled with lots of eager young people who want to get a foothold in the city, do some do-goodering, then bail for their “real” careers later on – essentially replicating the Teach for America model. So: even if we take all of the results from such programs at face value, do you think this is a situation that can be scaled up in places that are far less attractive to well-educated, striving young workers? Can you get that kind of churn and get the more talented candidates you say you need, at no higher cost, to come to the Ozarks or Flint, Michigan or the Native American reservations? Can you nationally have a profession of 3 million people, already caught in a teacher shortage, and then replicate conditions that lead to somewhere between 35%-50% annual turnover, depending on whose numbers you trust?

And am I really being too skeptical if my assumption is to say no, you can’t?


Study of the Week: Of Course Virtual K-12 Schools Don’t Work

This one seems kind of like shooting fish in a barrel, but given that “technology will solve our educational problems” is holy writ among the Davos crowd no matter what the evidence, I suppose this is worth doing.

Few people would ever come out and say this, but central to assumptions about educational technology is that human teachers are an inefficiency to be removed from the system by whatever means possible. Right now, not even the most credulous Davos type, nor the most shameless ed tech profiteer, is making the case for fully automated AI-based instruction. But attempts to dramatically increase the number of students that you can force through the capitalist pipeline at low cost that you can help nurture and grow are well under way, typically by using digital systems to let one teacher teach more students than you’d see in a brick-and-mortar classroom. This also cuts down on the costs of facilities, which give kids a safe and engaging place to go every day but which are expensive. So you build a virtual platform, policy types use words like “innovation” and “disrupt,” and for-profit entities start sucking up public money with vague promises of deliverance-through-digital-technology. Kids and parents get “choice,” which the ed reform movement has successfully branded as a good thing even though at scale school choice has not been demonstrated to have any meaningful relationship to improved outcomes at all.

Today’s Study of the Week, from a couple years ago, takes a look at whether these virtual K-12 schools actually, you know, work. It’s a part of the CREDO project. I have a number of issues, methodological and political, with the CREDO program generally, but I still think this is high-quality data. It’s a large data set that compares the outcomes of students in traditional public schools, brick and mortar charters, and virtual charters. The study uses a matched data method – in simple terms, comparing students from the different “conditions” who match on a variety of demographic and educational metrics in order to attempt to control construct-irrelevant variance. This can be help to ameliorate some of the problems with observational studies, but bear in mind that once again, this is not the same as a true randomized controlled trial. They had to do things this way because online charter seats are not assigned via lottery. (For the record, I do not trust the randomization effects of such lotteries because of the many ways in which they are gamed, but here that’s not even an issue because there’s no lottery at all.)

The matched variables, if you’re curious:

• Grade level
• Gender3
• Race/Ethnicity
• Free or Reduced-Price Lunch Eligibility
• English Language Learner Status
• Special Education Status
• Prior test score on state achievement test

So how well do online charters work? They don’t. They don’t work. Look at this.

Please note that, though these negative effect sizes may not seem that big to you, in a context where most attempted interventions are not statistically different than zero, they’re remarkable. I invite you to look at the “days of learning lost” scale on the right of the graphic. There’s only 180 days in the typical K-12 school year! This is educational malpractice. How could such a thing have been attempted with over 160,000 students without any solid evidence it could work? Because the constant, the-sky-is-falling crisis narrative in education has created a context where people believe they are entitled to try anything, so long as their intentions are good. Crisis narratives undermine checks and balances and the natural skepticism that we should ordinarily apply to the interests of young children and to public expenditure. So you get millions of dollars spent on online charter schools that leave students a full school year behind their peers.

Are policy types still going full speed ahead, working to send more and more students – and more and more public dollars – into these failed, broken online schools? Of course. Educational technology and the ed reform movement writ large cannot fail, they can only be failed, and nothing as trivial as reality is going to stand in the way.

Study of the Week: Trade Schools Are No Panacea

You will likely have encountered the common assertion that we need to send people into trade schools to address problems like college dropout rates and soft labor markets for certain categories of workers. As The Atlantic recently pointed out, the idea that we need to be sending more people to trade and tech schools has broad bipartisan, cross-ideological appeal. This argument has a lot of different flavors, but it tends to come down to the claim that we shouldn’t be sending everyone to college (I agree!) and that instead we should be pushing more people into skilled trades. Oftentimes this is encouraged as an apprenticeship model over a schooling model.

I find there’s far more in the way of narrative force behind these claims than actual proof. It just sounds good – we need to get back to making things, to helping people learn how to build and repair! But… where’s the evidence? I’ve often looked at brute-force numbers like unemployment numbers for particular professions, but it’s hard to make responsible conclusions with that kind of analysis. Well, there’s a big new study out that looks in a much more rigorous way – and the results aren’t particularly encouraging.

Today’s Study of the Week, written by Eric A. Hanushek, Guido Schwerdt, Ludger Woessmann, and Lei Zhang, looks at how workers who attend vocational schools perform relative to those who attend general education schools. Like the recent Study of the Week on the impact of universal free school breakfast, this study uses a difference-in-difference approach to explore causation, again because it’s impossible to do an experiment with this type of question – you can’t exactly tell people that your randomization has sorted them into a particular type of schooling and potentially life-long career path, after all. The primary data they use is the International Adult Literacy Survey, a very large, metadata-robust survey with demographic, education, and employment data from 18 countries, gather from 1994 to 1998. (The authors restrict their analysis to the 11 countries that have robust vocational education systems in place.) The age of the data is unfortunate, but there’s little reason to believe that the analysis here would have changed dramatically, and the data set is so rich with variables (and thus the potential to do extensive checks for robustness and bias) that it’s a good resource. What do they find? In broad strokes, vocational/tech training helps you get a job right out of school, but hurts you as you go along later in life:

(don’t be too offended by excluding women – their overall change in workforce participation made it necessary)

Most important to our purpose, while individuals with a general education are initially (normalized to an age of 16 years) 6.9 percentage points less likely to be employed than those with a vocational education, the gap in employment rates narrows by 2.1 percentage points every ten years. This implies that by age 49, on average, individuals completing a general education are more likely to be employed than individuals completing a vocational education. Individuals completing a secondary-school equivalency or other program (the “other” category) have a virtually identical employment trajectory as those completing a vocational education.

Now, they go on to do a lot of quality controls and checks for robustness and confounds. As much of a slog as that stuff is, I recommend you check some of that out and start to pick some of it apart. Becoming a skilled reader of academic research literature really requires that you get used to picking apart the quality controls, because this is often where the juicy stuff can be found. Still, in this study the various checks and controls all support the same basic analysis: those who attend vocational schools or programs enjoy initial higher employability but go on to suffer from higher unemployment later in life.

What’s going on with these trends? The suggestion of the authors seems correct to me: vocational training is likely more specific and job-focused than general ed, which means that its students are more ready to jump right into work. But over time, technological and economic changes change which skills and competencies are valued by employers, and the general education students have been “taught to learn,” meaning that they are more adaptable and can acquire new and valuable skills.

I’m not 100% convinced that counseling more people into the trades is a bad idea. After all, the world needs people who can do these things, and early-career employability is nothing to dismiss. But the affirmative case that more trade school is a solution to long-term unemployment problems seems clearly wrong. And in fact this type of education seems to deepen one of our bigger problems in the current economy: the speed of technological change moves so fast these days that it’s hard for older workers to adapt, and they often find themselves in truly unfortunate positions. Even in trades that are less susceptible to technological change, there’s uncertainty; a lot of the traditional construction trades, for example, are very exposed to the housing market, as we learned the hard way in 2009. Do we want to use public policy to deepen these risks?

In a broader sense: it’s unclear if it’s ever a good idea to push people into a particular narrow range of occupations, because then people rush into them and… there stops being any shortage and advantage for labor. For a little while there, petrochemical engineering seemed huge. But it takes a lot of schooling to do those jobs, and then the oil market crashed. Pharmacy was the safe haven, and then word got out, a ton of people went into the field, and the labor market advantage was eroded. Also, there are limits to our understanding of how many workers we need in a given field. Some people argue there’s a teacher shortage; some insist there isn’t. Some people believe there’s a shortage of nurses; some claim there’s a glut. If you were a young student, would you want to bet your future on this uncertainty? It seems far more useful to me to try and train students into being nimble, adaptable learners than to train them for particular jobs. That has the bonus advantage of restoring the “practical” value of the humanities and arts, which have always been key aspects of learning to be well-rounded intellects.

My desires are twofold. First, that we be very careful when making claims about the labor market of the future, given the certainty that trends change. (One of my Purdue University students once told me, with a smirk, that he had intended to study Search Engine Optimization when he was in school, only to find that Facebook had eaten Google as the primary driver of many kinds of web traffic.) Second, that we stop saying “the problem is you went into X field” altogether. Individual workers are not responsible for labor market conditions. Those are the product of macroeconomic conditions – inadequate aggregate demand, outsourcing, and the merciless march of automation. What’s needed is not to try and read the tea leaves and guess which fields might reward some slice of our workforce now, but to redefine our attitude towards work and material security through the institution of some sort of guaranteed minimum income. Then, we can train students in the fields in which they have interest and talent, contribute to their human flourishing in doing so, and help shelter them from the fickleness of the economy. The labor market is not a morality play.

Study of the Week: Feed Kids to Feed Them

Today’s Study of the Week is about subsidized meal programs for public school students, particularly breakfast. School breakfast programs have been targeted by policymakers for awhile, in part because of discouraging participation levels. Even many students who are eligible for subsidized lunches often don’t take advantage of school breakfast. The reasons for this are multiple. Price is certainly a factor. As you’d expect, price is inversely related to participation rates for school breakfast. Also, in order to take advantage of breakfast programs, you need to arrive at school early enough to eat before school formally begins, and it’s often hard enough to get teenagers to school on time just for class. Finally, there’s a stigma component, particularly associated with subsidized breakfast programs. It was certainly the case at my public high school, where 44% of students were eligible for federal school lunch subsidies, that school breakfast carried class associations. At lunch, everybody’s eating together, but students at breakfast tended to be poorer kids – which in turn likely makes it less likely that students will want to be seen getting school breakfast.

The study, written by Jacob Leos-Urbel, Amy Ellen Schwartz, Meryle Weinstein, and Sean Corcoran (all of NYU), takes advantage of a policy change in New York public schools in 2003. Previously, school breakfast had been free only to those who were eligible for federal lunch subsidies, which remains the case in most school districts. New York made breakfast free for all students, defraying the costs by raising the price of unsubsidized lunch from $1.00 to $1.50. They then went looking to see if the switch to free breakfast for all changed participation in the breakfast program, looking for differences between the three tiers – free lunch students, reduced lunch students, and students who pay full price. They also compared outcomes from traditional schools to Universal Free Meal (UFM) schools, where the percentage of eligible students is so high that everyone in the school gets meals for free already. This helped them tease out possible differences in participation based on moving to a universal free breakfast model. They were able to use a robust data set comprising results from 723,843 students from 667 schools, grades 3–8. They also investigated whether breakfast participation rates were associated with performance in quantitative educational metrics.

It’s important to say that it’s hard to really get at causality here because we’re not doing a randomized experiment. Such an experiment would be flatly unethical – “sorry, kid, you got sorted into the no-free-breakfast group, good luck.” So we have to do observational studies and use what techniques we can to adjust for their weaknesses. In this study, the authors used what’s called a difference in difference design. These techniques are often used when analyzing natural experiments. In the current case, we have schools where the change in policy has no impact on who receives free breakfast (the UFM schools) and schools where there is an impact (the traditional schools). Therefore the UFM schools can function as a kind of natural control group, since they did not receive the “treatment.” You then use a statistical model to compare the change in the variables of interest for the “control” group to the change for the “treatment” group. Make sense?

What did the authors find? The results of the policy change were modest, in almost every measurable way, and consistent across a number of models that the authors go into in great detail in the paper. Students did take advantage of school breakfast more after breakfast became universally free. On the one hand, students who paid full price increased breakfast participation by 55%, which is a large number; but on the other hand, their initial baseline participation rates were so low (again because breakfast participation is class-influenced) that they only ate on average 6 additional breakfasts a year. Reduced price and free were increased by 33% and 15%, respectively – the latter particularly interesting given that those students did not pay for breakfast to begin with. Still, that too only represents about 6 meals over the course of a year, not nothing but perhaps less than we’d hope for a program with low participation rates. The only meaningful difference in models seems to be when they restrict their analysis to the small number (91) of schools where less than a third of students are eligible for lunch subsidies, in which case breakfast participation grew by a substantially larger amount. The purchase of lunches, for what it’s worth, remained static despite the price increase.

There’s a lot of picking apart the data and attempting to determine to what degree these findings are related to stigma. I confess I find the discussion a bit muddled but your money may vary. The educational impacts, also, were slight. They found a small increase in attendance, but this result was not significant, and no impact on reading and math outcomes.

These findings are somewhat discouraging. Certainly we would hope that moving to a universal program would help to spur participation rates to a greater degree than we’re seeing here. But it’s important to note that the authors largely restricted their analysis to the years immediately before and after the policy change, thanks to the needs of their model. When broadening the time frame by a couple years, they find an accelerating trend in participation rates, though the model is somewhat less robust. What’s more, as the authors note, decreasing stigma is the kind of thing that takes time. If it is in fact the case that stigma keeps students from taking part in school breakfast, it may well take a longer time period for universal free breakfast to erode that disincentive.

I’m also inclined to suspect that the need to get kids to school early to eat represents a serious challenge to the pragmatic success of this program. There’s perhaps good news on the way:

Even when free for all, school breakfast is voluntary. Further, unlike school lunch, breakfast traditionally is not fully incorporated into the school day and students must arrive at school early in order to participate. Importantly, in the time period since the introduction of the universal free breakfast policy considered in this paper, New York City and other large cities have begun to explore other avenues to increase participation. Most notably, some schools now provide breakfast in the classroom.

Ultimately, I believe that making school breakfast universally free is a great change even in light of relatively modest impacts on participation rate. We should embrace providing free breakfast to all students regardless of income level out of the principle of doing so, particularly considering that fluctuations in parental income might make kids who are technically ineligible unable to pay for breakfast. In time, if we set up this universal program as an embedded part of the school day, and work diligently to erase the stigma of using it, I believe more and more kids will begin their days with a full stomach.

As for the lack of impacts on quantitative metrics, well – I think that’s no real objection at all. We should feed kids to feed them, not to improve their numbers. This all dovetails with my earlier point about after school programs: if we insist on viewing every question through the lens of test scores, we’re missing out on opportunities to improve the lives of children and parents that are real and important. Again, I will say that I recognize the value of quantitative academic outcome in certain policy situations. But the relentless focus on quantitative outcomes leads to scenarios where we have to ask questions like whether giving kids free breakfast improves test scores. If it does, great – but the reason to feed children is to feed children. When it comes to test scores and education policy, the tail too often wags the dog, and it has to stop.

Study of the Week: Better and Worse Ways to Attack Entrance Exams

For this week’s Study of the Week I want to look at standardized tests, the concept of validity, and how best – and worst – to criticize exams like the SAT and ACT. To begin, let’s consider what exactly it means to call such exams valid.

What is validity?

Validity is a multi-faceted concept that’s seen as a core aspect of test development. Like many subjects in psychometrics and stats, it tends to be used casually and referred to as something fairly simple, when in fact the concept is notoriously complex. Accepting that any one-sentence definition of validity is thus a distortion, generally we say that validity refers to the degree that a test measures that which it purports to measure. A test is more valid or less depending on its ability to actually capture the underlying traits we are interested in investigating through its mechanism. No test can ever be fully or perfectly validated; rather we can say that it is more or less valid. Validity is a vector, not a destination.

Validity is so complex, and so interesting, in part because it sits at the nexus of both quantitative and philosophical concerns. Concepts that we want to test may appear superficially simple but are often filled with hidden complexity. As I wrote in a past Study of the Week, talking about the related issues of construct and operationalization,

If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

Questions such as these are endemic to test development, and frequently we are forced to make subjective decisions about how best to measure complex constructs of interest. Common to the quantitative social sciences, this subjective, theoretical side of validity is often written out of our conception of the topic, as we want to speak with the certainty of numbers and the authority of the “harder” sciences. But theory is inextricable from empiricism, and the more that we wish to hide it, the more subject we are to distortions that arise from failing to fully think through our theories and what they mean. Good empiricists know theory comes first; without it, the numbers are meaningless.

Validity has been subdivided into a large number of types, which reflect different goals and values within the test development process. Some examples include:

  • Predictive Validity: The ability of a test’s results to predict that which it should be able to predict if the test is in fact valid. If a reading test predicts whether students can in fact read texts of a given complexity or reading level, that would provide evidence of predictive validity. The SAT’s ability to predict the grades of college freshmen is a classic example.
  • Concurrent Validity: If a test’s results are strongly correlated with that of a test that measures similar constructs and which has itself been sufficiently validated, that provides evidence of concurrent validity. Of course, you have to be careful – two invalid tests might provide similar results but not tell us much of actual worth. Still, a test of quantitative reasoning and a test of math would be expected to be imperfectly yet moderately-to-strongly correlated if each is itself a valid test of the given construct.
  • Curricular Validity: As the name implies, curricular validity reflects the degree to which a test matches with a given curriculum. If a test of biology closely matches the content in the syllabus of that biology course, we would argue for high curricular validity. This is important because we can easily imagine a scenario where general ability in biology could be measured effectively by a test that lacked curricular validity – students who are strong in biology might score well on a test, and students who are poor would likely score poorly, even if that test didn’t closely match the curriculum. But that test would still not be a particularly valid measure of biology as learned in that class, so curricular validity would be low. This is often expressed as a matter of ethics.
  • Ecological Validity: Heading in a “softer” direction, ecological validity is often discussed to refer to the degree to which a test or similar assessment instrument matches the real-life contexts in which its consequences will be enacted. Take writing assessment. In previous generations, it was common for student writing ability to be tested through multiple choice tests on grammar and sentence combining. These tests were argued to be valid because their results tend to be highly correlated with the scores that students receive on written essay exams. But writing teachers objected, quite reasonably, that we should test student writing by having them write, even if those correlations are strong. This is an invocation of ecological validity and reflects a broader (and to me positive) effort to not reduce validity to narrowly numerical terms.

I could go on!

When we talk about entrance examinations like the SAT or GRE, we often fixate on predictive validity, for obvious reasons. If we’re using test scores as criteria for entry into selective institutions, we are making a set of claims about the relationship between those scores and the eventual performance of those students. Most importantly, we’re saying that the tests help us to know that students can complete a given college curriculum, that we’re not setting them up to fail by admitting them to a school where they are not academically prepared to thrive. This is, ostensibly, the first responsibility of the college admissions process. Ostensibly.

Of course, there are ceiling effects here, and a whole host of social and ethical concerns that predictive validity can’t address. I can’t find a link now but awhile back a Harvard admissions officer admitted that something like 90% of the applicants have the academic ability to succeed at the school, and that much of the screening process had little to do with actual academic preparedness. This is a big subject that’s outside of the bounds of this week’s study.

The ACT: Still Predictively Valid

Today’s study, by Paul A. Westrick, Huy Le, Steven B. Robbins, Justine M. R. Radunzel, and Frank L. Schmidt1, is a large-n (189,612) study about the predictive validity of the ACT, with analysis of the role of socioeconomic status (SES) and high school grades in retention and college grades. The researchers examined the outcomes of students who took the ACT and went on to enroll in 4-year institutions from 2000 to 2006.

The nut:

After corrections for range restriction, the estimated mean correlation between ACT scores and 1st-year GPA was .51, and the estimated mean correlation between high school GPA and 1st-year GPA was .58. In addition, the validity coefficients for ACT Composite score and high school GPA were found to be somewhat variable across institutions, with 90% of the coefficients estimated to fall between .43 and .60, and between .49 and .68, respectively (as indicated by the 90% credibility intervals). In contrast, after correcting for artifacts, the estimated mean correlation between SES and 1st-year GPA was only .24 and did not vary across institutions….

…1st-year GPA, the most proximal predictor of 2nd-year retention, had the strongest relationship (.41). ACT Composite scores (.19) and high school GPA (.21) were similar in the strength of their relationships with 2nd-year retention, and SES had the weakest relationship with 2nd-year retention (.10).

The results should be familiar to anyone who has taken a good look at the literature on these tests, and to anyone who has been a regular reader of this blog. The ACT is in fact a pretty strong predictor of GPA, though far from a perfect one at .51. Context is key here; in the world of social sciences and education, .51 is an impressive degree of predictive validity for the criterion of interest. But there’s lots of wiggle! And I think that’s ultimately a good thing; it permits us to recognize that there are a variety of ways to effectively navigate the challenges of the college experience… and to fail to do so. (As the Study of the Week post linked to above notes, GPA is strongly influenced by Conscientiousness, the part of the Five Factor Model associated with persistence and delaying gratification.) We live in a world of variability, and no test can ever make perfectly accurate predictions about who will succeed or fail. Exceptions abound. Proponents of these tests will say, though, that they are probably much more valid predictors of college grades and dropout rates than more subjective criteria like essays and extracurricular activities. And they have a point.

Does the fact that SES correlates “only” at .24 with college GPA mean SES doesn’t matter? Of course not. That level of correlation for a variable that is truly construct-irrelevant and which has such obvious social justice dimensions is notable even if its less powerful than some would suspect. It simply means that we should take care not to exaggerate that relationship, or the relationship between SES and performance on tests like the ACT and SAT, which is similar at about .25 in the best data known to me. Again: clearly that is a relevant relationship, and clearly it does not support the notion that these tests only reflect differences in SES.

Ultimately, every read I have of the extant evidence demonstrates that tests like the SAT and ACT are moderately to highly effective at predicting which students will succeed in terms of college GPA and retention rates. They are not perfect and should not be treated as such, so we should use other types of evidence such as high school grades and other, “soft” factors in our college admissions procedures – in other words, what we already do – if we’re primarily concerned with screening for prerequisite ability. Does that mean I have no objections to these tests or their use? Not at all. It just means that I want to make the right kinds of criticisms.

Don’t Criticize Strength, Criticize Weakness

A theme that I will return to again and again in this space is that we need to consider education and its place in society from a high enough level to think coherently. Critics of the SAT and ACT tend to pitch their criticisms at a level that does them no good.

So take this piece in Slate from a couple enthusiastic SAT (and IQ) proponent. In it, they take several liberal academics to task for making inaccurate claims about the SAT, in particular the idea that the SAT only measures how well you take the SAT. As the authors say, the evidence against this is overwhelming; the SAT, like the ACT, is and has always been an effective predictor of college grades and retention rates, which is precisely what the test is mean to predict. The big testing companies invest a great deal of money and effort in making them predictively valid. (And a great deal of test taker time and effort, too, given that one section out of each given exam is “experimental,” unscored and used for the production of future tests.) When you attack the predictive validity of these tests – their ability to make meaningful predictions about who will succeed and who will fail at college – you are attacking them at their strongest point. It’s like their critics are deliberately making the weakest critique possible.

“These tests are only proxies for socioeconomic status” is a factually incorrect attempt to make a criticism of how our educational system replicates received advantage. It fails because it does not operate at the right level of perspective. Here’s a better version, my version: “these tests are part of an educational system that reflects a narrow definition of student success that is based on the needs of capitalism, rather than a fuller, more humanistic definition of what it means to be a good student.”

These tests do indeed tell us how well students are likely to do in college and in turn provide some evidence of how well they will do in the working world. But college, like our educational system as a whole, has been tuned to attend to the needs of the market rather than to the broader needs of humanity. The former privileges the kind of abstract processing and brute reasoning skills that tests are good at measuring and which makes one a good Facebook or Boeing employee. The latter would include things like ethical responsibility, aesthetic appreciation, elegance of expression, and dedication to equality, among other things, which tests are not well suited to measuring. A more egalitarian society would of course also have need for, and value, the raw processing power that we can test for effectively, but that strength would be correctly seen as just one value among many. To get there, though, we have to make much broader critiques and reforms of contemporary society than the “SAT just measures how well you take the SAT” crowd tend to engage in.

What I am asking for, in other words, is that we focus on telling the whole story rather than distorting what we know about part of the story. There is so much to criticize in our system and how it doles out rewards, so let’s attack weakness, not strength.