Study of the Week: More Bad News for College Remediation

Today’s Study of the Week combines two subjects we’ve talked about recently on the ANOVA, college remediation and regression discontinuity design. The study, by the University of Warwick’s Emma Duchini, throws even more cold water on our efforts to fix gaps in college student readiness with remediation – and leaves us wondering what to do instead.

One of the basic difficulties in improving educational outcomes lies in the chain of disadvantage. Students who start out behind tend to stay behind, and it’s not productive to ask teachers to make up for the gaps that have been opened over the course of a student’s life. As I’ve said on this blog many times, most students tend to sort themselves into fairly stable academic rankings early in life, and though individuals move between those rankings fairly often, at scale and in numbers this hierarchy is remarkably persistent. So third grade reading group serves as a good predictor of high school graduation rates, which in turn obviously predicts college completion rates. Meanwhile, the racial achievement gap appears to exist before students ever show up in formal schooling at all. It’s discouraging.

This study comes from the United Kingdom, but it concerns a question of great interest on this side of the Atlantic: do college remediation classes work? We know that college student populations are profoundly different in incoming ability. The college admissions process makes sure of that. That means that institutions like mine, the City University of New York, face profoundly higher hurdles in getting students to typical levels of ability, as our admissions data tells us that many of our students are unprepared. Typically, this results in remedial classes, to the tune of $4 billion a year for public universities. But as Duchini notes, evidence for the effectiveness of remediation is thin on the ground. Her study takes another look.

Duchini’s study draws its data from the economics department of a public Italian university. This university implemented an entrance exam for potential students, consisting of a math section, a verbal section, and a logic section. The results of this test, combined with high school grades, determines whether students are admitted to the program. However, the math section alone is used to determine whether students need to take a remedial program. Because this involves using a cut score, the cut score is fairly close to the mean, and there are no other systematic differences between students placed in or out of the remediation program, this is an ideal situation for a regression discontinuity design, as I explained in this previous post.

I can teach you regression discontinuity design in two images

Ultimately Duchini considers the exam scores and educational and demographic data of 2,682 students, sorted into descriptive categories like gender, immigrant or domestic, vocational or general track, or similar. Importantly for a regression discontinuity design, there is no evidence of student groupings tightly on either side of the cut score, which can indicate that there is student manipulation of placement that would invalidate the design.

There’s an interesting dynamic in the data set, perhaps an example of Berkson’s paradox. Students who perform better on the entrance test are actually less likely to enroll in the program, even though doing well on the test is a requirement for attendance. Why? Think about what it means to do well on the test: those students are more academically prepared over all, and thus have more options for majors to take, meaning that more of them will choose to enroll in a different program.

In any event, Duchini uses a regression discontinuity design to see if there is any meaningful difference between students on students on either side of the cut score and how the trend line changes, looking at outcome variables like odds of dropping out, passing college-level math, and credits accumulated. The results are not encouraging. In particular, the real nut is here, how remediation affects the odds of passing college-level math. Note that the sample is restricted here to edge cases, as we don’t want to get a misleading picture from looking at students too far from the cutoff – this is a last in/last out style model, after all – and bear in mind that because this is a remediation test, the treatment is assigned to those on the left hand side of the cut line.

The upward-sloping trend is no surprise; we should expect student performance on an entrance exam to predict the likelihood that they’ll get through a class in the test subject. What we want to see here is a large break in the performance of the groups at the cut score, with a corresponding shift in the trend line, to suggest that the remediation program is meaningfully affecting outcomes – that is, that it’s bringing students below the cutline closer to the performance of those well above it. Neither eyeballing this scatterplot nor the statistical significance checks Duchini describes provides any such evidence. I find that fact that the data points are more tightly grouped on the left side of the cutline than on the right interesting, but I’m guessing it’s mostly noise. Look in the PDF for more scatterplots with similar trend lines as well as the model and threshold for significance.

Duchini goes into a lot of extra detail, breaking the data set down by demographic groupings and educational factors, though in every case there is little evidence of meaningful gains from the remediation program. Duchini also speaks at length about potential reasons why the program failed to meaningfully prepare students to pass college-level math, including wondering if being assigned remediation might discourage students by making them feel like the work of getting their degree will be even harder than they thought. It’s interesting stuff and worth reading, but for our purposes the conclusion is simple: this remediation program does not appear to meaningfully help students succeed in later college endeavors. It’s only one study from a particular context. But given similar studies that also find little value in remediation, this is more reason to question the value of such programs. More study is needed, but it’s not looking good.

Clearly, if remedial classes don’t work, and they cost students time and money, they should be scrapped. But scrapping them won’t solve the underlying problem: students are arriving at college without the necessary academic skills to ensure that they succeed. College educators will typically lament that they’re trying to solve the deficiencies of high school education, but of course high school teachers can fairly look back as well. Ultimately the dynamic is applicable to the whole system: students are profoundly unequal in their various academic talents from a very early age, and we’re all searching for ways to serve them better. Perhaps the conversation needs to turn to whether we should be pushing so many students into college in the first place, and whether we need to look for answers to economic woes outside of the education system entirely. But for now, we as college educators are left with a sticky problem: our students come to our schools unprepared, but our programs to fill those gaps show little sign of working.

Study of the Week: Hitting the Books, or Hitting the Bong?

Today’s Study of the Week, by Olivier Marie and Ulf Zölitz, considers the impact of access to legal marijuana on college performance. (Via Vox’s podcast The Weeds.) The researchers took advantage of an unusual legal circumstance to examine a natural experiment involving college students and  marijuana. For years now, the Netherlands has been working to avoid some of the negative consequences of its famous legal marijuana industry. While most in the country still support decriminalization, many have felt frustrated by the influx of (undoubtedly annoying) tourists who show up to Dutch cities simply looking to get high. This has led to some policies designed to ameliorate the negative impacts of marijuana tourism without going backwards towards criminalization.

In the city of Maastricht, one such policy involved only selling marijuana to people who had citizenship identification from the Netherlands, Germany, and Belgium, and not from other nationalities. These specific countries seem to have been chosen as a matter of geography – look at Maastricht on a map and you’ll see it’s part of a small Dutch “peninsula” wedged between Germany and Belgium. Importantly for our purposes here, Maastricht features a large university, and like a lot of European schools it attracts students from all over the continent. That means that when the selective-enforcement policy went into effect in 2011, one group of students still had access to marijuana, while another lost it, at least legally. That provided an opportunity to study how decriminalization impacts academic outcomes.

This research thus does not amount to a true randomized experiment, although I suppose that’s one that you could really do, given the long-established relative safety of marijuana use. (“Dude, I’ll slip you $100 not to end up in the control group! No placebo!”) Instead, like a couple of our Studies of the Week in the past, this research utilizes a difference-in-difference design, comparing outcomes for the two different groups using panel data, with a lot of the standard quality checks and corrections to try and root out construct-irrelevant variance between the groups. Ultimately they looked at 4,323 students from the School of Business and Economics. Importantly for our purposes here, there were about equivalent dropout rates between the two treatment groups, which can potentially wreak havoc on this kind of analysis if they are not closely matched.

There’s a couple obvious issues here. First, not only are these groups not randomly selected, they are deliberately selected by nationality. This could potentially open up a lot of confounds and makes me nervous. Still, it’s hard to imagine that there is a distinct impact of smoking marijuana on brains of people from different European nationalities, and the authors are quite confident in the power of their models to wash out nonrandom group variance. Second, you might immediately object that of course even students who are not legally permitted to smoke marijuana will frequently do so, and that many who can won’t. How do we know there aren’t crossover effects? Well, this is actually potentially a feature of the research, not a bug. See, that condition would be true in any decriminalization scheme; there will inevitably be people who use under a period of illegality and who don’t when decriminalized. In other words, this research really is looking at the overall aggregate impact of policy, not the impact of marijuana smoking on individual students. Much like the reasoning behind intent-to-treat models, we want to capture noncompliance because noncompliance will be present in real-world scenarios.

So what did they find? Effects of legal access to marijuana are negative, although to my mind quite modest. Their summary:

the temporary restriction of legal cannabis access increased performance by on average .093 standard deviations and raised the probability of passing a course by 5.4 percent

That effect size – not even a tenth of an SD – is interesting, as when I heard this study discussed casually, it sounded as if the effect was fairly powerful. Still, it’s not nothing, and the course-passing probability makes a difference, particularly given that we’re potentially multiplying these effects across thousands of students. The authors make the case for its practical significance like so:

Our reduced form estimates are roughly the same size as the effect as having a professor whose quality is one standard deviation above the mean (Carrell and West, 2010) or of the effect of being taught by a non-tenure track faculty member (Figlio, Shapiro and Soter, 2014). It is about twice as large as having a same gender instructor (Hoffmann and Oreopoulos, 2009) and of similar size as having a roommate with a one standard deviation higher GPA (Sacerdote, 2001). The effect of the cannabis prohibition we find is a bit smaller than the effect of starting school one hour later and therefore being less sleep-deprived (Carell, Maghakian & West, 2011).

This context strikes me as mostly being proof that most interventions into higher ed are low-impact, but still, the discussed effects are real, and given that marijuana use is associated with minor cognitive impairment, it’s an important finding. Interestingly, the negative effects were most concentrated in women students, lower-performing students, and in quantitative classes, suggesting that the average negative impact of legalization would be unequally distributed. One important note: these findings were consistent even when correcting for time spent studying, suggesting that it wasn’t merely that students who had access to marijuana were less inclined to work but actually performed less well on their tasks on a minute-per-minute basis.

What do we want to do with this information? Does this count as evidence supporting continued marijuana criminalization? No, not to me. Part of what makes achieving a sensible drug policy difficult lies in this shifting of the burden of proof: things that are already illegal are often treated as worthy of decriminalization only if they can be proven to be literally harmless. But all number of behaviors that are perfectly legal involve harms. Alcohol and tobacco use are obvious examples, but there are others, including eating junk food – which is not just legal but actively subsidized by our government, thanks to a raft of bad laws and regulation that provide perverse incentives for food production. Part of freedom means the freedom to make bad choices. The question is when those choices are so bad that society feels compelled to prevent individuals from making them. Even if you aren’t as attached to civil liberties as I am, I think you can believe that marijuana use simply doesn’t qualify.

As for myself, I actually mostly stopped smoking when I got to grad school. In part that’s because I didn’t enjoy it anymore the way I once did. But it was also because I knew I simply couldn’t read and write effectively after I had smoked, and graduate study required me to be reading and writing upwards of 12 hours a day. That’s by no means universal; some people I know find it helps them concentrate. Likewise, I am useless as a writer after more than one beer, though of course there are many writers who famously wrote best when soused. Still, it seems to me entirely intuitive that habitual marijuana use would have minor-but-real negative impacts on academic outcomes. Marijuana, as safe as it is, and as ridiculous as its continued federal illegality in the United States is, does tend to cause minor cognitive impairments, and it would be foolish to assume there’s no negative educational impacts associated with it.

I’d still rather have college kids getting stoned than binge drinking constantly. And ultimately this is a question of pluses and minuses that individual people should be able to weigh for themselves, just as they do when they decide on a cheeseburger or a salad. That’s what freedom is all about, and one part of college is giving young people a chance to make these kinds of adult decisions for themselves.

Study of the Week: Modest But Real Benefits From Lead Exposure Interventions

Today’s Study of the Week, via SlateStarCodex, considers the impact of intervention programs to help ameliorate the impact of lead exposure on children. Exposure to lead, even at relatively low doses, has a long-established set of negative consequences, particular pertaining to cognitive functioning and behavioral control. This dynamic has long been hypothesized as a source of a great deal of social problems, perhaps even explaining the dramatic rise and fall in crime rates in America in the 20th century, given the rise and fall of leaded gasoline. Those broader questions are persistently controversial and will take years to answer. In the meantime, we have interventions designed to ameliorate the negative impacts of lead exposure, but little in the way of large-scale responsible research to measure their impact. This study is a step in closing that gap.

In the study, written by Stephen B. Billings and Kevin T. Schnepel, a set of observational data is analyzed to see how children eligible for inclusion in a program of interventions for lead exposure compared to a control group that did not receive the intervention. The data, taken from North Carolina programs in the 1990s, is robust and full-featured, allowing the researchers to consider behavioral outcomes for children, later-in-life criminal behavior, educational outcomes, and some other metrics of overall quality of life.

For obvious reasons, the study is not a true experiment – you can’t expose children to lead as an experimental treatment and note the difference. But they are able to approximate an experimental design, first thanks to the number of statistical controls, and second thanks to a trick of the screening process. Lead testing is notoriously finicky, so children are usually tested twice in early childhood. If children were tested once and found to have lead levels higher than the threshold, they would then be tested again several months later. If they were found to have again exceeded that threshold, they would be assigned to the intervention protocol. This provided researchers with the opportunity to examine children who tested above the threshold the first time but not the second and compare them to those who tested above the threshold both times. Because only those who were above threshold twice were subject to interventions, these formed natural “control” and “test” groups, subject to quality and robustness checks. Because those in the intervention groups had higher lead exposure overall, their outcomes were statistically corrected for comparison to the control group.

As discussed in the last Study of the Week, this research uses an intent-to-treat model (“once randomized, always analyzed”) because it was not possible to tell what portion of test subjects actually completed the interventions, and because there will certainly be noncompliers in other real-world populations as well, helping to avoid overly strong estimates of intervention effects.

The interventions included education and awareness campaigns, general medical screenings for overall childhood health, nutritional interventions which are believed (but not proven) to be effective at mitigating the effects of lead exposure, educational interventions, and for higher levels of exposure, efforts to physically locate and remove the sources of contamination, usually lead paint. These efforts can be quite expensive, with an estimated average cost of intervention for in-study participants of $5,288. To my mind this is precisely the kind of thing a healthy society should ensure is paid for.

I want to note that this study strikes me as a monumental task. The sheer amount of types of data they pulled – birth records, housing records, educational data, criminal justice data, and others – must have taken great effort, and wrangling that amount of data from that many different sources is no mean feat. They even investigate which of their research subjects may have lived in the same house. And the sheer number of controls and quality tests employed here are remarkable. It’s admirable work, which will serve as a good model for replication going forward.

Unsurprisingly, lead exposure has a serious impact on educational outcomes:

This is consistent with a large body of research, as suggested previously. The behavioral outcomes are even more pronounced, which you can investigate in the paper. Bear in mind that in the raw numbers there are many confounds – poor people and people of color are disproportionately likely to live in lead-tainted environments, and they are also more likely to suffer from educational disadvantage in general, thanks to many social factors. But these trends are true within identified demographic groups as well.

Luckily, the intervention protocol does have an impact. To estimate it, the researchers combine this data (math and reading at 3rd and 8th grade and grade retention from grades 1-9) into an educational intervention index. They find an overall effect of .117 SD improvement relative to the control group in this index, though with a p-value only significant to .10, not typically considered significant in many contexts. This is perhaps explained in part by the of 301 and may be improved with larger-replications. There is a great deal of difference in the metrics that make up the index, listed in Table 4, so I urge you to investigate their individual effect sizes and p-values.

This overall effect size of .117 for the educational index is somewhat discouraging, even though they suggest the intervention does have a positive impact. The biggest positive educational interventions achievable by policy, such as high-intensity tutoring, tend to have a .4-.5 SD effect on quantitative outcomes; the black-white achievement gap in many metrics is around 1 SD. So we’re talking about modest gains that don’t close the educational disadvantages associated with lead exposure. This can perhaps intuitive given that these efforts are largely aimed at preventing more exposure, rather than counteracting the impact of past behaviors. Still, in a world where we’re grasping for the positive impacts we can, and given our clear moral responsibility to help children grow in lead-free environments regardless of the educational impacts, it’s an encouraging sign.

What’s more, the behavioral indices were more encouraging, The researchers assembled an antisocial behavior index including metrics related to school discipline and criminality. Here the effect size was .184, significant to an alpha of .05. These non-cognitive skills make a big impact on the quality of life of students, parents, teachers, and peers. Still fairly modest in impact, but more than worth the costs.

Seems pretty clear to me that we need robust efforts to clean up lead in our environment and to mitigate the damage done to people already exposed. This is an important study and I’m eager to see replication attempts.

Study of the Week: To Remediate or Not to Remediate?

Today’s Study of the Week comes from researchers at my own university, the City University of New York, and concerns an issue of profound professional interest to me: the success of students who are required to attend remedial math classes in our community colleges. CUNY is a system of vast differences between and within its institutions, playing host to programs at senior colleges with well-prepared students that could succeed anywhere and also to many severely under-prepared students who struggle and drop out at unacceptable rates. In this diversity of outcomes, you have a microcosm of American higher education writ large, which like seemingly all things American is plagued by profound inequality.

Here at Brooklyn College, fully two thirds of undergraduates are transfer students, the vast majority of them having come from the CUNY community college system. (Those who get a sufficient number of credits from the community colleges must be admitted to the institution under CUNY policy, even when they would ordinarily not have met the necessary academic standards.) Typical academic outcomes data for these students, as distinct from the third who start and finish their careers at Brooklyn College, are vastly different, and in a discouraging direction. Since my job entails demonstrating to people in positions of power that students are learning adequately here, and in explaining why unfortunate numbers may look that way, this difference is important. But CUNY policy and the overall rhetoric of American college agitates against this nuance. Indeed, the recent adoption of the Pathways system of credits is based on a simple premise, that a CUNY student is a CUNY student, a CUNY class a CUNY class, and a CUNY credit a CUNY credit. This assumption of equivalence across the very large system makes life easier for students and administrators. It is also, I would argue, empirically wrong. But this is a question far above my pay grade.

In any event, the fact is that CUNY colleges host many students who lack the level of prerequisite ability we would hope for. Today’s study asks an essential question: is the best way to serve CUNY community college students who lack basic math skills to send them to non-credit bearing remedial algebra classes? Or is it to substitute a credit-bearing college course in statistics? The question has relevance far beyond CUNY.

Algebra is a Problem

When we’re talking about incoming students who fail to meet standards, we’re also talking about how they fared at the high school level. And the failure to meet entrance requirements for college corresponds with a failure to meet graduation requirements for high school. Among the biggest, most intractable problems with getting students to meet standards comes from algebra. A raft of evidence tells us that algebra requirements stand as one of the biggest impediments to students graduating from high school in our system. Here in New York City, the pass rate for the relevant sections of the Regents Exam has fluctuated with changes to standards, with 65% passing the Algebra I in 2014, 52% passing in 2015, and 62% in 2016. Even with changing standards, in other words, more than a third of all NYC students are failing to meet Algebra I requirements – and that’s despite longstanding complaints that the standards are too low. Low standards in math might help explain why 57% of undergraduates in the CUNY system writ large were found to be unable to pass their math requirements in a 2012 study.

Indeed, rising graduation rates nation-wide have come along with concerns that this improvement is the product of lower standards. You can see this dynamic play out with allegations against “online credit recovery” or in the example of a San Diego charter school where graduation rates and grades are totally contrary to test performance. Someone I know who works in education policy in the think tank world told me recently that he suspects that less than half of American high school graduates actually have the skills and knowledge required of them by math standards, as distinct from just formally passing.

The political scientist Andrew Hacker, himself of CUNY, has made the case against algebra requirements at book length in his recent The Math Myth. As Hacker says, the screening mechanism of getting through algebra, pre-calculus, and similar required courses prevents many students who are otherwise academically sufficient for higher education from attending college. He marries this argument to a critique of the funnel-every-student-into-STEM-career school of ed philosophy that has become so dominant and which I myself have argued, at length, is empirically unjustifiable, economically illiterate, and educationally impossible. Rather than trying to get every kid to be an aspiring quant, Hacker recommends replacing algebra and calculus requirements with more forgiving, practically-aligned and conceptual courses in quantitative literacy.

The question is, can we do what Hacker has asked without wholesale remaking the college system, a very large boat that’s notoriously slow to turn? That’s the question this Study of the Week is intended to answer.

The Study

Today’s study was conducted by A. W. Logue, Mari Watanabe-Rose, and Daniel Douglas, all of CUNY. They were able to take advantage of an unusual degree of administrative access to conduct a true randomized experiment, assigning students to conditions randomly in a way very rarely possible in practical educational settings. The researchers conducted their study at three (unnamed) CUNY community colleges. Their research subjects were students who would ordinarily be required to take a remedial non-credit-bearing algebra course. These students were randomly assigned to three groups: the traditional elementary algebra class (which we can think of as a control), an elementary algebra class where students were required to participate in a support workshop of a type often recommended as a remediation effort, and a undergraduate level, course-bearing introduction to statistics course with its own workshop.

In order to control for instructor effects, all instructors in the research taught one section each of the various classes, helping to minimize systematic differences between the experimental groups. Additionally, there was an important quality check in place regarding non-participants. In an ideal world, true randomization would mean that everyone selected for a treatment or group would participate, but of course you can’t force participation in experiments. That means that there might be some bias if students assigned to one treatment were more likely to decline to participate. Because of the nature of this study, the researchers were able to track the performance of non-participants, who took the standard elementary algebra class. Those students performance similarly to the in-study control group, an important source of confidence in the research.

The researchers used several different techniques to examine their relationships of interest, specifically the odds of passing the course and the amount of credits earned in the following year. One technique was an intent-to-treat (ITT) analysis, which is a kind of model used to address the fact that participants in randomized controlled trials will often drop out or otherwise not comply with the experimental program. It generates conservative effect size estimates by simply assuming that everyone who was randomized into a group stayed there for statistical purposes, even if we know we had some attrition and non-compliance along the way. (“Once randomized, always analyzed.” ) Why would we do that? Because we know that in a real-world scenario “subjects” won’t stick with their assigned “treatments” either, and we want to avoid overly optimistic effect sizes that might come with only looking at compliance.

(As always, if you want the real skinny on these model adjustments I urge you to reading people who really know this stuff. Here’s a useful, simply stated brief on intent to treat.)

The results seem like a pretty big deal to me: after analysis, including throwing in some covariates, they find that there is no significant difference in passing the course between students enrolled in the traditional elementary algebra class and that class plus a workshop, but there is a significant and fairly large (16% without covariates in the model, 14% with) difference in odds of passing the course for those randomized to the intro stats course compared to the elementary algebra course. That is, after randomization students were 16% more likely to pass a credit-bearing college-level course than a non-credit-bearing elementary algebra course. Additionally, the stats group had a significantly higher number of total credits accumulated during the experimental semester and next year, even after subtracting the credits earned for that stats course.

(Please do take a look at the confidence interval numbers listed in brackets below, which tells you a range of effects that we can say with 95% confidence contains the true average effect. Starting to look at confidence intervals is an important step in reading research reports if you’re just getting started.)

Additionally, as the authors write, “as of 1 year after the end of the experiment, 57.32% of the Stat-WS students had passed a college-level quantitative course…, while 37.80% still had remedial need. In contrast, only 15.98% of the EA students had passed a college-level quantitative course and 50.00% still had remedial need.”

Another thing that jumped out at me: in an aside, the authors note that there was no significant difference between the groups in their likelihood of taking credits a year after the experimental semester, with all groups around 66% still enrolled. Think about that – just a year out, fully a third of all students in the study were not enrolled, reflecting the tendency to stop out or drop out that is endemic to community colleges.

Of course, none of this would be inconsistent with assuming a good deal of explanatory power of incoming ability effects, and the relationship between performance on the Compass algebra placement test and the odds of passing are about what you’d expect. Prerequisite ability matters.

In short, students who were randomly selected into an elementary algebra class with a supportive workshop attached were no more likely to pass that class than those sorted into a regular algebra class, but those sorted into an introductory statistics class were 16% more likely to have passed that course. Additionally, the latter group earned significantly more college credits in the following year than the other groups, and were much more likely to have completed a quantitative skills class.

OK. So what do we think about all of this?

First, I would be very skeptical about extrapolating these results into other knowledge domains such as literacy, writing, or similar. I don’t think all remediation efforts are the same across content domains and it’s likely that research will need to be done in other fields. Second, the fact that a supporting workshop did little to improve outcomes compared to students without such a workshop is discouraging but hardly surprising. Such interventions have been attempted for a long time and at scale, but their results have been frustratingly limited.

All in all, the evidence in this study supports Hacker’s point of view, and I suppose my own: students can achieve better results in terms of pure moving towards graduation quickly if we just let them take college stats instead of forcing them to take remedial algebra first. But there’s a dimension that the researchers leave largely unexplored, which is the question of whether this all just represents the benefits of lowering standards.

Are We Just Avoiding Rigor?

The authors examine many potential explanations about why the stats-taking students outperformed the other group, including potential non-random differences in groups, motivation, and similar, but seem oddly uninterested in what strikes me as the most obvious read of the data: that it’s just easier to pass Intro to Stats than it is to pass even a remedial algebra course. They do obliquely get at this point in the discussion, writing

degree progression is not the only consideration in setting remediation policy. The participants in Group Stat-WS were only taught elementary algebra material to the extent that such material was needed to understand the statistics material. Whether students should be graduating from college having learned statistics but without having learned all of elementary algebra is one of the many decisions that a college must make regarding which particular areas of knowledge should be required for a college degree. Views can differ as to which quantitative subjects a college graduate should know.

They sure can! This seems to me to be the root of the policy issue: should we substitute stats courses for algebra courses if we think doing so will make it less likely for students to drop out or otherwise be disrupted on the path to graduation?

This is not really a criticism of this research, though I’d have liked a little more straightforward discussion of this from the authors. But I will hold with Hacker in suggesting that this does represent a lowering of standards, and that this is a feature, not a bug. That is, I think we should allow some students to avoid harder math requirements precisely because the current standards are too high. Students in deeply quantitative fields will have higher in-major math requirements anyway. Of course, in order to take advantage of this, we’d have to acknowledge that the “every student a future engineer” school of educational policy is ill-conceived and likely to result only in a lot of otherwise talented students butting their heads up against the wall of math standards. But unlike most ed policy people, I am willing to say straightforwardly that there are real and obvious differences in the specific academic talents of different individual students, and that these differences cannot be closed through normal pedagogical means. That’s what the best evidence tells us, including this very study.

Hacker says that many ostensibly quantitative professions, like computer programmer or doctor, require far less abstract math skills than is presumed. I don’t doubt he’s correct. The question is whether we as a society – and, more important, whether employers – are will to accept a world where some significant % of people in such jobs never had to pass an Algebra II or Calculus class. Or, failing that, can we redefine our sense of what is valuable work so that the many people who seem incapable of reaching standards in math can go on to have productive, financially-secure lives?

What We’re Attempting with College is Very Hard

Colleges and universities have found themselves under a great deal of pressure, internal and external, in recent years. This is to be expected; they are charging exorbitant tuition and fees from their students, after all, and despite an army of concern trolls doubting their value, the degrees they hand out in return are arguably more essential than ever for securing the good life. Though enrollment rates have slowed in recent years, over time the trend is clear:

Policymakers and politicians must understand: these new enrollments are coming overwhelmingly from the ranks of those who would once have been considered unprepared for higher education, and this has increased the difficulty of what we’re attempting dramatically.

What we’re attempting is to admit millions of more people into the higher education system than before, almost all of whom come from educational and demographic backgrounds that would once have screened them out from attendance. Because those backgrounds are so deeply intertwined with traditional inequalities and social injustice, we have rightly felt a moral need to expand opportunity to those from them. Because the 21st century economy grants such economic rewards to those who earn a bachelor’s degree, we have developed a policy regime designed to do so. I cannot help but see the moral logic behind these sentiments. And yet.

Let’s set aside my perpetual questions about the difference between relative and absolute academic performance and how they are rewarded. (Can the economic advantage of a college degree survive the erosion of the rarity of holding one? How could it possibly under any basic theory of market goods?) We’re still left with this dilemma: can we possibly maintain some coherent standards for what a college degree means while dramatically expanding the people who get them?

One way or another, everyone with an interest in college must understand that the transformation that we’re attempting as a community of schools, educators, and policymakers is unprecedented. Today, the messages we receive in higher education seem impossible: we must educated more cheaply, we must educate more quickly, we must educated far more underprepared students, and we must do so without sacrificing standards. This seems quixotic, to me. Adjusting curricula in the way proposed in this research, and accepting that higher completion rates probably require lower standards, is one way forward. Or we can refuse to adapt to the clear implication of a mountain of discouraging data for students at the lower end of the performance distribution, and get back failure in return.

(Actual) Study of the Week: Academic Outcomes for Preemies

Now back to our regularly scheduled programming….

There’s a lurking danger in the “nature vs nurture” debate that has been so prominent in educational research for so long: people tend to assume that genetic influence means that something is immutable, while environmental influences are assumed to be changeable. The former is not correct, at least in the sense that there are a lot of genetically influenced traits that can be altered or ameliorated – all manner of physical skills, for example, are subject to the impact of exercise, even while we acknowledge that at the top of the distribution tiers, natural/genetic talents play a big role. Likewise, we can believe in educational efforts that somewhat ameliorate genetic influences even while we recognize that biological parentage powerfully shapes intellectual outcomes.

The obverse is even more often forgotten: just because an influence is environmental in nature, that does not mean we can necessarily change its effects. Lead exposure, for example, leads to relatively small but persistent damage to cognitive function. This is certainly environmental influence, but not one that we have tools to ameliorate. I’m not quite sure if we would call neonatal development “environmental,” but influences on children in the womb are a good example of non-genetic influences that are potentially immutable. And they are also another lens through which I want us to consider our tangled, frequently-contradictory intuitions about academic performance and just deserts.

Today’s Study, written by the exceptionally-Dutch-named Cornelieke Sandrine Hanan Aarnoudse-Moens, Nynke Weisglas-Kuperus,  Johannes Bernard van Goudoever, and Jaap Oosterlaan, is a meta-analysis of extant research on the academic outcomes of children who were born very prematurely and/or at very low birth weight. (For an overview of meta-analysis and effect size, please see this post.)

The studies had a number of restrictions in addition to typical quality checks. First, the studies consider had to look at very premature births, defined as less than 33 weeks gestation and/or with very low birth weight, defined as less than 1500 grams. Additionally, for inclusion in the meta-analysis, the studies had to track student performance to at least age 5, as this is where formal schooling begins and where responsible analysis of academic outcomes can be considered. These studies reported on academic outcomes, behavioral outcomes as represented by teacher and parent observation checklists/surveys, and so-called executive functioning variables, which includes things like impulse control and ability to plan (and which have been pretty trendy). All in all, data from 14 studies on academic outcomes, 9 on behavioral outcomes, and 6 on executive functioning were considered. (There was some overlap.) All in all, 4125 very preterm and/or very low birth weight children were compared to 3197 children born at term. The authors performed standard meta-analytic procedures involved pooling SDs and weighting by sample size and reported effect sizes in good old Cohen’s d. 

They also used a couple of statistical tests to attempt to adjust for publication bias. Publication bias is a troubling aspect of research studies that can undermine meta-analysis, particularly problematic given that meta-analysis is often viewed as a way to ameliorate (never eliminate) other problems like p-value hacking or similar. Publishing bias refers to the fact that journals are much more likely to publish studies with significant effects that those without them. This has several bad outcomes – for one, it provides perverse incentives for academics trying to get jobs and tenure. But it also distorts our view of reality. We adjust for the various issues with individual studies, in part, by looking at a broad swath of research literature. But if the non-significant results are sitting in a drawer while significant results are in Google Scholar, that’s not going to help, even with meta-analysis.

The results are not particularly surprising, but are sad all the same: children born very prematurely and/or at very low birth way have persistently worse academic outcomes compared to similar children. In terms of academic outcomes, we’re talking about -0.48 SD for reading, -0.60 SD for mathematics, and -0.76 for spelling. These are, in context of educational research, large effects. There was some variation between studies, as is to be expected in any meta-analysis, but this variation was not large enough to undermine our confidence in these results. Checks for publication bias came up largely clean as well. There are also findings indicating that children born premature have problems with attention, verbal fluency, and working memory. These effect sizes had no meaningful relationship to the age of assessment, suggesting that these problems are persistent. With a few exceptions, these relationships are continuous – that is, children with lower gestational ages and birth weight are generally worse off in terms of outcomes even when compared to other children born prematurely and/or at low birth weight.

First, this is very important to say: the studies included in this meta-analysis represent averages. We live in a world of variability. There are certainly many children who are born severely prematurely and go on to academic excellence. It would be wrong to assume that these influences indicate a certain academic destiny, as it is for any variable we examine in educational research. The trends, however, are clear. Sadly, other research suggests that these problems are likely to extend into at least young adulthood.

What are some of the consequences here? Well, to begin with, I think it’s another important facet of how we think about educational outcomes and how much of those outcomes lie outside of the hands of students, parents, and teachers. No one has chosen this outcome. For another thing, there’s the breaking of the nature/nurture binary I pointed out above. This is a non-genetic but uncontrolled introduction of major influence into the educational outcomes of children. I don’t mean to be fatalistic about things;  there’s always a chance that we’ll find some interventions that help to close these gaps. But I think this is another reason for us to get outside of a moralistic framework for education, where every below-average outcome has to be the fault of someone – the parents, the teachers, or the student themselves. 

And again, I think this points in the direction of a societal need to expand our definition of what it means to be a good student, and through that, what it means to be a valuable human being. True, very early births are comparatively rare, though almost 10% of all American births are preterm. (Like seemingly everything else in the United States, preterm birth rates are influenced by race, class, and geography.) But this dynamic is just another data point in a large set of evidence that suggests that academic outcomes are largely outside of the hands of individuals, parents, and teachers, particularly if we recognize that genetic influence is not controlled by those groups. What’s interesting with premature babies is that I doubt anyone would think that they somehow deserve worse life outcomes as a result of their academic struggles. Who could be so callous? And yet when it comes to genetic gifts – which are just as uncontrolled by individuals as being born prematurely – there are many who think it’s fine to disproportionately hand out reward. I don’t get that.

Ultimately, rather than continuing to engage in a quixotic policy agenda designed to give every child the exact same odds of being a Stanford-trained computer scientist, we should recognize as a society that we will always have a range of academic outcomes, that this means we will always have people who struggle as well as excel, and that to a large extent these outcomes are not controlled by individuals. Therefore we should build a robust social safety net to protect people who are not fortunate enough to be academically gifted, and we should critique the Cult of Smart, recognizing that there are all manner of ways to be valuable human beings.

Study of the Week: We’ll Only Scale Up the Good Ones

When it comes to education research and public policy, scale is the name of the game.

Does pre-K work? Left-leaning people (that is, people who generally share my politics) tend to be strong advocates of these programs. It’s true that generically, it’s easier to get meaningful educational benefits from interventions in early childhood than later in life. And pre-K proponents tend to cite some solid studies that show some gains relative to peer groups, though these gains are generally modest and tend to fade out over time. Unfortunately, while some of these studies have responsible designs, many that are still cited are old, from small programs, or both.

Today’s Study of the Week, by Mark W. Lipsey, Dale C. Farran, and Kerry G. Hofer, is a much-discussed, controversial study from Tennessee’s Voluntary Prekindergarten Program. The Vanderbilt University researchers investigated the academic and social impacts of the state’s pre-K programs on student outcomes. The study we’re looking at is a randomized experimental design, which was pulled from a larger observational study. The Tennessee program, in some locales, had more applicants than available seats. These seats are filled by a random lottery, creating a natural control and experimental group.

There is one important caveat here: the students examined in the intensive portion of the research had to be selected from those whose parents gave consent. That’s about a third of the potential students. This is a potential source of bias. While the randomized design will help, what we can responsibly say is that we have random selection within the group of students whose parents opted in, but with a nonrandom distribution relative to the overall group of students attending this program. I don’t think that’s a particularly serious problem, but it’s a source of potential selection bias and something to be aware of. There’s also my persistent question about the degree to which school selection lotteries can be gamed by parents and administrators. There are lots of examples of this happening. (Here’s one at a much-lauded magnet school in Connecticut.) Most people in the research field seem not to see this as a big concern. I don’t know.

In any event, the results of the research were not encouraging. Researchers examined six identified subtests (two language, two literacy, two math) from the Woodcock-Johnson tests of cognitive ability, a well-validated and widely-used battery of tests of student academic and intellectual skills. They also looked at a set of non-cognitive abilities related to behavior, socialization, and enthusiasm for school. A predictable pattern played out. Students who attended the Tennessee pre-K program saw short-term significant gains relative to their peers who did not attend the program. But over time, the peer group caught up, and in fact in this study, exceeded the test group. That is, students who attended Tennessee’s pre-K program ended up actually underperforming those who were not selected into it.

By the end of kindergarten, the control children had caught up to the TN‐VPK children and there were no longer significant differences between them on any achievement measures. The same result was obtained at the end of first grade using both composite achievement measures. In second grade, however, the groups began to diverge with the TN‐VPK children scoring lower than the control children on most of the measures….  In terms of behavioral effects, in the spring the first grade teachers reversed the fall kindergarten teacher ratings. First grade teachers rated the TN‐ VPK children as less well prepared for school, having poorer work skills in the classrooms, and feeling more negative about school.

This dispiriting outcome mimics that of the Head Start study, another much-discussed, controversial study that found similar outcomes: initial advantages for Head Start students that are lost entirely by 3rd grade.

Further study is needed1 but it seems that the larger and more representative the study, the less impressive – and the less persistent – the gains from pre-K. There’s a bit of uncertainty here about whether the differences in outcomes are really the product of differences in programs or due to differences in the research itself. And I don’t pretend that this is a settled question. But it is important to recognize that the positive evidence for pre-K comes from smaller, higher-resource, more-intensive programs. Larger programs have far less encouraging outcomes.

The best guess, it seems to me, is that at scale universal pre-K programs would function more like the Tennessee system and less like the small, higher-performing programs. That’s because scaling up any major institutional venture, in a country the size of the United States, is going to entail the inevitable moderating effects of many repetitions. That is, you can build one school or one program and invest a lot of time, effort, and resources into making it as effective as possible, and potentially see significant gains relative to other schools. But it strikes me as a simple statement of the nature of reality that this intensity of effort and attention can’t scale. As Farran and Lipsey say in a Brookings Institution essay, “To assert that these same outcomes can be achieved at scale by pre-K programs that cost less and don’t look the same is unsupported by any available evidence.”

Some will immediately say, well, let’s just pay as much for large-scale pre-K as they do in the other programs and model their techniques. The $26 billion question is, can you actually do that? Can what makes these programs special actually be scaled? Is there hidden bias here that will wash out as we expand the programs? I confess I’m skeptical that we’ll see these quantitative gains under even the best scenario. I think we need to understand the inevitability of mediocrity and regression to the mean. That doesn’t mean I don’t support universal pre-kindergarten childcare. As with after school programs, I do for social and political reasons, though, not out of the conviction much that they’ll change test scores much. I’d be happy to be proven wrong.

Now I don’t mean to extrapolate irresponsibly. But allow me to extrapolate irresponsibly: isn’t this precisely what we should expect with charter schools, too? We tend to see, survivorship-bias heavy CREDO studies aside, that at scale the median charter school does little or nothing to improve on traditional public schools. We also see a number of idiosyncratic, high-intensity, high-attention charters that report better outcomes. The question you have to ask, based on how the world works, is which is more likely to be replicated at scale – the median, or the exceptions?

I’ve made this point before about Donald Trump’s favorite charter schools, Success Academy here in New York. Let’s set aside questions of the abusive nature of the teaching that goes on in these schools. The basic charter proponent argument is that these schools succeed because they can fire bad teachers and replace them with good. Success Academy schools are notoriously high stress, long-hour, low pay affairs. This leads naturally to high teacher attrition. Luckily for the NYC-based Success Academy, New York is filled with lots of eager young people who want to get a foothold in the city, do some do-goodering, then bail for their “real” careers later on – essentially replicating the Teach for America model. So: even if we take all of the results from such programs at face value, do you think this is a situation that can be scaled up in places that are far less attractive to well-educated, striving young workers? Can you get that kind of churn and get the more talented candidates you say you need, at no higher cost, to come to the Ozarks or Flint, Michigan or the Native American reservations? Can you nationally have a profession of 3 million people, already caught in a teacher shortage, and then replicate conditions that lead to somewhere between 35%-50% annual turnover, depending on whose numbers you trust?

And am I really being too skeptical if my assumption is to say no, you can’t?


 

Study of the Week: Of Course Virtual K-12 Schools Don’t Work

This one seems kind of like shooting fish in a barrel, but given that “technology will solve our educational problems” is holy writ among the Davos crowd no matter what the evidence, I suppose this is worth doing.

Few people would ever come out and say this, but central to assumptions about educational technology is that human teachers are an inefficiency to be removed from the system by whatever means possible. Right now, not even the most credulous Davos type, nor the most shameless ed tech profiteer, is making the case for fully automated AI-based instruction. But attempts to dramatically increase the number of students that you can force through the capitalist pipeline at low cost that you can help nurture and grow are well under way, typically by using digital systems to let one teacher teach more students than you’d see in a brick-and-mortar classroom. This also cuts down on the costs of facilities, which give kids a safe and engaging place to go every day but which are expensive. So you build a virtual platform, policy types use words like “innovation” and “disrupt,” and for-profit entities start sucking up public money with vague promises of deliverance-through-digital-technology. Kids and parents get “choice,” which the ed reform movement has successfully branded as a good thing even though at scale school choice has not been demonstrated to have any meaningful relationship to improved outcomes at all.

Today’s Study of the Week, from a couple years ago, takes a look at whether these virtual K-12 schools actually, you know, work. It’s a part of the CREDO project. I have a number of issues, methodological and political, with the CREDO program generally, but I still think this is high-quality data. It’s a large data set that compares the outcomes of students in traditional public schools, brick and mortar charters, and virtual charters. The study uses a matched data method – in simple terms, comparing students from the different “conditions” who match on a variety of demographic and educational metrics in order to attempt to control construct-irrelevant variance. This can be help to ameliorate some of the problems with observational studies, but bear in mind that once again, this is not the same as a true randomized controlled trial. They had to do things this way because online charter seats are not assigned via lottery. (For the record, I do not trust the randomization effects of such lotteries because of the many ways in which they are gamed, but here that’s not even an issue because there’s no lottery at all.)

The matched variables, if you’re curious:

• Grade level
• Gender3
• Race/Ethnicity
• Free or Reduced-Price Lunch Eligibility
• English Language Learner Status
• Special Education Status
• Prior test score on state achievement test

So how well do online charters work? They don’t. They don’t work. Look at this.

Please note that, though these negative effect sizes may not seem that big to you, in a context where most attempted interventions are not statistically different than zero, they’re remarkable. I invite you to look at the “days of learning lost” scale on the right of the graphic. There’s only 180 days in the typical K-12 school year! This is educational malpractice. How could such a thing have been attempted with over 160,000 students without any solid evidence it could work? Because the constant, the-sky-is-falling crisis narrative in education has created a context where people believe they are entitled to try anything, so long as their intentions are good. Crisis narratives undermine checks and balances and the natural skepticism that we should ordinarily apply to the interests of young children and to public expenditure. So you get millions of dollars spent on online charter schools that leave students a full school year behind their peers.

Are policy types still going full speed ahead, working to send more and more students – and more and more public dollars – into these failed, broken online schools? Of course. Educational technology and the ed reform movement writ large cannot fail, they can only be failed, and nothing as trivial as reality is going to stand in the way.