Study of the Week: We’ll Only Scale Up the Good Ones

When it comes to education research and public policy, scale is the name of the game.

Does pre-K work? Left-leaning people (that is, people who generally share my politics) tend to be strong advocates of these programs. It’s true that generically, it’s easier to get meaningful educational benefits from interventions in early childhood than later in life. And pre-K proponents tend to cite some solid studies that show some gains relative to peer groups, though these gains are generally modest and tend to fade out over time. Unfortunately, while some of these studies have responsible designs, many that are still cited are old, from small programs, or both.

Today’s Study of the Week, by Mark W. Lipsey, Dale C. Farran, and Kerry G. Hofer, is a much-discussed, controversial study from Tennessee’s Voluntary Prekindergarten Program. The Vanderbilt University researchers investigated the academic and social impacts of the state’s pre-K programs on student outcomes. The study we’re looking at is a randomized experimental design, which was pulled from a larger observational study. The Tennessee program, in some locales, had more applicants than available seats. These seats are filled by a random lottery, creating a natural control and experimental group.

There is one important caveat here: the students examined in the intensive portion of the research had to be selected from those whose parents gave consent. That’s about a third of the potential students. This is a potential source of bias. While the randomized design will help, what we can responsibly say is that we have random selection within the group of students whose parents opted in, but with a nonrandom distribution relative to the overall group of students attending this program. I don’t think that’s a particularly serious problem, but it’s a source of potential selection bias and something to be aware of. There’s also my persistent question about the degree to which school selection lotteries can be gamed by parents and administrators. There are lots of examples of this happening. (Here’s one at a much-lauded magnet school in Connecticut.) Most people in the research field seem not to see this as a big concern. I don’t know.

In any event, the results of the research were not encouraging. Researchers examined six identified subtests (two language, two literacy, two math) from the Woodcock-Johnson tests of cognitive ability, a well-validated and widely-used battery of tests of student academic and intellectual skills. They also looked at a set of non-cognitive abilities related to behavior, socialization, and enthusiasm for school. A predictable pattern played out. Students who attended the Tennessee pre-K program saw short-term significant gains relative to their peers who did not attend the program. But over time, the peer group caught up, and in fact in this study, exceeded the test group. That is, students who attended Tennessee’s pre-K program ended up actually underperforming those who were not selected into it.

By the end of kindergarten, the control children had caught up to the TN‐VPK children and there were no longer significant differences between them on any achievement measures. The same result was obtained at the end of first grade using both composite achievement measures. In second grade, however, the groups began to diverge with the TN‐VPK children scoring lower than the control children on most of the measures….  In terms of behavioral effects, in the spring the first grade teachers reversed the fall kindergarten teacher ratings. First grade teachers rated the TN‐ VPK children as less well prepared for school, having poorer work skills in the classrooms, and feeling more negative about school.

This dispiriting outcome mimics that of the Head Start study, another much-discussed, controversial study that found similar outcomes: initial advantages for Head Start students that are lost entirely by 3rd grade.

Further study is needed1 but it seems that the larger and more representative the study, the less impressive – and the less persistent – the gains from pre-K. There’s a bit of uncertainty here about whether the differences in outcomes are really the product of differences in programs or due to differences in the research itself. And I don’t pretend that this is a settled question. But it is important to recognize that the positive evidence for pre-K comes from smaller, higher-resource, more-intensive programs. Larger programs have far less encouraging outcomes.

The best guess, it seems to me, is that at scale universal pre-K programs would function more like the Tennessee system and less like the small, higher-performing programs. That’s because scaling up any major institutional venture, in a country the size of the United States, is going to entail the inevitable moderating effects of many repetitions. That is, you can build one school or one program and invest a lot of time, effort, and resources into making it as effective as possible, and potentially see significant gains relative to other schools. But it strikes me as a simple statement of the nature of reality that this intensity of effort and attention can’t scale. As Farran and Lipsey say in a Brookings Institution essay, “To assert that these same outcomes can be achieved at scale by pre-K programs that cost less and don’t look the same is unsupported by any available evidence.”

Some will immediately say, well, let’s just pay as much for large-scale pre-K as they do in the other programs and model their techniques. The $26 billion question is, can you actually do that? Can what makes these programs special actually be scaled? Is there hidden bias here that will wash out as we expand the programs? I confess I’m skeptical that we’ll see these quantitative gains under even the best scenario. I think we need to understand the inevitability of mediocrity and regression to the mean. That doesn’t mean I don’t support universal pre-kindergarten childcare. As with after school programs, I do for social and political reasons, though, not out of the conviction much that they’ll change test scores much. I’d be happy to be proven wrong.

Now I don’t mean to extrapolate irresponsibly. But allow me to extrapolate irresponsibly: isn’t this precisely what we should expect with charter schools, too? We tend to see, survivorship-bias heavy CREDO studies aside, that at scale the median charter school does little or nothing to improve on traditional public schools. We also see a number of idiosyncratic, high-intensity, high-attention charters that report better outcomes. The question you have to ask, based on how the world works, is which is more likely to be replicated at scale – the median, or the exceptions?

I’ve made this point before about Donald Trump’s favorite charter schools, Success Academy here in New York. Let’s set aside questions of the abusive nature of the teaching that goes on in these schools. The basic charter proponent argument is that these schools succeed because they can fire bad teachers and replace them with good. Success Academy schools are notoriously high stress, long-hour, low pay affairs. This leads naturally to high teacher attrition. Luckily for the NYC-based Success Academy, New York is filled with lots of eager young people who want to get a foothold in the city, do some do-goodering, then bail for their “real” careers later on – essentially replicating the Teach for America model. So: even if we take all of the results from such programs at face value, do you think this is a situation that can be scaled up in places that are far less attractive to well-educated, striving young workers? Can you get that kind of churn and get the more talented candidates you say you need, at no higher cost, to come to the Ozarks or Flint, Michigan or the Native American reservations? Can you nationally have a profession of 3 million people, already caught in a teacher shortage, and then replicate conditions that lead to somewhere between 35%-50% annual turnover, depending on whose numbers you trust?

And am I really being too skeptical if my assumption is to say no, you can’t?


 

Study of the Week: Of Course Virtual K-12 Schools Don’t Work

This one seems kind of like shooting fish in a barrel, but given that “technology will solve our educational problems” is holy writ among the Davos crowd no matter what the evidence, I suppose this is worth doing.

Few people would ever come out and say this, but central to assumptions about educational technology is that human teachers are an inefficiency to be removed from the system by whatever means possible. Right now, not even the most credulous Davos type, nor the most shameless ed tech profiteer, is making the case for fully automated AI-based instruction. But attempts to dramatically increase the number of students that you can force through the capitalist pipeline at low cost that you can help nurture and grow are well under way, typically by using digital systems to let one teacher teach more students than you’d see in a brick-and-mortar classroom. This also cuts down on the costs of facilities, which give kids a safe and engaging place to go every day but which are expensive. So you build a virtual platform, policy types use words like “innovation” and “disrupt,” and for-profit entities start sucking up public money with vague promises of deliverance-through-digital-technology. Kids and parents get “choice,” which the ed reform movement has successfully branded as a good thing even though at scale school choice has not been demonstrated to have any meaningful relationship to improved outcomes at all.

Today’s Study of the Week, from a couple years ago, takes a look at whether these virtual K-12 schools actually, you know, work. It’s a part of the CREDO project. I have a number of issues, methodological and political, with the CREDO program generally, but I still think this is high-quality data. It’s a large data set that compares the outcomes of students in traditional public schools, brick and mortar charters, and virtual charters. The study uses a matched data method – in simple terms, comparing students from the different “conditions” who match on a variety of demographic and educational metrics in order to attempt to control construct-irrelevant variance. This can be help to ameliorate some of the problems with observational studies, but bear in mind that once again, this is not the same as a true randomized controlled trial. They had to do things this way because online charter seats are not assigned via lottery. (For the record, I do not trust the randomization effects of such lotteries because of the many ways in which they are gamed, but here that’s not even an issue because there’s no lottery at all.)

The matched variables, if you’re curious:

• Grade level
• Gender3
• Race/Ethnicity
• Free or Reduced-Price Lunch Eligibility
• English Language Learner Status
• Special Education Status
• Prior test score on state achievement test

So how well do online charters work? They don’t. They don’t work. Look at this.

Please note that, though these negative effect sizes may not seem that big to you, in a context where most attempted interventions are not statistically different than zero, they’re remarkable. I invite you to look at the “days of learning lost” scale on the right of the graphic. There’s only 180 days in the typical K-12 school year! This is educational malpractice. How could such a thing have been attempted with over 160,000 students without any solid evidence it could work? Because the constant, the-sky-is-falling crisis narrative in education has created a context where people believe they are entitled to try anything, so long as their intentions are good. Crisis narratives undermine checks and balances and the natural skepticism that we should ordinarily apply to the interests of young children and to public expenditure. So you get millions of dollars spent on online charter schools that leave students a full school year behind their peers.

Are policy types still going full speed ahead, working to send more and more students – and more and more public dollars – into these failed, broken online schools? Of course. Educational technology and the ed reform movement writ large cannot fail, they can only be failed, and nothing as trivial as reality is going to stand in the way.

Study of the Week: Trade Schools Are No Panacea

You will likely have encountered the common assertion that we need to send people into trade schools to address problems like college dropout rates and soft labor markets for certain categories of workers. As The Atlantic recently pointed out, the idea that we need to be sending more people to trade and tech schools has broad bipartisan, cross-ideological appeal. This argument has a lot of different flavors, but it tends to come down to the claim that we shouldn’t be sending everyone to college (I agree!) and that instead we should be pushing more people into skilled trades. Oftentimes this is encouraged as an apprenticeship model over a schooling model.

I find there’s far more in the way of narrative force behind these claims than actual proof. It just sounds good – we need to get back to making things, to helping people learn how to build and repair! But… where’s the evidence? I’ve often looked at brute-force numbers like unemployment numbers for particular professions, but it’s hard to make responsible conclusions with that kind of analysis. Well, there’s a big new study out that looks in a much more rigorous way – and the results aren’t particularly encouraging.

Today’s Study of the Week, written by Eric A. Hanushek, Guido Schwerdt, Ludger Woessmann, and Lei Zhang, looks at how workers who attend vocational schools perform relative to those who attend general education schools. Like the recent Study of the Week on the impact of universal free school breakfast, this study uses a difference-in-difference approach to explore causation, again because it’s impossible to do an experiment with this type of question – you can’t exactly tell people that your randomization has sorted them into a particular type of schooling and potentially life-long career path, after all. The primary data they use is the International Adult Literacy Survey, a very large, metadata-robust survey with demographic, education, and employment data from 18 countries, gather from 1994 to 1998. (The authors restrict their analysis to the 11 countries that have robust vocational education systems in place.) The age of the data is unfortunate, but there’s little reason to believe that the analysis here would have changed dramatically, and the data set is so rich with variables (and thus the potential to do extensive checks for robustness and bias) that it’s a good resource. What do they find? In broad strokes, vocational/tech training helps you get a job right out of school, but hurts you as you go along later in life:

(don’t be too offended by excluding women – their overall change in workforce participation made it necessary)

Most important to our purpose, while individuals with a general education are initially (normalized to an age of 16 years) 6.9 percentage points less likely to be employed than those with a vocational education, the gap in employment rates narrows by 2.1 percentage points every ten years. This implies that by age 49, on average, individuals completing a general education are more likely to be employed than individuals completing a vocational education. Individuals completing a secondary-school equivalency or other program (the “other” category) have a virtually identical employment trajectory as those completing a vocational education.

Now, they go on to do a lot of quality controls and checks for robustness and confounds. As much of a slog as that stuff is, I recommend you check some of that out and start to pick some of it apart. Becoming a skilled reader of academic research literature really requires that you get used to picking apart the quality controls, because this is often where the juicy stuff can be found. Still, in this study the various checks and controls all support the same basic analysis: those who attend vocational schools or programs enjoy initial higher employability but go on to suffer from higher unemployment later in life.

What’s going on with these trends? The suggestion of the authors seems correct to me: vocational training is likely more specific and job-focused than general ed, which means that its students are more ready to jump right into work. But over time, technological and economic changes change which skills and competencies are valued by employers, and the general education students have been “taught to learn,” meaning that they are more adaptable and can acquire new and valuable skills.

I’m not 100% convinced that counseling more people into the trades is a bad idea. After all, the world needs people who can do these things, and early-career employability is nothing to dismiss. But the affirmative case that more trade school is a solution to long-term unemployment problems seems clearly wrong. And in fact this type of education seems to deepen one of our bigger problems in the current economy: the speed of technological change moves so fast these days that it’s hard for older workers to adapt, and they often find themselves in truly unfortunate positions. Even in trades that are less susceptible to technological change, there’s uncertainty; a lot of the traditional construction trades, for example, are very exposed to the housing market, as we learned the hard way in 2009. Do we want to use public policy to deepen these risks?

In a broader sense: it’s unclear if it’s ever a good idea to push people into a particular narrow range of occupations, because then people rush into them and… there stops being any shortage and advantage for labor. For a little while there, petrochemical engineering seemed huge. But it takes a lot of schooling to do those jobs, and then the oil market crashed. Pharmacy was the safe haven, and then word got out, a ton of people went into the field, and the labor market advantage was eroded. Also, there are limits to our understanding of how many workers we need in a given field. Some people argue there’s a teacher shortage; some insist there isn’t. Some people believe there’s a shortage of nurses; some claim there’s a glut. If you were a young student, would you want to bet your future on this uncertainty? It seems far more useful to me to try and train students into being nimble, adaptable learners than to train them for particular jobs. That has the bonus advantage of restoring the “practical” value of the humanities and arts, which have always been key aspects of learning to be well-rounded intellects.

My desires are twofold. First, that we be very careful when making claims about the labor market of the future, given the certainty that trends change. (One of my Purdue University students once told me, with a smirk, that he had intended to study Search Engine Optimization when he was in school, only to find that Facebook had eaten Google as the primary driver of many kinds of web traffic.) Second, that we stop saying “the problem is you went into X field” altogether. Individual workers are not responsible for labor market conditions. Those are the product of macroeconomic conditions – inadequate aggregate demand, outsourcing, and the merciless march of automation. What’s needed is not to try and read the tea leaves and guess which fields might reward some slice of our workforce now, but to redefine our attitude towards work and material security through the institution of some sort of guaranteed minimum income. Then, we can train students in the fields in which they have interest and talent, contribute to their human flourishing in doing so, and help shelter them from the fickleness of the economy. The labor market is not a morality play.

Study of the Week: Feed Kids to Feed Them

Today’s Study of the Week is about subsidized meal programs for public school students, particularly breakfast. School breakfast programs have been targeted by policymakers for awhile, in part because of discouraging participation levels. Even many students who are eligible for subsidized lunches often don’t take advantage of school breakfast. The reasons for this are multiple. Price is certainly a factor. As you’d expect, price is inversely related to participation rates for school breakfast. Also, in order to take advantage of breakfast programs, you need to arrive at school early enough to eat before school formally begins, and it’s often hard enough to get teenagers to school on time just for class. Finally, there’s a stigma component, particularly associated with subsidized breakfast programs. It was certainly the case at my public high school, where 44% of students were eligible for federal school lunch subsidies, that school breakfast carried class associations. At lunch, everybody’s eating together, but students at breakfast tended to be poorer kids – which in turn likely makes it less likely that students will want to be seen getting school breakfast.

The study, written by Jacob Leos-Urbel, Amy Ellen Schwartz, Meryle Weinstein, and Sean Corcoran (all of NYU), takes advantage of a policy change in New York public schools in 2003. Previously, school breakfast had been free only to those who were eligible for federal lunch subsidies, which remains the case in most school districts. New York made breakfast free for all students, defraying the costs by raising the price of unsubsidized lunch from $1.00 to $1.50. They then went looking to see if the switch to free breakfast for all changed participation in the breakfast program, looking for differences between the three tiers – free lunch students, reduced lunch students, and students who pay full price. They also compared outcomes from traditional schools to Universal Free Meal (UFM) schools, where the percentage of eligible students is so high that everyone in the school gets meals for free already. This helped them tease out possible differences in participation based on moving to a universal free breakfast model. They were able to use a robust data set comprising results from 723,843 students from 667 schools, grades 3–8. They also investigated whether breakfast participation rates were associated with performance in quantitative educational metrics.

It’s important to say that it’s hard to really get at causality here because we’re not doing a randomized experiment. Such an experiment would be flatly unethical – “sorry, kid, you got sorted into the no-free-breakfast group, good luck.” So we have to do observational studies and use what techniques we can to adjust for their weaknesses. In this study, the authors used what’s called a difference in difference design. These techniques are often used when analyzing natural experiments. In the current case, we have schools where the change in policy has no impact on who receives free breakfast (the UFM schools) and schools where there is an impact (the traditional schools). Therefore the UFM schools can function as a kind of natural control group, since they did not receive the “treatment.” You then use a statistical model to compare the change in the variables of interest for the “control” group to the change for the “treatment” group. Make sense?

What did the authors find? The results of the policy change were modest, in almost every measurable way, and consistent across a number of models that the authors go into in great detail in the paper. Students did take advantage of school breakfast more after breakfast became universally free. On the one hand, students who paid full price increased breakfast participation by 55%, which is a large number; but on the other hand, their initial baseline participation rates were so low (again because breakfast participation is class-influenced) that they only ate on average 6 additional breakfasts a year. Reduced price and free were increased by 33% and 15%, respectively – the latter particularly interesting given that those students did not pay for breakfast to begin with. Still, that too only represents about 6 meals over the course of a year, not nothing but perhaps less than we’d hope for a program with low participation rates. The only meaningful difference in models seems to be when they restrict their analysis to the small number (91) of schools where less than a third of students are eligible for lunch subsidies, in which case breakfast participation grew by a substantially larger amount. The purchase of lunches, for what it’s worth, remained static despite the price increase.

There’s a lot of picking apart the data and attempting to determine to what degree these findings are related to stigma. I confess I find the discussion a bit muddled but your money may vary. The educational impacts, also, were slight. They found a small increase in attendance, but this result was not significant, and no impact on reading and math outcomes.

These findings are somewhat discouraging. Certainly we would hope that moving to a universal program would help to spur participation rates to a greater degree than we’re seeing here. But it’s important to note that the authors largely restricted their analysis to the years immediately before and after the policy change, thanks to the needs of their model. When broadening the time frame by a couple years, they find an accelerating trend in participation rates, though the model is somewhat less robust. What’s more, as the authors note, decreasing stigma is the kind of thing that takes time. If it is in fact the case that stigma keeps students from taking part in school breakfast, it may well take a longer time period for universal free breakfast to erode that disincentive.

I’m also inclined to suspect that the need to get kids to school early to eat represents a serious challenge to the pragmatic success of this program. There’s perhaps good news on the way:

Even when free for all, school breakfast is voluntary. Further, unlike school lunch, breakfast traditionally is not fully incorporated into the school day and students must arrive at school early in order to participate. Importantly, in the time period since the introduction of the universal free breakfast policy considered in this paper, New York City and other large cities have begun to explore other avenues to increase participation. Most notably, some schools now provide breakfast in the classroom.

Ultimately, I believe that making school breakfast universally free is a great change even in light of relatively modest impacts on participation rate. We should embrace providing free breakfast to all students regardless of income level out of the principle of doing so, particularly considering that fluctuations in parental income might make kids who are technically ineligible unable to pay for breakfast. In time, if we set up this universal program as an embedded part of the school day, and work diligently to erase the stigma of using it, I believe more and more kids will begin their days with a full stomach.

As for the lack of impacts on quantitative metrics, well – I think that’s no real objection at all. We should feed kids to feed them, not to improve their numbers. This all dovetails with my earlier point about after school programs: if we insist on viewing every question through the lens of test scores, we’re missing out on opportunities to improve the lives of children and parents that are real and important. Again, I will say that I recognize the value of quantitative academic outcome in certain policy situations. But the relentless focus on quantitative outcomes leads to scenarios where we have to ask questions like whether giving kids free breakfast improves test scores. If it does, great – but the reason to feed children is to feed children. When it comes to test scores and education policy, the tail too often wags the dog, and it has to stop.

Study of the Week: Better and Worse Ways to Attack Entrance Exams

For this week’s Study of the Week I want to look at standardized tests, the concept of validity, and how best – and worst – to criticize exams like the SAT and ACT. To begin, let’s consider what exactly it means to call such exams valid.

What is validity?

Validity is a multi-faceted concept that’s seen as a core aspect of test development. Like many subjects in psychometrics and stats, it tends to be used casually and referred to as something fairly simple, when in fact the concept is notoriously complex. Accepting that any one-sentence definition of validity is thus a distortion, generally we say that validity refers to the degree that a test measures that which it purports to measure. A test is more valid or less depending on its ability to actually capture the underlying traits we are interested in investigating through its mechanism. No test can ever be fully or perfectly validated; rather we can say that it is more or less valid. Validity is a vector, not a destination.

Validity is so complex, and so interesting, in part because it sits at the nexus of both quantitative and philosophical concerns. Concepts that we want to test may appear superficially simple but are often filled with hidden complexity. As I wrote in a past Study of the Week, talking about the related issues of construct and operationalization,

If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

Questions such as these are endemic to test development, and frequently we are forced to make subjective decisions about how best to measure complex constructs of interest. Common to the quantitative social sciences, this subjective, theoretical side of validity is often written out of our conception of the topic, as we want to speak with the certainty of numbers and the authority of the “harder” sciences. But theory is inextricable from empiricism, and the more that we wish to hide it, the more subject we are to distortions that arise from failing to fully think through our theories and what they mean. Good empiricists know theory comes first; without it, the numbers are meaningless.

Validity has been subdivided into a large number of types, which reflect different goals and values within the test development process. Some examples include:

  • Predictive Validity: The ability of a test’s results to predict that which it should be able to predict if the test is in fact valid. If a reading test predicts whether students can in fact read texts of a given complexity or reading level, that would provide evidence of predictive validity. The SAT’s ability to predict the grades of college freshmen is a classic example.
  • Concurrent Validity: If a test’s results are strongly correlated with that of a test that measures similar constructs and which has itself been sufficiently validated, that provides evidence of concurrent validity. Of course, you have to be careful – two invalid tests might provide similar results but not tell us much of actual worth. Still, a test of quantitative reasoning and a test of math would be expected to be imperfectly yet moderately-to-strongly correlated if each is itself a valid test of the given construct.
  • Curricular Validity: As the name implies, curricular validity reflects the degree to which a test matches with a given curriculum. If a test of biology closely matches the content in the syllabus of that biology course, we would argue for high curricular validity. This is important because we can easily imagine a scenario where general ability in biology could be measured effectively by a test that lacked curricular validity – students who are strong in biology might score well on a test, and students who are poor would likely score poorly, even if that test didn’t closely match the curriculum. But that test would still not be a particularly valid measure of biology as learned in that class, so curricular validity would be low. This is often expressed as a matter of ethics.
  • Ecological Validity: Heading in a “softer” direction, ecological validity is often discussed to refer to the degree to which a test or similar assessment instrument matches the real-life contexts in which its consequences will be enacted. Take writing assessment. In previous generations, it was common for student writing ability to be tested through multiple choice tests on grammar and sentence combining. These tests were argued to be valid because their results tend to be highly correlated with the scores that students receive on written essay exams. But writing teachers objected, quite reasonably, that we should test student writing by having them write, even if those correlations are strong. This is an invocation of ecological validity and reflects a broader (and to me positive) effort to not reduce validity to narrowly numerical terms.

I could go on!

When we talk about entrance examinations like the SAT or GRE, we often fixate on predictive validity, for obvious reasons. If we’re using test scores as criteria for entry into selective institutions, we are making a set of claims about the relationship between those scores and the eventual performance of those students. Most importantly, we’re saying that the tests help us to know that students can complete a given college curriculum, that we’re not setting them up to fail by admitting them to a school where they are not academically prepared to thrive. This is, ostensibly, the first responsibility of the college admissions process. Ostensibly.

Of course, there are ceiling effects here, and a whole host of social and ethical concerns that predictive validity can’t address. I can’t find a link now but awhile back a Harvard admissions officer admitted that something like 90% of the applicants have the academic ability to succeed at the school, and that much of the screening process had little to do with actual academic preparedness. This is a big subject that’s outside of the bounds of this week’s study.

The ACT: Still Predictively Valid

Today’s study, by Paul A. Westrick, Huy Le, Steven B. Robbins, Justine M. R. Radunzel, and Frank L. Schmidt1, is a large-n (189,612) study about the predictive validity of the ACT, with analysis of the role of socioeconomic status (SES) and high school grades in retention and college grades. The researchers examined the outcomes of students who took the ACT and went on to enroll in 4-year institutions from 2000 to 2006.

The nut:

After corrections for range restriction, the estimated mean correlation between ACT scores and 1st-year GPA was .51, and the estimated mean correlation between high school GPA and 1st-year GPA was .58. In addition, the validity coefficients for ACT Composite score and high school GPA were found to be somewhat variable across institutions, with 90% of the coefficients estimated to fall between .43 and .60, and between .49 and .68, respectively (as indicated by the 90% credibility intervals). In contrast, after correcting for artifacts, the estimated mean correlation between SES and 1st-year GPA was only .24 and did not vary across institutions….

…1st-year GPA, the most proximal predictor of 2nd-year retention, had the strongest relationship (.41). ACT Composite scores (.19) and high school GPA (.21) were similar in the strength of their relationships with 2nd-year retention, and SES had the weakest relationship with 2nd-year retention (.10).

The results should be familiar to anyone who has taken a good look at the literature on these tests, and to anyone who has been a regular reader of this blog. The ACT is in fact a pretty strong predictor of GPA, though far from a perfect one at .51. Context is key here; in the world of social sciences and education, .51 is an impressive degree of predictive validity for the criterion of interest. But there’s lots of wiggle! And I think that’s ultimately a good thing; it permits us to recognize that there are a variety of ways to effectively navigate the challenges of the college experience… and to fail to do so. (As the Study of the Week post linked to above notes, GPA is strongly influenced by Conscientiousness, the part of the Five Factor Model associated with persistence and delaying gratification.) We live in a world of variability, and no test can ever make perfectly accurate predictions about who will succeed or fail. Exceptions abound. Proponents of these tests will say, though, that they are probably much more valid predictors of college grades and dropout rates than more subjective criteria like essays and extracurricular activities. And they have a point.

Does the fact that SES correlates “only” at .24 with college GPA mean SES doesn’t matter? Of course not. That level of correlation for a variable that is truly construct-irrelevant and which has such obvious social justice dimensions is notable even if its less powerful than some would suspect. It simply means that we should take care not to exaggerate that relationship, or the relationship between SES and performance on tests like the ACT and SAT, which is similar at about .25 in the best data known to me. Again: clearly that is a relevant relationship, and clearly it does not support the notion that these tests only reflect differences in SES.

Ultimately, every read I have of the extant evidence demonstrates that tests like the SAT and ACT are moderately to highly effective at predicting which students will succeed in terms of college GPA and retention rates. They are not perfect and should not be treated as such, so we should use other types of evidence such as high school grades and other, “soft” factors in our college admissions procedures – in other words, what we already do – if we’re primarily concerned with screening for prerequisite ability. Does that mean I have no objections to these tests or their use? Not at all. It just means that I want to make the right kinds of criticisms.

Don’t Criticize Strength, Criticize Weakness

A theme that I will return to again and again in this space is that we need to consider education and its place in society from a high enough level to think coherently. Critics of the SAT and ACT tend to pitch their criticisms at a level that does them no good.

So take this piece in Slate from a couple enthusiastic SAT (and IQ) proponent. In it, they take several liberal academics to task for making inaccurate claims about the SAT, in particular the idea that the SAT only measures how well you take the SAT. As the authors say, the evidence against this is overwhelming; the SAT, like the ACT, is and has always been an effective predictor of college grades and retention rates, which is precisely what the test is mean to predict. The big testing companies invest a great deal of money and effort in making them predictively valid. (And a great deal of test taker time and effort, too, given that one section out of each given exam is “experimental,” unscored and used for the production of future tests.) When you attack the predictive validity of these tests – their ability to make meaningful predictions about who will succeed and who will fail at college – you are attacking them at their strongest point. It’s like their critics are deliberately making the weakest critique possible.

“These tests are only proxies for socioeconomic status” is a factually incorrect attempt to make a criticism of how our educational system replicates received advantage. It fails because it does not operate at the right level of perspective. Here’s a better version, my version: “these tests are part of an educational system that reflects a narrow definition of student success that is based on the needs of capitalism, rather than a fuller, more humanistic definition of what it means to be a good student.”

These tests do indeed tell us how well students are likely to do in college and in turn provide some evidence of how well they will do in the working world. But college, like our educational system as a whole, has been tuned to attend to the needs of the market rather than to the broader needs of humanity. The former privileges the kind of abstract processing and brute reasoning skills that tests are good at measuring and which makes one a good Facebook or Boeing employee. The latter would include things like ethical responsibility, aesthetic appreciation, elegance of expression, and dedication to equality, among other things, which tests are not well suited to measuring. A more egalitarian society would of course also have need for, and value, the raw processing power that we can test for effectively, but that strength would be correctly seen as just one value among many. To get there, though, we have to make much broader critiques and reforms of contemporary society than the “SAT just measures how well you take the SAT” crowd tend to engage in.

What I am asking for, in other words, is that we focus on telling the whole story rather than distorting what we know about part of the story. There is so much to criticize in our system and how it doles out rewards, so let’s attack weakness, not strength.

Study of the Week: What Actually Helps Poor Students? Human Beings

As I’ve said many times, a big part of improving our public debates about education (and, with hope, our policy) lies in having a more realistic attitude towards what policy and pedagogy are able to accomplish in terms of changing quantitative outcomes. We are subject to socioeconomic constraints which create persistent inequalities, such as the racial achievement gap; these may be fixable via direct socioeconomic policy (read: redistribution and hierarchy leveling), but have proven remarkably resistant to fixing through educational policy. We also are constrained by the existence of individual differences in academic talent, the origins of which are controversial but the existence of which should not be. These, I believe, will be with us always, though their impact on our lives can be ameliorated through economic policy.

I have never said that there is no hope for changing quantitative indicators. I have, instead, said that the reduction of the value of education to only those quantitative indicators is a mistake, especially if we have a realistic attitude towards what pedagogy and policy can achieve.  We can and should attempt to improve outcomes on these metrics, but we must be realistic, and the absolute refusal of policy types to do so has resulted in disasters like No Child Left Behind. Of course we should ask questions about what works, but we must be willing to recognize that even what works is likely of limited impact compared to factors that schools, teachers, and policy don’t control.

This week’s Study of the Week, by Dietrichson, Bøg, Filges, and Jørgensen, provides some clues. It’s a meta-analysis of 101 studies from the past 15 years, three quarters of which were randomized controlled trials. That’s a particularly impressive evidentiary standard. It doesn’t mean that the conclusions are completely certain, but that number of studies, particularly with randomized controlled designs, lends powerful evidence to what the authors find. If we’re going to avoid the pitfalls of significance testing and replicability, we have to do meta-analysis, even as we recognize that they are not a panacea. Before we take a look at this one, a quick word on how they work.

Effect Size and Meta-Analysis

The term “statistically significant” appears in discussions of research all the time, but as you often hear, statistical significance is not the same thing as practical significance. (After “correlation does not imply causation!” that’s the second most common Stats 101 bromide people will throw at you on the internet.) And it’s true and important to understand. Statistical significance tests are performed to help ascertain the likelihood that a perceived quantitative effect is a figment of our data. So we have some hypothesis (giving kids an intervention before they study will boost test scores, say) and we also have the null hypothesis (kids who had the intervention will not perform differently than those who didn’t take it). After we do our experiment we have two average test scores for the two groups, and we know how many of each we have and how spread out their scores are (the standard deviation). Afterwards we can calculate a p-value, which tells us the likelihood that we would have gotten that difference in average test scores or better even if the null was actually true. Stat heads hate this kind of language but casually people will say that a result with a low p-value is likely a “real” effect.

For all of its many problems, statistical significance testing remains an important part of navigating a world of variability. But note what a p-value is not telling us: the actual strength of the effect. That is, a p-value helps us have confidence in making decisions based on a perceived difference in outcomes, but it can’t tell us how practically strong the effect is. So in the example above, the p-value would not be an appropriate way to report the size in the differences in averages between the two groups. Typically people have just reported those different averages and left it at that. But consider the limitations of that approach: frequently we’re going to be comparing different figures from profoundly different research contexts and derived from different metrics and scales. So how can we responsibly compare different studies and through them different approaches? By calculating and reporting effect size.

As I discussed the other day, we frequently compare different interventions and outcomes through reference to the normal distribution and standard deviation. As I said, that allows us to make easy comparisons between positions on different scales. You look at the normal distribution and can say OK, students in group A were this far below the mean, students in group B were this far above it, and so we can say responsibly how different they are and where they stand relative the the norm. Pragmatically speaking (and please don’t scold me), there’s only about three standard deviations of space below and above the mean in normally-distributed data. So when we say that someone is a standard deviation above or below someone else, that gives you a sense of the scale we’re talking about here. Of course, the context and subject matter makes a good deal of difference too.

There’s lots of different ways to calculate effect sizes, though all involve comparing the size of the given effect to the standard deviation. (Remember, standard deviation is important because spread tells us how much we should trust a given average. If I give a survey on a 0-10 scale and I get equal numbers of every number on that scale – exactly as many 0s, 1s, 2s, 3s, etc. – I’ll get an average of 5. If I give that same survey and everyone scores a 5, I still get an average of 5. But for which situation is 5 a more accurate representation of my data?) In the original effect size, and one that you still see sometimes, you simply divide the difference between the averages by the pooled standard deviations of the experiments you’re comparing, to give you Cohen’s d. There are much fancier ways to calculate effect size, but that’s outside the bounds of this post.

A meta-analysis takes advantage of the affordances of effect size to compare different interventions in a mathematically responsible way. A meta-analysis isn’t just a literature review; rather than just reporting what previous researchers have found, those conducting a meta-analysis use quantitative data made available to researchers to calculate pooled effect sizes. When doing so, they weight the data by looking at the sample size (more is better), the standardized deviation (less spread is better), and the size of the effect. There are then some quality controls and attempts to account for differences in context and procedure between different studies. What you’re left with is the ability to compare different results and discuss how big effects are in a way that helps mitigate the power of error and variability in individual studies.

Because meta-analyses must go laboriously through explanations of how studies were selected and disqualified, as well as descriptions of quality controls and the particular methods to pool standard deviations and calculate effect sizes, reading them carefully is very boring. So feel free to hit up the Donation buttons to the right to reward your humble servant for peeling through all this.

Bet On the Null

One cool thing about meta-analysis is that they allow you to get a bird’s eye view on the kind of effects that are reported on various studies of various types of interventions. And what you find, in ed research, is that we’re mostly playing with small effects.

In the graphic above, the scale at the bottom is for effect sizes represented in standard deviations. The dots on the lines are the effect sizes for a given study. The lines extending from the dots are our confidence interval. A confidence interval is another way of grappling with statistical significance and how much we trust a given average. Because of the inevitability of measurement error, we can never say for 100% that a sample mean is the actual mean of that population. Instead, we can say with a certain degree of confidence, which we choose ourselves, that the true mean lines within a given range of values. 95% confidence intervals, such as these, are a typical convention. Those lines tell us that, given the underlying data, we can say with 95% confidence that the true average lies within those lines. If you wanted to narrow those lines, you could choose a lower % of confidence, but then you’re necessarily increasing the chance the true mean isn’t actually within the line.

Anyhow, look at the effects here. As is so common in education, we’re generally talking about small impacts from our various interventions. This doesn’t tell you what kind of interventions these studies performed – we’ll get there in just a second – but I just want to note how studies with the most dependable designs tend to produce limited effects in education. In fact in a majority of these studies the confidence interval includes zero. Meanwhile, only 6 of these studies have meaningfully powerful effects, although in context they’re pretty large.

Not to cast aspersions but the Good et al. study is the kind of effect size that makes me skeptical right off the bat. The very large confidence interval should also give us pause. That doesn’t mean the researchers weren’t responsible, or that we throw out that study entirely. It just means that this is exactly what meta-analysis is for: it helps us put results in context, to compare the quantitative results of individual studies against others and to get a better perspective on the size of a given effect and the meaning of a confidence interval. In this case, the confidence interval is so wide that we should take the result with several pinches of salt, given the variability involved. Again, no insult to the researchers; ed data is highly variable so getting dependable numbers is hard. We just need to be real: when it comes to education interventions, we are constrained by the boundaries of the possible.

Poor students benefit most from the intervention of human beings

OK, on to the findings. When it comes to improving outcomes for students from poor families, what does this meta-analysis suggest works?

A few things here. We’ve got a pretty good illustration of the relationship between confidence intervals and effect size; small-group instruction has a strong effect size but because the confidence interval (just barely) overlaps with 0 it could not be considered statistically significant to a .05 level. Does that mean we throw out the findings? No; the .05 confidence interval isn’t a dogma, despite what journal publishing guidelines might make you think. But it does mean that we have to be frank about the level of variability in outcomes here. It seems small group instruction is pretty effective in some contexts for some students but potentially not effective at all.

Bear in mind: because we’re looking at aggregates of various studies here, wide confidence intervals likely mean that different studies found conflicting findings. We might say, then, that these interventions can be powerful but that we are less certain about the consistency of their outcomes; maybe these things work well for some students but not at all for others. Meanwhile an intervention like increased resources has a nice tight confidence interval, giving us more confidence that the effect is “real,” but a small effect size. Is it worth it? That’s a matter of perspective.

Tutoring looks pretty damn good, doesn’t it? True, we’re  talking about less than .4 of a SD on average, but again, look at the context here. And that confidence interval is nice and tight, meaning that we should feel pretty strongly that this is a real effect. This should not be surprising to anyone who has followed the literature on tutoring interventions. Yet how often do you hear about tutoring from ed reformers? How often does it pop up at The Atlantic or The New Republic? Compare that to computer-mediated instruction, which is a topic of absolute obsession in our ed debate, the digital Godot we’re all waiting for to swoop in and save our students. No matter how often we get the same result, technology retains its undeserved reputation as the key to fixing our system. When I say that education reform is an ideological project and not a practical one, this is what I mean.

What’s shared by tutoring, small group instruction, cooperative learning, and feedback and progress monitoring – the interventions that come out looking best? The influence of another human being. The ability to work closely with others, particularly trained professionals, to go through the hard, inherently social work of error and correction and trying again. Being guided by another human being towards mastery of skills and concepts. Not paying tons of money on some ed tech boondoggle. Rather, giving individual people the time necessary to work closely with students and shepherd their progress. Imagine if we invested our money in giving all struggling students the ability to work individually or in small groups with dedicated educational professionals that we treated as respected experts and paid accordingly.

What are we doing instead? Oh, right. Funneling millions of dollars into one of the most  profitable companies in the world for little proven benefit. Guess you can’t be too cynical.

Study of the Week: Rebutting Academically Adrift with Its Own Mechanism

It’s a frustrating fact of life that arguments that are most visible are always going to be, for most people, the arguments that define the truth. I fear that’s the case with Academically Adrift, the 2011 book by Richard Arum and Joseph Roksa that has done so much to set the conventional wisdom about the value of college. That book made incendiary claims about the limited learning that college students are supposedly doing Many people assume that the book’s argument is the final word. There are in fact many critical words out there on its methodology, or the methodology we’re allowed to see. (One of the primary complaints about the book is that the authors hide the evidence for some of their claims.) Richard Haswell’s review, available here, is particularly cogent and critical:

They compared the performance of 2,322 students at twenty-four institutions during the first and fourth semesters on one Collegiate Learning Assessment task. Surprisingly, their group as a whole recorded statistically significant gain. More surprisingly, every one of their twenty-seven subgroups recorded gain. Faced with this undeniable improvement, the authors resort to the Bok maneuver and conclude that the gain was “modest” and “limited,” that learning in college is “adrift.” Not one piece of past research showing undergraduate improvement in writing and critical thinking—and there are hundreds—appears in the authors’ discussion or their bibliography….

What do they do? They create a self-selected set of participants and show little concern when more than half of the pretest group drops out of the experiment before the post-test. They choose to test that part of the four academic years when students are least likely to record gain, from the first year through the second year, ending at the well-known “sophomore slump.” They choose prompts that ask participants to write in genres they have not studied or used in their courses. They keep secret the ways that they measured and rated the student writing. They disregard possible retest effects. They run hundreds of tests of statistical significance looking for anything that will support the hypothesis of nongain and push their implications far beyond the data they thus generate.

There are more methodological critiques out there to be found, if you’re interested.

Hoisted By Their Own Petard

But let’s say we want to be charitable and accept their basic approach as sound. Even then, their conclusions are hard to justify, as later research using the same primary mechanism found much more learning. Academically Adrift utilized the Collegiate Learning Assessment (CLA), a test of college learning developed by the Council for Aid to Education (CAE). I wrote my dissertation on the CLA and its successor, the CLA+, so as you can imagine my thoughts on the test in general are complex. I can come up with both a list of things that I like about it and a list of things that I don’t like about it. For now, though, what matters is that the CLA was the primary mechanism through which Arum and Roksa made their arguments. Yet research with a far larger data set and undertaken using the freshman-to-senior academic cycle that the CLA was intended to use has shown far larger gains than those reported by Arum to Roksa.

This report from CAE, titled “Does College Matter?” and this week’s Study of the Week, details research on a larger selection of schools than that measured in Academically Adrift. In contrast to the .18 SAT-normed standard deviation growth in performance that Arum and Roksa found, CAE find an average growth of .78 SAT-normed standard deviations, with no school demonstrating an effect size of less than .62. Now, if you don’t work in effects sizes often, you might not find this a particularly large increase, but as an average level of growth across institutions, that’s in fact quite impressive. The institutional average score grew on the CLA’s scale, which is quite similar to that of the SAT and also runs from 400 to 1600, by over 100 points. (For contrast, commercial tutoring programs for tests like the SAT and ACT rarely exceed 15 to 20 points, despite the claims of the major prep companies.)

The authors write:

This stands in contrast to the findings of Academically Adrift (Arum and Roska, 2011) who also examined student growth using the CLA. They suggest that there is little growth in critical thinking as measured by the CLA. They report an effect size of .18, or less than 20% of a standard deviation. However, Arum and Roska used different methods of estimating this growth, which may explain the differences in growth shown here with that reported in Academically Adrift…. The Arum and Roska study is also limited to a small sample of schools that are a subset of the broader group of institutions that conduct value-added research using the CLA, and so may not be representative of CLA growth in general.

To summarize: research undertaken with the same mechanism as used in Academically Adrift and with both a dramatically larger sample size and a sample more representative of American higher education writ large contradicts the book’s central claim. It would be nice if that would seep out into the public consciousness, given how ubiquitous Academically Adrift was a few years ago.

Of course, the single best way to predict a college’s CLA scores is with the SAT scores of its incoming classes… but you’ve heard that from me before.

 

Motivation Matters

There’s a wrinkle to all of these tests, and a real challenge to their validity.

A basic assumption of educational and cognitive testing is that students are attempting to do their best work; if all students are not sincerely trying to do their best, they introduce construct-irrelevant variance and degrade the validity of the assessment. This issue of motivation is a particularly acute problem for value-added metrics test the CLA, as students who apply greater effort to the test as freshmen than they do as seniors would artificially reduce the amount of demonstrated learning.

At present, the CLA is a low stakes test for students. Unlike with tests like the SAT and GRE, which have direct relevance to admission into college and graduate school, there is currently no appreciable gain to be had for individual students from taking the CLA. Whatever criticisms you may have of the SAT or ACT, we can say with confidence that most students are applying their best effort to them, given the stakes involved in college admissions. Frequently, CLA schools have to provide incentives for students to take the test at all, which typically involve small discounts on graduation-related fees or similar. The question of student motivation is therefore of clear importance for assessing the test’s validity. The developers of the test apparently agree, as in their pamphlet “Reliability and Validity of CLA+,” they write “low student motivation and effort are threats to the validity of test score interpretations.”

In this 2013 study, Ou Lydia Liu, Brent Bridgeman, and Rachel Adler studied the impact of student motivation on ETS’s Proficiency Profile, itself a test of collegiate learning and a competitor to the CLA+. They tested motivation by dividing test takers into two groups. In the experimental group, students were told that their scores would be added to a permanent academic file and noted by faculty and administrators. In the second group, no such information was delivered. The study found that “students in the [experimental] group performed significantly and consistently better than those in the control group at all three institutions and the largest difference was .68 SD.” That’s a mighty large effect! And so a major potential confound. It is true that the Proficiency Profile is a different testing instrument than the CLA, although Oiu, Bridgeman, and Adler suggest that this phenomenon could be expected in any test of college learning that is considered low stakes. The results of this research were important enough that CAE’s Roger Benjamin, in an interview with Inside Higher Ed, said that the research “raises significant questions” and that the results are “worth investigating and [CAE] will do so.”

Now, in terms of test-retest scores and value added, the big question is, do we think motivation is constant between administrations? That is, do we think our freshman and senior cohorts are each working equally hard at the test? If not, we’re potentially inflating or deflating the observed learning. Personally, I think first-semester freshmen are much more likely to work hard than last-semester seniors; first-semester freshmen are so nervous and dazed you could tell them to do jumping jacks in a trig class and they’d probably dutifully get up and go for it. But absent some valid and reliable motivation indicator, there’s just a lot of uncertainty as long as students are taking tests that they are not intrinsically motivated to perform well on.

Disciplinary Knowledge – It’s Important

Let’s set aside questions of the test’s validity for a moment. There’s another reason not to think that modest gains on a test like this are reason to fret too much about the degree of learning on campus: they don’t measure disciplinary knowledge and aren’t intended to.

That is, these tests don’t measure (and can’t measure) how much English an English major learns, whether a computer science student can code, what a British history major knows about the Battle of Agincourt, if an Education major will be able to pass a state teacher accreditation test…. These are pretty important details! The reason for this omission is simple: because these instruments want to measure students across different majors and schools, content-specific knowledge can’t be involved. There’s simply too much variation in what’s learned from one major to the next to make such comparisons fruitful. But try telling a professor that! “Hey, we found limited learning on college campuses. Oh, measuring the stuff you actually teach your majors? We didn’t try that.” This is especially a problem because late-career students are presumably most invested in learning within their major and getting professionalized into a particular discipline.

I do think that “learning to learn,” general meta-academic skills, and cross-disciplinary skills like researching and critically evaluating sources are important and worth investigating. But let’s call tests of those things tests of those things instead of summaries of college learning writ large.

Neither Everything Nor Nothing

I have gotten in some trouble with peers in the humanities and social sciences in the past for offering qualified defenses of test instruments like the CLA+. To many, these tests are porting the worst kinds of testing mania into higher education and reducing the value of college to a number. I understand these critiques and think they have some validity, but I think they are somewhat misplaced.

First, it’s important to say that for all of the huffing and puffing of presidential administrations since the Reagan White House put out A Nation at Risk, there’s still very little in the way of overt federal pressure being placed on institutions to adopt tests like this, particularly in a high-stakes way. We tend to think of colleges and universities as politically powerless, but in fact they represent a powerful lobby and have proven to be able to defend their own independence. (Though not their funding, I’m very sorry to say.)

Second, I will again say that a great deal of the problem with standardized testing lies in the absurd scope of that testing. That is, the current mania for testing has convinced people that the only way to test is to do census-style testing – that is, testing all the students, all the time. But as I will continue to insist, the power of inferential statistics means that we can learn a great deal about the overall trends in college learning without overly burdening students or forcing professors to teach to the test. Scaling results up from carefully-collected, randomized and stratified samples is something that we do very well. We can have relatively small numbers of college students taking tests a couple times in their careers and still glean useful information about our schools and the system.

Ultimately, I think we need to be doing something to demonstrate learning gains on college. Because the university is threatened. We have many enemies, and they are powerful. And unless we can make an affirmative case for learning, we will be defenseless. Should a test like the CLA+ be the only way we make that case? Of course not. Instead, we should use a variety of means, including tests like the CLA+ or Proficiency Profile, disciplinary tests developed by subject-matter experts and given to students in appropriate disciplines, faculty-led and controlled assessment of individual departments and programs, raw metrics like graduation rate and time-to-graduation, student satisfaction surveys like the Gallup-Purdue index, and broader, more humanistic observations of contemporary campus life. These instruments should be tools in our toolbelt, not the hammer that forces us to see a world of nails. And the best data available to us that utilizes this particular tool tells us the average American college is doing a pretty good job.

Study of the Week: Nicaraguan Sign Language and the Speaking Animal

As I’ve mentioned before, research into childhood development is tricky, thanks to ethical and practical constraints on what researchers can do. Consider randomized controlled experimental studies – that is, taking a group of test subjects, dividing them at random into a group that receives some sort of experimenter-determined influence and a group that does not, and noting the differences between the two groups after. This is considered the gold standard for making causal inferences, and it is the way that, for example, we test the efficacy of drugs that are in development.

But there are obvious constraints. For one, ethics prevents us from deliberately causing harm; you can’t apply the condition of abuse or malnutrition to children, of course. But we remain interested in how these conditions affect development. There are also practical constraints. Even aside from ethical reasons, we have no way to randomly assign dyslexia or aphasia to people. More pragmatically, in educational research we often can’t randomly assign condition to individual students in a class setting. It’s simply not feasible or fair to ask a teacher to give, say, 7 students one lesson plan and 13 students another in the same class setting. Typically we instead assign condition to classes rather than individuals – Class A doing the conventional lesson plan (control) and Class B a new one (test). But there are issues with alpha and statistical inference there, though we can address them with things like hierarchical linear models and other types of nested models.

One of the places where research constraints have significantly limited out ability to investigate core developmental questions lies in language acquisition. The degree to which language is learned vs. acquired – indeed, what that distinction even means – remains somewhat unsettled. So does the question of a “critical period,” the idea that there is a time in the life cycle when children’s brains are particularly attuned to acquiring language. We have developed this intuition, in part, from the sturdy observation that children seem to be better able to acquire (learn?) a second language than adults, particularly when in an immersive environment. But conclusive proof of the existence of the critical period remains elusive, in part because we have limited ability to study what happens when children grow up in unusually linguistically poor environments. Given that we can’t go out of our way to stunt the language development of test subjects, for obvious reasons, we often have to turn to “natural experiments” – that is, situations where circumstances have conspired to create something like a natural assignment to condition. There’s all sorts of complicated epistemological questions in these cases, but they are often the best we can do give constraints.

Today’s Study of the Week concerns the development of language and just how powerful the human language instinct is, through one of the most fascinating natural experiments I’ve ever read about: the story of Nicaraguan Sign Language.

The Poverty of the Stimulus

Modern syntax has a kind of communal origin story, and like so many other aspects of linguistics, that origin story comes back to Noam Chomsky.

Chomsky famously rewrote the study of language, placing theoretical syntax at the fore of the field and declaring the study of language to be fundamentally part of cognitive science. Since at least Ferdinand de Saussure, the emphasis on mental structures (as opposed to exploring the development and “meaning” of arbitrary phonological shapes of particular words in particular languages) has been core to linguistics, but Chomsky took this work further than anyone, leading to a state where Chomsky acolyte Akeel Bilgrami could write in a 2015 book,

[The theory of language] is not a theory about external utterances, nor is it, therefore, about a social phenomenon. The nomenclature to capture this latter distinction between what is individual/internal/intensional and what is externalized/social is I-languages and E-languages respectively. It is I-languages alone that can be the subject of scientific study, not E-languages.

This is High Chomskyanism at its most frustrating, as I’ve written about before, but for our purposes will suffice to define his project. And this focus on the interior, cognitive dimensions of language is intimately connected to a central concept of Chomsky’s approach: the notion that the language capacity is a part of our genetic endowment, and that learning language is therefore fundamentally different from learning algebra or how to skip a rock. It was this contention that led to a book review which helped to make Chomsky’s reputation and which is part of the origin story of modern linguistics I mentioned above.

The book in question was written by BF Skinner, then the most famous and influential mind in psychology and human development. Skinner’s ideas – still prominent in parts of human psychology and especially in animal behaviorism – were centered on the idea of conditioning, the notion that behavior is the product of external systems of reward. Pavlov’s dog salivated when it heard a bell ring because it had been conditioned to associate that bell with a food reward; a neglected child throws tantrums because she has be conditioned to see doing so as the only way to get the reward of attention. In basic terms we can still see the truth of this essential insight, that behaviors that are rewarded tend to be repeated.

But Skinner, like a lot of great minds, over-generalized his most famous theory, seeking to push it into more and more domains. In particular, he wrote a book called Verbal Behavior, published in 1957, that sought to explain language acquisition through behaviorist principles. A child cries in a particular way, his mother learns that this cry means “I’m hungry,” and he is rewarded for communicating. As he grows, he makes more and more sophisticated sounds, modeling the words he hears around him, and is similarly rewarded. Eventually he develops adult language capacity through stimulus and reward.

Chomsky wrote a famous, famously scathing review of Skinner’s book. I suspect that the review has become somewhat overemphasized in the story of Chomsky and how he came to dominate contemporary linguistics, but there’s no question that it was a prominent early moment for his theories, or that taking on the old guard in that way was highly symbolic. Central to Chomsky’s critique was the concept of the “poverty of the stimulus.” The poverty of the stimulus argument depends on a simple observation: what even very young users of language can accomplish with language far exceeds what they’ve been exposed to. That is, the stimulus (their observations of the language use of others) is insufficient (impoverished) as an explanation for what they can do. The average five year old is perfectly capable of combining words and phrases in such a way that they produce a sentence that has never before been uttered in the history of the world. Language is made up of discrete parts that humans arrange into meaningful expressions without consciously controlling them, and the implicit rules through which we do this arranging are the natural subject of linguistic study. Or say say the followers of Chomsky.

Chomsky used the poverty of the stimulus theory as indirect evidence for the notion of a genetic language capacity, something unique to the human genome that gives us the ability to master the incredible complexity of real grammars without ever being formally trained in their use. (You don’t, I hope, sit there showing your 6 month old flash cards of the parts of speech.) Rather than arriving at language tabula rasa and acquiring it through the kind of rote practice that you learn to tie a tie, you “learn language in a way a bird learns wings,” as he once put it. We are the language-using animal, and this is the product of evolution and not of culture.

Where exactly this language capacity resides in the brain, and where the language instinct can be found in the genome, remain unanswered questions. And as with any prominent theory there are detractors and skeptics. The inability to conduct an experimental study to see how linguistic deprivation might influence the acquisition of language complicates our ability to sort these questions. But in the late 1970s, the government of Nicaragua inadvertently provided us with clues.

Nicaraguan Sign Language: the Language of the Truly Stimulus-Impoverished

For much of the history of the Nicaraguan state, deaf children had essentially no formal support from the government. As has been sadly typical in the history of children with disabilities, deaf Nicaraguan children were often socially isolated, kept at home away from peers and the greater community, typically lacking any formal education at all. Obviously, lacking the ability to hear and often to speak, and never being taught any kind of formal sign language, these children faced enormous obstacles to communicating effectively.

But in the late 1970s, a part of a broader set of social reforms, the government opened an elementary school for children with disabilities. Later on they would found a similarly-focused vocational school. For the first time, these communicatively-deprived children were granted the opportunity to interact with peers and learn in a formal setting. But they were still not granted the chance to be fully-functioning communicators. The school administrators had decided, for whatever reason, to restrict the students to signing – letter by letter – in Spanish, rather than teaching them a mature sign language. The communicative insufficiency of this should be obvious. Try speaking to someone you know by spelling out each word by the letter and you’ll see what I mean. So the kids took matters into their own hands: they generated their own human language.

Within a few years, a complex and robust language, Nicaraguan Sign Language (NSL), had been born. It has been passed down through generations of deaf Nicaraguan children, advancing and evolving quickly as it does. In time, researchers realized what this represented – a natural experiment on the linguistic capacities held by children who faced enormous disadvantages, and the chance to watch the birth of a new language in real time.

The Study

The story has been told in several places, but today I encourage you to read a 2004 study in Science on NSL. The study, by Ann Senghas, Sotaro Kita, and Asli Ozyurek, uses NSL to consider how languages develop over time. In particular, they use signs for types of motion to show how a language develops the ability to talk in greater abstraction and thus becomes more sophisticated and flexible. Why motion? Think about the nature of signing. If I want to get the concept of a wave across to you, I would naturally tend to make a wave with my hand, as remains the sign for wave in American Sign Language. But note that communicating iconically – that is, by matching something about the thing being referred to with something in the sign you’re using to refer to it – is in the broader sense unsophisticated or insufficient. What functioning languages must do is present the opportunity for abstraction and segmentation. Pictographic languages, where words are images of the things they stand for, are primitive because they prevent us from moving from those specifics to more general ideas. Instead, mature languages combine several key properties that make them flexible and useful:

We focus here on two particular properties of language: discreteness and combinatorial patterning. Every language consists of a finite set of recombinable parts. These basic elements are perceived categorically, not continuously, and are organized in a principled, hierarchical fashion. For example, we have discrete sounds that are combined to form words, that are combined to form phrases, and then sentences, and so on. Even those aspects of the world that are experienced as continuous and holistic are represented with language that is discrete and combinatorial. Together, these properties make it possible to generate an infinite number of expressions with a finite system.

The researchers therefore were interested in seeing how these properties were developing in NSL. So they divided their participants into different cohorts of aptitude and fluency in NSL, to see how much more abstracted the motion signs of the advanced NSL users might be. They found that in fact the more advanced cohort was significantly more likely to use the type of discrete patterning described above. That is, the more sophisticated speakers were, the more abstracted their description of motion was even though motion signs can be easily understood outside of a sign language context – and despite the fact that there are trade offs here:

Note that this change to the language, in the short term, entails a loss of information. When representations express manner and path separately, it is no longer iconically clear that the two aspects of movement occurred simultaneously, within a single event. For example, roll followed by downward might have instead referred to two separate events, meaning “rolling, then descending.” However, the communicative power gained by combining elements more than offsets this potential for ambiguity

This is essentially the deal that we make – that our brains make – as we develop more sophisticated languages. We trade simplicity (a picture of a sun for a system of sounds/letters that can be broken apart and assembled into a combination that conveys the abstracted idea “the sun”) for the capacity for complexity and abstraction. Given that many have argued that the development of sophisticated languages marks the beginning of humanity’s great intellectual leap forward out of the pre-modern phase and into civilization, this seems like a good trade indeed.

The question of how deeply embedded the language capacity may be in the human genome, and what precisely that capacity determines in terms of rules for how languages work, remain unanswered. Going on 60 years into the Chomsky project, we still don’t have a comprehensive set of “rules” that the genetic language capacity enforces on human expression. But the idea that language is a kind of information that is learned like any other, through conscious absorption and rote practice, seems unsupportable. To explain what these children did, and what humans have done for millennia, it seems inarguable to me that there is some special capacity in the genome for language learning, as surely as there is something in our genome that compels us to walk on two feet.

Wherever the study of human language development goes next, I will always come back to the story of Nicaraguan Sign Language, which has fascinated me for years and which never fails to amaze me even after all that I’ve read about it. This is that story: a group of young deaf children, all of whom suffered from severely reduced exposure to language compared to most children, many or most of whom grew up in poverty, some of whom had various other cognitive and developmental disabilities, spontaneously generated a functioning human grammar despite the immense complexity of such grammar and in the face of adult authorities actively trying to stop them from doing so. That’s the potential of the human language instinct, which functions, as a distributed network, as the most powerful information system in the history of the world. Nothing else, not even the entirety of the internet, comes close.

Study of the Week: When It Comes to Student Satisfaction, Faculty Matter Most

In an excellent piece from years back arguing against the endless push for online-only college education, Williams College President Adam Falk wrote:

At Williams College, where I work, we’ve analyzed which educational inputs best predict progress in these deeper aspects of student learning. The answer is unambiguous: By far, the factor that correlates most highly with gains in these skills is the amount of personal contact a student has with professors. Not virtual contact, but interaction with real, live human beings, whether in the classroom, or in faculty offices, or in the dining halls. Nothing else—not the details of the curriculum, not the choice of major, not the student’s GPA—predicts self-reported gains in these critical capacities nearly as well as how much time a student spent with professors.

I was always intrigued by this, and it certainly matched with my experience as both a student and instructor – that what really makes a difference to individual college students is meaningful interaction with faculty. I have always wanted the ability to share this impression in a persuasive way. But Falk was making reference to internal research at one small college, and I had no way to responsibly discuss those findings. We just didn’t have the research to back it up.

But we do now, and we have for a couple years, and that research is our Study of the Week.

The Gallup-Purdue Index came together while I was getting my doctorate at the school. (Humorously, it was always referred to as the Purdue-Gallup Index there and no one would ever make note of the discrepancy.) It’s the biggest and highest-quality survey of post-graduate satisfaction with one’s collegiate education ever done, with about 30,000 responses in both 2014 and 2015. The survey is nationally representative across different educational and demographic groups. That might not leap out at you, but understand that in this kind of longitudinal survey data, that’s all very impressive. It’s incredibly difficult to get this a good response rate in that kind of alumni research. For example, a division of my current institution sent an alumni survey out seeking some simple information on post-college outcomes. They restricted their call only to those who they had contact information known to be less than two years old and who had had some form of interaction with the university in that span. Their response rate was something like 3%. That’s not at all unusual. The Gallup-Purdue Index had an advantage: they had the resources necessary to provide incentives for participants to complete the survey.

Why student satisfaction and not, say, income or unemployment rate? To the credit of the researchers, they wanted to use a deeper concept of the value of college than just pecuniary. Treating college as a simple dollars-and-cents investment happens in ed talk all the time – take this piece by Derek Thompson on the value of elite schools, relative to others, once you control for student ability effects. (There isn’t much.) I get why Thompson frames it this way, but this sort of thinking inevitably degrades the humanistic values on which college was built. It also unfairly discounts the value of the humanities and arts, which often attract precisely the kinds of students who are most likely to choose non-financial reward over financial. There’s still a lot of value talk in there – this was, it’s worth mentioning, a Mitch Daniels-influenced project, and Mitch is as corporate as higher ed gets – but I’ll take it as an antidote to Anthony Carnevale-style obsession with financial outcomes.

So what do the researchers find? To begin with: from the perspective of student satisfaction, American colleges and universities are doing a pretty good job – unless they’re for-profit schools.

alumni rated on a five-point scale whether they agreed their education was worth the cost. Given that many families invest heavily in higher education for their children, there should be little doubt about its value. However, only half of graduates overall (50%) were unequivocally positive in their response, giving the statement a 5 rating on the scale ranging from strongly disagree (1) to strongly agree (5). Another 27% rated their agreement at 4, while 23% gave it a 3 rating or less.

This figure varies only slightly between alumni of public universities (52%) and alumni of private nonprofit universities (47%), but it drops sharply to 26% among graduates of private for-profit universities. Alumni from for-profit schools are disproportionately minorities or first-generation college students and are substantially more likely than those from public or private nonprofit schools to have taken out $25,000 or more in student loans.

This tracks with a broader sense I’ve cultivated for years, particularly in assembling my dissertation research on standardized testing of college learning, that the common perception that we are in some sort of college quality crisis is incorrect. I suspect that the average college graduate, as opposed to student, learns a great deal and goes on to do OK. The problems, financial and otherwise, mostly afflict the students we have to sadly label “some college” – those who have taken on student loan debt for degrees that they then don’t complete. They are a tragic case and, like student loan borrowers writ large, they desperately need government-provided financial relief.

(Incidentally, this dynamic where people think there’s a crisis with colleges writ large but rate their own school highly happens with parents and public schools too.)

Most importantly than the overall quality findings, this research tells us what ought to have always been clear: that faculty, and the ability for faculty to form meaningful relationships with students, are the most important part of a satisfying education. Check it out.

The three strongest predictors of seeing your college education as worth the cost are about faculty. And not just about faculty in general, but about relationships with faculty, the ability to have meaningful interactions with them.

There are certainly conclusions you could draw from this that not all faculty would like. In particular, this would lend credence to those who think that faculty at research universities are too disconnected from their students because of the drive for prestige and the pressure to publish. But two conclusions that most any faculty members could get on board with are these: one, that the adjunctification of the American university cuts directly against the interests of students, and two, that online-only or online-dominant education cuts students out of the most important aspect of college.

The all-adjunct teaching corps is bad not because adjuncts are bad educators or because they don’t care; many adjuncts are incredibly skilled teachers who care deeply for their students. Rather, adjuncts must teach so much, and balance such hectic schedules, that reaching out to students personally in this way is just not possible. At CUNY, I believe the rule is that adjuncts can work three classes on one campus and two at another. I know it’s difficult for people who don’t teach to understand how much work and time a college class takes up, but 5 classes in a semester is a ton. (Even at my most efficient, as a teacher I could never spend less than about three hours out of class for every one hour in class, between lesson planning, logistics, and grading.) Add to that having to go from, say, Hostos Community College in the Bronx to Brooklyn College here in south Brooklyn, and you can get a sense of the schedule constraints. Many adjuncts don’t even have an office space to meet students in outside of class. How could they be expected to form these essential relationships, then?

Online courses, meanwhile, cut the face-to-face interaction entirely. Can those relationships still be fostered? I’m sure it’s not impossible, but I’m also sure that it’s much, much harder. People have expressed confusion to me over my skepticism towards online classes for a long time, but I’m confused by their confusion. Teaching is a social activitiy. There seems to be a large number of aspects of face-to-face teaching that we don’t even attempt to replicate in online spaces. As both a teacher and a student, I have found my online courses to be deeply alienating, lacking any of the organic sense of mutual commitment that is essential to good pedagogy. My biggest concern is the lack of human accountability in online education. As an educator, I often felt that my first and most important role was not as a dispenser of information, much less as an evaluator of progress, but as a human face – one that brought with it the sort of sympathetic responsibility that underwrites so much of social life. What I could offer to students was support, yes, but of a certain kind: the support that implies reciprocity, the support that comes packaged with the expectation that the student would then recognize his or her commitment to doing the work. Not for me, but for the project of education, for the process.

I know that this is all the sort of airy talk that a lot of policy and technology types wave away. But in a completely grounded and pragmatic sense, in my own teaching the students who did best were those who demonstrated a sense of personal accountability to me and my course. They were the ones who realized that a class is an agreement between two people, teacher and student, and that each owed something to the other. How could I foster that if I never saw anything else beyond text and numbers on a screen? And note too that the financial motivation for online courses is often put in stark terms: you can dramatically upscale the number of students in an online course per instructor. Well, you can’t dramatically upscale a teacher’s attention, and there is no digital substitute for personal connection.

Does this mean there’s no place for online courses? No. In big lecture hall-style classes, there isn’t a lot of mutual accountability and social interaction, either. I don’t doubt that online courses can be part of a strong education. But I feel very bad for any student who is forced to go through college without repeated and extended opportunities for faculty mentorship – particular students who aren’t among the top high school graduates, given the durable finding that the students who do best in online courses are those who need the least help, likely because they already have the soft skills so many in the policy world take for granted.

The contemporary university is under enormous pressures, external and internal. We ask it for all sorts of things that cut against each other – educate more students, but be more selective; keep standards high, but graduate a higher percentage; move students through the system more cheaply, more quickly, and with higher quality. Meanwhile we lack a healthy job market for those who don’t have a college education, making the pressure only more intense. There are no magic bullets for this situation. But it is remarkable that so little of our attempts to make progress involve recognizing that the teachers are the heart of any institution of learning. We’ve systematically devalued the profession, doing everything we can to replace experienced, full-time, career academics with cheaper alternatives. Perhaps it’s time to listen to our intuition and to our alumni and begin to rebuild the professoriate, the ones who will ultimately shepherd our students to graduation. If we’re going to invest in college, let’s start there.

Study of the Week: Discipline Reform and Test Score Mania

This week’s study considers how quantitative educational indicators (read: test scores) are affected by serious disciplinary action against students.

The context

We’re in the midst of a criminal justice reform movement in this country. The regularity of police killing, particularly of black men, and our immense prison population have led to civic unrest and widespread perception that something needs to change. We’ve even seen rare bipartisan admissions that something has gone wrong, at least to a point. But we’ve made frustratingly little in the way of actual progress thus far.

One of the salutary aspects of this movement has been the insistence, by activists, as seeing crime and policing as part of broader social systems. You can’t meaningfully talk about how crime happens, or why abusive policing or over-incarceration happen, without talking about the socioeconomic conditions of this country. In particular, there’s been a great deal of interest in the “school to prison pipeline,” the way that some kids – particularly poor students of color – are set up to fail by the system. One aspect of our school system that clearly connects with criminal justice reform is our discipline system. Students who receive suspensions and other serious disciplinary action frequently struggle academically, and they are disproportionately likely to be students of color. As activists have argued, in many ways these students begin their lives in an overly punitive system and continue to suffer in that condition into adulthood.

In an era of test score mania, it’s inevitable that people will ask – how does academic discipline impact test scores? And how can we assess this relationship when there are such obvious confounds? In this week’s paper, the University of Arkansas’s Kaitlin P. Anderson, Gary W. Ritter, and Gema Zamarro attempt to explore that relationship, and arrive at surprising results.

The data

What’s really remarkable about this research is the size and quality of the data set, summarized like this:

This study uses six years of de-identified demographic, achievement (test score), and disciplinary data from all K-12 schools in Arkansas provided by the Arkansas Department of Education (2008-09 through 2013-14). Demographic data include race, gender, grade, special education status, limited English proficiency-status, and free-and-reduced-lunch (FRL) status.

That’s some serious data! We’re talking in the hundreds of thousands of observed students, with longitudinal (multiple observations over time) information and a good amount of demographic data for stratification. I’d kill for a data set like this. (If I could get such anonymized data for students who go through NYC public schools and enroll in CUNY, man.)

Why does the longitudinal aspect matter? Because of endogeneity.

The arrow of causation again, or endogeneity

In last week’s Study of the Week, I pointed out that experimental research in education is rare. Sometimes this is a resource issue, but mostly it’s an issue of ethics and practicality. Suspensions and their impact on academic performance are a perfect example: you can’t go randomly assigning the experimental condition of suspension to kids. That means that the negative academic outcomes typically associated with suspensions, discussed in the literature review of this study, might be caused by them. But it may be that kids who struggle academically are more likely to be suspended. You might presume that if one follows another that’s sufficient to prove causation, but what if there was some preceding event that caused both? (Parents divorcing, say.) It’s tricky. Because experimental designs physically intercede and are randomly controlled, they don’t have this problem, but again, nobody should be suspending kids at random in the name of science.

This research question has an endogeneity problem, in other words. Endogeneity is a fancy statistical term that, like many, is often not used in a particularly precise way. The Wikipedia for it is awful, but the first couple pages here are a good resource. In general, endogeneity means that something in your model is reliant on something else in your model but that’s not expressed in your model’s studied relationship. That is, there’s a hidden relationship within your model that potentially confounds your ability to assess causation. Often this is defined as your error term being correlated with your independent variable(s) (your input variables, the predictors, the variables you suspect may influence the output variable you’re looking at).

Say you’re running a simple linear regression analysis and your model looks at the relationship between income and happiness as expressed on some interval scale. Your model will always include an error term, which contains all the stuff impacting your variable of interest (here happiness) that’s not captured by your model. That’s OK – real world models are never fully explanatory. Uncontrolled variability is inevitable and fine in many research situations. The trouble is that some of the untested variables, the error portion, are likely to be correlated with income. If you’re healthy you’re likely to have a better income. If you’re healthy you’re likely to be happier (depending on type of illness). If you just plug income in as a predictor of happiness and income correlates with health and health correlates with happiness then you can end up overestimating the impact of income. If you’re really just looking for associations, no harm done. But if you want to make a causal inference, you’re asking for trouble. That’s (one type of) endogeneity.

Now, there, you have an endogeneity problem that could be solved by putting more variables into your model, like some sort of indicator of health. But sometimes you have endogeneity that stems from the kind of arrow of causation question that I’ve talked about in this space before. The resource I linked to above details a common example. Actors who have more status are perceived as having more skill at acting. But of course having more skill at acting gives you more status as an actor. Again, if you’re just looking for association, no problem. But there’s no way to really dig out causation – and it doesn’t matter if you add more variables to the model. That problem pops up again and again in education research.

Endogeneity is discussed at length in this paper. The study’s authors attempt to confront it, first, by throwing in demographic variables that may help control for unexplained variation, such as whether students qualify for federal school lunch assistance (a standard proxy for socioeconomic status). They also use a fixed effects panel data model. Fixed effects models attempt to account for unexplained variation by looking at how particular variables change over time for an individual research subject/observation. Fixed effect data is longitudinal, in other words, rather than cross-sectional (looking at each subject/observation only once). Why does this matter? There’s a great explanation by example in this resource here regarding demand for a service and that service’s cost. By using a fixed effect model, you can look at correlations over time within a given subject or observation.

Suppose I took a bunch of homeless kids and looked at the relationship between the calories they consumed and their self-reported happiness. I do a regression and I find, surprisingly, that even among kids with an unusually high chance of being malnourished, calories are inversely correlated with self-reported happiness – the more calories, the lower the happiness. But we know that different kids have different metabolisms and different caloric needs. So now I take multiple observations of the same kids. I find that for each individual kid, rising caloric intake is positively correlated with happiness. Kids who consume less calories might be happier, but the idea that lower calories causes higher happiness has proven to be an illusion. Looking longitudinally shows that for each kid higher calories are associated with higher happiness. That’s the benefit of a fixed effect model. Make sense?

The authors of this study use a panel data (read: contains longitudinal data) fixed effects model as an attempt to confront the obvious confounds here. As they say, most prior research is simply correlational, using cross-sectional approaches that merely compare incidence of suspensions to academic outcomes. By introducing longitudinal data, they can use a fixed effects model to look at variation within particular students, which helps address endogeneity concerns. Do I understand everything going on in their model, statistically? I most certainly do not. So if I’ve gotten something wrong about how they’re attempting to control endogeneity with a fixed effects model, please write me an email and I’ll run it and credit you by name.

The findings

What the authors find, having used their complex model and their longitudinal data, is counterintuitive and goes against the large pool of correlational studies: students who receive serious disciplinary actions don’t suffer academically, at least in terms of test scores, when you control for other variables. In fact there are statistically significant but very small increases in test scores associated with serious disciplinary action. This is true for both math and language arts. The effects are small enough to not be worth representing positively, in my view. (This is why we need to report effect sizes.) But still, the research directly cuts against the idea that serious disciplinary action hurts test scores.

This is just one study and will need replication. But it utilizes a very large, high-quality data set and attempts methodologically to address consistent problems with prior correlational research. So I would lend a good deal of credence to its findings. The question is, what do we do with it?

Keeping results separate from conclusions

To state the obvious: it’s important that we do research like this. We need to know these relationships. But it’s also incredibly important that we recognize the difference between the empirical outcomes of a given study and what our policy response should be. We need, in other words, to separate results from conclusions.

This research occurs in a period of dogged fixation on test scores by policy types. This is, as I’ve argued many times, a mistake. The tail has clearly come to wag the dog when it comes to test scores, with those quantitative indicators of success now pursued so doggedly that they have overwhelmed our interest in the lives of the very children we are meant to be advocating for. And while I don’t think many people will come out and say, “suspensions don’t hurt test scores, so let’s keep suspending so many kids,” this research comes in a policy context where test scores loom so large that they dominate the conversation.

To their credit, the authors of this study express the direct conclusion in a limited and fair way: “Based on our results, if policymakers continue to push for changes to disciplinary policies, they should do so for reasons other than the hypothesized negative impacts of exclusionary discipline on all students.” This is, I think, precisely the right way to frame this. We should not change disciplinary policy out of a concern for test scores. We should change disciplinary policy out of a concern for justice. Do the authors agree? They are cagey, but I suspect they show their hands several times in this research. They caution that the discipline reform movement is leading to deteriorating “school climate” measures and in general concern troll their way through the final paragraphs of their paper. I wish they would state what seems to me to be the most important point: that while we should empirically assess the relationship between discipline and test scores, as they have just done admirably well, the moral question of discipline reform is simply not related to that empirical question. When it comes to asking if we’re suspending too many kids, test scores are simply irrelevant.

I am not a “no testing, ever” guy. That would be strange, given that I spend a considerable portion of my professional life researching about educational testing. I see tests as a useful tool – that is, they exist to satisfy some specific pragmatic human purpose and are valuable to us as long as they fulfill that purpose and their side effects are not so onerous that they overwhelm that positive benefit. As I have said for years, “no testing” or “test everyone all the time” is a false binary; we enjoy the power of inferential statistics, which make it possible to know how our students are doing at scale with great precision. And since relative standardized testing outcomes (that is, individual student performance relative to peers) tend to be remarkably static over life, we don’t have much reason to worry about test data suddenly going obsolete. Careful, responsibly-implemented random sampling with stratification can give us useful data without the social and emotional costs on children that limitless testing imposes. No kids lie awake at night crying because they’re stressed about having to take the NAEP.

The only people who are harmed by reducing the amount of testing in this way are the for-profit testing companies and ancillary businesses that suck up public funds for dubious value, and the politically motivated who use test scores as an instrument to bash teachers and schools. Whether you see their interests as equal to those of the people most directly affected – the students who must endure days of stress and boredom and the teachers who must turn their classes into little more than test prep factories – is an issue for you and your conscience.

Ultimately, the conclusions we must draw about the use of suspensions and other serious disciplinary actions must be moral and political in their nature. As such, good empiricism can function as evidence, context, and support, but it cannot solve the questions for us. To their credit, the researchers behind this study conclude by saying “as we seek to better understand these relationships, we must also consider the systemic effects.” Though I might not reach the same political conclusions as they do, I agree completely.

Many thanks to the American Prospect‘s outstanding education reporter Rachel Cohen for bringing this study to my attention.