Study of the Week: Discipline Reform and Test Score Mania

This week’s study considers how quantitative educational indicators (read: test scores) are affected by serious disciplinary action against students.

The context

We’re in the midst of a criminal justice reform movement in this country. The regularity of police killing, particularly of black men, and our immense prison population have led to civic unrest and widespread perception that something needs to change. We’ve even seen rare bipartisan admissions that something has gone wrong, at least to a point. But we’ve made frustratingly little in the way of actual progress thus far.

One of the salutary aspects of this movement has been the insistence, by activists, as seeing crime and policing as part of broader social systems. You can’t meaningfully talk about how crime happens, or why abusive policing or over-incarceration happen, without talking about the socioeconomic conditions of this country. In particular, there’s been a great deal of interest in the “school to prison pipeline,” the way that some kids – particularly poor students of color – are set up to fail by the system. One aspect of our school system that clearly connects with criminal justice reform is our discipline system. Students who receive suspensions and other serious disciplinary action frequently struggle academically, and they are disproportionately likely to be students of color. As activists have argued, in many ways these students begin their lives in an overly punitive system and continue to suffer in that condition into adulthood.

In an era of test score mania, it’s inevitable that people will ask – how does academic discipline impact test scores? And how can we assess this relationship when there are such obvious confounds? In this week’s paper, the University of Arkansas’s Kaitlin P. Anderson, Gary W. Ritter, and Gema Zamarro attempt to explore that relationship, and arrive at surprising results.

The data

What’s really remarkable about this research is the size and quality of the data set, summarized like this:

This study uses six years of de-identified demographic, achievement (test score), and disciplinary data from all K-12 schools in Arkansas provided by the Arkansas Department of Education (2008-09 through 2013-14). Demographic data include race, gender, grade, special education status, limited English proficiency-status, and free-and-reduced-lunch (FRL) status.

That’s some serious data! We’re talking in the hundreds of thousands of observed students, with longitudinal (multiple observations over time) information and a good amount of demographic data for stratification. I’d kill for a data set like this. (If I could get such anonymized data for students who go through NYC public schools and enroll in CUNY, man.)

Why does the longitudinal aspect matter? Because of endogeneity.

The arrow of causation again, or endogeneity

In last week’s Study of the Week, I pointed out that experimental research in education is rare. Sometimes this is a resource issue, but mostly it’s an issue of ethics and practicality. Suspensions and their impact on academic performance are a perfect example: you can’t go randomly assigning the experimental condition of suspension to kids. That means that the negative academic outcomes typically associated with suspensions, discussed in the literature review of this study, might be caused by them. But it may be that kids who struggle academically are more likely to be suspended. You might presume that if one follows another that’s sufficient to prove causation, but what if there was some preceding event that caused both? (Parents divorcing, say.) It’s tricky. Because experimental designs physically intercede and are randomly controlled, they don’t have this problem, but again, nobody should be suspending kids at random in the name of science.

This research question has an endogeneity problem, in other words. Endogeneity is a fancy statistical term that, like many, is often not used in a particularly precise way. The Wikipedia for it is awful, but the first couple pages here are a good resource. In general, endogeneity means that something in your model is reliant on something else in your model but that’s not expressed in your model’s studied relationship. That is, there’s a hidden relationship within your model that potentially confounds your ability to assess causation. Often this is defined as your error term being correlated with your independent variable(s) (your input variables, the predictors, the variables you suspect may influence the output variable you’re looking at).

Say you’re running a simple linear regression analysis and your model looks at the relationship between income and happiness as expressed on some interval scale. Your model will always include an error term, which contains all the stuff impacting your variable of interest (here happiness) that’s not captured by your model. That’s OK – real world models are never fully explanatory. Uncontrolled variability is inevitable and fine in many research situations. The trouble is that some of the untested variables, the error portion, are likely to be correlated with income. If you’re healthy you’re likely to have a better income. If you’re healthy you’re likely to be happier (depending on type of illness). If you just plug income in as a predictor of happiness and income correlates with health and health correlates with happiness then you can end up overestimating the impact of income. If you’re really just looking for associations, no harm done. But if you want to make a causal inference, you’re asking for trouble. That’s (one type of) endogeneity.

Now, there, you have an endogeneity problem that could be solved by putting more variables into your model, like some sort of indicator of health. But sometimes you have endogeneity that stems from the kind of arrow of causation question that I’ve talked about in this space before. The resource I linked to above details a common example. Actors who have more status are perceived as having more skill at acting. But of course having more skill at acting gives you more status as an actor. Again, if you’re just looking for association, no problem. But there’s no way to really dig out causation – and it doesn’t matter if you add more variables to the model. That problem pops up again and again in education research.

Endogeneity is discussed at length in this paper. The study’s authors attempt to confront it, first, by throwing in demographic variables that may help control for unexplained variation, such as whether students qualify for federal school lunch assistance (a standard proxy for socioeconomic status). They also use a fixed effects panel data model. Fixed effects models attempt to account for unexplained variation by looking at how particular variables change over time for an individual research subject/observation. Fixed effect data is longitudinal, in other words, rather than cross-sectional (looking at each subject/observation only once). Why does this matter? There’s a great explanation by example in this resource here regarding demand for a service and that service’s cost. By using a fixed effect model, you can look at correlations over time within a given subject or observation.

Suppose I took a bunch of homeless kids and looked at the relationship between the calories they consumed and their self-reported happiness. I do a regression and I find, surprisingly, that even among kids with an unusually high chance of being malnourished, calories are inversely correlated with self-reported happiness – the more calories, the lower the happiness. But we know that different kids have different metabolisms and different caloric needs. So now I take multiple observations of the same kids. I find that for each individual kid, rising caloric intake is positively correlated with happiness. Kids who consume less calories might be happier, but the idea that lower calories causes higher happiness has proven to be an illusion. Looking longitudinally shows that for each kid higher calories are associated with higher happiness. That’s the benefit of a fixed effect model. Make sense?

The authors of this study use a panel data (read: contains longitudinal data) fixed effects model as an attempt to confront the obvious confounds here. As they say, most prior research is simply correlational, using cross-sectional approaches that merely compare incidence of suspensions to academic outcomes. By introducing longitudinal data, they can use a fixed effects model to look at variation within particular students, which helps address endogeneity concerns. Do I understand everything going on in their model, statistically? I most certainly do not. So if I’ve gotten something wrong about how they’re attempting to control endogeneity with a fixed effects model, please write me an email and I’ll run it and credit you by name.

The findings

What the authors find, having used their complex model and their longitudinal data, is counterintuitive and goes against the large pool of correlational studies: students who receive serious disciplinary actions don’t suffer academically, at least in terms of test scores, when you control for other variables. In fact there are statistically significant but very small increases in test scores associated with serious disciplinary action. This is true for both math and language arts. The effects are small enough to not be worth representing positively, in my view. (This is why we need to report effect sizes.) But still, the research directly cuts against the idea that serious disciplinary action hurts test scores.

This is just one study and will need replication. But it utilizes a very large, high-quality data set and attempts methodologically to address consistent problems with prior correlational research. So I would lend a good deal of credence to its findings. The question is, what do we do with it?

Keeping results separate from conclusions

To state the obvious: it’s important that we do research like this. We need to know these relationships. But it’s also incredibly important that we recognize the difference between the empirical outcomes of a given study and what our policy response should be. We need, in other words, to separate results from conclusions.

This research occurs in a period of dogged fixation on test scores by policy types. This is, as I’ve argued many times, a mistake. The tail has clearly come to wag the dog when it comes to test scores, with those quantitative indicators of success now pursued so doggedly that they have overwhelmed our interest in the lives of the very children we are meant to be advocating for. And while I don’t think many people will come out and say, “suspensions don’t hurt test scores, so let’s keep suspending so many kids,” this research comes in a policy context where test scores loom so large that they dominate the conversation.

To their credit, the authors of this study express the direct conclusion in a limited and fair way: “Based on our results, if policymakers continue to push for changes to disciplinary policies, they should do so for reasons other than the hypothesized negative impacts of exclusionary discipline on all students.” This is, I think, precisely the right way to frame this. We should not change disciplinary policy out of a concern for test scores. We should change disciplinary policy out of a concern for justice. Do the authors agree? They are cagey, but I suspect they show their hands several times in this research. They caution that the discipline reform movement is leading to deteriorating “school climate” measures and in general concern troll their way through the final paragraphs of their paper. I wish they would state what seems to me to be the most important point: that while we should empirically assess the relationship between discipline and test scores, as they have just done admirably well, the moral question of discipline reform is simply not related to that empirical question. When it comes to asking if we’re suspending too many kids, test scores are simply irrelevant.

I am not a “no testing, ever” guy. That would be strange, given that I spend a considerable portion of my professional life researching about educational testing. I see tests as a useful tool – that is, they exist to satisfy some specific pragmatic human purpose and are valuable to us as long as they fulfill that purpose and their side effects are not so onerous that they overwhelm that positive benefit. As I have said for years, “no testing” or “test everyone all the time” is a false binary; we enjoy the power of inferential statistics, which make it possible to know how our students are doing at scale with great precision. And since relative standardized testing outcomes (that is, individual student performance relative to peers) tend to be remarkably static over life, we don’t have much reason to worry about test data suddenly going obsolete. Careful, responsibly-implemented random sampling with stratification can give us useful data without the social and emotional costs on children that limitless testing imposes. No kids lie awake at night crying because they’re stressed about having to take the NAEP.

The only people who are harmed by reducing the amount of testing in this way are the for-profit testing companies and ancillary businesses that suck up public funds for dubious value, and the politically motivated who use test scores as an instrument to bash teachers and schools. Whether you see their interests as equal to those of the people most directly affected – the students who must endure days of stress and boredom and the teachers who must turn their classes into little more than test prep factories – is an issue for you and your conscience.

Ultimately, the conclusions we must draw about the use of suspensions and other serious disciplinary actions must be moral and political in their nature. As such, good empiricism can function as evidence, context, and support, but it cannot solve the questions for us. To their credit, the researchers behind this study conclude by saying “as we seek to better understand these relationships, we must also consider the systemic effects.” Though I might not reach the same political conclusions as they do, I agree completely.

Many thanks to the American Prospect‘s outstanding education reporter Rachel Cohen for bringing this study to my attention.