Campbell’s Law, statistical caveats, and standardized testing

A colleague of mine sent along this excellent piece by John Ewing of Math for America. It’s a very necessary and important critique of the use of value added models in education assessment. (It’s also– are you listening, Nick Kristof?– an excellent example of academic writing that is accessible and understandable.) I highly, highly encourage you to read the whole thing; it’s only five and a half pages. I want to pull out a couple pieces of this argument because I really think it’s immensely important for our policy debates.

First is Ewing’s reference to Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to measure.” This can be seen in a variety of arenas, with the Wire‘s depiction of police stat-juking as a perfectly indicative example. In education, the most notorious examples are the widespread cheating observed in school systems like Washington DC during the Michelle Rhee era. What’s essential to understand is that Campbell’s Law is descriptive, not normative: this corruption will happen regardless of the question of whether it should happen. A lot of ed reform types respond to cheating in the face of pressure to improve on standardized tests by groaning about the immorality of our public educators. I certainly don’t condone cheating. But when the ability to remain employed is tied to measures that educators have little ability to actually control, you are directly incentivizing cheating, and in fact often leaving people with the choice between cheating and losing their jobs, with no other alternative. I’m not excusing cheating under those conditions. I’m arguing that it’s inevitable.

Second, Ewing elegantly distills the evidence that value added measures of teacher performance are, in a perverse way, an argument against the assumptions that undergird the ed reform movement: that teacher quality is a static value, which can be validly measured and quantified, and be reliably and consistently separated from student inputs, and that this quality will demonstrate over time which teachers to pay better and which to fire. As Ewing points out, the actual result of many or most value added models is to show wildly inconsistent results for individual teachers and schools. I really can’t overstate this: if we take value added models seriously in the way that ed reformers want us to, the only reasonable conclusion is that static teacher quality does not exist. These models return such wildly divergent results for individual teachers that, if you really believe they accurately reflect teacher quality, you must conclude that teacher quality is so variable from year to year and class to class that it is next to impossible to predict, and thus to fairly reward. Extant value added models are either invalid or unreliable indicators, or teaching quality is not stable.

I say this all from the perspective of social science. I am not trying to make a political plea, here. I’m saying that in the most ruthless sense of empirically-justifiable policy. It’s a consistent and major frustration of mine that, in these debates, arguments about the soundness (or lack thereof) of teaching assessment regimes are so often dismissed as politicized defenses of the status quo. Ewing is a mathematician making a mathematician’s argument, and anyone defending value added models must address his concerns.

Finally and most importantly, there’s Ewing’s discussion of the way in which statistical and empirical caveats get sanded away over time. This is a constant frustration, and I think it’s important in light of recent discussions about the mutual mistrust and lack of understanding between academics and journalists. Ewing rightfully goes after the Los Angeles Times for its notorious publication of LA teacher value added metrics. As Ewing points out, the reporters involved waved briefly at the many caveats and limitations that the developers of these metrics include, then completely ignored those caveats and limitations. This is a dynamic I’ve found in the popular media, when it comes to empirical research, again and again. It’s a facet of simply reading abstracts or press releases without going into the actual methodology, and it’s very destructive.

So take the College Learning Assessment +, the subject of my dissertation. I have a complicated relationship to the test. I have a great deal of criticisms of this kind of instrument in general, and I think that there are specific dangers with this test. At the same time, the procedures and documentation of the CLA+ are admirable in the way that they admit limitation, argue against a high-stakes approach to using the test, and are scrupulous in saying that the mechanism is specifically designed to measure education across instructors and departments, and cannot be used to assess between-department differences. This is a point made again and again in the supporting documentation of the CLA+; the entire purpose of its mechanism is to show aspects of collegiate learning working in concert, and thus it cannot be used to assess the instructional quality of particular majors or instructors.

For these reasons among others, I see the CLA+ as a potentially better test than some of the proposed alternatives. As one tool among others, it could demonstrate the actual intellectual growth that I deeply believe goes on at college campuses, and enable students to show this growth to employers in a brutal labor market. But in order for the test to actually function the way it was intended by its creators. Actual, on-the-ground implementation at universities requires administrators to understand these caveats and limitations and draw conclusions accordingly. As someone who is observing such an implementation in real time, it’s not at all clear that the people pushing for this test here at Purdue really understand these limitations or take them seriously. You cannot hope to draw real-world value from any test or empirical research without scrupulously understanding and taking seriously the limitations that the developers or researchers announce. Otherwise, you end up making a significant investment in terms of resources and effort and getting a lot of statistical noise back for your trouble.

What has been remarkable for me, as I have gradually acquired an education as a quantitative empirical researcher, is how more and more knowledge leads to greater skepticism, not less. This is not research nihilism; I am a researcher, after all, and I believe that we can learn many things about the world. But I cannot tell you how easy it is for these metrics to go wrong. They are so manipulable, so subject to corruption, and so easily misinterpreted if you aren’t rigorously careful. So I genuinely beg (beg) the popularizers and explainers out there: take these caveats and limitations very seriously. They are not the fine print; they are, often, the most essential part of the text. The work of explaining complex research to the public is noble and essential in our society. I am a bit cynical about the ability of the web to make prominent figures in the media more accessible, at this stage in the internet’s development, but I maintain a streak of hope in that regard. So I use my limited platform ask that people like Matt Yglesias and Ezra Klein and Dylan  Matthews and those at Wonkbook and The Atlantic and Slate and The New York Times to please, please, please, be skeptical and rigorous readers of this research. Used carefully, empirical research can improve our policy. But it’s always easier to simply become a nation of stats jukers, moving the metrics to move the metrics and avoiding uncomfortable conclusions.

4 responses

  1. Given that they are both g-loaded, the SAT and the CLA+ correlate very well (institutional correlations at around .9, which is remarkable). As someone who has researched the CLA+ at length, can you comment on the necessity for an additional psychometric instrument? Is this just neoliberal capitalism at work, or is there actually something new being added to the conversation?

    Secondly, can you comment on the black-white achievement gap on the CLA+, if any? Do we still see the standard 1 SD gulf, or does the CLA+ do better on this than other psychometric instruments?

  2. In your view, what is the problem tests like cla are meant to solve? Is it more 1) defense of already successful institutions against anti-education corporatist forces, or 2) actually sorting out whether college is worth it, since we have no other way of knowing? Or something else?

Leave a Reply

Your email address will not be published. Required fields are marked *