two economists ask teachers to behave as irrational actors

I was considering doing a front-to-back fisking of this interview of Raj Chetty, Professor of Economics at Stanford University, conducted by the libertarian economist Tyler Cowen. Despite Chetty’s obviously impressive credentials, he says several things in the interview that simply don’t hold up to scrutiny, in particular regarding the simultaneity problem1 and the impact of the shared environment2 I’ve decided to just focus on one key point, though.

The standard neoliberal ed reform argument goes like this: the major entrenched socioeconomic and racial inequalities in this country are no excuse for poor quantitative outcomes for groups of students; teachers and schools, despite all of the evidence to the contrary, control most of the variation in educational outcomes; therefore our perceived education problems are the result of lazy, untalented teachers; introducing a market for schooling will force schools to get rid of those teachers and metrics will improve. Now this story has failed to play out this way again and again in places like Detroit and Washington DC, but we’ll let that slide for now. If we accept this argument on its own terms, we need to get many talented people into teaching and replace the hundreds of thousands of “bad” teachers we’d be getting rid of.

Ed reform types are typically cagey about the scale of teacher dismissals – they hate to actually come out and say “I’d like to get hundreds of thousands of teachers fired” – but based on their own numbers, their own claims about the size and extent of the problem, that’s what needs to happen. You can’t simultaneously say that there’s a nationwide education crisis that needs to be solved by firing teachers and avoid the conclusion that huge numbers need to be fired. If reformers claim that even one out of every ten public teachers needs to be let go (a low number in reform rhetoric), we’re talking about more than 300,000 fired teachers.

I’ve argued before that the idea that market economics are effective means to solve educational problems falls apart once you recognize that, unlike a factory building a widget, educators don’t control most of what contributes to a child’s learning outcomes. But suppose you do believe in the standard conservative economics take on school reform: how can Chetty’s ideas make sense, if we trust young workers in a labor market to act in their own rational best interest? Chetty believes that we need, at scale, to “either retrain or dismiss the teachers who are less effective, [to] substantially increase productivity without significantly increasing cost.” Without increasing costs, in other words, by raising teacher salaries. The median teacher in this country makes ~$57,000 a year; the 75th percentile makes ~$73k, and the 25th percentile, ~$45k. Compare with median lawyer salaries well above $100,000 a year and median doctor salaries close to $200,000, or an average of $125,000+ for MBA graduates. So we’re not going to pay teachers more, and we’re going to sufficiently erode labor protections, if we’re going to dismiss those less effective teachers. This doesn’t sound like a good deal already.

Of course, teachers don’t just suffer from low median wages compared to people with similar levels of schooling. They also suffer from far lower social status than they are typically afforded in other countries, as Dr. Chetty acknowledges:

Yeah, I think status seems incredibly important. My sense of the K–12 education system in the US is, unfortunately for many kids graduating from top colleges, teaching is not near the top of the list of professions that they’d consider. It’s partly because, in a sense, they can’t afford to be teachers because it entails such a pay cut. But also because they feel that it’s not the most prestigious career to pursue.

Why yes, Dr. Chetty, it’s true! Teachers don’t get a lot of prestige in this country! Maybe that’s because well-paid celebrity academics who make several times the median teacher salary – people like you – talk casually about firing them en masse and insist that they are the source of poor metrics! The ed reform movement has insulted the profession of public school teacher for years. Popular expressions of that philosophy, like the execrable documentary Waiting for “Superman, have contributed to widespread assumptions that students are failing because their teachers are lazy and corrupt. How can a political movement that has relentlessly insulted the teaching profession not contribute to declining interest in being part of that profession?

Here in New York, the numbers are clear: we’re already facing a serious teacher shortage.

What Chetty and Cowen are asking for makes no sense according to their own manner of thinking. Dr. Chetty, Dr. Cowen: there is no bullpen. Even if I thought that teachers controlled far more of the variance in quantitative education metrics than I do, and even if I didn’t have objections about fair labor practices against removing hundreds of thousands of teachers, we would be stuck with this simple fact. We do not have hundreds of thousands of talented young professionals, eager to forego the far greater rewards available in the private sector, ready to jump in and start teaching. And we certainly won’t have such a thing if we share Chetty’s resistance to paying teachers more and his commitment to making it easier to fire them.

So: no higher salaries for a relatively low-paying profession, eroding the job security that is the most treasured benefit of the job, continuing to degrade and insult the current workforce as lazy and undeserving, getting rid of hundreds of thousands of them, and yet somehow attracting hundreds of thousands of more talented, more committed young workers to become teachers.

According to what school of economics, exactly, is such a thing possible?


Study of the Week: Better and Worse Ways to Attack Entrance Exams

For this week’s Study of the Week I want to look at standardized tests, the concept of validity, and how best – and worst – to criticize exams like the SAT and ACT. To begin, let’s consider what exactly it means to call such exams valid.

What is validity?

Validity is a multi-faceted concept that’s seen as a core aspect of test development. Like many subjects in psychometrics and stats, it tends to be used casually and referred to as something fairly simple, when in fact the concept is notoriously complex. Accepting that any one-sentence definition of validity is thus a distortion, generally we say that validity refers to the degree that a test measures that which it purports to measure. A test is more valid or less depending on its ability to actually capture the underlying traits we are interested in investigating through its mechanism. No test can ever be fully or perfectly validated; rather we can say that it is more or less valid. Validity is a vector, not a destination.

Validity is so complex, and so interesting, in part because it sits at the nexus of both quantitative and philosophical concerns. Concepts that we want to test may appear superficially simple but are often filled with hidden complexity. As I wrote in a past Study of the Week, talking about the related issues of construct and operationalization,

If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

Questions such as these are endemic to test development, and frequently we are forced to make subjective decisions about how best to measure complex constructs of interest. Common to the quantitative social sciences, this subjective, theoretical side of validity is often written out of our conception of the topic, as we want to speak with the certainty of numbers and the authority of the “harder” sciences. But theory is inextricable from empiricism, and the more that we wish to hide it, the more subject we are to distortions that arise from failing to fully think through our theories and what they mean. Good empiricists know theory comes first; without it, the numbers are meaningless.

Validity has been subdivided into a large number of types, which reflect different goals and values within the test development process. Some examples include:

  • Predictive Validity: The ability of a test’s results to predict that which it should be able to predict if the test is in fact valid. If a reading test predicts whether students can in fact read texts of a given complexity or reading level, that would provide evidence of predictive validity. The SAT’s ability to predict the grades of college freshmen is a classic example.
  • Concurrent Validity: If a test’s results are strongly correlated with that of a test that measures similar constructs and which has itself been sufficiently validated, that provides evidence of concurrent validity. Of course, you have to be careful – two invalid tests might provide similar results but not tell us much of actual worth. Still, a test of quantitative reasoning and a test of math would be expected to be imperfectly yet moderately-to-strongly correlated if each is itself a valid test of the given construct.
  • Curricular Validity: As the name implies, curricular validity reflects the degree to which a test matches with a given curriculum. If a test of biology closely matches the content in the syllabus of that biology course, we would argue for high curricular validity. This is important because we can easily imagine a scenario where general ability in biology could be measured effectively by a test that lacked curricular validity – students who are strong in biology might score well on a test, and students who are poor would likely score poorly, even if that test didn’t closely match the curriculum. But that test would still not be a particularly valid measure of biology as learned in that class, so curricular validity would be low. This is often expressed as a matter of ethics.
  • Ecological Validity: Heading in a “softer” direction, ecological validity is often discussed to refer to the degree to which a test or similar assessment instrument matches the real-life contexts in which its consequences will be enacted. Take writing assessment. In previous generations, it was common for student writing ability to be tested through multiple choice tests on grammar and sentence combining. These tests were argued to be valid because their results tend to be highly correlated with the scores that students receive on written essay exams. But writing teachers objected, quite reasonably, that we should test student writing by having them write, even if those correlations are strong. This is an invocation of ecological validity and reflects a broader (and to me positive) effort to not reduce validity to narrowly numerical terms.

I could go on!

When we talk about entrance examinations like the SAT or GRE, we often fixate on predictive validity, for obvious reasons. If we’re using test scores as criteria for entry into selective institutions, we are making a set of claims about the relationship between those scores and the eventual performance of those students. Most importantly, we’re saying that the tests help us to know that students can complete a given college curriculum, that we’re not setting them up to fail by admitting them to a school where they are not academically prepared to thrive. This is, ostensibly, the first responsibility of the college admissions process. Ostensibly.

Of course, there are ceiling effects here, and a whole host of social and ethical concerns that predictive validity can’t address. I can’t find a link now but awhile back a Harvard admissions officer admitted that something like 90% of the applicants have the academic ability to succeed at the school, and that much of the screening process had little to do with actual academic preparedness. This is a big subject that’s outside of the bounds of this week’s study.

The ACT: Still Predictively Valid

Today’s study, by Paul A. Westrick, Huy Le, Steven B. Robbins, Justine M. R. Radunzel, and Frank L. Schmidt1, is a large-n (189,612) study about the predictive validity of the ACT, with analysis of the role of socioeconomic status (SES) and high school grades in retention and college grades. The researchers examined the outcomes of students who took the ACT and went on to enroll in 4-year institutions from 2000 to 2006.

The nut:

After corrections for range restriction, the estimated mean correlation between ACT scores and 1st-year GPA was .51, and the estimated mean correlation between high school GPA and 1st-year GPA was .58. In addition, the validity coefficients for ACT Composite score and high school GPA were found to be somewhat variable across institutions, with 90% of the coefficients estimated to fall between .43 and .60, and between .49 and .68, respectively (as indicated by the 90% credibility intervals). In contrast, after correcting for artifacts, the estimated mean correlation between SES and 1st-year GPA was only .24 and did not vary across institutions….

…1st-year GPA, the most proximal predictor of 2nd-year retention, had the strongest relationship (.41). ACT Composite scores (.19) and high school GPA (.21) were similar in the strength of their relationships with 2nd-year retention, and SES had the weakest relationship with 2nd-year retention (.10).

The results should be familiar to anyone who has taken a good look at the literature on these tests, and to anyone who has been a regular reader of this blog. The ACT is in fact a pretty strong predictor of GPA, though far from a perfect one at .51. Context is key here; in the world of social sciences and education, .51 is an impressive degree of predictive validity for the criterion of interest. But there’s lots of wiggle! And I think that’s ultimately a good thing; it permits us to recognize that there are a variety of ways to effectively navigate the challenges of the college experience… and to fail to do so. (As the Study of the Week post linked to above notes, GPA is strongly influenced by Conscientiousness, the part of the Five Factor Model associated with persistence and delaying gratification.) We live in a world of variability, and no test can ever make perfectly accurate predictions about who will succeed or fail. Exceptions abound. Proponents of these tests will say, though, that they are probably much more valid predictors of college grades and dropout rates than more subjective criteria like essays and extracurricular activities. And they have a point.

Does the fact that SES correlates “only” at .24 with college GPA mean SES doesn’t matter? Of course not. That level of correlation for a variable that is truly construct-irrelevant and which has such obvious social justice dimensions is notable even if its less powerful than some would suspect. It simply means that we should take care not to exaggerate that relationship, or the relationship between SES and performance on tests like the ACT and SAT, which is similar at about .25 in the best data known to me. Again: clearly that is a relevant relationship, and clearly it does not support the notion that these tests only reflect differences in SES.

Ultimately, every read I have of the extant evidence demonstrates that tests like the SAT and ACT are moderately to highly effective at predicting which students will succeed in terms of college GPA and retention rates. They are not perfect and should not be treated as such, so we should use other types of evidence such as high school grades and other, “soft” factors in our college admissions procedures – in other words, what we already do – if we’re primarily concerned with screening for prerequisite ability. Does that mean I have no objections to these tests or their use? Not at all. It just means that I want to make the right kinds of criticisms.

Don’t Criticize Strength, Criticize Weakness

A theme that I will return to again and again in this space is that we need to consider education and its place in society from a high enough level to think coherently. Critics of the SAT and ACT tend to pitch their criticisms at a level that does them no good.

So take this piece in Slate from a couple enthusiastic SAT (and IQ) proponent. In it, they take several liberal academics to task for making inaccurate claims about the SAT, in particular the idea that the SAT only measures how well you take the SAT. As the authors say, the evidence against this is overwhelming; the SAT, like the ACT, is and has always been an effective predictor of college grades and retention rates, which is precisely what the test is mean to predict. The big testing companies invest a great deal of money and effort in making them predictively valid. (And a great deal of test taker time and effort, too, given that one section out of each given exam is “experimental,” unscored and used for the production of future tests.) When you attack the predictive validity of these tests – their ability to make meaningful predictions about who will succeed and who will fail at college – you are attacking them at their strongest point. It’s like their critics are deliberately making the weakest critique possible.

“These tests are only proxies for socioeconomic status” is a factually incorrect attempt to make a criticism of how our educational system replicates received advantage. It fails because it does not operate at the right level of perspective. Here’s a better version, my version: “these tests are part of an educational system that reflects a narrow definition of student success that is based on the needs of capitalism, rather than a fuller, more humanistic definition of what it means to be a good student.”

These tests do indeed tell us how well students are likely to do in college and in turn provide some evidence of how well they will do in the working world. But college, like our educational system as a whole, has been tuned to attend to the needs of the market rather than to the broader needs of humanity. The former privileges the kind of abstract processing and brute reasoning skills that tests are good at measuring and which makes one a good Facebook or Boeing employee. The latter would include things like ethical responsibility, aesthetic appreciation, elegance of expression, and dedication to equality, among other things, which tests are not well suited to measuring. A more egalitarian society would of course also have need for, and value, the raw processing power that we can test for effectively, but that strength would be correctly seen as just one value among many. To get there, though, we have to make much broader critiques and reforms of contemporary society than the “SAT just measures how well you take the SAT” crowd tend to engage in.

What I am asking for, in other words, is that we focus on telling the whole story rather than distorting what we know about part of the story. There is so much to criticize in our system and how it doles out rewards, so let’s attack weakness, not strength.


  • For some odd reason my last post, on public subsidies for wealthy Ivies in an era of austerity, did not get pushed out to RSS readers. Apparently that’s happened before. It’s frustrating and I’m not sure what’s happening. You can always follow the ANOVA’s Twitter account for new posts.
  • That post has been republished at Jacobin.
  • I was on the left-leaning military affairs podcast What a Hell of a Way to Die, talking about the GI Bill, recently.
  • I will be appearing on the Katie Halper Show on June 14th at Brooklyn Commons from 7 PM to 10 PM, with the brilliant Angela Nagle. It’s a fundraiser for WBAI which is well-worth supporting.
  • This past week’s book review, on “Rebekah Nathan”‘s My Freshman Year, has been delayed and will be pushed out to Patreon patrons tomorrow afternoon. Archival content for patrons is coming later today.
  • Sometimes I write about non-education stuff on Medium. Here’s me on podcasts.
  • A couple people have asked about my and ResearchGate profiles, so I’ll just note that I often forget those exist and they are rarely if ever updated, though I’m going to make an effort to get them up to speed this week.
  • Coming soon: posts on teacher observations, corpus linguistics, and regression.

two sets of universities, two countries, two futures

click for creator's Flickr page
image by Flickr user John Walker used under CC License

Today, Yale University’s 316th commencement will take place. Beaming young people and their proud parents will flock to the immaculate New Haven campus, eager to start their climb further up the ladder of American success. They know, as they surely knew the day they arrived, that their passage through such an august institution prepares them for a life of financial security and high social standing. They know, in other words, that as much as any young people, they are positioned to advance to the rarefied world of elite America.

Meanwhile, elsewhere in Connecticut, twelve community colleges and four public universities – including one found in the very same city – are starved to death by austerity and neoliberalism, as the Democrat governor and a Democratic state legislature in a rich blue state enact brutal cuts to education, social services, and mental health care, while fighting to cut taxes on corporations. The cuts to the Connecticut State University system are particularly devastating. They risk killing majors, shuttering departments, and destroying tenure. Programs that help shepherd a student body that comes disproportionately from non-traditional backgrounds, and thus needs help the most, are under threat. Classes may be cut from course schedules, making it even harder for working students and students who are parents to fit school into the schedules. In every way, a university system that already struggles to serve its students and its state thanks to resource constraints will be hurt even more.

These cuts are personal, for me, as I am a graduate of Central Connecticut State University in the CSU system. I will risk self-aggrandizement in saying that I am an example of the kind of success story that is routinely produced by the CSU system and systems like it. In my early 20s I was lost – orphaned, broke, alcoholic, struggling from then-undiagnosed mental illness, and completely without direction or a sense of purpose. But I took classes at the local community college for a year, then transferred to Central, where I met warm, engaging, committed educators who shepherded me through my education and showed me that I had skills and knowledge that had value – that my life had value. Today, I have a PhD, live in New York City, work at a wonderful public college myself, and have been published by some of the most prominent newspapers and magazines in the world. I owe all of that, without exception, to my time in the CSU system. It was there that I put my life back together, thanks to the dedication of the professionals who worked there and the relatively low tuition costs that enabled me to attend. I say with no exaggeration: the Connecticut State University system saved my life. And now, for shortfalls of less than $100 million a year, that system risks being permanently crippled.

To make all of this worse, down I-91 from my old university, Yale sits on a mountain of money, and yet receives more and more from public funds. The degree to which our government subsidizes the immensely wealthy Ivy League schools defies belief. A report from Open the Books, an organization that works for transparency in government spending, estimates that the federal and state governments spent over $40 billion on the Ivy League schools in tax exemptions, contracts, grants, and direct gifts from 2010 to 2015. The eight Ivy League universities – small, elite institutions from one region of the country that serve a tiny fraction of our college students and who could scarcely need government support less – receive more money annually from the federal government, on average, than 16 states. Four in ten students from the top 0.1% of families by income attend the Ivy League or similarly elite institutions; in 2012, 70% of Yale’s incoming freshmen came from families making more than $120,000; the median family income for Harvard students is triple the national average. The overwhelming majority of these students go on to lives of economic security, and many to the upper echelons of our economy.

Yet we continue to pour in government money to these rich institutions, and their wealthy alumni pour in hundreds of millions of dollars to their endowments untaxed, often invoking the spirit of giving and the need for equal opportunity while they do so. Meanwhile, we know empirically that systems like the CSU system, or the City University of New York system (where I now work), or the California State University system – America’s Great Working Class Colleges – do a far better job of creating social mobility than their elite counterparts. Yet each of these systems struggles under brutal cuts to their funding even though our country has never been richer.

What political philosophy, exactly, could possibly justify this condition? What ideology would conclude that this is a good use of resources, either public or philanthropic?

And yet the condition endures, even accelerates, year after year. No one seems to ask why those institutions who are objectively fabulously wealthy should receive such outlandish public subsidy, nor does anyone provide an answer as to why so many of our wealthiest continue to cut large checks for these institutions while our working class colleges, who need the money so desperately, starve. I am absolutely committed to the idea that higher education should be funded with public moneys, but I am also perplexed at the tendency of charitable donations to go where they are needed least of all. Where is Bill Gates to subsidize our working class colleges? Where is Mark Zuckerberg? Why does the philanthropic impulse, when it comes to higher education, always result in the rich getting richer? Connecticut is home to a small army of hedge fund managers and other incredibly wealthy types. I would love it if we could take their money by force for the good of all of society. But barring that, why don’t they use Connecticut’s starving public system for tax avoidance, rather than elite universities that are already filthy rich? Unless the entire point of such gifts is not to create equality of opportunity but to destroy it, to ensure that only those who start out at the top get to end up there. Our elite universities do many good things, but there is no question that they perpetuate and deepen inequality. That is in fact their most basic function: the replication of the ruling class.

I have no doubt that Yale’s class of 2017 is full of smart, talented, and passionate young people. I wish them the best. I also have no doubt that those among them who may not be talented or hardworking will be wholly inoculated from that condition thanks to the accidents of birth and privilege that helped them reach their rarefied station in the first place. As a socialist, I am not interested in making them more susceptible to material hardship and the vagaries of chance, but rather of giving everyone that same level of protection – and that means raiding the coffers of their school, their parents, and their future employers for the betterment of all. I also don’t doubt that, on balance, graduates of the Connecticut State system will succeed as well. College graduates writ large enjoy a substantial premium in income and unemployment rates over those without degrees, after all. But how hard will they have to struggle, as their instructors are stretched thinner and thinner by these brutal cuts? How many of them will sink deeper into debt as they are forced to take additional semesters of classes to complete their degrees? How many of them will drop out, thanks to these cuts, and suffer under the burden of student loan debt with no degree to help them secure a better life? How many people who could have been saved, as I was saved, now won’t be because of these cuts?

Today’s Yale commencement ceremony, of course, will be stocked with liberals, decent progressive folk who will tell you they believe in equality and social justice. The parents will mostly be liberal Democrats. The student ranks will be filled, no doubt, by genuine radicals, and the faculty with Marxists and socialists. They do good deeds at these places, such as how Yale’s community recently forced the school to change the name of Calhoun College, thanks to John C Calhoun’s history as a slave owner. I celebrate the activist zeal of all involved in such actions. Yet what Yale’s community can’t do – and perhaps wouldn’t, if it could – is to dismantle its place in the engine of American inequality. For all of the decent people involved in that institution, there is no chance that it will ever voluntarily abandon its role as an incubator of the ruling class. To do so would be unthinkable. That’s the reality of higher education: ostensible leftists preside over the ever-accelerating accumulation of power, money, and privilege. A better way is possible, but it cannot be achieved from within campus.

Until we reach that better world, we’ll be left with these ugly divides. In a sea of political ugliness it’s hard for me to imagine a more stark statement of America’s grand failures than this, a starving public university system that serves the poor and the brown and the needy, while next door a school for the 1% sits on $25 billion dollars, untaxed. CSU students, like Yale students, will walk on campus lawns with caps and gowns, eager to begin their new lives. Like Yale students, CSU students will seek a better life. But how many of them will be stuck here in this other America, inequality America, austerity America, while those who’ve already been given so much are given even more?

Correction: Fixed some inaccurate wording in the fourth paragraph.

“Like the validity of intelligence testing, the heritability of intelligence is no longer scientifically contentious.”

That headline is taken from this piece on Vox, by  advocating a third way between “race realist”-style racism and liberal blank slatism. I’ve chosen it as the title for this post because I thought the reaction on social media showed the power of a headline for shaping popular perception of an argument.

Yesterday was an interesting day for me, watching that piece get passed around approvingly on Twitter and Facebook. Interesting because I wrote a post that made substantially the same argument as the Vox piece – that intelligence testing is predictively valid and that genes account for some of the individual variation in that testing, but that racial groupings are socially constructed and arguments about inherent racial inferiority are invalid (and bigoted). Yet I got a lot of heat for my post while the Vox piece was roundly praised. In particular, I was told often that a) IQ and its proxies are not valid and b) that there is no genetic influence on psychological and cognitive outcomes. Both of these ideas are strenuously denied by the authors, who are (unlike me) experts in the relevant fields. Yet because the piece was pitched as anti-Charles Murray (and Sam Harris), objections to these points were muted to nonexistent. Still, it’s essential that progressive people recognize the most important contention of the Vox piece: that rejecting pseudoscientific racism does not undermine the predictive validity of IQ testing or the overwhelming evidence of polygenic heritability of cognitive outcomes. As they say,

a realistic acceptance of the facts about intelligence and genetics, tempered with an appreciation of the complexities and gaps in evidence and interpretation, does not commit the thoughtful scholar to Murrayism in either its right-leaning mainstream version or its more toxically racialist forms.

Obviously, when topics are as sensitive as these, first impressions are incredibly important. Still, it was simultaneously gratifying and aggravating to me. For example, I was accused of “cosplaying as Charles Murray” at Lawyers, Guns, and Money for my initial post on these topics, but a blogger there approvingly shared the Vox piece that made the same argument yesterday as a rejection of Murray. Such a fine line between imitation and rejection, when you don’t read carefully! Like I said in a brief post on Medium, same planet, different worlds.

In any event, I am encouraged by the success of that essay, the authors deserve credit for laying out the case so persuasively, and I think the worm is finally turning against blank slatism or IQ denialism as default progressive opinions. It is not necessary to embrace blank slate thinking to fight racism, and in fact our efforts to do so will be strengthened by our willingness to embrace genetic behaviorism.

Why It Matters

Some people ask me, why bother? Why not just leave this stuff alone, given that some have taken ideas in the same general orbit to truly noxious ends?

It matters that progressive people reject blank slatism because blank slatism is incorrect and we should tell the truth. But even from the most pragmatic or consequentialist perspective, we should accept the contemporary science on intelligence and heritability because doing so is the only way to effectively fight racism and white supremacy. By refusing to engage with the extant science on individual variation, we leave that field of argument entirely to those who would use it for the worst possible ends. As the authors say,

The left has another lesson to learn as well. If people with progressive political values, who reject claims of genetic determinism and pseudoscientific racialist speculation, abdicate their responsibility to engage with the science of human abilities and the genetics of human behavior, the field will come to be dominated by those who do not share those values. Liberals need not deny that intelligence is a real thing or that IQ tests measure something real about intelligence, that individuals and groups differ in measured IQ, or that individual differences are heritable in complex ways.

This is precisely my position. Don’t play to the alt-right frame; don’t help them make the case that progressives are anti-science or resistant to facts. Fight bad science with better. It is also my position, as readers of this blog know, that the assumption that all human beings have equal academic potential produces bad educational policy and leads inevitably to conservative “just deserts” economic attitudes and the social inequalities inherent to meritocracy.

As the authors note, the heritability of cognitive outcomes does not imply that they are not mutable. My position is not, and has never been, genetic determinism, which suggests that genes are destiny and that there are no other meaningful factors. But the influence of genes has to be part of a frank discussion about the fact that, summatively speaking, we have overwhelming evidence that not all individuals have the same academic potential. I have always actually been mechanism agnostic about this; that is, I am not sure to what degree the persistence of academic inequality is about genes or parenting or environment or resources or pure luck. I’m just sure that when we look at scale, the obvious conclusion has to be that not everyone has the same level of potential in all academic endeavors. (We should bear in mind that just as genetic influence does not make an outcome immutable, environmental influence does not mean it’s necessarily changeable.)

As someone interested in education policy, the obvious analytical conclusion is that we should stop trying to force students to reach universal arbitrary performance goals, as No Child Left Behind mandated and test mania encourages. As a socialist, the obvious moral conclusion is that we should move more and more material needs outside of the market economy and guarantee them via government, as our own inability to fully control our academic outcomes means that they cannot morally be used as justification for increased risk of poverty, hopelessness, and marginalization. As I’ve said many times, I believe the racial academic achievement gap will be closed, precisely because I don’t think races are meaningful categories or that they express intrinsic differences in human value. The question is, what happens after we close the racial achievement gap? Would it imply that a bigotry against those not blessed with strong academic potential would be justified? That’s what “meritocracy” argues, and I believe it’s a moral error. I believe that this tendency, called the hereditarian left by some, will only grow in a world where the logic of meritocracy has brought us spiraling inequality, the division of our country into essentially two different societies with profoundly different qualities of life.

To fight against it, we have to talk clearly and openly about these issues, and I think that Vox piece was strong step in that direction. The Genetics & Human Agency project, which is led by Turkheimer and Harden, is a step in the right direction too. There are big moral and political questions here, and it’s up to us to answer them in a fair and humane way.



Campbell’s Law and the inevitability of school fraud

When discussing education policy there’s few things more useful to understand than Campbell’s Law:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

There’s a great piece on Campbell’s Law and testing mania from a mathematician here (PDF). The implications of this dynamic are obvious in ed circles. Why do teachers cheat in a high-stakes environments? Why do charter schools cook the books while preaching about the importance of their mission? Why do I suspect that there’s a mountain of cheating and corruption going on in our ed data that hasn’t been discovered? Because of Campbell’s Law and the stubbornness of academic inequalities.

Hard to think of a more apt example of the influence of Campbell’s Law than this story out of a San Diego charter school. I highly encourage you to read this piece, as its the kind of diligent and important local journalism that is so deeply threatened today. It tells the story of a school where a leader with missionary zeal and a no excuses culture has conspired to pressure teachers into rampant grade inflation, sending young people into higher education with grades that don’t remotely match their skills.

Forgive this lengthy excerpt but I think it’s worth it here.

Teachers who have worked with 48-year-old Riveroll say he’s an inspiring leader, a visionary with extraordinary charisma and passion. Parents adore the man who has been named teacher of the year, educator of the year and selected as one of four principals nationwide to participate in the Public Education Leadership Program at Harvard University.

Yet data, documents and interviews contradict the Gompers brand of preparing every student for college. Gompers’ standardized test scores — one metric for college acceptance — are among the bottom of schools in San Diego County and California. These numbers are in contrast to students’ straight A grades with courses in precalculus, advanced biology and AP history.

Teachers say grades are inflated, and if students still can’t graduate, they are “counseled” to attend school elsewhere. The same teachers who praise Riveroll’s talent blame him, saying he shames educators who assign failing grades by telling them they are “murdering” kids.

“He knows he’s not allowed to say, ‘Change their grades or else,’” said former Gompers chemistry teacher Ben Davey.

“But he can say, ‘You’re killing these kids, are you sure you want to leave it as an F?’”

Many people have pointed to rising graduation rates as evidence of the effectiveness of ed reform. And more kids graduating from high school is a good thing indeed. But there’s concerns about the graduation rate, involving juking the stats (again) and the fear that this stems from lower standards rather than objectively better students. (Take Renewal Schools here in NYC, for example.) I’m agnostic on the overall question, although I agree with pessimists who say, for example, that something like a third to a half of all graduating American high school students probably couldn’t demonstrate the requisite level of algebra ability required to graduate from high school. The question is, what happens when you combine intense pressure from above to graduate students along with the reality that (as I keep insisting) student outcomes are not nearly as plastic as policy types like to imagine they are?

Standardized tests show proficiency in math and English language arts at Gompers has gotten worse from 2011 to 2016. Forty percent of 11th-graders are below basic proficiency in English. Ninety-one percent didn’t reach the state standard for mathematics….

Six percent of Gompers students were considered “college-ready” based on their SAT scores in 2015-2016. Five percent based on their ACT.

Twenty-two percent of the Advanced Placement (AP) tests taken that year were marked three or higher, the level at which college credit is granted. San Diego Unified averaged 59 percent.

However, inewsource learned that of the 113 students graduating this year, not one earned a grade lower than a C in the first semester of their 2015-2016 school year. More than half of the class had straight A’s with courses in advanced chemistry, AP history and precalculus. Some of those students failed several lower-level classes the year before.

The class averaged a 4.7 GPA out of 5 the first half of their junior year.

It’s essential to say: this kind of dynamic, where a crusading spirit and insistence that everyone can achieve to the same level collides with the limitations of reality, makes fraud, lowered standards, or both inevitable. It’s an entirely predictable condition; as long as you make people’s jobs dependent on reaching metrics that they can’t reach legitimately, they will achieve them illegitimately. It doesn’t matter how much integrity they have. It doesn’t matter if they’re good people. It doesn’t matter if they’re really invested in their students success. Campbell’s Law is not a normative claim but an empirical observation; educational fraud does happen under these conditions, no matter what we think about it morally. And these lowered standards inevitably come back to bite us in the end, as these students go on to colleges where they either fail out from lack of prerequisite ability or are graduated into jobs they then can’t perform. That’s a problem here in the CUNY system, for example, where 57% of undergraduates can’t pass an algebra test. Advancing them in the system might seem humane at the individual level but in the broader perspective it’s just amplifying problems.

This is a story that some might imagine inspires a level of bitter cynicism in me. But I don’t feel embittered about the story; I just feel sad about it. It seems genuinely tragic to me, in that there are genuinely good intentions leading to these bad outcomes. The charter school world is no doubt full of profiteers and con men, but I also acknowledge that many people really believe their cheery, fingers-stuck-in-their-ears rhetoric about how every single child is capable of excelling. The problem is that this enthusiasm is destructive in a world where different students have very real differences in their individual ability and in the socioeconomic and environmental conditions in which they learn. Truly humane education policy would acknowledge those differences, not attempt to paper them over with cheerful, dishonest bromides. What we need to accept as a society is that what’s really “killing these kids” is not their lack of academic preparation for college but an economic system in which only those who are so prepared have a meaningful shot at a comfortable and secure life.

She remembers a strict but supportive director who valued 12-hour days out of his staff, along with sacrificing vacation time, career goals and a personal life. But she also remembers that Riveroll would bring her coffee after she’d put in late hours the night before.

“It is very challenging to balance this work,” said Parsons. “It is truly missionary work.”

So much pathology If it is to be as effective and pragmatically useful as it can be at scale, teaching cannot be missionary work. This has always been the problem with the Dead Poets Society, one-inspiring-teacher-breaks-through-to-kids-in-the-ghetto narrative. Even if I was sure these narratives actually reflected what’s best for kids, they are by nature not scalable or subject to being instituted by policy. It’s a hard thing for teachers to accept, but true: by its very nature, inspiration cannot be required or replicated. It’s a beautiful thing when the lives of students are changed in that way. But mass education has to be a system that works in the mundane constraints of real life.

Acknowledging those constraints is necessary if we’re really committed to improving our system for the pragmatic benefit of all. And the first step to getting there is to recognize that insisting that all students can excel will inevitably result in these kinds of pleasant lies.

Study of the Week: What Actually Helps Poor Students? Human Beings

As I’ve said many times, a big part of improving our public debates about education (and, with hope, our policy) lies in having a more realistic attitude towards what policy and pedagogy are able to accomplish in terms of changing quantitative outcomes. We are subject to socioeconomic constraints which create persistent inequalities, such as the racial achievement gap; these may be fixable via direct socioeconomic policy (read: redistribution and hierarchy leveling), but have proven remarkably resistant to fixing through educational policy. We also are constrained by the existence of individual differences in academic talent, the origins of which are controversial but the existence of which should not be. These, I believe, will be with us always, though their impact on our lives can be ameliorated through economic policy.

I have never said that there is no hope for changing quantitative indicators. I have, instead, said that the reduction of the value of education to only those quantitative indicators is a mistake, especially if we have a realistic attitude towards what pedagogy and policy can achieve.  We can and should attempt to improve outcomes on these metrics, but we must be realistic, and the absolute refusal of policy types to do so has resulted in disasters like No Child Left Behind. Of course we should ask questions about what works, but we must be willing to recognize that even what works is likely of limited impact compared to factors that schools, teachers, and policy don’t control.

This week’s Study of the Week, by Dietrichson, Bøg, Filges, and Jørgensen, provides some clues. It’s a meta-analysis of 101 studies from the past 15 years, three quarters of which were randomized controlled trials. That’s a particularly impressive evidentiary standard. It doesn’t mean that the conclusions are completely certain, but that number of studies, particularly with randomized controlled designs, lends powerful evidence to what the authors find. If we’re going to avoid the pitfalls of significance testing and replicability, we have to do meta-analysis, even as we recognize that they are not a panacea. Before we take a look at this one, a quick word on how they work.

Effect Size and Meta-Analysis

The term “statistically significant” appears in discussions of research all the time, but as you often hear, statistical significance is not the same thing as practical significance. (After “correlation does not imply causation!” that’s the second most common Stats 101 bromide people will throw at you on the internet.) And it’s true and important to understand. Statistical significance tests are performed to help ascertain the likelihood that a perceived quantitative effect is a figment of our data. So we have some hypothesis (giving kids an intervention before they study will boost test scores, say) and we also have the null hypothesis (kids who had the intervention will not perform differently than those who didn’t take it). After we do our experiment we have two average test scores for the two groups, and we know how many of each we have and how spread out their scores are (the standard deviation). Afterwards we can calculate a p-value, which tells us the likelihood that we would have gotten that difference in average test scores or better even if the null was actually true. Stat heads hate this kind of language but casually people will say that a result with a low p-value is likely a “real” effect.

For all of its many problems, statistical significance testing remains an important part of navigating a world of variability. But note what a p-value is not telling us: the actual strength of the effect. That is, a p-value helps us have confidence in making decisions based on a perceived difference in outcomes, but it can’t tell us how practically strong the effect is. So in the example above, the p-value would not be an appropriate way to report the size in the differences in averages between the two groups. Typically people have just reported those different averages and left it at that. But consider the limitations of that approach: frequently we’re going to be comparing different figures from profoundly different research contexts and derived from different metrics and scales. So how can we responsibly compare different studies and through them different approaches? By calculating and reporting effect size.

As I discussed the other day, we frequently compare different interventions and outcomes through reference to the normal distribution and standard deviation. As I said, that allows us to make easy comparisons between positions on different scales. You look at the normal distribution and can say OK, students in group A were this far below the mean, students in group B were this far above it, and so we can say responsibly how different they are and where they stand relative the the norm. Pragmatically speaking (and please don’t scold me), there’s only about three standard deviations of space below and above the mean in normally-distributed data. So when we say that someone is a standard deviation above or below someone else, that gives you a sense of the scale we’re talking about here. Of course, the context and subject matter makes a good deal of difference too.

There’s lots of different ways to calculate effect sizes, though all involve comparing the size of the given effect to the standard deviation. (Remember, standard deviation is important because spread tells us how much we should trust a given average. If I give a survey on a 0-10 scale and I get equal numbers of every number on that scale – exactly as many 0s, 1s, 2s, 3s, etc. – I’ll get an average of 5. If I give that same survey and everyone scores a 5, I still get an average of 5. But for which situation is 5 a more accurate representation of my data?) In the original effect size, and one that you still see sometimes, you simply divide the difference between the averages by the pooled standard deviations of the experiments you’re comparing, to give you Cohen’s d. There are much fancier ways to calculate effect size, but that’s outside the bounds of this post.

A meta-analysis takes advantage of the affordances of effect size to compare different interventions in a mathematically responsible way. A meta-analysis isn’t just a literature review; rather than just reporting what previous researchers have found, those conducting a meta-analysis use quantitative data made available to researchers to calculate pooled effect sizes. When doing so, they weight the data by looking at the sample size (more is better), the standardized deviation (less spread is better), and the size of the effect. There are then some quality controls and attempts to account for differences in context and procedure between different studies. What you’re left with is the ability to compare different results and discuss how big effects are in a way that helps mitigate the power of error and variability in individual studies.

Because meta-analyses must go laboriously through explanations of how studies were selected and disqualified, as well as descriptions of quality controls and the particular methods to pool standard deviations and calculate effect sizes, reading them carefully is very boring. So feel free to hit up the Donation buttons to the right to reward your humble servant for peeling through all this.

Bet On the Null

One cool thing about meta-analysis is that they allow you to get a bird’s eye view on the kind of effects that are reported on various studies of various types of interventions. And what you find, in ed research, is that we’re mostly playing with small effects.

In the graphic above, the scale at the bottom is for effect sizes represented in standard deviations. The dots on the lines are the effect sizes for a given study. The lines extending from the dots are our confidence interval. A confidence interval is another way of grappling with statistical significance and how much we trust a given average. Because of the inevitability of measurement error, we can never say for 100% that a sample mean is the actual mean of that population. Instead, we can say with a certain degree of confidence, which we choose ourselves, that the true mean lines within a given range of values. 95% confidence intervals, such as these, are a typical convention. Those lines tell us that, given the underlying data, we can say with 95% confidence that the true average lies within those lines. If you wanted to narrow those lines, you could choose a lower % of confidence, but then you’re necessarily increasing the chance the true mean isn’t actually within the line.

Anyhow, look at the effects here. As is so common in education, we’re generally talking about small impacts from our various interventions. This doesn’t tell you what kind of interventions these studies performed – we’ll get there in just a second – but I just want to note how studies with the most dependable designs tend to produce limited effects in education. In fact in a majority of these studies the confidence interval includes zero. Meanwhile, only 6 of these studies have meaningfully powerful effects, although in context they’re pretty large.

Not to cast aspersions but the Good et al. study is the kind of effect size that makes me skeptical right off the bat. The very large confidence interval should also give us pause. That doesn’t mean the researchers weren’t responsible, or that we throw out that study entirely. It just means that this is exactly what meta-analysis is for: it helps us put results in context, to compare the quantitative results of individual studies against others and to get a better perspective on the size of a given effect and the meaning of a confidence interval. In this case, the confidence interval is so wide that we should take the result with several pinches of salt, given the variability involved. Again, no insult to the researchers; ed data is highly variable so getting dependable numbers is hard. We just need to be real: when it comes to education interventions, we are constrained by the boundaries of the possible.

Poor students benefit most from the intervention of human beings

OK, on to the findings. When it comes to improving outcomes for students from poor families, what does this meta-analysis suggest works?

A few things here. We’ve got a pretty good illustration of the relationship between confidence intervals and effect size; small-group instruction has a strong effect size but because the confidence interval (just barely) overlaps with 0 it could not be considered statistically significant to a .05 level. Does that mean we throw out the findings? No; the .05 confidence interval isn’t a dogma, despite what journal publishing guidelines might make you think. But it does mean that we have to be frank about the level of variability in outcomes here. It seems small group instruction is pretty effective in some contexts for some students but potentially not effective at all.

Bear in mind: because we’re looking at aggregates of various studies here, wide confidence intervals likely mean that different studies found conflicting findings. We might say, then, that these interventions can be powerful but that we are less certain about the consistency of their outcomes; maybe these things work well for some students but not at all for others. Meanwhile an intervention like increased resources has a nice tight confidence interval, giving us more confidence that the effect is “real,” but a small effect size. Is it worth it? That’s a matter of perspective.

Tutoring looks pretty damn good, doesn’t it? True, we’re  talking about less than .4 of a SD on average, but again, look at the context here. And that confidence interval is nice and tight, meaning that we should feel pretty strongly that this is a real effect. This should not be surprising to anyone who has followed the literature on tutoring interventions. Yet how often do you hear about tutoring from ed reformers? How often does it pop up at The Atlantic or The New Republic? Compare that to computer-mediated instruction, which is a topic of absolute obsession in our ed debate, the digital Godot we’re all waiting for to swoop in and save our students. No matter how often we get the same result, technology retains its undeserved reputation as the key to fixing our system. When I say that education reform is an ideological project and not a practical one, this is what I mean.

What’s shared by tutoring, small group instruction, cooperative learning, and feedback and progress monitoring – the interventions that come out looking best? The influence of another human being. The ability to work closely with others, particularly trained professionals, to go through the hard, inherently social work of error and correction and trying again. Being guided by another human being towards mastery of skills and concepts. Not paying tons of money on some ed tech boondoggle. Rather, giving individual people the time necessary to work closely with students and shepherd their progress. Imagine if we invested our money in giving all struggling students the ability to work individually or in small groups with dedicated educational professionals that we treated as respected experts and paid accordingly.

What are we doing instead? Oh, right. Funneling millions of dollars into one of the most  profitable companies in the world for little proven benefit. Guess you can’t be too cynical.

no, really, race is a social construct

As I’ve argued in this space before, perceived racial differences in academic outcomes (the racial achievement gap) are the product of socioeconomic and environmental inequalities, while differences between individuals (even non-monozygotic twin siblings) are partially genetic.

This position has proven consistently controversial, but I’ve never quite understood why; it’s explicitly rejecting racist pseudoscience while accepting the findings of a vast body of well-replicated research demonstrating that genes influence (not determine!) psychological traits like academic ability. What’s more, I find it frustrating how many people reject it given that this belief is perfectly in keeping with the idea that race is a social construct, which is an idea that I believe. Again, the relationship between parents and their children is genetically simple, whereas the relationship between genetically and historically distant people who we have socially categorized into racial categories is extremely complex, inconsistent, and tangled. To think that saying individual genetic differences implies racial genetic differences is to think like a race realist – so it’s strange that so many people who accept the social construction of race make that leap!

Here’s a passage from a recent (gated, alas) Danish study that I will be writing at length about next week. I think it fits into all of this nicely.

However, recent evidence from the United States indicates that hereditary factors are not a major constraint for low SES students (Nisbett et al., 2012). For example, Tucker-Drob, Rhemtulla, Harden, Turkheimer, and Fask (2011) found no significant differences between children in high and low SES families on the Bayley Short Form–Research Edition (see, e.g., Andreassen & Fletcher, 2007)—a test of infant mental ability—at the age of 10 months, but by age 2 children in high SES families scored about one third of a standard deviation higher than children in low SES families. Genes accounted for nearly 50% of the variation in mental ability of high SES children but only a negligible share of low SES children’s variation, indicating that the latter are not reaching their full cognitive potential. Rhemtulla and Tucker-Drob (2012) found similar patterns of gene and SES interactions in follow-up tests of mathematics skill at age 4 (but no significant interactions in reading). Fryer and Levitt (2013) found no significant differences on the Bayley Short Form–Research Edition among Hispanic, Asian, Black, and White infants aged 8 to 12 months, although a one standard deviation gap in test scores between Black and White children, which typically differ in SES, has been observed by age 3.

Now, how can the “race realists” account for the lack of a racial gap between cognitive abilities in children at 8 to 12 months, and the presence of a large gap at 3? Am I to believe that their genome changed that much in 24-28 months? Or is there a much more plausible explanation – that we have socially constructed racial categories and embedded deep and persistent inequality into our society based on those categories, which in turn results in educational inequality? The race science crowd will insist that we’ve controlled for socioeconomic status, but this presumes that metrics like income band can account entirely for the relationship between race and socioeconomic reality, which I don’t think is true. The impact of race on someone’s position in our society is profoundly multivariate, with all manner of pernicious inequalities that are not fully explained by raw metrics of income and wealth.

Meanwhile, the genetic relationship between parents and children, between brothers and sisters, and the lack of same between adopted children and parents/siblings implies a not-particularly-controversial degree of influence on all kinds of outcomes, including academic ability. That, too, is perfectly in keeping with social construct theory – and, as I will keep insisting, functions as a powerful argument for economic redistribution and away from market capitalism.

None of this means that my take on these issues is correct, necessarily, though I think the evidence grows all the time. But the insistence that belief in genetic influence on individual academic talent somehow implies racist pseudoscience seems to me to make the exact same error that racists do: imagining that race must be biologically real.

Update: To briefly answer the “why do you care about differences in individual academic potential” question:

  1. because the belief that everyone has identical academic potential leads inevitably to profoundly conservative “just deserts”-style economics and
  2. the most disastrous education policy efforts in our history, especially No Child Left Behind, have been based on the assumption that there are no constraints on what policy, educators, and schools can achieve, and I’d like us to have good education policy instead of bad.

norm referencing, criterion referencing, and ed policy

I want to talk a bit about a distinction between different types of educational testing/assessment and how they interface with some basic questions we have about education policy. The two concepts are norm referencing and criterion referencing.

Criterion Referencing

Why do we perform tests? What are their purpose? One common reason is to ensure that people are able to perform some sort of an essential task. Take a driver’s test. The point is to make sure that people who are on the roads possess certain minimal abilities to safely pilot a car, based on social standards of competence that are written into policy. While we might understand that some people are better or worse drivers than others, we’re not really interested in using a driver’s test to say who is adequate, who is good, and who is excellent. Rather we just want to know: do you meet this minimal threshold? The name for that kind of test is a criterion referenced test. We have some criterion (or criteria) and we check and see if the people taking the test fulfill them. Sometimes we want these tests to be fairly generous; society couldn’t function, in many locales, if a majority of competent adults couldn’t pass a driver’s test. On the other hand, we probably want the benchmarks for, say, running a nuclear reactor to be fairly strict. The social costs would here be higher for a test that was too lenient rather than too strict. In either case, though, our interest is not in discriminating between different individuals to a fine level of gradation, particularly for those who are clearly good enough or clearly not. Rather we just want to know: is the test taker competent to perform the real-world task?

Norm Referencing

Criterion referencing depends on, well, the existence of a criterion. That is, there has to be some sort of benchmark or goal that the test taker will either reach or not. What would it mean to have a criterion referenced test for, say, college readiness? We can certainly imagine a set benchmark for being prepared for college, and we’d probably like to think that there’s some minimal level of preparation that’s required for any college-bound student. But in a broader sense we are probably aware that there is no one set criterion that would work given the large range of schools and students that “college readiness” reflects. What’s more, we also know that colleges are profoundly interested in relative readiness; elite colleges spend vast amounts of money attracting the most highly-qualified students to campus.

For that we need to use tests like the SAT and ACT which are not oriented around fulfilling a given criterion but for creating a scale of test-takers and being able to discriminate between different students in exacting detail. We need, that is, norm referenced tests. When we say “norm” here we mean in comparison to others, to an average and to quintiles, and in particular to the normal distribution. I don’t want to get too into the weeds on that big topic, for those who aren’t already versed in it. Suffice to say for now that the normal distribution is a very common distribution of observed values for all sorts of naturally-occurring phenomenon that have a finite range (that is, a beginning and an ending) and which are affected by multiple variables. The ideal normal distribution looks like this:

The big center line in there is the mean, median, and mode – that it, the arithmetic average, the line that divides one half of the data from the other, and the observation that occurs most. As you can see, in a true normal distribution the observations fall in very particular patterns relative to the average. In particular, the further away we go from the average, the less likely we are to find observations, again in predictable quantitative relationships. When we talk about something changing relative to a standard deviation, using that statistic (actually a measure of spread) as a measure of distance or extremity, we’re doing so in relation to the normal distribution. Tests like the SAT and ACT, GRE, IQ tests, and all manner of other tests used for the purpose of screening applicants for some finite number of slots use norm referencing.

Why? Again, think about what we’re trying to accomplish with norm referencing. I want to give a test to be able to say that test taker X is better than test taker Y but worse than test take Z. But we also want to be able to say where they each fall relative to the mass of data. Norm referencing allows us to make meaningful statements about how someone will perform relative to others, and this in turn gives us information about how to, for example, select people for our scarce admissions slots at an exclusive college.

As Glenn Fulcher puts it in his (excellent) book Practical Language Testing:

As we move away from the mean the scores in the distribution become more
extreme, and so less common. It is very rare for test takers to get all items correct, just as
it is very rare for them to get all items incorrect. But there are a few in every large group
who do exceptionally well, or exceptionally poorly. The curve of normal distribution
tells us what the probability is that a test taker could have got the score they have, given
the place of the score in a particular distribution. And this is why we can say that a score
is ‘exceptional’ or ‘in the top 10 per cent’, or ‘just a little better than average.

(Incidentally, it’s a very common feature of various types of educational, intelligence, and language testing that scores become less meaningful as the move towards the extremes. That is, a 10 point difference on a well-validated IQ test means a lot when it comes to the difference between a 95 and a 105, but it means much less when it comes to a difference between 25 and 35 or 165 and 175. Why? In part because outliers are rare, by their nature, which means we have less data to validate that range of our scale. Also, practically speaking there are floors and ceilings. Someone who gets a 20 on the TOEFL ibt and someone who gets a 30 share one most important thing: they’re functionally unable to communicate in English. This is also why you shouldn’t trust anyone who tells you they have an IQ over, say, 150 or so. The scale just doesn’t mean anything up that high.)

How do we get these pretty normal distributions? That’s the work of test development, and part of why it can be a seriously difficult and expensive undertaking. The nature of numbers (and the central limit theorem) help, but ultimately the big testing companies have to spend a ton of time and money getting a distribution as close to normal as possible – and whatever else their flaws, organizations like ETS do that very well. Either way, it’s essential to say that the normal distribution does not arrange itself like magic in tests. It has to be produced with careful work.

The Question of Grades

Thinking about these two paradigms can help us think through some questions. Here’s a simple one: are grades norm referenced or criterion referenced? The answer is surely both and neither, but I think it’s useful to consider the dynamics here for a minute. In one sense, grades are clearly criterion referenced, in that they are meant to reflect a given student’s mastery of the given course’s subject matter. And we aren’t likely to say that only X% of students in a class should pass or fail according to a model distribution; rather, we think we should pass everyone who demonstrates the knowledge, skills, and competencies a class is designed to instill. And we have little reason to think that grades are actually normally distributed between students in the average class.

Yet, in another sense, we clearly think of grades as norm referenced. When we talk about grade inflation, we are often speaking in terms of norm referencing, complaining that we’re losing the ability to discriminate between different levels of ability. Grades are used to compare different students from remarkably different educational contexts in the college admissions process. And sometimes, we “curve a test,” meaning adjusting a test to better match the normal distribution – which does not, contrary to undergraduate presumption, necessarily help the people who took the test. There’s an intuitive sense that grades should match a fairness distribution that, while not normal around the natural mean (you wouldn’t want the average grade to be a 50, I trust), still essentially replicates normality in that both very high and very low grades should be rare. In practice, this is not often the case. A’s, in my experience, outnumber F’s.

In grad school a perpetual concern was the very high average grade of freshman composition, something like a B+. And the downwards pressure on that average was largely the product of students who stopped coming to class and thus got F’s, meaning that the average grade of students who actually completed the course was probably very high indeed. Is this a problem? I guess it depends on your point of view. If we were trying to meaningfully discriminate between different students based on their freshman comp grades, then certainly – but I’m not sure if we’d want to do that. On the other hand, it might be that freshman comp is a class that we think most students might naturally be expected to do well; it isn’t written anywhere that the criterion for success be at a particularly high level. Certainly the major departments wouldn’t like too many students failing that Gen Ed, given that they’d then have to fill valuable schedule space when they retook it. The question, though, is whether those grades actually reflect a meaningful meeting of the given benchmarks – fulfilling the criterion. I’m of the opinion that the answer is often “no” when it comes to freshman comp, meaning that the average grades are probably too high no matter what.

I don’t have any grand insight here, and I think most people are able to meaningfully think about grades in a way that reflects both norm-referenced and criterion-referenced interests. But I do think that these dynamics are important to think about. As I’ve been saying lately, I think that there are some basic aspects of education and education policy that we simply haven’t thought through adequately, and we all could benefit from going back to the basics and pulling apart what we think we want.

Study of the Week: Rebutting Academically Adrift with Its Own Mechanism

It’s a frustrating fact of life that arguments that are most visible are always going to be, for most people, the arguments that define the truth. I fear that’s the case with Academically Adrift, the 2011 book by Richard Arum and Joseph Roksa that has done so much to set the conventional wisdom about the value of college. That book made incendiary claims about the limited learning that college students are supposedly doing Many people assume that the book’s argument is the final word. There are in fact many critical words out there on its methodology, or the methodology we’re allowed to see. (One of the primary complaints about the book is that the authors hide the evidence for some of their claims.) Richard Haswell’s review, available here, is particularly cogent and critical:

They compared the performance of 2,322 students at twenty-four institutions during the first and fourth semesters on one Collegiate Learning Assessment task. Surprisingly, their group as a whole recorded statistically significant gain. More surprisingly, every one of their twenty-seven subgroups recorded gain. Faced with this undeniable improvement, the authors resort to the Bok maneuver and conclude that the gain was “modest” and “limited,” that learning in college is “adrift.” Not one piece of past research showing undergraduate improvement in writing and critical thinking—and there are hundreds—appears in the authors’ discussion or their bibliography….

What do they do? They create a self-selected set of participants and show little concern when more than half of the pretest group drops out of the experiment before the post-test. They choose to test that part of the four academic years when students are least likely to record gain, from the first year through the second year, ending at the well-known “sophomore slump.” They choose prompts that ask participants to write in genres they have not studied or used in their courses. They keep secret the ways that they measured and rated the student writing. They disregard possible retest effects. They run hundreds of tests of statistical significance looking for anything that will support the hypothesis of nongain and push their implications far beyond the data they thus generate.

There are more methodological critiques out there to be found, if you’re interested.

Hoisted By Their Own Petard

But let’s say we want to be charitable and accept their basic approach as sound. Even then, their conclusions are hard to justify, as later research using the same primary mechanism found much more learning. Academically Adrift utilized the Collegiate Learning Assessment (CLA), a test of college learning developed by the Council for Aid to Education (CAE). I wrote my dissertation on the CLA and its successor, the CLA+, so as you can imagine my thoughts on the test in general are complex. I can come up with both a list of things that I like about it and a list of things that I don’t like about it. For now, though, what matters is that the CLA was the primary mechanism through which Arum and Roksa made their arguments. Yet research with a far larger data set and undertaken using the freshman-to-senior academic cycle that the CLA was intended to use has shown far larger gains than those reported by Arum to Roksa.

This report from CAE, titled “Does College Matter?” and this week’s Study of the Week, details research on a larger selection of schools than that measured in Academically Adrift. In contrast to the .18 SAT-normed standard deviation growth in performance that Arum and Roksa found, CAE find an average growth of .78 SAT-normed standard deviations, with no school demonstrating an effect size of less than .62. Now, if you don’t work in effects sizes often, you might not find this a particularly large increase, but as an average level of growth across institutions, that’s in fact quite impressive. The institutional average score grew on the CLA’s scale, which is quite similar to that of the SAT and also runs from 400 to 1600, by over 100 points. (For contrast, commercial tutoring programs for tests like the SAT and ACT rarely exceed 15 to 20 points, despite the claims of the major prep companies.)

The authors write:

This stands in contrast to the findings of Academically Adrift (Arum and Roska, 2011) who also examined student growth using the CLA. They suggest that there is little growth in critical thinking as measured by the CLA. They report an effect size of .18, or less than 20% of a standard deviation. However, Arum and Roska used different methods of estimating this growth, which may explain the differences in growth shown here with that reported in Academically Adrift…. The Arum and Roska study is also limited to a small sample of schools that are a subset of the broader group of institutions that conduct value-added research using the CLA, and so may not be representative of CLA growth in general.

To summarize: research undertaken with the same mechanism as used in Academically Adrift and with both a dramatically larger sample size and a sample more representative of American higher education writ large contradicts the book’s central claim. It would be nice if that would seep out into the public consciousness, given how ubiquitous Academically Adrift was a few years ago.

Of course, the single best way to predict a college’s CLA scores is with the SAT scores of its incoming classes… but you’ve heard that from me before.


Motivation Matters

There’s a wrinkle to all of these tests, and a real challenge to their validity.

A basic assumption of educational and cognitive testing is that students are attempting to do their best work; if all students are not sincerely trying to do their best, they introduce construct-irrelevant variance and degrade the validity of the assessment. This issue of motivation is a particularly acute problem for value-added metrics test the CLA, as students who apply greater effort to the test as freshmen than they do as seniors would artificially reduce the amount of demonstrated learning.

At present, the CLA is a low stakes test for students. Unlike with tests like the SAT and GRE, which have direct relevance to admission into college and graduate school, there is currently no appreciable gain to be had for individual students from taking the CLA. Whatever criticisms you may have of the SAT or ACT, we can say with confidence that most students are applying their best effort to them, given the stakes involved in college admissions. Frequently, CLA schools have to provide incentives for students to take the test at all, which typically involve small discounts on graduation-related fees or similar. The question of student motivation is therefore of clear importance for assessing the test’s validity. The developers of the test apparently agree, as in their pamphlet “Reliability and Validity of CLA+,” they write “low student motivation and effort are threats to the validity of test score interpretations.”

In this 2013 study, Ou Lydia Liu, Brent Bridgeman, and Rachel Adler studied the impact of student motivation on ETS’s Proficiency Profile, itself a test of collegiate learning and a competitor to the CLA+. They tested motivation by dividing test takers into two groups. In the experimental group, students were told that their scores would be added to a permanent academic file and noted by faculty and administrators. In the second group, no such information was delivered. The study found that “students in the [experimental] group performed significantly and consistently better than those in the control group at all three institutions and the largest difference was .68 SD.” That’s a mighty large effect! And so a major potential confound. It is true that the Proficiency Profile is a different testing instrument than the CLA, although Oiu, Bridgeman, and Adler suggest that this phenomenon could be expected in any test of college learning that is considered low stakes. The results of this research were important enough that CAE’s Roger Benjamin, in an interview with Inside Higher Ed, said that the research “raises significant questions” and that the results are “worth investigating and [CAE] will do so.”

Now, in terms of test-retest scores and value added, the big question is, do we think motivation is constant between administrations? That is, do we think our freshman and senior cohorts are each working equally hard at the test? If not, we’re potentially inflating or deflating the observed learning. Personally, I think first-semester freshmen are much more likely to work hard than last-semester seniors; first-semester freshmen are so nervous and dazed you could tell them to do jumping jacks in a trig class and they’d probably dutifully get up and go for it. But absent some valid and reliable motivation indicator, there’s just a lot of uncertainty as long as students are taking tests that they are not intrinsically motivated to perform well on.

Disciplinary Knowledge – It’s Important

Let’s set aside questions of the test’s validity for a moment. There’s another reason not to think that modest gains on a test like this are reason to fret too much about the degree of learning on campus: they don’t measure disciplinary knowledge and aren’t intended to.

That is, these tests don’t measure (and can’t measure) how much English an English major learns, whether a computer science student can code, what a British history major knows about the Battle of Agincourt, if an Education major will be able to pass a state teacher accreditation test…. These are pretty important details! The reason for this omission is simple: because these instruments want to measure students across different majors and schools, content-specific knowledge can’t be involved. There’s simply too much variation in what’s learned from one major to the next to make such comparisons fruitful. But try telling a professor that! “Hey, we found limited learning on college campuses. Oh, measuring the stuff you actually teach your majors? We didn’t try that.” This is especially a problem because late-career students are presumably most invested in learning within their major and getting professionalized into a particular discipline.

I do think that “learning to learn,” general meta-academic skills, and cross-disciplinary skills like researching and critically evaluating sources are important and worth investigating. But let’s call tests of those things tests of those things instead of summaries of college learning writ large.

Neither Everything Nor Nothing

I have gotten in some trouble with peers in the humanities and social sciences in the past for offering qualified defenses of test instruments like the CLA+. To many, these tests are porting the worst kinds of testing mania into higher education and reducing the value of college to a number. I understand these critiques and think they have some validity, but I think they are somewhat misplaced.

First, it’s important to say that for all of the huffing and puffing of presidential administrations since the Reagan White House put out A Nation at Risk, there’s still very little in the way of overt federal pressure being placed on institutions to adopt tests like this, particularly in a high-stakes way. We tend to think of colleges and universities as politically powerless, but in fact they represent a powerful lobby and have proven to be able to defend their own independence. (Though not their funding, I’m very sorry to say.)

Second, I will again say that a great deal of the problem with standardized testing lies in the absurd scope of that testing. That is, the current mania for testing has convinced people that the only way to test is to do census-style testing – that is, testing all the students, all the time. But as I will continue to insist, the power of inferential statistics means that we can learn a great deal about the overall trends in college learning without overly burdening students or forcing professors to teach to the test. Scaling results up from carefully-collected, randomized and stratified samples is something that we do very well. We can have relatively small numbers of college students taking tests a couple times in their careers and still glean useful information about our schools and the system.

Ultimately, I think we need to be doing something to demonstrate learning gains on college. Because the university is threatened. We have many enemies, and they are powerful. And unless we can make an affirmative case for learning, we will be defenseless. Should a test like the CLA+ be the only way we make that case? Of course not. Instead, we should use a variety of means, including tests like the CLA+ or Proficiency Profile, disciplinary tests developed by subject-matter experts and given to students in appropriate disciplines, faculty-led and controlled assessment of individual departments and programs, raw metrics like graduation rate and time-to-graduation, student satisfaction surveys like the Gallup-Purdue index, and broader, more humanistic observations of contemporary campus life. These instruments should be tools in our toolbelt, not the hammer that forces us to see a world of nails. And the best data available to us that utilizes this particular tool tells us the average American college is doing a pretty good job.