Study of the Week: Computers in the Home

A quickie today. It is fair to say that technology plays an enormous role in our educational discourse. Indeed, “technology will solve our educational problems” is a central part of the solutionism that dominates ed talk. From “teach a kid to code” to “who needs highly trained teachers when there’s Khan Academy?,” the idea that digital technology holds the key to the future of schooling is ubiquitous and unavoidable.

This is strange given that educational technology has done almost nothing but fail. Study after study has found no impact on education metrics from technology.

(Now, let me say upfront: this blog post is not intended as a literature review for the vast body of work on the educational impacts of technology. It is instead using a large and indicative study to discuss a broader research trend. If you would like for me to write a real literature review, my PayPal is available at the right.)

Consider having a personal computer in the home. Many would assume that this would give kids an advantage in school. After all, they could play educational software, surf the Web, get help on their homework remotely…. And yet that appears to not be the case. Published in 2013, this week’s study comes from the National Bureau of Educational Research. Written by Robert W. Fairlie and Jonathan Robinson of UC Santa Cruz, the study finds in fact that a personal computer in the home simply makes no difference to student outcomes – not good, not bad, nothing.

The study is large (= 1,123) and high quality. In particular, it offers the rare advantage of being a genuine controlled randomized experiment. That is, the researchers identified research subjects who, at baseline, did not own computers, assigned them randomly to control (no computer) and test (given a computer). This is really not common in educational research. Typically, you’d have to do an observational/correlational study. That is, you’d try to identify research subjects, find which of them already have computers and which didn’t, and look for differences in the groups. These studies are often very useful and the best we have to go on given the nature of the questions we are likely to ask. You can’t, for example, assign poverty as a condition to some kids and not to others. (And, obviously, it would be unethical if you could.) But experiments, where researchers actually cause the difference between experimental and control groups – some methodologists say that there must be, in some sense, a physical intervention to manipulate independent variables – are the gold standard because they are the studies where we can most carefully assess cause and effect. Giving one set of kids computers certainly qualifies as a physical intervention.

And the results are clear: it just doesn’t matter. Grades, test scores, absenteeism and more… no impact. The study is generally accessible to a general audience, save for some discussion of their statistical controls, and I encourage you to peruse it on your own.

In its irrelevance for academic outcomes, owning a personal computer joins a whole host of other educational interventions via digital technology that have washed out completely. But hope springs eternal. I couldn’t help but laugh at this interview of Marc Andreessen in Vox, as it’s so indicative of how this conversation works. Andreessen makes outsized claims about the future impacts of technology. Timothy Lee points out that these claims have never come true in the past. Andreessen simply asserts that this will change, and Lee dutifully writes it down. That is the basic trend, always: the repeated failures of technology to make actually meaningful impacts on student outcomes will always be hand waved away; progress is always coming, next year, or the year after that, or the next. Meanwhile, we had the internet in my classrooms in my junior high school in 1995. Maybe it’s time to stop waiting for technology to save us.

But then again, there are iPads to sell….

addressing some complaints

There was a lively discussion about my last post on Facebook yesterday. There was a lot of enthusiastic people participating. Let me address some common complaints.

I am mad because I believe something that you expressly agree with. The most depressing response was all of the people who made claims that I had myself made in the post and then represented that as criticism. That is, dozens of people made statements that they imagined contradicted me, even though those statements were points I had made in the very piece the were attempting to critique. This problem can generally be avoided if you take the radical step of reading what you are responding to.

Things people floated as disagreements that I explicitly said in the original piece include:

  • Race is a social construct
  • Genetics do not explain all the variation in IQ tests and other quantitative measures of academic ability
  • The precise amount of variation explained by genetics is contested
  • IQ tests are not complete or comprehensive measures of human mental capacity or human worth
  • Other things than IQ are valuable and important predictors of student success
  • The definition of academic success is socially mediated and influenced by capitalism
  • There are methodological criticisms of twin studies
  • Some established researchers disagree with this line of thinking
  • More research is needed

I could go on. To each of these “criticisms,” the answer is the same: yes, I agree, that’s why I said them in the essay that you’re criticizing and clearly didn’t read.

[voice of a New England blueblood wearing a blazer with brass buttons, Nantucket Red pants, and an ascot while swirling a glass of port] “Sirrah! What are your qualifications!” My qualifications are in fact irrelevant here. I would defend them if I thought it was relevant. But I’m not doing primary research. I didn’t disappear into a lab and emerge with a new model of human cognition. I’m reading studies and books, which is what I do all day, and faithfully reporting back what I find. And what I find, and have reported, is that in the fields of genetic behaviorism and developmental psychology there is a broad agreement that academic and cognitive outcomes are significantly influenced by genetics.

That’s not really a consensus. I do not agree. I’m afraid there is no system of consensus points that I could assemble to establish this point objectively. However, given how many people within the relevant fields discuss those fields, I find it hard to dispute that there is broad agreement. Consider this from the Plomin piece I linked to:

Finding that differences between individuals (traits, whether assessed quantitatively as a dimension or qualitatively as a diagnosis) are significantly heritable is so ubiquitous for behavioural traits that it has been enshrined as the first law of behavioural genetics. Although the pervasiveness of this finding makes it a commonplace observation, it should not be taken for granted, especially in the behavioural sciences, because this was the battleground for nature-nurture wars until only a few decades ago in psychiatry, even fewer decades ago in psychology, and continuing today in some areas such as education. It might be argued that it is no longer surprising to demonstrate genetic influence on a behavioural trait, and that it would be more interesting to find a trait that shows no genetic influence….

For some areas of behavioural research—especially in psychiatry—the pendulum has swung so far from a focus on nurture to a focus on nature that it is important to highlight a second law of genetics for complex traits and common disorders: All traits show substantial environmental influence, in that heritability is not 100% for any trait.

Or this from the Turkheimer I linked to:

I too am a behavior geneticist, so it is important to conclude this response with a “lest I be misunderstood” paragraph. It is remarkable that in this day and age there continues to be a school of thought maintaining that behavior genetics is fundamentally mistaken about even weak genetic influence, that the nearly universal findings of quantitative genetics can be dismissed because of methodological assumptions of twin studies (Joseph, 2014) or contemporary findings in epigenetics (Charney, 2012). Those arguments can be evaluated on their own terms, but my point of view must not be cited in their support. Genetic influence is real and has profound methodological implications for how human behavior is studied.

Note that many people cited Turkheimer to me as a skeptic of behavioral genetics writ large. You could take The Blank Slate, which is now rather out of date but which functions as a book-length exploration of these topics. Or you could read The Nurture Assumption which had a new edition come out in 2009. There is a lot of literature out there for you to consider. Does this mean that nobody disagrees? Of course not! And I specifically that there is controversy here. But the existence of dissenters does not mean that there is not broad agreement. They could all be mistaken. But I’m not mistaken for saying that this is a widespread belief in the relevant fields. Please stop saying I’m making this up.

This blog post is not an exhaustive literature review! No. That’s true. I’m afraid I don’t have it in me to conduct one when I’m never going to be able to stick it in a tenure review. (I mean, I’m not even in a tenure track job.) Luckily, other people who receive more direct professional incentives for doing literature studies have already put them together. I linked to the Plomin article because it is a very recent review that includes citations to dozens of papers that establish a long research record. I write a lot, and I enjoy it, and I remain your humble servant, but give me a break, please.

IQ tests don’t measure anything. I’m sorry, you guys, but you have to drop this one. It is not true. The predictive power of IQ tests has been replicated over and over again. If I take a group of 8 year olds, or a group of 12 year olds, or a group of 16 year olds, and give them high-quality age-appropriate IQ tests, those results will be strongly predictive of various academic outcomes. Not perfectly! No one ever said they were perfectly predictive. We live in a world variability. But in the world of social science and human research, they are remarkably well validated. If I want to know if someone will pass high school algebra, yes, IQ tests tell me something. If I want to know if someone will graduate from high school on time, yes, IQ tests tell me something. If I want to predict how selective of a college someone will go to, yes, IQ tests tell me something. They also predict a number of social and life outcomes that are not academic, although generally less well than they do academic outcomes. Click the Slate link above. The evidence is out there. This one is really a kind of know-nothingism. It’s casually destructive to keep saying that without consulting the evidence.

Now: is there something tautological about this? I think so, yes. Does this reflect assumptions about value and human worth that stem from capitalism and ideology and such? Yes, and I said so. Should we prefer a society where these things are less valued? Yes, and I said so. Are there strong objections to the manner of thinking that created the tests, and the hierarchical systems which we sort people into? Of course. I am a socialist in part because I want to tear down those systems. But if you want to attack IQ tests, attack their weaknesses, not their strengths. That is, don’t attack their predictive validity; attack the social and economic framework in which they are potentially destructive. That leads to the next complaint.

You invented capitalism, “meritocracy,” and test mania. Uh, I in fact did not. So many of the complaints I’ve received have been about systems and ways of thinking that I openly oppose. Yes, it’s true that it’s bad to reduce human value to test scores – but I am against that and I said so. Yes, it’s true that this kind of thinking can lead to pernicious tracking systems and restrictions on opportunity – but I’m against that and I said so. I am attempting to describe problems with a system from within that system. That does not imply my endorsement of that system, only my understanding that we are currently in it.

The fact of the matter is that American education policy is being written by people who are obsessive about quantitative metrics of academic performance generally and test scores particularly, and who believe against all evidence that all students can reach the same arbitrary performance standards. That is a recipe for disaster for our public schools. To mount an argument against this situation, I have to be able to address the problems in a way that does not preemptively assume a radical critique that people with power in our education system are unlikely to share. Does that make sense?

I cherry-picked this one study that disputes what you’re saying. You can do that. I will read those studies with interest if I haven’t already, when I find the time. As I said, repeatedly, there is still work to be done and there are still controversies here. My mind is not closed. My honest take on the extant evidence is that these dissenting studies, while interesting and valuable, are not sufficient to counter the general trend. I could be wrong. But I’m laying out a case here and particularly citing a lot of qualified people who have made the same case.

Only twin studies have shown this result. Nope.

A belief in genetic influences on academic performance is incompatible with a belief that the racial achievement gap is the product of socioeconomic inequality. Not so. Let me argue by analogy.

Suppose I wanted to study what variables impact how high people can jump. Most people would not dispute that your genetics has an impact. We are not all equal when it comes to our natural talents for physical activities like jumping. Children of high jump Olympians will tend (tend!) to jump higher than the average person. Of course, there’s also substantial non-genetic variation in play – the amount you train, your diet and nutrition, etc. To say that jumping ability is substantially genetic is not the same as saying that it is exclusively genetic. And in fact most children of Olympic high jumpers will not go on to be Olympic high jumpers themselves, just as most geniuses do not have genius parents or children even though there is a significant genetic influence on intelligence.

Now, let’s suppose that a certain portion of society – like, say, black and Hispanic people – are fitted by society with heavy weight belts at birth. These weight belts would, obviously, constrain the ability of black and Hispanic people from jumping high. If you simply looked at the average heights of jumps by racial groups, you might conclude that black and Hispanic people are genetically predisposed to being bad jumpers. But of course, when you’re wearing a weight belt, it’s hard to jump high.

Now: does arguing that the weight belt is creating a perceived difference in jumping ability mean that genetic explanations are invalid? Of course not. It means that whatever genetic predisposition individuals have is being washed out by the weight belt. The existence of the weight belts is not an argument against genetic influence on jumping ability. It’s instead a non-genetic variable that produces a group difference. Were the weight belts to be removed from black and Hispanic people, there would still be substantial genetic variation between individuals in their ability to jump. We would just find the average to be higher relative to other groups.

Of course, in this silly analogy, white supremacy and its many manifestations are the weight belt. Yes, as Charles Murray types always insist, income band alone does not sufficiently explain various aspects of the racial achievement gap. But then, who ever said racial inequality is only about income gaps? Racial inequality is a profoundly multivariate phenomenon. It manifests itself in all sorts of ways. And I don’t believe that the “human biodiversity” types have come close to accounting for the influence of all of those variables.

My belief is that, if and when we remove the weight belt of white supremacy from black and Hispanic people, the racial achievement gap will disappear, and at scale we’ll seen equivalent academic performance across groups. But we’ll still also see substantial variation between individuals; the racial groupings will be proportionately arranged around performance bands, but there will still be people who do better or worse in school/on IQ tests. And that variation, the evidence suggests, is significantly (but not completely, of course) influenced by genetics.

Other factors complicate and attenuate the genetic influence on IQ and academic performance. Of course they do. I said so in the piece! The presence of other variables does not imply that there is no variation influenced by genetics. Some have cited Angela Lee Duckworth and the important of conscientiousness as a counter to my post, but Duckworth herself explicitly says IQ/g/native intelligence are also important. I never disputed that. In fact, I wrote 2000 words on a study exploring this connection literally last week!

You’re a genetic determinist! This is eugenics! No I’m really, really not, and it really, really isn’t. In fact, I embedded that post with so many caveats and qualifiers that I am absolutely amazed that people are so affronted by it. I’m making a very mild version of a generally uncontroversial argument.

Here is what I am saying. Biological children tend to resemble their biological parents in all manner of academic outcomes, and this similarity increases over the course of life. This relationship is not perfect and no one has ever claimed that it is. However, it is powerful, particular in the context of studying human variation. In contrast, adoptive children are not much more like their adoptive parents than they are like random strangers. Identical twins reared apart are more like each other than they are like adoptive siblings; adoptive siblings are not much more like each other than they are like random strangers. These observations have proven to be durable in a variety of studies over the course of decades conducted by established researchers at respected institutions. Perhaps new evidence will cast them into doubt; we’ll see. For now I can only work based on the information available. I think that these observations have obvious and important consequences for our educational policy, and I think it’s a good idea for progressive people to think about them. Yes, they have some potentially disturbing implications. But that’s all the more reason to be able to confront them clearly and rationally as we think about what kind of society we want to be.

disentangling race from intelligence and genetics, or how to rescue behavioral genetics from racists

Here are two things that I believe to be true:

  • Bigoted ideas about fundamental intellectual inequalities between demographic groups are wrong. Black people aren’t less intelligent than white, women aren’t bad at science, Asian people do not have natural facility for math, etc.
  • Genetics play a substantial role in essentially all human outcomes, including what we define as “intelligence” or academic ability.

Both of these things, I think, are true. The evidence for both seems very strong to me. And in fact it’s not hard at all to believe both of them at the same time. Yet I find it almost impossible for some progressive people to recognize that we can believe both things at the same time.

Take this recent piece about pseudoscientific racism. The author, Nicole Hemmer, is typical in that she seems to think that any discussion of genetics and intelligence implies racist notions of inherent inequalities between racial groups. At the very least, she does nothing to separate a belief in genetic influences on IQ from the notion that some races are inherently more intelligent than others, when those ideas must be carefully separated. Here’s a typical passage.

Murray and Herrnstein’s book, The Bell Curve, was published in 1994, generating immediate controversy for its arguments that IQ was heritable, to a significant degree, and unchangeable to that extent; that it was correlated to both race and to negative social behaviors; and that social policy should take those correlations into account.

I kept waiting for Hemmer to pull these separate claims apart and show what’s correct and what’s wrong, but she never does. Throughout the piece she moves through the claims of people like Charles Murray without bothering to identify the truths on which they then build lies. That’s perhaps understandable, as it’s easy to simply want to wash our hands of the whole thing. But that’s a mistake. That some races are genetically superior to others is a racist fiction. That IQ is significantly heritable and unchangeable is a empirical fact. On this essential intellectual task – untangling the difference between racist pseudoscience and the science of genetic influence on human psychological outcomes – Hemmer is silent. And she’s joined in that failure by far too many liberals I know, who often get visibly anxious any time genetics and intelligence are discussed at all, as if racist conclusions must necessarily follow. This is a problem.

I am, for context, not at all a genetic determinist, compared to many other people who talk about these issues. The world is filled with people who argue as if genetics is destiny. I’m largely an amateur when it comes to these questions, but I’m willing to say that I am skeptical of the confidence and universality with which some researchers assert genetic causes for human outcomes. And there are some real methodological challenges to typical procedures for identifying genetic influences. Still, as someone with a background in academic assessment and educational testing, I find it impossible to avoid the conclusion that there is significant genetic influence on essentially all measurable human traits, including academic outcomes. In particular, that IQ is significantly heritable is one of the most robust and well-replicated findings in the history of social science. That’s the reality.

If you’d like a recent study that aggregates a lot of the evidence, this by Plomin and Deary is a great place to start. If you’d like a broad overview of what genetics research has – and crucially, has not – found in recent years, I highly recommend this article by Eric Turkheimer on the weak genetic explanation, even for those without any background in psychometrics. Turkheimer is a poised and measured writer, one who has never spoken with the zealotry common to genetic behaviorists. I encourage you to read the article.

As time goes on, the evidence for the influence of genetics on individual human variation only grows. That includes intelligence and much more. Do racist conclusions necessarily follow? Not at all. Genetics is about parentage, not race. If I claim that a trait is heritable, I am making a claim about the transmission of that trait through biological parentage – mother and father to daughter and son. Extrapolating to the socially-mediate construct of race is irresponsible and unwarranted.

Simply consider the differences in the paths of genetic information we’re talking about here. While unsolved questions still abound in genetic research, the general mechanisms through which genetic information is passed down within families have been well understood for decades. We know how parents contribute genetic material to children, and we thus know how grandparents and great-grandparents influence genotype too. If we say that a particular trait runs in families, we can look through very clear lines of descent to show how genetic information is pass along. We know more or less how an individual genotype is formed, we know how various generational connections contribute different pieces of genetic data, and we know more and more about how genotype defines phenotype.

Contrast that with the construct of race. What does it mean to call two people “Asian”? The connection between, say, a third generation Hmong American college student whose family came to Santa Barbara as refugees from Vietnam and a Indian IT specialist whose family has lived in Madurai for generations seems, uh, unclear to me. Yes, I understand that there are phenotypical markers which often (but not always) indicate closer common ancestry between individuals. But “closer” here still can mean people whose families branched off the family tree hundreds of generations ago, making the genetic connections extremely distant. Low-cost genetic testing has revealed vast complexity in the genealogy of individuals and groups, with once simple stories of descent everywhere complicated by intermixing and the tangled lines of history. (I have bad news for the alt-right: the volk does not exist and never did.) Meanwhile, the concept of race entails vastly more baggage than just genetic lineage, all of the cultural and social and linguistic and political markers that we have, as a species, decided to package with certain phenotypical markers, historically for the purpose of maintaining white supremacy. To suggest that this process of racialization must be implied by acknowledging genetic influences on individual human outcomes is, well, thinking like Charles Murray.

If nothing else, I think it’s profoundly important that everyone understands that the belief that genetics influence intelligence does not imply a belief in “scientific” racism. In fact, most of the world’s foremost experts on genetic behaviorism believe the former and not the latter.

None of this is to deny that intelligence itself is a socially-mediated concept. What we think of as intelligence is always impacted by social and economic values. When Jews began to enter elite American colleges in large numbers, those colleges suddenly discovered the importance of “character” as a part of intelligence, conveniently grafting culture-specific ideas about what it means to be intelligent into their admissions processes in order to ensure that enough WASP men from “the right families” made it in. Right now, we favor a definition of intelligence that is high on the kind of raw abstract processing that enables one to make a living on Wall Street or in Silicon Valley. That we have disregarded emotional intelligence, social consciousness, or ethical reasoning tells you a lot about why those industries are filled with sociopathic profiteers. This does not mean that IQ testing doesn’t tell us anything meaningful; IQ tests measures consistent and durable traits and are predictive of a number of academic and social outcomes related to those traits. It does mean, though, that our decision to reward this particular set of abilities is a choice, and one that I would argue has had deeply pernicious impacts on our society. The ability to score highly on Raven’s Progressive Matrices does tell us something about the likelihood that you will pass high school algebra or be good at chess. It does not tell us your worth as a human being, as worth is a concept created by humans. We decide who has value. That we distribute that designation so stingily is a product of capitalism, not of genetics.

Nor do you have to adopt a depressing, Gattacastyle assumption that genes are destiny. Read the Plomin and Deary; read the Turkheimer. As Turkheimer points out, the strong explanation – “a gene for X” – has largely not come true. As Plomin and Deary point out, no traits are 100% heritable, with environment, opportunity, privilege, and chance all playing a role in outcomes. Besides, inherited human traits tend to be the product of the interaction between many genes. For this reason, geniuses are often the children of parents with no particularly unusual intellectual aptitude. We live in a world of variability. Nothing is certain. And again, one of our crucial social and political tasks must be to fight against the assumption that only those who can do complex equations are worthwhile human beings. No matter how hard I worked, I could never have been a research physicist; I simply do not have the facility for advanced math. Yet I maintain a stubborn belief that I have value and can contribute to the human race. So can everybody else, in their own particular ways. There are so many ways to be a good human being, but we reward very few, and to our shame. (And by the way: human quantitative processing powers are the most likely to be replaced by automation in the workplace of the future, so don’t get too comfortable, smarties.)

I also think people sometimes avoid this topic because they’re afraid it leads to conservative political conclusions. Some conservatives seem to think that too. I find that bizarre: if intellectual talent leads to financial security under capitalism, and intellectual talent is largely outside of the control of individuals, that amounts to one of the most powerful arguments for socialism I can imagine. An outcome individuals cannot control cannot morally be used to determine their basic material conditions.

In any event: as long as we value intelligence in the way we do, progressive people must be willing to be honest about the existence of inherent differences between individuals in academic traits. When we act as if good schooling and committed teachers can bring any student to the pinnacle of academic achievement, we are creating entirely unfair expectations. Meanwhile, failure to recognize the impact of genetics on academic outcomes leaves us unable to combat an increasingly rigid social hierarchy. I often ask people, what happens after we close the racial achievement gap? What becomes the task then? Precisely because I don’t believe in pseudoscientific racism, I believe that we will eventually close the racial achievement gap, if we are willing to confront socioeconomic inequality directly and with government intervention. But what happens then? We will still have a distribution of academic talent. It will simply be a distribution with proportional numbers of black people, of women, of LGBTQ people…. Does it therefore follow that those on the bottom of the talent distribution will deserve poverty, hopelessness, and marginalization? I can’t imagine how that could be perceived as a just outcome. But if progressive people fear getting involved in these discussions out of a vague sense that any link between genetics and academic ability is racist, they will not be able to help shape the future.

Liberals have flattered themselves, since the election, as the party of facts, truth tellers who are laboring against those who have rejected reason itself. And, on certain issues, I suspect they are right. But let’s be clear: the denial of the impact of genetics on human academic outcomes is fake news. It’s alternative facts. It’s not the sort of thing the reality-based community should be trafficking in. As I said, I’m not a zealot on these topics. I read critical pieces about genetic behaviorism with care. I find a lot of genetic determinists and IQ absolutists frustrating, occasionally downright creepy. And I am willing to surprised by new evidence. But the strength of the current evidence is overwhelming. Denying that IQ and other metrics of academic and intellectual ability are substantially heritable is as contrary to scientific consensus as the denial of global warming. This belief does not at all imply belief in racist pseudoscience. It does, however, imply a willingness to trust scientific evidence in precisely the way progressive people insist we must.

Update: Do you have questions? I have answers.

Too Hot for Academic Journals: Lexical Diversity and Quality in L1 and L2 Student Essays

Today I’m printing a pilot study I wrote as a seminar paper for one of my PhD classes, a course in researching second language learning. It was one of the first times I did what I think of as real empirical research, using an actual data set. That data set came from a professor friend of mine. It was a corpus of essays written for a major test of writing in English, often used for entrance into English-language colleges and universities, and developed by a major testing company. The essays came packaged with metadata include the score they received, making them ideal for investigating the relationship between textual features and perceived quality, then as now a key interest of mine. And since the data had been used in real-world testing with high stakes for test takers, it added obvious exigence to the project. The data set was perfect – except for the very fact that it was from Big Testing Company, and thus proprietary and subject to their rules about using their data.

That’s why, when I got the data from my prof, she said “you probably won’t want to try and publish this.” She said that the process of getting permission would likely be so onerous that it wouldn’t be worth trying to send it out for review. That wasn’t a big deal, really – like I said, it was a pilot study, written for a class – but this points to broader problems with how independent researchers can vet and validate tests that are part of a big money, high-stakes industry.

Here’s the thing: often, to use data from testing companies like Big Testing Company, you have to submit your work for their prior review at every step of the revision process. And since you will have to make several rounds of changes for most journals, and get Big Testing Company to sign off on them, you could easily find yourself waiting years and years to get published. So for this article I would have had to send them the paper, wait months for them to say if they were willing to review it and then send me revisions, make those revisions and send the paper back, wait for them to see if they’d accept my revisions, submit it to a journal, wait for the journal to get back to me with revision requests, make the revisions for the journal, then send the revised paper back to the testing company to see if they were cool with the new revisions…. It would add a whole new layer of waiting and review to an already long and frustrating process.

So I said no thanks and moved on to new projects. I suspect I’m not alone in this; grad students and pre-tenure professors, after all, have time constraints on how long the publication process can take, and that process is professionally crucial. Difficulties in obtaining data on these tests amounts to a powerful disincentive for conducting research on them, which in turn leaves us with less information about them than we should have, given the roles they play in our economy. Some of these testing companies are very good about doing rigorous research on their own products – ETS is notable in this regard – but I remain convinced that only truly independent validation can give us the confidence we need to use them, especially given the stakes for students.

As for the study – please be gentle. I was a second-semester PhD student when I put this together. I was still getting my sea legs in terms of writing research articles, and I hadn’t acquired a lot of the statistical and research methods knowledge that I developed over the course of my doctoral education. This study is small-n, with only 50 observations, though the results are still significant to the .05 alpha that is typical in applied linguistics. Today I’d probably do the whole set of essays. I’d also do a full-bore regression etc. rather than just correlations. Still, I can see the genesis of a lot of my research interests in this article. Anyway, check it out if you’re interested, and please bear in mind the context of this research.


Lexical Diversity and Quality in L1 and L2 Student Essays

Introduction and Rationale.

Traditionally, linguistics has recognized a broad division within the elementary composition of any language: the lexicon of words, parts of words, and idiomatic expressions that make up the basic units of that language’s meanings, and the computational system that structures them to make meaning possible. In college writing pedagogy, our general orientation is to higher-order concerns than either of these two elementary systems (Faigley and Witte). College composition scholars and instructors are more likely to concern themselves with rhetorical, communicative, and disciplinary issues than in the two elementary systems, which they reasonably believe to be too remedial to be appropriate for college level instruction. This prioritization of higher-order concerns persists despite tensions with students, who frequently focus on lower-order concerns themselves (Beach and Friedrich). Despite this resistance, one half of this division receives considerable attention in college writing pedagogy. Grammatical issues are enough of a concern that research has been continuously published concerning how to address them. Books are published that deal solely with issues of grammar and mechanics. This grudging attention persists, despite theoretical and disciplinary resistance to it, because of a perceived exigency: without adequate skills in basic English grammar and syntax, writers are unlikely to fulfill any of the higher-order requirements typical of academic writing.

In contrast, very little attention has been paid— theoretically, pedagogically, or empirically— to the lexical development of adult writers. Consideration of vocabulary is dominantly concentrated in scholarly literature of childhood education. Here, the lack of attention is likely a combination of both resistance based on the assumption that such concerns are too rudimentary to be appropriate for college instruction, as with grammatical issues, and also because of a perceived lack of need. Grammatical errors, after all, are typically systemic— they stem from a misunderstanding or ignorance of important grammatical “moves,” which means they tend to be replicated within assignments and across assignments. A lack of depth in vocabulary, meanwhile, does not result in observable systemic failures within student texts. Indeed, because a limited vocabulary results in problems of omission rather than of commission, it is unlikely to result in identifiable error at all. A student could have a severely limited vocabulary and still produce texts that are entirely mechanically correct.

But this lack of visibility in problems of vocabulary and lexical diversity should not lead us to imagine that limited vocabulary does not represent a problem for student writers. Academic writing often functions as a kind of signaling mechanism through which students and scholars demonstrate basic competencies and shared knowledge that indicates that they are part of a given discourse community (Spack). Utilizing specialized vocabulary is a part of that. Additionally, writing instructors and others who will evaluate a given student’s work often value and privilege complexity and diversity of expression. What’s more, the use of an expansive vocabulary is typically an important element of the type of precision in writing that many within composition identify as a key part of written fluency.

Issues with vocabulary are especially important when considering second language (L2) writers. Part of the reason that most writing instructors are not likely to consider vocabulary as an element of student writing lies in the fact that, for native speakers of a given language, vocabulary is principally acquired, not learned. Most adults are already in possession of a very large vocabulary in their native language, and those who are not are unlikely to have entered college. For L2 writers, however, we cannot expect similar levels of preexisting vocabulary. Vocabulary in a second language, research suggests, is more often learned than acquired. The lexical diversity of a given L2 writer is likely influenced by all of the factors that contribute to general second language fluency, such as amount of prior instruction, quality of instruction, opportunities for immersion, exposure to native speakers, access to resources, etc. Further, because some L2 writers return to their country of origin, or otherwise frequently converse in their L1, they may lack opportunities to continue to develop their vocabulary equivalent to their L1 counterparts. In sum, the challenge of adequate vocabulary can reasonably be expected to be higher for L2 writers.

If it can be demonstrated that diversity in vocabulary in fact has a significant impact on perceptions of quality in student essays, we might be inspired to alter our pedagogy. Attention to development of vocabulary might be a necessary part of effective second language writing instruction. Such pedagogical evolution might entail formal vocabulary teaching with testing and memorization, or greater reading requirements, or any number of instruments to improve student vocabulary. But before such changes can be implemented, we first must understand whether diversity in vocabulary alters perceptions of essay quality and to what degree. This research is an attempt to contribute to that effort.

Theoretical Background.

The calculation of lexical diversity has proven difficult and controversial. The simplest method for measuring lexical diversity lies in simply counting the number of different words (NDW) that appear in a given text. (This figure is now typically referred to as types.) In some research, only words with different roots are counted, so that inflectional differences do not alter the NDW; in some research, each different type is counted separately. The problems with NDW are obvious. The figure is entirely dependent on the length of a given text. It’s impossible to meaningfully compare a text of 50 words to a text of 75 words, let alone to a text of 500 words or 3,000 words. Problems with scalability— the difficulty in making meaningful measures across texts of differing lengths— have been the most consistent issue with attempts to measure lexical diversity.

The most popular method to address this problem has been Type-to-Token Ratio, or TTR. TTR is a simple measurement where the number of types is divided by the number of tokens, giving a proportion between 0 to 1, with a higher figure indicating a more diverse range of vocabulary in the given sample. A large amount of research has been conducted utilizing TTR over a number of decades (see Literature Review). However, the discriminatory power of TTR, and thus its value as a descriptive statistic, has been seriously disputed. These criticisms are both empirical and theoretical in nature. Empirically, TTR has been shown in multiple studies to steadily decrease with sample size, making it impossible to use the statistic to discriminate between texts and thus losing any explanatory value (Broeder; Chen and Leimkhuler; Richards). David Malvern et al explain the theoretical reason for this observed phenomenon:

It is true that a ratio provides better comparability than the simple raw value of one quantity when the quantities in the ration come in fixed proportion regardless of their size. For example, in the case of the density of a substance, the ratio (mass/volume) remains the same regardless of the volume from which it is calculated. Adding half as much again to the volume will add half as much to the mass… and so on. Language production is not like that, however. Adding an extra word to a language sample always increases the token count (N) but will only increase the type count (V) if the word has not been used before. As more and more words are used, it becomes harder and harder to avoid repetition and the chance of the extra word being a new type decreases. Consequently, the type count (V) in the numerator increases at a slower rate than the token count (N) in the denominator and TTR inevitably falls. (22)

This loss of discriminatory power over sample size renders TTR an ineffective measure of lexical diversity. Many transformations of TTR have been proposed to address this issue, but none of them have proven consistently satisfying as alternative measures.

One of the most promising metrics for lexical diversity is D, derived from the vocd algorithm. Developed by Malvern et al, and inspired by theoretical statistics described by Thompson and Thompson in 1915, the process utilized in the generation of D avoids the problem of sample size through reference to ideal curves. The algorithm, implemented through a computer program, draws a set of samples from the target text, beginning with 35 types, then 36, etc., until 50 samples are taken. Each sample is then compared to a series of ideal curves that are generated based on the highest and lowest possible lexical diversity for a given text. This relative position, derived from the curve fitting, is expressed as D, a figure that represents rising lexical diversity as it increases. Since each sample is slightly different, each one returns a slightly different value for D, which is averaged to reach Doptimum. See Figure 1 for a graphical representation of the curve fitting of vocd.

Figure 1. “Ideal TTR versus token curves.” Malvern et al. Lexical Diversity and Language Development, pg. 52

D has proven to be a more reliable statistic than those based on TTR, and it has not been subject to sample size issues to the same degree as other measures of lexical diversity. (See Limitations, however, for some criticisms that have been leveled against the statistic.) The calculation of D and the vocd algorithm are quite complex and go beyond the boundaries of this research. An in-depth explanation and demonstration of vocd and the generation of D, including a thorough literature review, can be found in Malvern et al’s Lexical Diversity and Language Development (2004).

Research Questions.

My research questions for this project are multiple.

  • How diverse is the vocabulary of L1 and L2 writers in standardized essays, as operationalized through lexical density measures such as D?
  • What is the relationship between quality of student writing, as operationalized through essay rating, and the diversity of vocabulary, as operationalized through measures of lexical density such as D?
  • Is this relationship equivalent between L1 and L2 writers? Between writers of different first languages?

Literature Review.

As noted in the Rationale section, the consideration of diversity in vocabulary in composition studies generally and in second language writing specifically has been somewhat limited, at least relative to attention paid to strictly grammatical issues or to higher-order concerns such as rhetorical or communicative success. This lack of attention is interesting, as some standardized tests of writing that are important aspects of educational and economic success explicitly mention lexical command as an aspect of effective writing.

Some research has been conducted by second language researchers considering the importance of vocabulary to perceptions of writing quality. In 1995, Cheryl Engber published “The Relationship of Lexical Proficiency to the Quality of ESL Compositions.” This research involved the holistic scoring of 66 student essays and comparison to four measures of lexical diversity: lexical variation, error-free variation, percentage of lexical error, and lexical density. These measures considered not only the diversity of displayed vocabulary but also the degree to which the demonstrated vocabulary was used effectively and appropriately in its given context. Engber found that there was a robust and significant correlation between a student’s (appropriate and free of error) demonstrated lexical diversity and the rating of that student’s essay. However, the research utilized the conventional TTR measure for lexical diversity, which is flawed for the reasons previously discussed.  In 2000, Yili Li published “Linguistic characteristics of ESL writing in task-based e-mail activities.” Li’s research considered 132 emails written by 22 ESL students, which addressed a variety of tasks and contexts. These emails were subjected to linguistic feature analysis, including lexical diversity, as well as syntactic complexity and grammatical accuracy. Li found that there were slight but statistically significant differences in the lexical diversity of different email tasks (Narrative, Information, Persuasive, Expressive). She also found that lexical diversity was essentially identical between structured and non-structured writing tasks. However, she too used the flawed TTR measure for lexical diversity. In the context of the period of time in which these researchers conducted their studies, the use of TTR was appropriate, but its flaws have eroded the confidence we can place in such research.

The most directly and obviously useful precedent for my current research was conducted by Guoxing Yu and published in 2009 under the title “Lexical Diversity in Writing and Speaking Task Performances.” Having been published within the last several years, Yu’s research is new enough to have assimilated and reacted to the many challenges to TTR and related measures of lexical diversity. Yu’s research utilizes D as measured via the vocd algorithm that also was used in this research. Yu also correlated D with essay rating. However, Yu’s research was primarily oriented towards comparing and contrasting written lexical diversity with spoken lexical diversity and the influence of each on perceptions of fluency or quality. My own research is oriented specifically towards written communication. Additionally, Yu’s research utilized essays that were written and rated specifically for the research, to approximate the type of essays typically written for standardized tests, but also understood more generally. My own research utilizes a data set of essays that were specifically written and rated within the administration of a real standardized test, [REDACTED] (see Research Subjects). Also in 2009, Pauline Foster and Parvaneh Tavakoli published a consideration of how narrative complexity affected certain textual features of complexity, fluency, and lexical diversity. Like Yu, Foster and Takavoli utilized D as a measure of lexical diversity. Among other findings, their research demonstrated that the narrative complexity of a given task did not have a significant impact on lexical diversity.

Research Subjects.

For this research, I utilized the [REDACTED] archive, a database of essays that were submitted for the writing portion of the [REDACTED]. These essays were planned and composed by test takers, in a controlled environment, in 30 minutes. These essays were then rated by trained raters working for [REDACTED], holistically scored between 1 (the worst score) and 6 (the best score). Essays which earned the same rating by each rater are represented in this research through whole numbers ending in 0 (10, 20, 30, 40, 50, 60). Essays where one rater gave one score and the other rater gave one point higher or lower are represented through whole numbers ending in 5 (15, 25, 35, 45, 55). Essays where the raters assigned scores that were discordant by more than a point were rescored by [REDACTED] and are not included in this sample. According to [REDACTED], the inter-rater reliability of the [REDACTED] averages .790. A detailed explanation of the test can be found in the [REDACTED].

The corpus utilized in this research includes 1,737 essays from test administrations performed in 1990. Obviously, the age of the data should give us pause. However, as the [REDACTED] writing section and standardized essay test writing have not undergone major changes in the time since then, I believe the data remains viable. (See Limitations for more.) Within the archive, test subjects are represented from four language backgrounds: English, Spanish, Arabic, and Chinese. The essays in the archive are derived from two prompts, listed below:


The [REDACTED] archive exists as a set of .TXT files that lack file-internal metadata. Instead, the essays are identified via a complex number, and the number compared against a reference list to find information on the essay topic, language background, and score. Because of this, and because of the file extension necessary for use with the software utilized in this research (see Methods), this study utilized a small subsample of 50 essays. For this reason, this research represents a pilot study. In order to control for prompt effects, all of the included essays are drawn from the second prompt, about the writer’s preferred method of news delivery. I chose 25 essays from L1 English writers and 25 essays from L1 Chinese writers, for an n of 50. Each set of essays represents a range of scores from across the available sample. Because the essays were almost all too short to possess adequate tokens to be measured by vocd, I eliminated essays rated 10. I drew five essays each at random from those rated 20, 30, 40, 50, and 60 from the Chinese L1s, for a total of 25 essays from Chinese writers. Because L1 English writers are naturally more proficient at a test of English, the English essays have a restricted range, with very few 10s, 20s, and 30s. I therefore randomly drew five essays each from those rated 35, 40, 45, 50, and 60, for a total of 25 essays from English writers.


This research utilized a computational linguistic approach. Due to the aforementioned problems with traditional statistics for measuring lexical density like TTR and NDW, I used an algorithm known as vocd to generate D, the previously-discussed measure of lexical density that compares a random sample of given texts to a series of ideal curves to determine how diverse the vocabulary of that text is. This algorithm was implemented in the CLAN (Computerized Language Analysis) software suite, a product of the CHILDES (Child Language Development Exchange System) program at Carnegie Mellon University. CLAN is a freeware software that provides a graphical user interface (GUI) for using several programs for typical linguistics uses, such as frequency lists or collocation. Previously, researchers using vocd would have to perform the operations using a command line system. The integration of vocd into CLAN makes vocd easier to use and more accessible.

CLAN uses a proprietary form of file extension, .CHA, as the program suite was originally developed for the study of transcribed audio data that maintains information about pausing and temporal features. In order to utilize the [REDACTED] archive files, they were converted to .CHA format using CLAN’s “textin” program. They were then analyzed using the vocd program, which returned information on types (NDW), tokens (word count), TTR, and the various Ds obtained with each sample, along with a Doptimum derived by averaging each. An example of the output provided by CLAN is below in Figure 2.

Figure 2. CLAN Interface.
Once these data were obtained, averages for type, token, TTR, and D were generated. Then, corrolation matrixes were developed for Chinese L1s, English L1s, and combined data, to find correlations between Types, Tokens, TTR, D, and rating. A scatter plot was developed from the combined data’s correlation of D and rating to represent that relationship graphically.


There are several relevant, statistically significant results from my research. Raw data is attached as Appendix A.

Figure 3. Chinese L1s Correlation Matrix

As can be seen in the correlation matrix of results from essays by Chinese L1 students (Figure 3), there is a moderate, significant correlation between D and an essay’s rating, at .498 and significant at p<.05. This suggests that, for Chinese L1s (and perhaps L2 students in general), the demonstration of diversity in vocabulary is an important part of perceptions of essay quality. The highest correlation are with the simple measures of length, type (NDW) and token (essay length). These correlations (while unusually high for this sample) confirm longstanding empirical understanding that length of essay correlates powerfully with essay rating in standardizd essay tests. The moderate negative correlation between TTR and tokens adds further evidence that TTR degrades with text length.

Figure 4. English L1s Correlation Matrix

The subsample of English L1s displays some similarities to the correlations found in Chinese L1s. Once again, there is a significant correlation between tokens (essay length) and rating, suggesting that writing enough remains an essential part of succeeding in a standardized essay test. The correlation between rating and D is somewhat smaller than that for Chinese L1s. This might perhaps owe to an assumed lower functional vocabulary for ESL students. We might imagine a threshold of minimum functional vocabulary usage that must be met before writers can demonstrate sufficient writing ability to score highly on a standardized essay test. If true, and if L1s are likely to be in possession of a vocabulary at least large enough to craft effective essay answers, lexical diversity could be more important at the lower end of the quality scale. This might prove true especially in a study with a restricted, negatively skewed range of ratings for L1 writers. More investigation is needed. Unfortunately, this correlation is not statistically significant, perhaps owing to the small sample size utilized in this research.

Figure 5. Combined Data Correlation Matrix.

The combined results show similar patterns, as is to be expected. These results are encouraging for this research. The moderate, statistically-significant (p<.01) correlaiton between rating and D demonstrates that lexical diversity is in fact an important part of student success at standardized essay tests. The high correlation between rating and both types and tokens confirms longstanding beliefs that writing a lot is the key to scoring highly on standardized essay tests. TTR’s negative correlation with tokens demonstrates again that it inevitably falls with text length; its negative correlation with rating demonstrates that it lacks value as a descriptive statistic that can contribute to understanding perceived quality.

Figure 6. Scatterplot of Doptimum-Rating Correlation for Combined Data.

This scatterplot demonstrates the key correlation in this research, between D and essay rating. A general progression from lower left to upper right, with many outliers, demonstrates the moderate correlation I previously identified. This correlation is intuitively satisfying. As I have argued, a diverse functional vocabulary is an important prerequisite of argumentative writing. However, while it is necessary, it is not sufficient; essays can be written that display many different words, without achieving rhetorical, mechanical, or communicative success. Likewise, it is possible to write an effective essay that is focused on a small number of arguments or ideas, resulting in a high rating with a low amount of demonstrated diversity in vocabulary.


There are multiple limitations to this research and research design.

First, my data set has limitations. While the data set comes directly from [REDACTED], and represents actual test administrations, because I did not collect the data myself, there is a degree of uncertainty about some of the details regarding its collection. For example, there is reason to believe that the native English speaking participants (who naturally were unlikely to undertake a test of English language ability like the [REDACTED]) took the test under a research or diagnostic directive. While this is standard practice in the test administration world (Fulcher 185), it might introduce construct-irrelevant variance into the sample, particularly given that those taking the test on a diagnostic or research basis might feel less pressure to perform well. Additionally, the particular database of essays contains samples that are over 20 years old. Whether this constitutes a major limitation of this study is a matter of interpretation. While that lack of timeliness might give us pause, it is worth pointing out that neither the [REDACTED]’s writing portion or standardized tests of writing of the type employed in the [REDACTED] have changed dramatically in the time since this data was collected. The benefit of using actual data from a real standardized essay test, in my view, outweighs the downside of the age of that data. A final issue with the [REDACTED] archive as a dataset for this project lies in broad objections to the use of this kind of test to assess student writing. Arguments of this kind are common, and frequently convincing. However, exigence in utilizing this kind of data remains. Tests like the TOEFL, SAT, GRE, and similar high-stakes assessments of writing are hurdles that frequently must be cleared by both L1 and L2 students alike. High stakes assessments of this type are unlikely to go away, even given our resistance to them, and so should continue to be subject to empirical inquiry.

Additionally, while D does indeed appear to be a more robust, predictive, and widely-applicable measure of lexical diversity than traditional measures like NDW and TTR, it is not without problems. Recent scholarship has suggested that D, too, is subject to reduced discrimination above a certain sample size. See, for example, McCarthy and Jarvis (2007) use parallel sampling and comparison to a large variety of other indexes of lexical diversity to demonstrate that D’s ability to act as a unit of comparison degrades across texts that vary by more than perhaps 300 words (tokens). McCarthy and Jarvis argue that research utilizing vocd should be restricted to a “stable range” of texts comprising 100-400 words. The vast majority of essays in the TWE archive fall inside of this range, although many of the worst-scoring essays contain less than 100 words and a small handful exceed 400. Four essays in my sample do not meet the 100 word threshold, with word counts (tokens) of 51, 73, 86, and 90, and one essay that exceeds the 400 word threshold, with 428 words. Given the small number of essays outside of the stable range, I feel my research maintains validity and reliability despite McCarthy and Jarvis’s concerns.

Directions for Further Research.

There are many ways in which this research can be improved and extended.

The most obvious direction for extending this research lies in expanding its sample size. Due to the aforementioned difficulty in incorporating the [REDACTED] archive with the CLAN software package, less than 3% of the total [REDACTED] archive was analyzed. Utilizing all of the data set, whether through automation or by hand, is an obvious next step. An additional further direction for this research might involve the incorporation of additional advanced metrics for lexical diversity, such as MTLD and HD-D. In a 2010 article in the journal Behavior Research Methods titled “MTLD, vocd-D, HD-D: A validation study of sophisticated approaches to lexical diversity assessment,” Philip McCarthy and Scott Jarvis advocate for the use of those three metrics together in order to increase the validity of the analysis of lexical diversity. In doing so, they argue, these various metrics can help to address the various shortcomings of each other. There may, however, be statistical and computational barriers to affecting this kind of statistical analysis.

Another potential avenue for extending and improving this research would be to address the age of the essays analyzed by substituting a difference corpus of student essays for the [REDACTED]. This would also help with publication and flexibility of presentation, as ETS places certain restrictions on the use of its data in publication. Finding a comparable corpus will not necessarily be easy. While there are a variety of corpora available to researchers, few are as specific and real world-applicable as the [REDACTED] archive. Many publicly available corpora do not use writing specifically generated by student writers; those that do often derive their essays from a variety of tasks, genres, and assignments, limiting the reliability of comparisons made statistically. Finally, there are few extant corpora that have quality ratings already assigned to individual essays, as with the [REDACTED]. Without ratings, no correlation with D (or other measures of lexical diversity) is possible. Ratings could be generated by researchers, but this would likely require the availability of funds with which to pay them.

Finally, this research could be expanded by turning from its current quantitative orientation to a mixed methods design that incorporates qualitative analysis as well. There are multiple ways in which such an expansion might be undertaken. For example, a subsample of essays might be evaluated or coded by researchers who could assess them for a qualitative analysis of their diversity or complexity of vocabulary use. This qualitative assessment of lexical diversity could then be compared to the quantitative measures. Researchers could also explore individual essays to see how lexical diversity contributes to the overall impression of that essay and its quality. Researchers could examine essays where the observed lexical diversity is highly correlative with its rating, in order to explore how the diversity of vocabulary contributes to its perceived quality. Or they could consider essays where the correlation is low, to show the limits of lexical diversity of a predictor of quality and to better understand how outliers like this are generated.


As discussed in the Introduction and Rationale statement, this research arose from exigence. I have identified a potential gap in instruction, the lack of attention paid to vocabulary in formal writing pedagogy for adult students. I have also suggested that this gap might be especially problematic for second language learners, who might be especially vulnerable to problems with displaying adequate vocabulary, and having access to correct terms, when compared to their L1 counterparts.

Given that this research utilized a data set drawn from a standardized test of English using time essay writing, and that many scolars in composition dispute the validity of such tests for gauging overall writing ability, the most direct implications of this study must be restricted to those tasks. In those cases, the lessons of this research appear clear: students should try to write as much as possible in the time alloted, and they should attend to their vocabulary both in order to fill that space effectively and to be able to demonstrate a complex and diverse vocabulary. Precise methods for this kind of self-tutoring or instruction are beyond the boundaries of this research, but both direct vocabulary instruction (such as with word lists and definition quizzes) and indirect (such as through reading challenging material) should be considered. As noted, this kind of evolution in pedagogy and best practices within instruction is at odds with many conventional assumptions about the teaching of writing. Some resistance is to be expected.

As for the broader notion of lexical diversity as a key feature of quality in writing, further research is needed. While this pilot study cannot provide more than a limited suggestion that demonstrations of a wide vocabulary are important to perceptions of writing quality, the findings of this research coincide with intuition and assumptions about how writing works. Further research, in keeping with the suggestions outline above, could be of potentially significant benefit to students, instructors, administrators, and researchers within composition studies alike.

Works Cited

Beach, Richard, and Tom Friedrich. “Response to writing.” Handbook of writing research  New York: The Guilford Press, 2006. 222-234. Print.

Broeder, Peter, Guus Extra, and R. van Hout. “Richness and variety in the developing lexicon.” Adult language acquisition: Cross-linguistic perspectives. Vol. I: Field methods (1993): 145-163. Print.

Chen, Ye‐Sho, and Ferdinand F. Leimkuhler. “A type‐token identity in the Simon‐Yule model of text.” Journal of the American Society for Information Science 40.1 (1989): 45-53. Print.

Engber, Cheryl A. “The relationship of lexical proficiency to the quality of ESL compositions.” Journal of second language writing 4.2 (1995): 139-155. Print.

Faigley, Lester, and Stephen Witte. “Analyzing revision.” College composition and communication 32.4 (1981): 400-414. Print.

Foster, Pauline, and Parvaneh Tavakoli. “Native speakers and task performance: Comparing effects on complexity, fluency, and lexical diversity.” Language Learning 59.4 (2009): 866-896. Print.

“IELTS Handbook.”; The British Council. 2007. Web. 1 May 2013.

Li, Yili. “Linguistic characteristics of ESL writing in task-based e-mail activities.” System 28.2 (2000): 229-245. Print.

Malvern, David, et al. Lexical diversity and language development. New York: Palgrave Macmillan, 2004. Print.

McCarthy, Philip M., and Scott Jarvis. “vocd: A theoretical and empirical evaluation.” Language Testing 24.4 (2007): 459-488. Print.

—. “MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment.” Behavior research methods 42.2 (2010): 381-392.

Richards, Brian. “Type/token ratios: What do they really tell us.” Journal of Child Language 14.2 (1987): 201-209. Print.

Spack, Ruth. “Initiating ESL students into the academic discourse community: How far should we go?.” Tesol quarterly 22.1 (1988): 29-51. Print.

“TOEFL IBT.”, The Educational Testing Service. nd. Web. 1 May 2013.

Yu, Guoxing. “Lexical diversity in writing and speaking task performances.” Applied linguistics 31.2 (2010): 236-259. Print.

if formal music education is a privilege, spread the privilege

I was encouraged by this open letter from musicians and music educators in the Guardian, responding to a deeply wrongheaded essay arguing that, since formal music education is increasingly restricted to the white and wealthy, formal music education (that is, notation and theory) is therefore somehow bad and we should stop trying to do it at all. That sounds ridiculous but look and see for yourself.

This is hardly a unique argument in education or outside of it. (A recent vintage I’ve heard, particularly ludicrous, is that the impending shutdown of the New York L train is a good thing, because the L train is ridden by privileged hipsters, which… I can’t even begin to tell you how immensely stupid that is.) French poetry and other “impractical” majors sometimes get this routine – they are disproportionately concentrated in elite private colleges, so therefore there is something inherently decadent about studying them. Their connection to privilege somehow renders them unclean. But of course the fact that these wonderful subjects are now the province of those who have less immediate pressure to achieve independent financial stability only means that we should spread that condition.

A more equitable and humane society is one in which more people, not fewer, can spend their time on beautiful, “impractical” pursuits. Yet there are those deluded leftists who sometimes take a similar tack; why, they ask, are we funding Shakespeare in the park when there are people without warm clothes for the winter? Why pay for museums when some people go hungry? But follow this thinking long enough and you realize what they’re really saying is “poor people have no inner life.” In a just society we recognize that nobody, rich or poor, lives on bread alone. If your socialism doesn’t spread access to music and art and theater and cathedrals and tree-lined boulevards, I have no use for it or for you.

The point of privilege analysis is to spread the privileges to everyone, not to end them, and since music is the food of love, let rich and poor kids play on.

journalists, beware the gambler’s fallacy

One persistent way that human beings misrepresent the world is through the gambler’s fallacy, and there’s a kind of implied gambler’s fallacy that works its way into journalism quite often. It’s hugely important to anyone who cares about research and journalism about research.

The gambler’s fallacy is when you expect a certain periodicity in outcomes when you have no reason to expect it. That is, you look at events that happened in the recent past, and say “that is an unusually high/low number of times for that event to happen, so therefore what will follow is an unusually low/high number of times for it to happen.” The classic case is roulette: you’re walking along the casino floor, and you see the electronic sign showing that a roulette table has hit black 10 times in a row. You know the odds of this are very small, so you rush over to place a bet on red. But of course that’s not justified: the table doesn’t “know” it has come up black 10 times in a row. You’ve still got the same (bad) odds of hitting red, 47.4%. You’re still playing with the same house edge. A coin that’s just come up heads 50 times in a row has the same odds of being heads again as being tails again. The expectation that non-periodic random events are governed by some sort of god of reciprocal probabilities is the source of tons of bad human reasoning – and journalism is absolutely stuffed with it. You see it any time people point out that a particular event hasn’t happened in a long time, so therefore we’ve got an increased chance of it happening in the future.

Perhaps the classic case of this was Kathryn Schulz’s Pulitzer Prize-winning, much-celebrated New Yorker article on the potential mega-earthquake in the Pacific northwest. This piece was a sensation when it appeared, thanks to its prominent placement in a popular publication, the deftness of Schulz’s prose, and the artful construction of her story – but also because of the gambler’s fallacy. At the time I heard about the article constantly, from a lot of smart, educated people, and it was all based on the idea that we were “overdue” for a huge earthquake in that region. People I know were considering selling their homes. Rational adults started stockpiling canned goods. The really big one was overdue.

Was Schulz responsible for this idea? After publication, she would go on to be dismissive of the idea that she had created the impression that we were overdue for such an earthquake. She wrote in a followup to the original article,

Are we overdue for the Cascadia earthquake?

No, although I heard that word a lot after the piece was published. As DOGAMI’s Ian Madin told me, “You’re not overdue for an earthquake until you’re three standard deviations beyond the mean”—which, in the case of the full-margin Cascadia earthquake, means eight hundred years from now. (In the case of the “smaller” Cascadia earthquake, the magnitude 8.0 to 8.6 that would affect only the southern part of the zone, we’re currently one standard deviation beyond the mean.) That doesn’t mean that the quake won’t happen tomorrow; it just means we are not “overdue” in any meaningful sense.

How did people get the idea that we were overdue? The original:

we now know that the Pacific Northwest has experienced forty-one subduction-zone earthquakes in the past ten thousand years. If you divide ten thousand by forty-one, you get two hundred and forty-three, which is Cascadia’s recurrence interval: the average amount of time that elapses between earthquakes. That timespan is dangerous both because it is too long—long enough for us to unwittingly build an entire civilization on top of our continent’s worst fault line—and because it is not long enough. Counting from the earthquake of 1700, we are now three hundred and fifteen years into a two-hundred-and-forty-three-year cycle.

By saying that there is a “two-hundred-and-forty-three-year cycle,” Schulz implied a regular periodicity. The definition of a cycle, after all, is “a series of events that are regularly repeated in the same order.” That simply isn’t how a recurrence interval functions, as Schulz would go on to clarify in her followup – which of course got vastly less attention. I appreciate that, in her followup, Schulz was more rigorous and specific, referring to an expert’s explanation, but it takes serious chutzpah to have written the preceding paragraph and then to later act as though there’s no reason your readers thought the next quake was “overdue.” The closest thing to a clarifying statement in the original article is as follows:

It is possible to quibble with that number. Recurrence intervals are averages, and averages are tricky: ten is the average of nine and eleven, but also of eighteen and two. It is not possible, however, to dispute the scale of the problem.

If we bother to explain that first sentence thoroughly, we can see it’s a remarkable to-be-sure statement – she is obliquely admitting that since there is no regular periodicity to a recurrence interval, there is no sense in which that “two-hundred-and-forty-three-year cycle” is actually a cycle. It’s just an average. Yes, the “really big one” could hit the Pacific northwest tomorrow – and if it did, it still wouldn’t imply that we’ve been overdue, as her later comments acknowledge. The earthquake might also happen 500 years from now. That’s not a quibble; it’s the root of the very panic she set off by publishing the piece. But by immediately leaping from such an under-explained discussion of what a recurrence interval is and isn’t to the irrelevant and vague assertion about “the scale of the problem,” Schulz ensured that her readers would misunderstand in the most sensationalistic way possible. However well crafted her story was, it left people getting a very basic fact wrong, and was thus bad science writing. I don’t think Schulz was being dishonest, but this was a major problem with a piece that received almost universal praise.

I just read another good example of an implied gambler’s fallacy in a comprehensively irresponsible Gizmodo piece on supposed future pandemics. I am tempted to just fisk the whole thing, but I’ll spare you. For our immediate interests let’s just look at how a gambler’s fallacy can work by implication. George Dvorsky:

Experts say it’s not a matter of if, but when a global scale pandemic will wipe out millions of people…. Throughout history, pathogens have wiped out scores of humans. During the 20th century, there were three global-scale influenza outbreaks, the worst of which killed somewhere between 50 and 100 million people, or about 3 to 5 percent of the global population. The HIV virus, which went pandemic in the 1980s, has infected about 70 million people, killing 35 million.

Those specific experts are not named or quoted, so we’ll have to take Dvorsky’s word for it. But note the implication here: because we’ve had pandemics in the past that killed significant percentages of the population, we are likely to have more in the future. An-epidemic-is-upon-us stories are a dime a dozen in contemporary news media, given their obvious ability to drive clicks. Common to these pieces are the implication that we are overdue for another epidemic because epidemics used to happen regularly in the past. But of course, conditions change, and there’s few fields where conditions have changed more in the recent past than infectious diseases. Dvorsky implies that they have changed for the worse:

Diseases, particularly those of tropical origin, are spreading faster than ever before, owing to more long-distance travel, urbanization, lack of sanitation, and ineffective mosquito control—not to mention global warming and the spread of tropical diseases outside of traditional equatorial confines.

Sure, those are concerns. But since he’s specifically set us up to expect more pandemics by referencing those in the early 20th century, maybe we should take a somewhat broader perspective and look at how infectious diseases have changed in the past 100 years. Let’s check with the CDC.

The most salient change, when it comes to infectious, has been the astonishing progress of modern medicine. We have a methodology for fighting infectious disease that has saved hundreds of millions of lives. Unsurprisingly, the diseases that keep getting nominated as the source of the next great pandemic keep failing to spread at expected rates. Dvorsky names diseases likes SARs (global cases since 2004: zero) and Ebola (for which we just discovered a very promising vaccine), not seeming to realize that these are examples of victories for the control of infectious disease, as tragic as the loss of life has been. The actual greatest threats to human health remain what they have been for some time, the deeply unsexy threats of smoking, heart disease, and obesity.

Does the dramatically lower rate of deaths from infectious disease mean a pandemic is impossible? Of course not. But “this happened often in the past, and it hasn’t happened recently, so….” is fallacious reasoning. And you see it in all sorts of domains of journalism. “This winter hasn’t seen a lot of snow so far, so you know February will be rough.” “There hasn’t been a murder in Chicago in weeks, and police are on their toes for the inevitable violence to come.” “The candidate has been riding a crest of good polling numbers, but analysts expect he’s due for a swoon.” None of these are sound reasoning, even though they seem superficially correct based on our intuitions about the world. It’s something journalists in particular should watch out for.

another day, another charter school scandal

Stop me if you’ve heard this one before. A charter regime sweeps into town with grand ambitions, lofty rhetoric, and missionary zeal, promising to save underperforming kids with the magic of markets and by getting rid of those lazy teachers and their greedy unions. What results instead is no demonstrable learning gains, serial rule breaking, underhanded  tactics to attract students, a failure to provide for students with disabilities, and a total lack of real accountability. That’s the story in Nashville. It’s not a new story.

The official narrative is that students and parents will flock to charters, given that they provide “choice,” and in so doing sprinkle students with magical capitalism dust that, somehow – the mechanism is never clear – results in sturdy learning gains. (That schools have both abundant ability to juice the numbers and direct incentive to do so usually goes undiscussed.) Yet the charters in Nashville, like those in the horrific mess in Detroit, are so driven by the need to get dollars – precisely the thing that was supposed to make charters better than public, in the neoliberal telling – that they have to resort to dirty tricks to get parents to sign their children up. And what kind of conditions do students face when they do go to these schools?

On March 7 WSMV-TV reported that California-based Rocketship isn’t providing legally required services to students with disabilities and English language learners. A report by the Tennessee Department of Education even found that Rocketship is forcing homeless students to scrape together money to pay for uniforms.

Lately you may have heard about “public charter schools.” But there is no such thing as a public charter school. Public schools entail public accountability. They involve local control. So-called public charters just take public money, tax dollars. The “flexibility” that is so often touted as part of the charter school magic really means that the citizens who fund these charters have none of the local control of schools that has been such an essential part of public education. And so in Nashville you have area citizens getting mass texted to send their kids to charters that benefit for-profit companies, who then in turn can’t actually directly respond through a local school board or municipal government. When parents feel they need to file a class action lawsuit to enforce some accountability on out-of-control local schools, we’ve officially gone around the bend.

I try – I really do – to have patience for the army of people who are charter school true believers, still, after all this scandal and all this failure. I remind myself that there are people who sincerely believe that charters are the best route forward to improve education. (Which of course means “raise test scores” in our current culture.) I try not to view them as cynically as I do, say, for-profit prison advocates claiming that they’re really in it to make society safer. I remind myself that missionary zeal and a dogged belief that all social problems can be solved if we just believe hard enough can really cloud people’s minds.

But the fact of the matter is that the charter school “movement” is absolutely stuffed to the gills with profiteers and grifters. Thanks to the nearly-universal credulousness of our news media towards the school reform movement, some greased palms in local, state, and federal government, and the powerful and pernicious influence of big-money philanthropic organizations like the Gates Foundation, the conditions have been perfect for rampant corruption and bad behavior. Charter school advocates rang the dinner bell for entire industries that seek to ring profit out of our commitment to universal free education, and the wolves predictably followed. Meanwhile, charters continue to be pushed based on bad research, attrition and survivorship bias, dubious quality metrics, and through undue focus on small, specific, atypical success stories whose conditions cannot possibly scale. And the people who have spent so much time flogging the profound moral need to save struggling children have been remarkably silent about the decades of failure in charter schools writ large. Where is Jonathan Alter to decry the corruption and failure in Nashville? Where is Jonathan Chait’s column admitting that the charter school movement has proven to be an unsalvageable mess? Where is the follow up to Waiting for “Superman,” titled Turns Out Superman Isn’t Real and Other Dispatches from Planet Earth? That’s the problem with zealots; they’re always too busy circling the wagons for their pet causes to actually look at them critically.

As the Betsy Devoses of the world make policy, as companies get rich wringing profit out of poor school districts, and as writers make careers for themselves with soaring rhetoric and tough talk about accountability that, somehow, never changes in the light of new evidence, we’ll see more school reform disasters like those in Nashville, Detroit, and Newark. Will that do anything to slow the charter movement? Not on your life.

Study of the Week: the Gifted and the Grinders

Back in high school, I was a pretty classic example of a kid that teachers said was bright but didn’t apply himself. There were complex reasons for that, some of them owing to my home life, some of it my failure to understand the stakes, and some of it laziness and arrogance. Though I wasn’t under the impression that I was a genius, I did think that in the higher placement classes there were people who got by on talent and people who were striver types, the ones who gritted out high grades more through work than through being naturally bright.

This is, of course, reductive thinking, and was self-flattery on my part. (In my defense, I was a teenager.) Obviously, there’s a range of smarts and a range when it comes to perseverance and work ethic, and there are all sorts of aspects of these things that are interacting with each other. And clearly those at the very top of the academic game likely have both smarts and work ethic in spades. (And luck. And privilege.) But my old vague sense that some people were smarties and some were grinders seems pervasive to me. Our culture is full of those archetypes. Is it really the case that intelligence and work ethic are separate, and that they’re often found in quite different amounts in individuals?

Kind of, yeah.

At least, there’s evidence for that in a recent replication study performed by Clemens Lechner, Daniel Danner, and Beatrice Rammstedt of the Leibniz Institute for the Social Sciences, which I will talk about today for the first Study of the Week, and which I’ll use to take a quick look at a few core concepts.

Construct and Operationalization

Social sciences are hard, for a lot of reasons. One is the famously complex number of variables that influence human behavior, which in turn makes it difficult to identify which variables (or interactions of variables) are responsible for a given outcome. Another is the concept of the construct.

In the physical sciences, we’re general measuring things that are straightforward facets of the physical universe, things that are to one degree or another accessible and mutually defined by different people. We might have different standards of measure, we might have different tools to measure them, and we might need a great deal of experimental sophistication to obtain these measurements, but there is usually a fundamental simplicity to what we’re attempting to measure. Take length. You might measure it in inches or in centimeters. You might measure it with a yard stick or a laser system. You might have to use complex ideas like cosmic distance ladders. But fundamentally the concept of length, or temperature, or mass, or luminosity, is pretty easy to define in a way that most every scientist will agree with.

The social sciences are mostly not that way. Instead, we often have to look at concepts like intelligence, reading ability, tolerance, anxiety…. Each of these reflect real-world phenomenon that most humans can agree exist, but what exactly they entail and how to measure them are matters of controversy. They just aren’t available to direct measurement in the ways common to the natural and physical sciences. So we need to define how we’re going to measure them in a way that will be regarded as valid by others – and that’s often not an uncomplicated task.

Take reading. Everybody knows what reading is, right? But testing reading ability turns out to be a complex task. If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

You get the idea. It’s complicated stuff. We can’t just say “reading ability” and know that everyone is going to agree with what that is or how to measure it. Instead, we recognize the social processes inherent in defining such concepts by referring to them as a construct and to the way we are measuring that construct as an operationalization. (You are invited to roll your eyes at the jargon if you’d like.) So we might have the concept “reading ability” and operationalize it with a multiple choice test. Note that the operationalization isn’t merely an instrument or a metric but the whole sense of how we take the necessarily indistinct construct and make it something measurable.

Construct and operationalization, as clunky as the terms are and as convoluted as they seem, are essential concepts for understanding the social sciences. In particular, I find the difficulty merely in defining our variables of interest and how to measure them a key reason for epistemic humility in our research.

So back to the question of intelligence vs. work ethic. The construct “intelligence” is notoriously contested, with hundreds of books written about its definition, its measurement, and the presumed values inherent to how we talk about it. For our purposes, let’s accept merely that this is a subject of a huge body of research, and that we have the concepts of IQ and in the public consciousness already. We’ll set aside all of the empirical and political issues with IQ for now. But what about work ethic/perseverance/”grinding”? How would we operationalize such a construct? Here we’ll have to talk about psychology’s Five Factor Model.

The “Big Five” or Five Factor Model

The Five Factor Model is a vision of human personality, particularly favored by those in behavioral genetics, that says there are essentially only five major factors in human personality: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, sometimes anagrammed to OCEAN. To one degree or another, proponents of the Five Factor Model argue that all of our myriad terms for personality traits are really just synonyms for these five things. That’s right, that’s all there is to it – those are the traits that make up the human personality, and we’re all found along a range on those scales. I’m exaggerating, of course, but in the case of some true believers not by much. Steven Pinker, for example, flogs the concept relentlessly in his most famous book, The Blank Slate. That’s not a coincidence; behavioral genetics, as a field, loves the Five Factor Model because it fits empirically with the maximalist case for genetic determinism. (Pinker’s MO is to say that both genetics and other factors matter about equally and then to speak as if only genetics matter.) In other words the Five Factor Model helps people make a certain kind of argument about human nature, so it gets a lot of extra attention. I sometimes call this kind of thinking the validity of convenience.

The standard defense of the Five Factor Model is, hey, it replicates – that is, its experimental reliability tends to be high, in that different researchers using somewhat different methods to measure these traits will find similar results. But this is the tail wagging the dog; that something replicates doesn’t mean it’s a valid theoretical construct, only that it tracks with some persistent real-world quality. As Louis Menand put it in New Yorker review of The Blank Slate that is quite entertaining if you find Pinker ponderous,

When Pinker and Harris say that parents do not affect their children’s personalities, therefore, they mean that parents cannot make a fretful child into a serene adult. It’s irrelevant to them that parents can make their children into opera buffs, water-skiers, food connoisseurs, bilingual speakers, painters, trumpet players, and churchgoers—that parents have the power to introduce their children to the whole supra-biological realm—for the fundamental reason that science cannot comprehend what it cannot measure.

That results of the Five Factor model can be replicated does not mean that the idea of dividing the human psyche into five reductive factors and declaring that the whole of personality is valid. It simply means that our operationalizations of the construct are indeed measuring some consistent property of individuals. It’s like answering the question “what is a human being?” by saying “a human being is bipedal.” If you then send a team of observers out into the world to measure the number of legs that tend to be found on humans, you will no doubt find that different researchers are likely to obtain similar findings when counting the number of legs of an individual person. But this doesn’t provide evidence that bipedalism is the sum of mankind; it merely suggests that legs are a thing you can consistently measure, among many. Reliability is a necessary criterion for validity, but it isn’t sufficient. I don’t doubt that the Five Factor Model describes consistent and real aspects of human personality, but the way that Pinker and others treat that model as a more or less comprehensive catalog of what it means to be human is not justified. I’m sure that you could meet two different people who share the same outcomes on the five measured traits in the model, fall madly in love with one of them, and declare the other the biggest asshole you’ve ever met in your life. We’re a multivariate species.

That windup aside, for this particular kind of analysis, I think a construct like “conscientiousness” can be analytically useful. That is, I think that we can avoid the question of whether the Five Factors are actually a comprehensive catalog of essential personality traits while recognizing that there’s some such property of educational perseverance and that it is potentially measurable. (Angela Lee Duckworth’s “grit” concept has been a prominent rebranding of this basic human capacity, although it has begun to generate some criticism.) The question is, does this trait really exist independent of intelligence, and how effective of a predictor is it compared to IQ testing?

Intelligence, Achievement, Test Scores, and Grades

In educational testing, it’s a constant debate: to what degree do various tests measure specific and independent qualities of tested subjects, and to what degree are they just rough approximations of IQ? You can find reams of studies concerning this question. The question hinges a great deal on the subject matter; obviously, a really high IQ isn’t going to mean much if you’re taking a Latin test and you’ve never studied Latin. On the other hand, tests like the SAT and its constituent sections tend to be very highly correlated with IQ tests, to the point where many argue that the test is simply a de facto test for g, the general intelligence factor that IQ tests are intended to measure. What makes these questions difficult, in part, is that we’re often going to be considering variables that are likely to be highly correlated within individuals. That is, the question of whether a given achievement test measures something other than is harder to answer because people with a high are also those who are likely to score highly on an achievement test even if that test effectively measures something other than g. Make sense?

Today’s study offers two main research questions:

first, whether achievement and intelligence tests are empirically distinct; second, how much variance in achievement measures is accounted for by intelligence vs. by personality, whereby R2 increments of personality after adjusting for intelligence are the primary interest

I’m not going to wade into the broader debate about whether various achievement tests effectively measure properties distinct from IQ. I’m not qualified, statistically, to try and separate the various overlapping sums of squares in intelligence and achievement testing. And given that the g-men are known for being rather, ah, strident, I’d prefer to avoid the issue. Besides I think the first question is of much more interest to professionals in psychometrics and assessment than the general public. (This week’s study is in fact a replication of a study that was in turn disputed by another researcher.) But the second question is interesting and relevant to everyone interested in education: how much of a given student’s outcomes are the product of intelligence and how much is the product of personality? In particular, can we see a difference in how intelligence (as measured with IQ and its proxies) influences test scores and grades and how personality (as operationalized through the Five Factor Model) influences them?

The Present Study

In the study at hand, the researchers utilized a data set of 13,648 German 9th graders. The student records included their grades; their results on academic achievement tests; their results on a commonly-used test of the Five Factors; and their performance on a test of reasoning/general intelligence (a Raven’s Standard Progressive Matrices analog) and a processing speed test, which are often used in this kind of cognitive research.

The researchers undertook a multivariable analysis of variance analysis called “exploratory structural equation modeling.” I would love to tell you what that is and how it works but I have no idea. I’m not equipped, statistically, to explain the process or judge whether it was appropriate in this instance. We’re just going to have to trust the researchers and recognize that the process does what analysis of variance does generally, which is to look at the quantitative relationships between variables to explain how they predict, or fail to predict, each other. The nut of it is here:

First, we regressed each of the four cognitive skill measures on all Big Five dimensions. Second, we decomposed the variance of the achievement measures (achievement test scores and school grades) by regressing them on intelligence alone and then on personality and intelligence jointly.

(“Decomposing” variables, in statistics, is a fancy way of saying that you’re using mathematical techniques to identify and separate variables that might be otherwise difficult to separate thanks to their close quantitative relationships.)

What did they find? The results are pretty intuitive. There is, as to be expected, a strong (.76) correlation between performance on the intelligence test and performance on achievement tests. There’s also a considerable but much weaker relationship between achievement tests and grades (.44) and the general intelligence test and grades (.32). So kids who are smarter as defined by achievement and reasoning tests do get better grades, but the relationship isn’t super strong. There are other factors involved. And a big part of that unexplained variance, according to that research, is personality.

The Big Five explain a substantial, and almost identical, share of variance in grades and achievement tests, amounting to almost one-fifth. By comparison, they explain less than half as much—<.10—of the variance in reasoning, and almost none in processing speed (0.07%).

In other words, if you’re trying to predict how students will do on grades and achievement tests, their personalities are pretty strong predictors. But if you’re trying to predict their pure reasoning ability, personality is pretty useless. And the good-at-tests, bad grades students like high school Freddie are pretty plentiful:

the predictive power of intelligence is markedly different for these two achievement measures: it is much higher than that of personality in the case of achievement—but much lower in the case of school grades, where personality alone explains almost two times more variance than intelligence alone does.

So it would seem there may be some validity to the concept of the naturally bright and grinders after all. And the obverse, the less naturally bright but highly motivated grinder types?

Conscientiousness has a substantial positive relationship with grades—but negative relationships with both achievement test scores and reasoning.

In other words, the more conscientious you are, the better the grades you receive, even though you score lower on achievement and intelligence tests. Unsurprisingly, Conscientiousness (the “grit,” perseverance, stick-to-itiveness factor) correlated most highly with school grades, at .27. The ability to continue to work diligently and through adversity makes a huge difference on getting good grades but is much less important when it comes to raw intelligence testing.

What It Means

Ultimately, this research result is intuitive and matches with the personal experience of many. As someone who spent a lot of his life skating by on being bright, and only really became academically focused late in my undergraduate education, there’s something selfishly comforting here. But in the broader, more socially responsible sense, I think we should take care not to perpetuate any stigmas about the grinders. On the one hand, our culture is absolutely suffused with celebrations of conscientiousness and hard work, so it’s not like I think grinders get no credit. And it is important to say that there are certain scenarios where pure reasoning ability matter; if you’re intent on being a research physicist or mathematician, for example, or if you’re bent on being a chess Grandmaster, hard work will not be sufficient, no matter what Malcolm Gladwell says. On the other hand, I am eager to contribute in whatever way to undermining the Cult of Smartness. We’ve perpetuated the notion that those naturally gifted with high intelligence are our natural leaders for decades, and to show for it we have immense elite failures and a sickening lack of social responsibility on Wall Street and in Silicon Valley, where the supposed geniuses roam.

What we really need, ultimately, from both our educational system and our culture, is a theme I will return to in this blog again and again: a broader, more charitable, more humanistic definition of what it means to be a worthwhile human being.

(Thanks to SlateStarCodex for bringing this study to my attention.)


Success Academy Charter Schools accepted $550,000 from pro-Trump billionaires

The charter school movement retains a reputation for being at least nominally politically progressive. This is strange, as the movement entails defunding public institutions and removing public accountability and replacing them with for-profit or not-for-profit-in-name-only institutions, and in doing so causing cuts in stable, unionized public sector jobs. Charter schools are notorious union busters and in some locales have resulted in hugely disproportionate job cuts against black women. The basic assumptions of school “choice” rely on conservative economic arguments – that markets always improve quality. As terrible as Betsy Devos’s appointment to Secretary of Education in the Trump administration may be, it at least helps clarify what should already be obvious: that support for charter schools is conservative on its face.

Here’s a little more evidence for you, concerning school reform darling Success Academy Charter Schools, celebrated for their strong metrics and notorious for their abusive methods in achieving them.

Via my friend and comrade Mindy Rosier, a public school special education teacher and tireless activist, I learned about these financial disclosures from the Mercer Family Foundation, provided by The Mercer Family Foundation is the funding arm of a secretive, powerful family of reactionary billionaires who have spent the last few years empowering Republicans generally and Donald Trump specifically. And in 2014 they donated $550,000 to Success Academy:

Overall, the Mercer Family Foundation’s donations are a veritable Who’s Who of reactionary conservatism, with large donations going to the Heritage Foundation, the Cato Institute, the George W. Bush Foundation, the Barry Goldwater Institute, the Manhattan Institute…. also receives key financial backing from the Mercer Family Foundation. What kind of publication is Breitbart?

How does Success Academy justify taking money from people who fund such hateful rhetoric? I don’t know. Betsy Woodruff has written about this connection, pointing out that Donald Trump has become a staunch advocate of charter schools, but otherwise it’s failed to generate much attention. Might be time for an enterprising reporter to pick up the phone.

One way or another, this should be just another clear indication of the obvious: the charter school movement is part of the same conservative movement that brought us Donald Trump.