Study of the Week: Discipline Reform and Test Score Mania

This week’s study considers how quantitative educational indicators (read: test scores) are affected by serious disciplinary action against students.

The context

We’re in the midst of a criminal justice reform movement in this country. The regularity of police killing, particularly of black men, and our immense prison population have led to civic unrest and widespread perception that something needs to change. We’ve even seen rare bipartisan admissions that something has gone wrong, at least to a point. But we’ve made frustratingly little in the way of actual progress thus far.

One of the salutary aspects of this movement has been the insistence, by activists, as seeing crime and policing as part of broader social systems. You can’t meaningfully talk about how crime happens, or why abusive policing or over-incarceration happen, without talking about the socioeconomic conditions of this country. In particular, there’s been a great deal of interest in the “school to prison pipeline,” the way that some kids – particularly poor students of color – are set up to fail by the system. One aspect of our school system that clearly connects with criminal justice reform is our discipline system. Students who receive suspensions and other serious disciplinary action frequently struggle academically, and they are disproportionately likely to be students of color. As activists have argued, in many ways these students begin their lives in an overly punitive system and continue to suffer in that condition into adulthood.

In an era of test score mania, it’s inevitable that people will ask – how does academic discipline impact test scores? And how can we assess this relationship when there are such obvious confounds? In this week’s paper, the University of Arkansas’s Kaitlin P. Anderson, Gary W. Ritter, and Gema Zamarro attempt to explore that relationship, and arrive at surprising results.

The data

What’s really remarkable about this research is the size and quality of the data set, summarized like this:

This study uses six years of de-identified demographic, achievement (test score), and disciplinary data from all K-12 schools in Arkansas provided by the Arkansas Department of Education (2008-09 through 2013-14). Demographic data include race, gender, grade, special education status, limited English proficiency-status, and free-and-reduced-lunch (FRL) status.

That’s some serious data! We’re talking in the hundreds of thousands of observed students, with longitudinal (multiple observations over time) information and a good amount of demographic data for stratification. I’d kill for a data set like this. (If I could get such anonymized data for students who go through NYC public schools and enroll in CUNY, man.)

Why does the longitudinal aspect matter? Because of endogeneity.

The arrow of causation again, or endogeneity

In last week’s Study of the Week, I pointed out that experimental research in education is rare. Sometimes this is a resource issue, but mostly it’s an issue of ethics and practicality. Suspensions and their impact on academic performance are a perfect example: you can’t go randomly assigning the experimental condition of suspension to kids. That means that the negative academic outcomes typically associated with suspensions, discussed in the literature review of this study, might be caused by them. But it may be that kids who struggle academically are more likely to be suspended. You might presume that if one follows another that’s sufficient to prove causation, but what if there was some preceding event that caused both? (Parents divorcing, say.) It’s tricky. Because experimental designs physically intercede and are randomly controlled, they don’t have this problem, but again, nobody should be suspending kids at random in the name of science.

This research question has an endogeneity problem, in other words. Endogeneity is a fancy statistical term that, like many, is often not used in a particularly precise way. The Wikipedia for it is awful, but the first couple pages here are a good resource. In general, endogeneity means that something in your model is reliant on something else in your model but that’s not expressed in your model’s studied relationship. That is, there’s a hidden relationship within your model that potentially confounds your ability to assess causation. Often this is defined as your error term being correlated with your independent variable(s) (your input variables, the predictors, the variables you suspect may influence the output variable you’re looking at).

Say you’re running a simple linear regression analysis and your model looks at the relationship between income and happiness as expressed on some interval scale. Your model will always include an error term, which contains all the stuff impacting your variable of interest (here happiness) that’s not captured by your model. That’s OK – real world models are never fully explanatory. Uncontrolled variability is inevitable and fine in many research situations. The trouble is that some of the untested variables, the error portion, are likely to be correlated with income. If you’re healthy you’re likely to have a better income. If you’re healthy you’re likely to be happier (depending on type of illness). If you just plug income in as a predictor of happiness and income correlates with health and health correlates with happiness then you can end up overestimating the impact of income. If you’re really just looking for associations, no harm done. But if you want to make a causal inference, you’re asking for trouble. That’s (one type of) endogeneity.

Now, there, you have an endogeneity problem that could be solved by putting more variables into your model, like some sort of indicator of health. But sometimes you have endogeneity that stems from the kind of arrow of causation question that I’ve talked about in this space before. The resource I linked to above details a common example. Actors who have more status are perceived as having more skill at acting. But of course having more skill at acting gives you more status as an actor. Again, if you’re just looking for association, no problem. But there’s no way to really dig out causation – and it doesn’t matter if you add more variables to the model. That problem pops up again and again in education research.

Endogeneity is discussed at length in this paper. The study’s authors attempt to confront it, first, by throwing in demographic variables that may help control for unexplained variation, such as whether students qualify for federal school lunch assistance (a standard proxy for socioeconomic status). They also use a fixed effects panel data model. Fixed effects models attempt to account for unexplained variation by looking at how particular variables change over time for an individual research subject/observation. Fixed effect data is longitudinal, in other words, rather than cross-sectional (looking at each subject/observation only once). Why does this matter? There’s a great explanation by example in this resource here regarding demand for a service and that service’s cost. By using a fixed effect model, you can look at correlations over time within a given subject or observation.

Suppose I took a bunch of homeless kids and looked at the relationship between the calories they consumed and their self-reported happiness. I do a regression and I find, surprisingly, that even among kids with an unusually high chance of being malnourished, calories are inversely correlated with self-reported happiness – the more calories, the lower the happiness. But we know that different kids have different metabolisms and different caloric needs. So now I take multiple observations of the same kids. I find that for each individual kid, rising caloric intake is positively correlated with happiness. Kids who consume less calories might be happier, but the idea that lower calories causes higher happiness has proven to be an illusion. Looking longitudinally shows that for each kid higher calories are associated with higher happiness. That’s the benefit of a fixed effect model. Make sense?

The authors of this study use a panel data (read: contains longitudinal data) fixed effects model as an attempt to confront the obvious confounds here. As they say, most prior research is simply correlational, using cross-sectional approaches that merely compare incidence of suspensions to academic outcomes. By introducing longitudinal data, they can use a fixed effects model to look at variation within particular students, which helps address endogeneity concerns. Do I understand everything going on in their model, statistically? I most certainly do not. So if I’ve gotten something wrong about how they’re attempting to control endogeneity with a fixed effects model, please write me an email and I’ll run it and credit you by name.

The findings

What the authors find, having used their complex model and their longitudinal data, is counterintuitive and goes against the large pool of correlational studies: students who receive serious disciplinary actions don’t suffer academically, at least in terms of test scores, when you control for other variables. In fact there are statistically significant but very small increases in test scores associated with serious disciplinary action. This is true for both math and language arts. The effects are small enough to not be worth representing positively, in my view. (This is why we need to report effect sizes.) But still, the research directly cuts against the idea that serious disciplinary action hurts test scores.

This is just one study and will need replication. But it utilizes a very large, high-quality data set and attempts methodologically to address consistent problems with prior correlational research. So I would lend a good deal of credence to its findings. The question is, what do we do with it?

Keeping results separate from conclusions

To state the obvious: it’s important that we do research like this. We need to know these relationships. But it’s also incredibly important that we recognize the difference between the empirical outcomes of a given study and what our policy response should be. We need, in other words, to separate results from conclusions.

This research occurs in a period of dogged fixation on test scores by policy types. This is, as I’ve argued many times, a mistake. The tail has clearly come to wag the dog when it comes to test scores, with those quantitative indicators of success now pursued so doggedly that they have overwhelmed our interest in the lives of the very children we are meant to be advocating for. And while I don’t think many people will come out and say, “suspensions don’t hurt test scores, so let’s keep suspending so many kids,” this research comes in a policy context where test scores loom so large that they dominate the conversation.

To their credit, the authors of this study express the direct conclusion in a limited and fair way: “Based on our results, if policymakers continue to push for changes to disciplinary policies, they should do so for reasons other than the hypothesized negative impacts of exclusionary discipline on all students.” This is, I think, precisely the right way to frame this. We should not change disciplinary policy out of a concern for test scores. We should change disciplinary policy out of a concern for justice. Do the authors agree? They are cagey, but I suspect they show their hands several times in this research. They caution that the discipline reform movement is leading to deteriorating “school climate” measures and in general concern troll their way through the final paragraphs of their paper. I wish they would state what seems to me to be the most important point: that while we should empirically assess the relationship between discipline and test scores, as they have just done admirably well, the moral question of discipline reform is simply not related to that empirical question. When it comes to asking if we’re suspending too many kids, test scores are simply irrelevant.

I am not a “no testing, ever” guy. That would be strange, given that I spend a considerable portion of my professional life researching about educational testing. I see tests as a useful tool – that is, they exist to satisfy some specific pragmatic human purpose and are valuable to us as long as they fulfill that purpose and their side effects are not so onerous that they overwhelm that positive benefit. As I have said for years, “no testing” or “test everyone all the time” is a false binary; we enjoy the power of inferential statistics, which make it possible to know how our students are doing at scale with great precision. And since relative standardized testing outcomes (that is, individual student performance relative to peers) tend to be remarkably static over life, we don’t have much reason to worry about test data suddenly going obsolete. Careful, responsibly-implemented random sampling with stratification can give us useful data without the social and emotional costs on children that limitless testing imposes. No kids lie awake at night crying because they’re stressed about having to take the NAEP.

The only people who are harmed by reducing the amount of testing in this way are the for-profit testing companies and ancillary businesses that suck up public funds for dubious value, and the politically motivated who use test scores as an instrument to bash teachers and schools. Whether you see their interests as equal to those of the people most directly affected – the students who must endure days of stress and boredom and the teachers who must turn their classes into little more than test prep factories – is an issue for you and your conscience.

Ultimately, the conclusions we must draw about the use of suspensions and other serious disciplinary actions must be moral and political in their nature. As such, good empiricism can function as evidence, context, and support, but it cannot solve the questions for us. To their credit, the researchers behind this study conclude by saying “as we seek to better understand these relationships, we must also consider the systemic effects.” Though I might not reach the same political conclusions as they do, I agree completely.

Many thanks to the American Prospect‘s outstanding education reporter Rachel Cohen for bringing this study to my attention.

Study of the Week: Computers in the Home

A quickie today. It is fair to say that technology plays an enormous role in our educational discourse. Indeed, “technology will solve our educational problems” is a central part of the solutionism that dominates ed talk. From “teach a kid to code” to “who needs highly trained teachers when there’s Khan Academy?,” the idea that digital technology holds the key to the future of schooling is ubiquitous and unavoidable.

This is strange given that educational technology has done almost nothing but fail. Study after study has found no impact on education metrics from technology.

(Now, let me say upfront: this blog post is not intended as a literature review for the vast body of work on the educational impacts of technology. It is instead using a large and indicative study to discuss a broader research trend. If you would like for me to write a real literature review, my PayPal is available at the right.)

Consider having a personal computer in the home. Many would assume that this would give kids an advantage in school. After all, they could play educational software, surf the Web, get help on their homework remotely…. And yet that appears to not be the case. Published in 2013, this week’s study comes from the National Bureau of Educational Research. Written by Robert W. Fairlie and Jonathan Robinson of UC Santa Cruz, the study finds in fact that a personal computer in the home simply makes no difference to student outcomes – not good, not bad, nothing.

The study is large (= 1,123) and high quality. In particular, it offers the rare advantage of being a genuine controlled randomized experiment. That is, the researchers identified research subjects who, at baseline, did not own computers, assigned them randomly to control (no computer) and test (given a computer). This is really not common in educational research. Typically, you’d have to do an observational/correlational study. That is, you’d try to identify research subjects, find which of them already have computers and which didn’t, and look for differences in the groups. These studies are often very useful and the best we have to go on given the nature of the questions we are likely to ask. You can’t, for example, assign poverty as a condition to some kids and not to others. (And, obviously, it would be unethical if you could.) But experiments, where researchers actually cause the difference between experimental and control groups – some methodologists say that there must be, in some sense, a physical intervention to manipulate independent variables – are the gold standard because they are the studies where we can most carefully assess cause and effect. Giving one set of kids computers certainly qualifies as a physical intervention.

And the results are clear: it just doesn’t matter. Grades, test scores, absenteeism and more… no impact. The study is generally accessible to a general audience, save for some discussion of their statistical controls, and I encourage you to peruse it on your own.

In its irrelevance for academic outcomes, owning a personal computer joins a whole host of other educational interventions via digital technology that have washed out completely. But hope springs eternal. I couldn’t help but laugh at this interview of Marc Andreessen in Vox, as it’s so indicative of how this conversation works. Andreessen makes outsized claims about the future impacts of technology. Timothy Lee points out that these claims have never come true in the past. Andreessen simply asserts that this will change, and Lee dutifully writes it down. That is the basic trend, always: the repeated failures of technology to make actually meaningful impacts on student outcomes will always be hand waved away; progress is always coming, next year, or the year after that, or the next. Meanwhile, we had the internet in my classrooms in my junior high school in 1995. Maybe it’s time to stop waiting for technology to save us.

But then again, there are iPads to sell….

Study of the Week: the Gifted and the Grinders

Back in high school, I was a pretty classic example of a kid that teachers said was bright but didn’t apply himself. There were complex reasons for that, some of them owing to my home life, some of it my failure to understand the stakes, and some of it laziness and arrogance. Though I wasn’t under the impression that I was a genius, I did think that in the higher placement classes there were people who got by on talent and people who were striver types, the ones who gritted out high grades more through work than through being naturally bright.

This is, of course, reductive thinking, and was self-flattery on my part. (In my defense, I was a teenager.) Obviously, there’s a range of smarts and a range when it comes to perseverance and work ethic, and there are all sorts of aspects of these things that are interacting with each other. And clearly those at the very top of the academic game likely have both smarts and work ethic in spades. (And luck. And privilege.) But my old vague sense that some people were smarties and some were grinders seems pervasive to me. Our culture is full of those archetypes. Is it really the case that intelligence and work ethic are separate, and that they’re often found in quite different amounts in individuals?

Kind of, yeah.

At least, there’s evidence for that in a recent replication study performed by Clemens Lechner, Daniel Danner, and Beatrice Rammstedt of the Leibniz Institute for the Social Sciences, which I will talk about today for the first Study of the Week, and which I’ll use to take a quick look at a few core concepts.

Construct and Operationalization

Social sciences are hard, for a lot of reasons. One is the famously complex number of variables that influence human behavior, which in turn makes it difficult to identify which variables (or interactions of variables) are responsible for a given outcome. Another is the concept of the construct.

In the physical sciences, we’re general measuring things that are straightforward facets of the physical universe, things that are to one degree or another accessible and mutually defined by different people. We might have different standards of measure, we might have different tools to measure them, and we might need a great deal of experimental sophistication to obtain these measurements, but there is usually a fundamental simplicity to what we’re attempting to measure. Take length. You might measure it in inches or in centimeters. You might measure it with a yard stick or a laser system. You might have to use complex ideas like cosmic distance ladders. But fundamentally the concept of length, or temperature, or mass, or luminosity, is pretty easy to define in a way that most every scientist will agree with.

The social sciences are mostly not that way. Instead, we often have to look at concepts like intelligence, reading ability, tolerance, anxiety…. Each of these reflect real-world phenomenon that most humans can agree exist, but what exactly they entail and how to measure them are matters of controversy. They just aren’t available to direct measurement in the ways common to the natural and physical sciences. So we need to define how we’re going to measure them in a way that will be regarded as valid by others – and that’s often not an uncomplicated task.

Take reading. Everybody knows what reading is, right? But testing reading ability turns out to be a complex task. If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

You get the idea. It’s complicated stuff. We can’t just say “reading ability” and know that everyone is going to agree with what that is or how to measure it. Instead, we recognize the social processes inherent in defining such concepts by referring to them as a construct and to the way we are measuring that construct as an operationalization. (You are invited to roll your eyes at the jargon if you’d like.) So we might have the concept “reading ability” and operationalize it with a multiple choice test. Note that the operationalization isn’t merely an instrument or a metric but the whole sense of how we take the necessarily indistinct construct and make it something measurable.

Construct and operationalization, as clunky as the terms are and as convoluted as they seem, are essential concepts for understanding the social sciences. In particular, I find the difficulty merely in defining our variables of interest and how to measure them a key reason for epistemic humility in our research.

So back to the question of intelligence vs. work ethic. The construct “intelligence” is notoriously contested, with hundreds of books written about its definition, its measurement, and the presumed values inherent to how we talk about it. For our purposes, let’s accept merely that this is a subject of a huge body of research, and that we have the concepts of IQ and in the public consciousness already. We’ll set aside all of the empirical and political issues with IQ for now. But what about work ethic/perseverance/”grinding”? How would we operationalize such a construct? Here we’ll have to talk about psychology’s Five Factor Model.

The “Big Five” or Five Factor Model

The Five Factor Model is a vision of human personality, particularly favored by those in behavioral genetics, that says there are essentially only five major factors in human personality: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, sometimes anagrammed to OCEAN. To one degree or another, proponents of the Five Factor Model argue that all of our myriad terms for personality traits are really just synonyms for these five things. That’s right, that’s all there is to it – those are the traits that make up the human personality, and we’re all found along a range on those scales. I’m exaggerating, of course, but in the case of some true believers not by much. Steven Pinker, for example, flogs the concept relentlessly in his most famous book, The Blank Slate. That’s not a coincidence; behavioral genetics, as a field, loves the Five Factor Model because it fits empirically with the maximalist case for genetic determinism. (Pinker’s MO is to say that both genetics and other factors matter about equally and then to speak as if only genetics matter.) In other words the Five Factor Model helps people make a certain kind of argument about human nature, so it gets a lot of extra attention. I sometimes call this kind of thinking the validity of convenience.

The standard defense of the Five Factor Model is, hey, it replicates – that is, its experimental reliability tends to be high, in that different researchers using somewhat different methods to measure these traits will find similar results. But this is the tail wagging the dog; that something replicates doesn’t mean it’s a valid theoretical construct, only that it tracks with some persistent real-world quality. As Louis Menand put it in New Yorker review of The Blank Slate that is quite entertaining if you find Pinker ponderous,

When Pinker and Harris say that parents do not affect their children’s personalities, therefore, they mean that parents cannot make a fretful child into a serene adult. It’s irrelevant to them that parents can make their children into opera buffs, water-skiers, food connoisseurs, bilingual speakers, painters, trumpet players, and churchgoers—that parents have the power to introduce their children to the whole supra-biological realm—for the fundamental reason that science cannot comprehend what it cannot measure.

That results of the Five Factor model can be replicated does not mean that the idea of dividing the human psyche into five reductive factors and declaring that the whole of personality is valid. It simply means that our operationalizations of the construct are indeed measuring some consistent property of individuals. It’s like answering the question “what is a human being?” by saying “a human being is bipedal.” If you then send a team of observers out into the world to measure the number of legs that tend to be found on humans, you will no doubt find that different researchers are likely to obtain similar findings when counting the number of legs of an individual person. But this doesn’t provide evidence that bipedalism is the sum of mankind; it merely suggests that legs are a thing you can consistently measure, among many. Reliability is a necessary criterion for validity, but it isn’t sufficient. I don’t doubt that the Five Factor Model describes consistent and real aspects of human personality, but the way that Pinker and others treat that model as a more or less comprehensive catalog of what it means to be human is not justified. I’m sure that you could meet two different people who share the same outcomes on the five measured traits in the model, fall madly in love with one of them, and declare the other the biggest asshole you’ve ever met in your life. We’re a multivariate species.

That windup aside, for this particular kind of analysis, I think a construct like “conscientiousness” can be analytically useful. That is, I think that we can avoid the question of whether the Five Factors are actually a comprehensive catalog of essential personality traits while recognizing that there’s some such property of educational perseverance and that it is potentially measurable. (Angela Lee Duckworth’s “grit” concept has been a prominent rebranding of this basic human capacity, although it has begun to generate some criticism.) The question is, does this trait really exist independent of intelligence, and how effective of a predictor is it compared to IQ testing?

Intelligence, Achievement, Test Scores, and Grades

In educational testing, it’s a constant debate: to what degree do various tests measure specific and independent qualities of tested subjects, and to what degree are they just rough approximations of IQ? You can find reams of studies concerning this question. The question hinges a great deal on the subject matter; obviously, a really high IQ isn’t going to mean much if you’re taking a Latin test and you’ve never studied Latin. On the other hand, tests like the SAT and its constituent sections tend to be very highly correlated with IQ tests, to the point where many argue that the test is simply a de facto test for g, the general intelligence factor that IQ tests are intended to measure. What makes these questions difficult, in part, is that we’re often going to be considering variables that are likely to be highly correlated within individuals. That is, the question of whether a given achievement test measures something other than is harder to answer because people with a high are also those who are likely to score highly on an achievement test even if that test effectively measures something other than g. Make sense?

Today’s study offers two main research questions:

first, whether achievement and intelligence tests are empirically distinct; second, how much variance in achievement measures is accounted for by intelligence vs. by personality, whereby R2 increments of personality after adjusting for intelligence are the primary interest

I’m not going to wade into the broader debate about whether various achievement tests effectively measure properties distinct from IQ. I’m not qualified, statistically, to try and separate the various overlapping sums of squares in intelligence and achievement testing. And given that the g-men are known for being rather, ah, strident, I’d prefer to avoid the issue. Besides I think the first question is of much more interest to professionals in psychometrics and assessment than the general public. (This week’s study is in fact a replication of a study that was in turn disputed by another researcher.) But the second question is interesting and relevant to everyone interested in education: how much of a given student’s outcomes are the product of intelligence and how much is the product of personality? In particular, can we see a difference in how intelligence (as measured with IQ and its proxies) influences test scores and grades and how personality (as operationalized through the Five Factor Model) influences them?

The Present Study

In the study at hand, the researchers utilized a data set of 13,648 German 9th graders. The student records included their grades; their results on academic achievement tests; their results on a commonly-used test of the Five Factors; and their performance on a test of reasoning/general intelligence (a Raven’s Standard Progressive Matrices analog) and a processing speed test, which are often used in this kind of cognitive research.

The researchers undertook a multivariable analysis of variance analysis called “exploratory structural equation modeling.” I would love to tell you what that is and how it works but I have no idea. I’m not equipped, statistically, to explain the process or judge whether it was appropriate in this instance. We’re just going to have to trust the researchers and recognize that the process does what analysis of variance does generally, which is to look at the quantitative relationships between variables to explain how they predict, or fail to predict, each other. The nut of it is here:

First, we regressed each of the four cognitive skill measures on all Big Five dimensions. Second, we decomposed the variance of the achievement measures (achievement test scores and school grades) by regressing them on intelligence alone and then on personality and intelligence jointly.

(“Decomposing” variables, in statistics, is a fancy way of saying that you’re using mathematical techniques to identify and separate variables that might be otherwise difficult to separate thanks to their close quantitative relationships.)

What did they find? The results are pretty intuitive. There is, as to be expected, a strong (.76) correlation between performance on the intelligence test and performance on achievement tests. There’s also a considerable but much weaker relationship between achievement tests and grades (.44) and the general intelligence test and grades (.32). So kids who are smarter as defined by achievement and reasoning tests do get better grades, but the relationship isn’t super strong. There are other factors involved. And a big part of that unexplained variance, according to that research, is personality.

The Big Five explain a substantial, and almost identical, share of variance in grades and achievement tests, amounting to almost one-fifth. By comparison, they explain less than half as much—<.10—of the variance in reasoning, and almost none in processing speed (0.07%).

In other words, if you’re trying to predict how students will do on grades and achievement tests, their personalities are pretty strong predictors. But if you’re trying to predict their pure reasoning ability, personality is pretty useless. And the good-at-tests, bad grades students like high school Freddie are pretty plentiful:

the predictive power of intelligence is markedly different for these two achievement measures: it is much higher than that of personality in the case of achievement—but much lower in the case of school grades, where personality alone explains almost two times more variance than intelligence alone does.

So it would seem there may be some validity to the concept of the naturally bright and grinders after all. And the obverse, the less naturally bright but highly motivated grinder types?

Conscientiousness has a substantial positive relationship with grades—but negative relationships with both achievement test scores and reasoning.

In other words, the more conscientious you are, the better the grades you receive, even though you score lower on achievement and intelligence tests. Unsurprisingly, Conscientiousness (the “grit,” perseverance, stick-to-itiveness factor) correlated most highly with school grades, at .27. The ability to continue to work diligently and through adversity makes a huge difference on getting good grades but is much less important when it comes to raw intelligence testing.

What It Means

Ultimately, this research result is intuitive and matches with the personal experience of many. As someone who spent a lot of his life skating by on being bright, and only really became academically focused late in my undergraduate education, there’s something selfishly comforting here. But in the broader, more socially responsible sense, I think we should take care not to perpetuate any stigmas about the grinders. On the one hand, our culture is absolutely suffused with celebrations of conscientiousness and hard work, so it’s not like I think grinders get no credit. And it is important to say that there are certain scenarios where pure reasoning ability matter; if you’re intent on being a research physicist or mathematician, for example, or if you’re bent on being a chess Grandmaster, hard work will not be sufficient, no matter what Malcolm Gladwell says. On the other hand, I am eager to contribute in whatever way to undermining the Cult of Smartness. We’ve perpetuated the notion that those naturally gifted with high intelligence are our natural leaders for decades, and to show for it we have immense elite failures and a sickening lack of social responsibility on Wall Street and in Silicon Valley, where the supposed geniuses roam.

What we really need, ultimately, from both our educational system and our culture, is a theme I will return to in this blog again and again: a broader, more charitable, more humanistic definition of what it means to be a worthwhile human being.

(Thanks to SlateStarCodex for bringing this study to my attention.)