Too Hot for Academic Journals: Lexical Diversity and Quality in L1 and L2 Student Essays

Today I’m printing a pilot study I wrote as a seminar paper for one of my PhD classes, a course in researching second language learning. It was one of the first times I did what I think of as real empirical research, using an actual data set. That data set came from a professor friend of mine. It was a corpus of essays written for a major test of writing in English, often used for entrance into English-language colleges and universities, and developed by a major testing company. The essays came packaged with metadata include the score they received, making them ideal for investigating the relationship between textual features and perceived quality, then as now a key interest of mine. And since the data had been used in real-world testing with high stakes for test takers, it added obvious exigence to the project. The data set was perfect – except for the very fact that it was from Big Testing Company, and thus proprietary and subject to their rules about using their data.

That’s why, when I got the data from my prof, she said “you probably won’t want to try and publish this.” She said that the process of getting permission would likely be so onerous that it wouldn’t be worth trying to send it out for review. That wasn’t a big deal, really – like I said, it was a pilot study, written for a class – but this points to broader problems with how independent researchers can vet and validate tests that are part of a big money, high-stakes industry.

Here’s the thing: often, to use data from testing companies like Big Testing Company, you have to submit your work for their prior review at every step of the revision process. And since you will have to make several rounds of changes for most journals, and get Big Testing Company to sign off on them, you could easily find yourself waiting years and years to get published. So for this article I would have had to send them the paper, wait months for them to say if they were willing to review it and then send me revisions, make those revisions and send the paper back, wait for them to see if they’d accept my revisions, submit it to a journal, wait for the journal to get back to me with revision requests, make the revisions for the journal, then send the revised paper back to the testing company to see if they were cool with the new revisions…. It would add a whole new layer of waiting and review to an already long and frustrating process.

So I said no thanks and moved on to new projects. I suspect I’m not alone in this; grad students and pre-tenure professors, after all, have time constraints on how long the publication process can take, and that process is professionally crucial. Difficulties in obtaining data on these tests amounts to a powerful disincentive for conducting research on them, which in turn leaves us with less information about them than we should have, given the roles they play in our economy. Some of these testing companies are very good about doing rigorous research on their own products – ETS is notable in this regard – but I remain convinced that only truly independent validation can give us the confidence we need to use them, especially given the stakes for students.

As for the study – please be gentle. I was a second-semester PhD student when I put this together. I was still getting my sea legs in terms of writing research articles, and I hadn’t acquired a lot of the statistical and research methods knowledge that I developed over the course of my doctoral education. This study is small-n, with only 50 observations, though the results are still significant to the .05 alpha that is typical in applied linguistics. Today I’d probably do the whole set of essays. I’d also do a full-bore regression etc. rather than just correlations. Still, I can see the genesis of a lot of my research interests in this article. Anyway, check it out if you’re interested, and please bear in mind the context of this research.


Lexical Diversity and Quality in L1 and L2 Student Essays

Introduction and Rationale.

Traditionally, linguistics has recognized a broad division within the elementary composition of any language: the lexicon of words, parts of words, and idiomatic expressions that make up the basic units of that language’s meanings, and the computational system that structures them to make meaning possible. In college writing pedagogy, our general orientation is to higher-order concerns than either of these two elementary systems (Faigley and Witte). College composition scholars and instructors are more likely to concern themselves with rhetorical, communicative, and disciplinary issues than in the two elementary systems, which they reasonably believe to be too remedial to be appropriate for college level instruction. This prioritization of higher-order concerns persists despite tensions with students, who frequently focus on lower-order concerns themselves (Beach and Friedrich). Despite this resistance, one half of this division receives considerable attention in college writing pedagogy. Grammatical issues are enough of a concern that research has been continuously published concerning how to address them. Books are published that deal solely with issues of grammar and mechanics. This grudging attention persists, despite theoretical and disciplinary resistance to it, because of a perceived exigency: without adequate skills in basic English grammar and syntax, writers are unlikely to fulfill any of the higher-order requirements typical of academic writing.

In contrast, very little attention has been paid— theoretically, pedagogically, or empirically— to the lexical development of adult writers. Consideration of vocabulary is dominantly concentrated in scholarly literature of childhood education. Here, the lack of attention is likely a combination of both resistance based on the assumption that such concerns are too rudimentary to be appropriate for college instruction, as with grammatical issues, and also because of a perceived lack of need. Grammatical errors, after all, are typically systemic— they stem from a misunderstanding or ignorance of important grammatical “moves,” which means they tend to be replicated within assignments and across assignments. A lack of depth in vocabulary, meanwhile, does not result in observable systemic failures within student texts. Indeed, because a limited vocabulary results in problems of omission rather than of commission, it is unlikely to result in identifiable error at all. A student could have a severely limited vocabulary and still produce texts that are entirely mechanically correct.

But this lack of visibility in problems of vocabulary and lexical diversity should not lead us to imagine that limited vocabulary does not represent a problem for student writers. Academic writing often functions as a kind of signaling mechanism through which students and scholars demonstrate basic competencies and shared knowledge that indicates that they are part of a given discourse community (Spack). Utilizing specialized vocabulary is a part of that. Additionally, writing instructors and others who will evaluate a given student’s work often value and privilege complexity and diversity of expression. What’s more, the use of an expansive vocabulary is typically an important element of the type of precision in writing that many within composition identify as a key part of written fluency.

Issues with vocabulary are especially important when considering second language (L2) writers. Part of the reason that most writing instructors are not likely to consider vocabulary as an element of student writing lies in the fact that, for native speakers of a given language, vocabulary is principally acquired, not learned. Most adults are already in possession of a very large vocabulary in their native language, and those who are not are unlikely to have entered college. For L2 writers, however, we cannot expect similar levels of preexisting vocabulary. Vocabulary in a second language, research suggests, is more often learned than acquired. The lexical diversity of a given L2 writer is likely influenced by all of the factors that contribute to general second language fluency, such as amount of prior instruction, quality of instruction, opportunities for immersion, exposure to native speakers, access to resources, etc. Further, because some L2 writers return to their country of origin, or otherwise frequently converse in their L1, they may lack opportunities to continue to develop their vocabulary equivalent to their L1 counterparts. In sum, the challenge of adequate vocabulary can reasonably be expected to be higher for L2 writers.

If it can be demonstrated that diversity in vocabulary in fact has a significant impact on perceptions of quality in student essays, we might be inspired to alter our pedagogy. Attention to development of vocabulary might be a necessary part of effective second language writing instruction. Such pedagogical evolution might entail formal vocabulary teaching with testing and memorization, or greater reading requirements, or any number of instruments to improve student vocabulary. But before such changes can be implemented, we first must understand whether diversity in vocabulary alters perceptions of essay quality and to what degree. This research is an attempt to contribute to that effort.

Theoretical Background.

The calculation of lexical diversity has proven difficult and controversial. The simplest method for measuring lexical diversity lies in simply counting the number of different words (NDW) that appear in a given text. (This figure is now typically referred to as types.) In some research, only words with different roots are counted, so that inflectional differences do not alter the NDW; in some research, each different type is counted separately. The problems with NDW are obvious. The figure is entirely dependent on the length of a given text. It’s impossible to meaningfully compare a text of 50 words to a text of 75 words, let alone to a text of 500 words or 3,000 words. Problems with scalability— the difficulty in making meaningful measures across texts of differing lengths— have been the most consistent issue with attempts to measure lexical diversity.

The most popular method to address this problem has been Type-to-Token Ratio, or TTR. TTR is a simple measurement where the number of types is divided by the number of tokens, giving a proportion between 0 to 1, with a higher figure indicating a more diverse range of vocabulary in the given sample. A large amount of research has been conducted utilizing TTR over a number of decades (see Literature Review). However, the discriminatory power of TTR, and thus its value as a descriptive statistic, has been seriously disputed. These criticisms are both empirical and theoretical in nature. Empirically, TTR has been shown in multiple studies to steadily decrease with sample size, making it impossible to use the statistic to discriminate between texts and thus losing any explanatory value (Broeder; Chen and Leimkhuler; Richards). David Malvern et al explain the theoretical reason for this observed phenomenon:

It is true that a ratio provides better comparability than the simple raw value of one quantity when the quantities in the ration come in fixed proportion regardless of their size. For example, in the case of the density of a substance, the ratio (mass/volume) remains the same regardless of the volume from which it is calculated. Adding half as much again to the volume will add half as much to the mass… and so on. Language production is not like that, however. Adding an extra word to a language sample always increases the token count (N) but will only increase the type count (V) if the word has not been used before. As more and more words are used, it becomes harder and harder to avoid repetition and the chance of the extra word being a new type decreases. Consequently, the type count (V) in the numerator increases at a slower rate than the token count (N) in the denominator and TTR inevitably falls. (22)

This loss of discriminatory power over sample size renders TTR an ineffective measure of lexical diversity. Many transformations of TTR have been proposed to address this issue, but none of them have proven consistently satisfying as alternative measures.

One of the most promising metrics for lexical diversity is D, derived from the vocd algorithm. Developed by Malvern et al, and inspired by theoretical statistics described by Thompson and Thompson in 1915, the process utilized in the generation of D avoids the problem of sample size through reference to ideal curves. The algorithm, implemented through a computer program, draws a set of samples from the target text, beginning with 35 types, then 36, etc., until 50 samples are taken. Each sample is then compared to a series of ideal curves that are generated based on the highest and lowest possible lexical diversity for a given text. This relative position, derived from the curve fitting, is expressed as D, a figure that represents rising lexical diversity as it increases. Since each sample is slightly different, each one returns a slightly different value for D, which is averaged to reach Doptimum. See Figure 1 for a graphical representation of the curve fitting of vocd.

Figure 1. “Ideal TTR versus token curves.” Malvern et al. Lexical Diversity and Language Development, pg. 52

D has proven to be a more reliable statistic than those based on TTR, and it has not been subject to sample size issues to the same degree as other measures of lexical diversity. (See Limitations, however, for some criticisms that have been leveled against the statistic.) The calculation of D and the vocd algorithm are quite complex and go beyond the boundaries of this research. An in-depth explanation and demonstration of vocd and the generation of D, including a thorough literature review, can be found in Malvern et al’s Lexical Diversity and Language Development (2004).

Research Questions.

My research questions for this project are multiple.

  • How diverse is the vocabulary of L1 and L2 writers in standardized essays, as operationalized through lexical density measures such as D?
  • What is the relationship between quality of student writing, as operationalized through essay rating, and the diversity of vocabulary, as operationalized through measures of lexical density such as D?
  • Is this relationship equivalent between L1 and L2 writers? Between writers of different first languages?

Literature Review.

As noted in the Rationale section, the consideration of diversity in vocabulary in composition studies generally and in second language writing specifically has been somewhat limited, at least relative to attention paid to strictly grammatical issues or to higher-order concerns such as rhetorical or communicative success. This lack of attention is interesting, as some standardized tests of writing that are important aspects of educational and economic success explicitly mention lexical command as an aspect of effective writing.

Some research has been conducted by second language researchers considering the importance of vocabulary to perceptions of writing quality. In 1995, Cheryl Engber published “The Relationship of Lexical Proficiency to the Quality of ESL Compositions.” This research involved the holistic scoring of 66 student essays and comparison to four measures of lexical diversity: lexical variation, error-free variation, percentage of lexical error, and lexical density. These measures considered not only the diversity of displayed vocabulary but also the degree to which the demonstrated vocabulary was used effectively and appropriately in its given context. Engber found that there was a robust and significant correlation between a student’s (appropriate and free of error) demonstrated lexical diversity and the rating of that student’s essay. However, the research utilized the conventional TTR measure for lexical diversity, which is flawed for the reasons previously discussed.  In 2000, Yili Li published “Linguistic characteristics of ESL writing in task-based e-mail activities.” Li’s research considered 132 emails written by 22 ESL students, which addressed a variety of tasks and contexts. These emails were subjected to linguistic feature analysis, including lexical diversity, as well as syntactic complexity and grammatical accuracy. Li found that there were slight but statistically significant differences in the lexical diversity of different email tasks (Narrative, Information, Persuasive, Expressive). She also found that lexical diversity was essentially identical between structured and non-structured writing tasks. However, she too used the flawed TTR measure for lexical diversity. In the context of the period of time in which these researchers conducted their studies, the use of TTR was appropriate, but its flaws have eroded the confidence we can place in such research.

The most directly and obviously useful precedent for my current research was conducted by Guoxing Yu and published in 2009 under the title “Lexical Diversity in Writing and Speaking Task Performances.” Having been published within the last several years, Yu’s research is new enough to have assimilated and reacted to the many challenges to TTR and related measures of lexical diversity. Yu’s research utilizes D as measured via the vocd algorithm that also was used in this research. Yu also correlated D with essay rating. However, Yu’s research was primarily oriented towards comparing and contrasting written lexical diversity with spoken lexical diversity and the influence of each on perceptions of fluency or quality. My own research is oriented specifically towards written communication. Additionally, Yu’s research utilized essays that were written and rated specifically for the research, to approximate the type of essays typically written for standardized tests, but also understood more generally. My own research utilizes a data set of essays that were specifically written and rated within the administration of a real standardized test, [REDACTED] (see Research Subjects). Also in 2009, Pauline Foster and Parvaneh Tavakoli published a consideration of how narrative complexity affected certain textual features of complexity, fluency, and lexical diversity. Like Yu, Foster and Takavoli utilized D as a measure of lexical diversity. Among other findings, their research demonstrated that the narrative complexity of a given task did not have a significant impact on lexical diversity.

Research Subjects.

For this research, I utilized the [REDACTED] archive, a database of essays that were submitted for the writing portion of the [REDACTED]. These essays were planned and composed by test takers, in a controlled environment, in 30 minutes. These essays were then rated by trained raters working for [REDACTED], holistically scored between 1 (the worst score) and 6 (the best score). Essays which earned the same rating by each rater are represented in this research through whole numbers ending in 0 (10, 20, 30, 40, 50, 60). Essays where one rater gave one score and the other rater gave one point higher or lower are represented through whole numbers ending in 5 (15, 25, 35, 45, 55). Essays where the raters assigned scores that were discordant by more than a point were rescored by [REDACTED] and are not included in this sample. According to [REDACTED], the inter-rater reliability of the [REDACTED] averages .790. A detailed explanation of the test can be found in the [REDACTED].

The corpus utilized in this research includes 1,737 essays from test administrations performed in 1990. Obviously, the age of the data should give us pause. However, as the [REDACTED] writing section and standardized essay test writing have not undergone major changes in the time since then, I believe the data remains viable. (See Limitations for more.) Within the archive, test subjects are represented from four language backgrounds: English, Spanish, Arabic, and Chinese. The essays in the archive are derived from two prompts, listed below:


The [REDACTED] archive exists as a set of .TXT files that lack file-internal metadata. Instead, the essays are identified via a complex number, and the number compared against a reference list to find information on the essay topic, language background, and score. Because of this, and because of the file extension necessary for use with the software utilized in this research (see Methods), this study utilized a small subsample of 50 essays. For this reason, this research represents a pilot study. In order to control for prompt effects, all of the included essays are drawn from the second prompt, about the writer’s preferred method of news delivery. I chose 25 essays from L1 English writers and 25 essays from L1 Chinese writers, for an n of 50. Each set of essays represents a range of scores from across the available sample. Because the essays were almost all too short to possess adequate tokens to be measured by vocd, I eliminated essays rated 10. I drew five essays each at random from those rated 20, 30, 40, 50, and 60 from the Chinese L1s, for a total of 25 essays from Chinese writers. Because L1 English writers are naturally more proficient at a test of English, the English essays have a restricted range, with very few 10s, 20s, and 30s. I therefore randomly drew five essays each from those rated 35, 40, 45, 50, and 60, for a total of 25 essays from English writers.


This research utilized a computational linguistic approach. Due to the aforementioned problems with traditional statistics for measuring lexical density like TTR and NDW, I used an algorithm known as vocd to generate D, the previously-discussed measure of lexical density that compares a random sample of given texts to a series of ideal curves to determine how diverse the vocabulary of that text is. This algorithm was implemented in the CLAN (Computerized Language Analysis) software suite, a product of the CHILDES (Child Language Development Exchange System) program at Carnegie Mellon University. CLAN is a freeware software that provides a graphical user interface (GUI) for using several programs for typical linguistics uses, such as frequency lists or collocation. Previously, researchers using vocd would have to perform the operations using a command line system. The integration of vocd into CLAN makes vocd easier to use and more accessible.

CLAN uses a proprietary form of file extension, .CHA, as the program suite was originally developed for the study of transcribed audio data that maintains information about pausing and temporal features. In order to utilize the [REDACTED] archive files, they were converted to .CHA format using CLAN’s “textin” program. They were then analyzed using the vocd program, which returned information on types (NDW), tokens (word count), TTR, and the various Ds obtained with each sample, along with a Doptimum derived by averaging each. An example of the output provided by CLAN is below in Figure 2.

Figure 2. CLAN Interface.
Once these data were obtained, averages for type, token, TTR, and D were generated. Then, corrolation matrixes were developed for Chinese L1s, English L1s, and combined data, to find correlations between Types, Tokens, TTR, D, and rating. A scatter plot was developed from the combined data’s correlation of D and rating to represent that relationship graphically.


There are several relevant, statistically significant results from my research. Raw data is attached as Appendix A.

Figure 3. Chinese L1s Correlation Matrix

As can be seen in the correlation matrix of results from essays by Chinese L1 students (Figure 3), there is a moderate, significant correlation between D and an essay’s rating, at .498 and significant at p<.05. This suggests that, for Chinese L1s (and perhaps L2 students in general), the demonstration of diversity in vocabulary is an important part of perceptions of essay quality. The highest correlation are with the simple measures of length, type (NDW) and token (essay length). These correlations (while unusually high for this sample) confirm longstanding empirical understanding that length of essay correlates powerfully with essay rating in standardizd essay tests. The moderate negative correlation between TTR and tokens adds further evidence that TTR degrades with text length.

Figure 4. English L1s Correlation Matrix

The subsample of English L1s displays some similarities to the correlations found in Chinese L1s. Once again, there is a significant correlation between tokens (essay length) and rating, suggesting that writing enough remains an essential part of succeeding in a standardized essay test. The correlation between rating and D is somewhat smaller than that for Chinese L1s. This might perhaps owe to an assumed lower functional vocabulary for ESL students. We might imagine a threshold of minimum functional vocabulary usage that must be met before writers can demonstrate sufficient writing ability to score highly on a standardized essay test. If true, and if L1s are likely to be in possession of a vocabulary at least large enough to craft effective essay answers, lexical diversity could be more important at the lower end of the quality scale. This might prove true especially in a study with a restricted, negatively skewed range of ratings for L1 writers. More investigation is needed. Unfortunately, this correlation is not statistically significant, perhaps owing to the small sample size utilized in this research.

Figure 5. Combined Data Correlation Matrix.

The combined results show similar patterns, as is to be expected. These results are encouraging for this research. The moderate, statistically-significant (p<.01) correlaiton between rating and D demonstrates that lexical diversity is in fact an important part of student success at standardized essay tests. The high correlation between rating and both types and tokens confirms longstanding beliefs that writing a lot is the key to scoring highly on standardized essay tests. TTR’s negative correlation with tokens demonstrates again that it inevitably falls with text length; its negative correlation with rating demonstrates that it lacks value as a descriptive statistic that can contribute to understanding perceived quality.

Figure 6. Scatterplot of Doptimum-Rating Correlation for Combined Data.

This scatterplot demonstrates the key correlation in this research, between D and essay rating. A general progression from lower left to upper right, with many outliers, demonstrates the moderate correlation I previously identified. This correlation is intuitively satisfying. As I have argued, a diverse functional vocabulary is an important prerequisite of argumentative writing. However, while it is necessary, it is not sufficient; essays can be written that display many different words, without achieving rhetorical, mechanical, or communicative success. Likewise, it is possible to write an effective essay that is focused on a small number of arguments or ideas, resulting in a high rating with a low amount of demonstrated diversity in vocabulary.


There are multiple limitations to this research and research design.

First, my data set has limitations. While the data set comes directly from [REDACTED], and represents actual test administrations, because I did not collect the data myself, there is a degree of uncertainty about some of the details regarding its collection. For example, there is reason to believe that the native English speaking participants (who naturally were unlikely to undertake a test of English language ability like the [REDACTED]) took the test under a research or diagnostic directive. While this is standard practice in the test administration world (Fulcher 185), it might introduce construct-irrelevant variance into the sample, particularly given that those taking the test on a diagnostic or research basis might feel less pressure to perform well. Additionally, the particular database of essays contains samples that are over 20 years old. Whether this constitutes a major limitation of this study is a matter of interpretation. While that lack of timeliness might give us pause, it is worth pointing out that neither the [REDACTED]’s writing portion or standardized tests of writing of the type employed in the [REDACTED] have changed dramatically in the time since this data was collected. The benefit of using actual data from a real standardized essay test, in my view, outweighs the downside of the age of that data. A final issue with the [REDACTED] archive as a dataset for this project lies in broad objections to the use of this kind of test to assess student writing. Arguments of this kind are common, and frequently convincing. However, exigence in utilizing this kind of data remains. Tests like the TOEFL, SAT, GRE, and similar high-stakes assessments of writing are hurdles that frequently must be cleared by both L1 and L2 students alike. High stakes assessments of this type are unlikely to go away, even given our resistance to them, and so should continue to be subject to empirical inquiry.

Additionally, while D does indeed appear to be a more robust, predictive, and widely-applicable measure of lexical diversity than traditional measures like NDW and TTR, it is not without problems. Recent scholarship has suggested that D, too, is subject to reduced discrimination above a certain sample size. See, for example, McCarthy and Jarvis (2007) use parallel sampling and comparison to a large variety of other indexes of lexical diversity to demonstrate that D’s ability to act as a unit of comparison degrades across texts that vary by more than perhaps 300 words (tokens). McCarthy and Jarvis argue that research utilizing vocd should be restricted to a “stable range” of texts comprising 100-400 words. The vast majority of essays in the TWE archive fall inside of this range, although many of the worst-scoring essays contain less than 100 words and a small handful exceed 400. Four essays in my sample do not meet the 100 word threshold, with word counts (tokens) of 51, 73, 86, and 90, and one essay that exceeds the 400 word threshold, with 428 words. Given the small number of essays outside of the stable range, I feel my research maintains validity and reliability despite McCarthy and Jarvis’s concerns.

Directions for Further Research.

There are many ways in which this research can be improved and extended.

The most obvious direction for extending this research lies in expanding its sample size. Due to the aforementioned difficulty in incorporating the [REDACTED] archive with the CLAN software package, less than 3% of the total [REDACTED] archive was analyzed. Utilizing all of the data set, whether through automation or by hand, is an obvious next step. An additional further direction for this research might involve the incorporation of additional advanced metrics for lexical diversity, such as MTLD and HD-D. In a 2010 article in the journal Behavior Research Methods titled “MTLD, vocd-D, HD-D: A validation study of sophisticated approaches to lexical diversity assessment,” Philip McCarthy and Scott Jarvis advocate for the use of those three metrics together in order to increase the validity of the analysis of lexical diversity. In doing so, they argue, these various metrics can help to address the various shortcomings of each other. There may, however, be statistical and computational barriers to affecting this kind of statistical analysis.

Another potential avenue for extending and improving this research would be to address the age of the essays analyzed by substituting a difference corpus of student essays for the [REDACTED]. This would also help with publication and flexibility of presentation, as ETS places certain restrictions on the use of its data in publication. Finding a comparable corpus will not necessarily be easy. While there are a variety of corpora available to researchers, few are as specific and real world-applicable as the [REDACTED] archive. Many publicly available corpora do not use writing specifically generated by student writers; those that do often derive their essays from a variety of tasks, genres, and assignments, limiting the reliability of comparisons made statistically. Finally, there are few extant corpora that have quality ratings already assigned to individual essays, as with the [REDACTED]. Without ratings, no correlation with D (or other measures of lexical diversity) is possible. Ratings could be generated by researchers, but this would likely require the availability of funds with which to pay them.

Finally, this research could be expanded by turning from its current quantitative orientation to a mixed methods design that incorporates qualitative analysis as well. There are multiple ways in which such an expansion might be undertaken. For example, a subsample of essays might be evaluated or coded by researchers who could assess them for a qualitative analysis of their diversity or complexity of vocabulary use. This qualitative assessment of lexical diversity could then be compared to the quantitative measures. Researchers could also explore individual essays to see how lexical diversity contributes to the overall impression of that essay and its quality. Researchers could examine essays where the observed lexical diversity is highly correlative with its rating, in order to explore how the diversity of vocabulary contributes to its perceived quality. Or they could consider essays where the correlation is low, to show the limits of lexical diversity of a predictor of quality and to better understand how outliers like this are generated.


As discussed in the Introduction and Rationale statement, this research arose from exigence. I have identified a potential gap in instruction, the lack of attention paid to vocabulary in formal writing pedagogy for adult students. I have also suggested that this gap might be especially problematic for second language learners, who might be especially vulnerable to problems with displaying adequate vocabulary, and having access to correct terms, when compared to their L1 counterparts.

Given that this research utilized a data set drawn from a standardized test of English using time essay writing, and that many scolars in composition dispute the validity of such tests for gauging overall writing ability, the most direct implications of this study must be restricted to those tasks. In those cases, the lessons of this research appear clear: students should try to write as much as possible in the time alloted, and they should attend to their vocabulary both in order to fill that space effectively and to be able to demonstrate a complex and diverse vocabulary. Precise methods for this kind of self-tutoring or instruction are beyond the boundaries of this research, but both direct vocabulary instruction (such as with word lists and definition quizzes) and indirect (such as through reading challenging material) should be considered. As noted, this kind of evolution in pedagogy and best practices within instruction is at odds with many conventional assumptions about the teaching of writing. Some resistance is to be expected.

As for the broader notion of lexical diversity as a key feature of quality in writing, further research is needed. While this pilot study cannot provide more than a limited suggestion that demonstrations of a wide vocabulary are important to perceptions of writing quality, the findings of this research coincide with intuition and assumptions about how writing works. Further research, in keeping with the suggestions outline above, could be of potentially significant benefit to students, instructors, administrators, and researchers within composition studies alike.

Works Cited

Beach, Richard, and Tom Friedrich. “Response to writing.” Handbook of writing research  New York: The Guilford Press, 2006. 222-234. Print.

Broeder, Peter, Guus Extra, and R. van Hout. “Richness and variety in the developing lexicon.” Adult language acquisition: Cross-linguistic perspectives. Vol. I: Field methods (1993): 145-163. Print.

Chen, Ye‐Sho, and Ferdinand F. Leimkuhler. “A type‐token identity in the Simon‐Yule model of text.” Journal of the American Society for Information Science 40.1 (1989): 45-53. Print.

Engber, Cheryl A. “The relationship of lexical proficiency to the quality of ESL compositions.” Journal of second language writing 4.2 (1995): 139-155. Print.

Faigley, Lester, and Stephen Witte. “Analyzing revision.” College composition and communication 32.4 (1981): 400-414. Print.

Foster, Pauline, and Parvaneh Tavakoli. “Native speakers and task performance: Comparing effects on complexity, fluency, and lexical diversity.” Language Learning 59.4 (2009): 866-896. Print.

“IELTS Handbook.”; The British Council. 2007. Web. 1 May 2013.

Li, Yili. “Linguistic characteristics of ESL writing in task-based e-mail activities.” System 28.2 (2000): 229-245. Print.

Malvern, David, et al. Lexical diversity and language development. New York: Palgrave Macmillan, 2004. Print.

McCarthy, Philip M., and Scott Jarvis. “vocd: A theoretical and empirical evaluation.” Language Testing 24.4 (2007): 459-488. Print.

—. “MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment.” Behavior research methods 42.2 (2010): 381-392.

Richards, Brian. “Type/token ratios: What do they really tell us.” Journal of Child Language 14.2 (1987): 201-209. Print.

Spack, Ruth. “Initiating ESL students into the academic discourse community: How far should we go?.” Tesol quarterly 22.1 (1988): 29-51. Print.

“TOEFL IBT.”, The Educational Testing Service. nd. Web. 1 May 2013.

Yu, Guoxing. “Lexical diversity in writing and speaking task performances.” Applied linguistics 31.2 (2010): 236-259. Print.

if formal music education is a privilege, spread the privilege

I was encouraged by this open letter from musicians and music educators in the Guardian, responding to a deeply wrongheaded essay arguing that, since formal music education is increasingly restricted to the white and wealthy, formal music education (that is, notation and theory) is therefore somehow bad and we should stop trying to do it at all. That sounds ridiculous but look and see for yourself.

This is hardly a unique argument in education or outside of it. (A recent vintage I’ve heard, particularly ludicrous, is that the impending shutdown of the New York L train is a good thing, because the L train is ridden by privileged hipsters, which… I can’t even begin to tell you how immensely stupid that is.) French poetry and other “impractical” majors sometimes get this routine – they are disproportionately concentrated in elite private colleges, so therefore there is something inherently decadent about studying them. Their connection to privilege somehow renders them unclean. But of course the fact that these wonderful subjects are now the province of those who have less immediate pressure to achieve independent financial stability only means that we should spread that condition.

A more equitable and humane society is one in which more people, not fewer, can spend their time on beautiful, “impractical” pursuits. Yet there are those deluded leftists who sometimes take a similar tack; why, they ask, are we funding Shakespeare in the park when there are people without warm clothes for the winter? Why pay for museums when some people go hungry? But follow this thinking long enough and you realize what they’re really saying is “poor people have no inner life.” In a just society we recognize that nobody, rich or poor, lives on bread alone. If your socialism doesn’t spread access to music and art and theater and cathedrals and tree-lined boulevards, I have no use for it or for you.

The point of privilege analysis is to spread the privileges to everyone, not to end them, and since music is the food of love, let rich and poor kids play on.

journalists, beware the gambler’s fallacy

One persistent way that human beings misrepresent the world is through the gambler’s fallacy, and there’s a kind of implied gambler’s fallacy that works its way into journalism quite often. It’s hugely important to anyone who cares about research and journalism about research.

The gambler’s fallacy is when you expect a certain periodicity in outcomes when you have no reason to expect it. That is, you look at events that happened in the recent past, and say “that is an unusually high/low number of times for that event to happen, so therefore what will follow is an unusually low/high number of times for it to happen.” The classic case is roulette: you’re walking along the casino floor, and you see the electronic sign showing that a roulette table has hit black 10 times in a row. You know the odds of this are very small, so you rush over to place a bet on red. But of course that’s not justified: the table doesn’t “know” it has come up black 10 times in a row. You’ve still got the same (bad) odds of hitting red, 47.4%. You’re still playing with the same house edge. A coin that’s just come up heads 50 times in a row has the same odds of being heads again as being tails again. The expectation that non-periodic random events are governed by some sort of god of reciprocal probabilities is the source of tons of bad human reasoning – and journalism is absolutely stuffed with it. You see it any time people point out that a particular event hasn’t happened in a long time, so therefore we’ve got an increased chance of it happening in the future.

Perhaps the classic case of this was Kathryn Schulz’s Pulitzer Prize-winning, much-celebrated New Yorker article on the potential mega-earthquake in the Pacific northwest. This piece was a sensation when it appeared, thanks to its prominent placement in a popular publication, the deftness of Schulz’s prose, and the artful construction of her story – but also because of the gambler’s fallacy. At the time I heard about the article constantly, from a lot of smart, educated people, and it was all based on the idea that we were “overdue” for a huge earthquake in that region. People I know were considering selling their homes. Rational adults started stockpiling canned goods. The really big one was overdue.

Was Schulz responsible for this idea? After publication, she would go on to be dismissive of the idea that she had created the impression that we were overdue for such an earthquake. She wrote in a followup to the original article,

Are we overdue for the Cascadia earthquake?

No, although I heard that word a lot after the piece was published. As DOGAMI’s Ian Madin told me, “You’re not overdue for an earthquake until you’re three standard deviations beyond the mean”—which, in the case of the full-margin Cascadia earthquake, means eight hundred years from now. (In the case of the “smaller” Cascadia earthquake, the magnitude 8.0 to 8.6 that would affect only the southern part of the zone, we’re currently one standard deviation beyond the mean.) That doesn’t mean that the quake won’t happen tomorrow; it just means we are not “overdue” in any meaningful sense.

How did people get the idea that we were overdue? The original:

we now know that the Pacific Northwest has experienced forty-one subduction-zone earthquakes in the past ten thousand years. If you divide ten thousand by forty-one, you get two hundred and forty-three, which is Cascadia’s recurrence interval: the average amount of time that elapses between earthquakes. That timespan is dangerous both because it is too long—long enough for us to unwittingly build an entire civilization on top of our continent’s worst fault line—and because it is not long enough. Counting from the earthquake of 1700, we are now three hundred and fifteen years into a two-hundred-and-forty-three-year cycle.

By saying that there is a “two-hundred-and-forty-three-year cycle,” Schulz implied a regular periodicity. The definition of a cycle, after all, is “a series of events that are regularly repeated in the same order.” That simply isn’t how a recurrence interval functions, as Schulz would go on to clarify in her followup – which of course got vastly less attention. I appreciate that, in her followup, Schulz was more rigorous and specific, referring to an expert’s explanation, but it takes serious chutzpah to have written the preceding paragraph and then to later act as though there’s no reason your readers thought the next quake was “overdue.” The closest thing to a clarifying statement in the original article is as follows:

It is possible to quibble with that number. Recurrence intervals are averages, and averages are tricky: ten is the average of nine and eleven, but also of eighteen and two. It is not possible, however, to dispute the scale of the problem.

If we bother to explain that first sentence thoroughly, we can see it’s a remarkable to-be-sure statement – she is obliquely admitting that since there is no regular periodicity to a recurrence interval, there is no sense in which that “two-hundred-and-forty-three-year cycle” is actually a cycle. It’s just an average. Yes, the “really big one” could hit the Pacific northwest tomorrow – and if it did, it still wouldn’t imply that we’ve been overdue, as her later comments acknowledge. The earthquake might also happen 500 years from now. That’s not a quibble; it’s the root of the very panic she set off by publishing the piece. But by immediately leaping from such an under-explained discussion of what a recurrence interval is and isn’t to the irrelevant and vague assertion about “the scale of the problem,” Schulz ensured that her readers would misunderstand in the most sensationalistic way possible. However well crafted her story was, it left people getting a very basic fact wrong, and was thus bad science writing. I don’t think Schulz was being dishonest, but this was a major problem with a piece that received almost universal praise.

I just read another good example of an implied gambler’s fallacy in a comprehensively irresponsible Gizmodo piece on supposed future pandemics. I am tempted to just fisk the whole thing, but I’ll spare you. For our immediate interests let’s just look at how a gambler’s fallacy can work by implication. George Dvorsky:

Experts say it’s not a matter of if, but when a global scale pandemic will wipe out millions of people…. Throughout history, pathogens have wiped out scores of humans. During the 20th century, there were three global-scale influenza outbreaks, the worst of which killed somewhere between 50 and 100 million people, or about 3 to 5 percent of the global population. The HIV virus, which went pandemic in the 1980s, has infected about 70 million people, killing 35 million.

Those specific experts are not named or quoted, so we’ll have to take Dvorsky’s word for it. But note the implication here: because we’ve had pandemics in the past that killed significant percentages of the population, we are likely to have more in the future. An-epidemic-is-upon-us stories are a dime a dozen in contemporary news media, given their obvious ability to drive clicks. Common to these pieces are the implication that we are overdue for another epidemic because epidemics used to happen regularly in the past. But of course, conditions change, and there’s few fields where conditions have changed more in the recent past than infectious diseases. Dvorsky implies that they have changed for the worse:

Diseases, particularly those of tropical origin, are spreading faster than ever before, owing to more long-distance travel, urbanization, lack of sanitation, and ineffective mosquito control—not to mention global warming and the spread of tropical diseases outside of traditional equatorial confines.

Sure, those are concerns. But since he’s specifically set us up to expect more pandemics by referencing those in the early 20th century, maybe we should take a somewhat broader perspective and look at how infectious diseases have changed in the past 100 years. Let’s check with the CDC.

The most salient change, when it comes to infectious, has been the astonishing progress of modern medicine. We have a methodology for fighting infectious disease that has saved hundreds of millions of lives. Unsurprisingly, the diseases that keep getting nominated as the source of the next great pandemic keep failing to spread at expected rates. Dvorsky names diseases likes SARs (global cases since 2004: zero) and Ebola (for which we just discovered a very promising vaccine), not seeming to realize that these are examples of victories for the control of infectious disease, as tragic as the loss of life has been. The actual greatest threats to human health remain what they have been for some time, the deeply unsexy threats of smoking, heart disease, and obesity.

Does the dramatically lower rate of deaths from infectious disease mean a pandemic is impossible? Of course not. But “this happened often in the past, and it hasn’t happened recently, so….” is fallacious reasoning. And you see it in all sorts of domains of journalism. “This winter hasn’t seen a lot of snow so far, so you know February will be rough.” “There hasn’t been a murder in Chicago in weeks, and police are on their toes for the inevitable violence to come.” “The candidate has been riding a crest of good polling numbers, but analysts expect he’s due for a swoon.” None of these are sound reasoning, even though they seem superficially correct based on our intuitions about the world. It’s something journalists in particular should watch out for.

another day, another charter school scandal

Stop me if you’ve heard this one before. A charter regime sweeps into town with grand ambitions, lofty rhetoric, and missionary zeal, promising to save underperforming kids with the magic of markets and by getting rid of those lazy teachers and their greedy unions. What results instead is no demonstrable learning gains, serial rule breaking, underhanded  tactics to attract students, a failure to provide for students with disabilities, and a total lack of real accountability. That’s the story in Nashville. It’s not a new story.

The official narrative is that students and parents will flock to charters, given that they provide “choice,” and in so doing sprinkle students with magical capitalism dust that, somehow – the mechanism is never clear – results in sturdy learning gains. (That schools have both abundant ability to juice the numbers and direct incentive to do so usually goes undiscussed.) Yet the charters in Nashville, like those in the horrific mess in Detroit, are so driven by the need to get dollars – precisely the thing that was supposed to make charters better than public, in the neoliberal telling – that they have to resort to dirty tricks to get parents to sign their children up. And what kind of conditions do students face when they do go to these schools?

On March 7 WSMV-TV reported that California-based Rocketship isn’t providing legally required services to students with disabilities and English language learners. A report by the Tennessee Department of Education even found that Rocketship is forcing homeless students to scrape together money to pay for uniforms.

Lately you may have heard about “public charter schools.” But there is no such thing as a public charter school. Public schools entail public accountability. They involve local control. So-called public charters just take public money, tax dollars. The “flexibility” that is so often touted as part of the charter school magic really means that the citizens who fund these charters have none of the local control of schools that has been such an essential part of public education. And so in Nashville you have area citizens getting mass texted to send their kids to charters that benefit for-profit companies, who then in turn can’t actually directly respond through a local school board or municipal government. When parents feel they need to file a class action lawsuit to enforce some accountability on out-of-control local schools, we’ve officially gone around the bend.

I try – I really do – to have patience for the army of people who are charter school true believers, still, after all this scandal and all this failure. I remind myself that there are people who sincerely believe that charters are the best route forward to improve education. (Which of course means “raise test scores” in our current culture.) I try not to view them as cynically as I do, say, for-profit prison advocates claiming that they’re really in it to make society safer. I remind myself that missionary zeal and a dogged belief that all social problems can be solved if we just believe hard enough can really cloud people’s minds.

But the fact of the matter is that the charter school “movement” is absolutely stuffed to the gills with profiteers and grifters. Thanks to the nearly-universal credulousness of our news media towards the school reform movement, some greased palms in local, state, and federal government, and the powerful and pernicious influence of big-money philanthropic organizations like the Gates Foundation, the conditions have been perfect for rampant corruption and bad behavior. Charter school advocates rang the dinner bell for entire industries that seek to ring profit out of our commitment to universal free education, and the wolves predictably followed. Meanwhile, charters continue to be pushed based on bad research, attrition and survivorship bias, dubious quality metrics, and through undue focus on small, specific, atypical success stories whose conditions cannot possibly scale. And the people who have spent so much time flogging the profound moral need to save struggling children have been remarkably silent about the decades of failure in charter schools writ large. Where is Jonathan Alter to decry the corruption and failure in Nashville? Where is Jonathan Chait’s column admitting that the charter school movement has proven to be an unsalvageable mess? Where is the follow up to Waiting for “Superman,” titled Turns Out Superman Isn’t Real and Other Dispatches from Planet Earth? That’s the problem with zealots; they’re always too busy circling the wagons for their pet causes to actually look at them critically.

As the Betsy Devoses of the world make policy, as companies get rich wringing profit out of poor school districts, and as writers make careers for themselves with soaring rhetoric and tough talk about accountability that, somehow, never changes in the light of new evidence, we’ll see more school reform disasters like those in Nashville, Detroit, and Newark. Will that do anything to slow the charter movement? Not on your life.

Study of the Week: the Gifted and the Grinders

Back in high school, I was a pretty classic example of a kid that teachers said was bright but didn’t apply himself. There were complex reasons for that, some of them owing to my home life, some of it my failure to understand the stakes, and some of it laziness and arrogance. Though I wasn’t under the impression that I was a genius, I did think that in the higher placement classes there were people who got by on talent and people who were striver types, the ones who gritted out high grades more through work than through being naturally bright.

This is, of course, reductive thinking, and was self-flattery on my part. (In my defense, I was a teenager.) Obviously, there’s a range of smarts and a range when it comes to perseverance and work ethic, and there are all sorts of aspects of these things that are interacting with each other. And clearly those at the very top of the academic game likely have both smarts and work ethic in spades. (And luck. And privilege.) But my old vague sense that some people were smarties and some were grinders seems pervasive to me. Our culture is full of those archetypes. Is it really the case that intelligence and work ethic are separate, and that they’re often found in quite different amounts in individuals?

Kind of, yeah.

At least, there’s evidence for that in a recent replication study performed by Clemens Lechner, Daniel Danner, and Beatrice Rammstedt of the Leibniz Institute for the Social Sciences, which I will talk about today for the first Study of the Week, and which I’ll use to take a quick look at a few core concepts.

Construct and Operationalization

Social sciences are hard, for a lot of reasons. One is the famously complex number of variables that influence human behavior, which in turn makes it difficult to identify which variables (or interactions of variables) are responsible for a given outcome. Another is the concept of the construct.

In the physical sciences, we’re general measuring things that are straightforward facets of the physical universe, things that are to one degree or another accessible and mutually defined by different people. We might have different standards of measure, we might have different tools to measure them, and we might need a great deal of experimental sophistication to obtain these measurements, but there is usually a fundamental simplicity to what we’re attempting to measure. Take length. You might measure it in inches or in centimeters. You might measure it with a yard stick or a laser system. You might have to use complex ideas like cosmic distance ladders. But fundamentally the concept of length, or temperature, or mass, or luminosity, is pretty easy to define in a way that most every scientist will agree with.

The social sciences are mostly not that way. Instead, we often have to look at concepts like intelligence, reading ability, tolerance, anxiety…. Each of these reflect real-world phenomenon that most humans can agree exist, but what exactly they entail and how to measure them are matters of controversy. They just aren’t available to direct measurement in the ways common to the natural and physical sciences. So we need to define how we’re going to measure them in a way that will be regarded as valid by others – and that’s often not an uncomplicated task.

Take reading. Everybody knows what reading is, right? But testing reading ability turns out to be a complex task. If we want to test reading ability, how would we go about doing that? A simple way might be to have a a test subject read a book out loud. We might then decide if the subject can be put into the CAN READ or CAN’T READ pile. But of course that’s quite lacking in granularity and leaves us with a lot of questions. If a reader mispronounces a word but understands its meaning, does that mean they can’t read that word? How many words can a reader fail to read correctly in a given text before we sort them into the CAN’T READ pile? Clearly, reading isn’t really a binary activity. Some people are better or worse readers and some people can reader harder or easier texts. What we need is a scale and a test to assign readers to it. What form should that scale take? How many questions is best? Should the test involve reading passages or reading sentences? Fill in the blank or multiple choice? Is the ability to spot grammatical errors in a text an aspect of reading, or is that a different construct? Is vocabulary knowledge a part of the construct of reading ability or a separate construct?

You get the idea. It’s complicated stuff. We can’t just say “reading ability” and know that everyone is going to agree with what that is or how to measure it. Instead, we recognize the social processes inherent in defining such concepts by referring to them as a construct and to the way we are measuring that construct as an operationalization. (You are invited to roll your eyes at the jargon if you’d like.) So we might have the concept “reading ability” and operationalize it with a multiple choice test. Note that the operationalization isn’t merely an instrument or a metric but the whole sense of how we take the necessarily indistinct construct and make it something measurable.

Construct and operationalization, as clunky as the terms are and as convoluted as they seem, are essential concepts for understanding the social sciences. In particular, I find the difficulty merely in defining our variables of interest and how to measure them a key reason for epistemic humility in our research.

So back to the question of intelligence vs. work ethic. The construct “intelligence” is notoriously contested, with hundreds of books written about its definition, its measurement, and the presumed values inherent to how we talk about it. For our purposes, let’s accept merely that this is a subject of a huge body of research, and that we have the concepts of IQ and in the public consciousness already. We’ll set aside all of the empirical and political issues with IQ for now. But what about work ethic/perseverance/”grinding”? How would we operationalize such a construct? Here we’ll have to talk about psychology’s Five Factor Model.

The “Big Five” or Five Factor Model

The Five Factor Model is a vision of human personality, particularly favored by those in behavioral genetics, that says there are essentially only five major factors in human personality: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, sometimes anagrammed to OCEAN. To one degree or another, proponents of the Five Factor Model argue that all of our myriad terms for personality traits are really just synonyms for these five things. That’s right, that’s all there is to it – those are the traits that make up the human personality, and we’re all found along a range on those scales. I’m exaggerating, of course, but in the case of some true believers not by much. Steven Pinker, for example, flogs the concept relentlessly in his most famous book, The Blank Slate. That’s not a coincidence; behavioral genetics, as a field, loves the Five Factor Model because it fits empirically with the maximalist case for genetic determinism. (Pinker’s MO is to say that both genetics and other factors matter about equally and then to speak as if only genetics matter.) In other words the Five Factor Model helps people make a certain kind of argument about human nature, so it gets a lot of extra attention. I sometimes call this kind of thinking the validity of convenience.

The standard defense of the Five Factor Model is, hey, it replicates – that is, its experimental reliability tends to be high, in that different researchers using somewhat different methods to measure these traits will find similar results. But this is the tail wagging the dog; that something replicates doesn’t mean it’s a valid theoretical construct, only that it tracks with some persistent real-world quality. As Louis Menand put it in New Yorker review of The Blank Slate that is quite entertaining if you find Pinker ponderous,

When Pinker and Harris say that parents do not affect their children’s personalities, therefore, they mean that parents cannot make a fretful child into a serene adult. It’s irrelevant to them that parents can make their children into opera buffs, water-skiers, food connoisseurs, bilingual speakers, painters, trumpet players, and churchgoers—that parents have the power to introduce their children to the whole supra-biological realm—for the fundamental reason that science cannot comprehend what it cannot measure.

That results of the Five Factor model can be replicated does not mean that the idea of dividing the human psyche into five reductive factors and declaring that the whole of personality is valid. It simply means that our operationalizations of the construct are indeed measuring some consistent property of individuals. It’s like answering the question “what is a human being?” by saying “a human being is bipedal.” If you then send a team of observers out into the world to measure the number of legs that tend to be found on humans, you will no doubt find that different researchers are likely to obtain similar findings when counting the number of legs of an individual person. But this doesn’t provide evidence that bipedalism is the sum of mankind; it merely suggests that legs are a thing you can consistently measure, among many. Reliability is a necessary criterion for validity, but it isn’t sufficient. I don’t doubt that the Five Factor Model describes consistent and real aspects of human personality, but the way that Pinker and others treat that model as a more or less comprehensive catalog of what it means to be human is not justified. I’m sure that you could meet two different people who share the same outcomes on the five measured traits in the model, fall madly in love with one of them, and declare the other the biggest asshole you’ve ever met in your life. We’re a multivariate species.

That windup aside, for this particular kind of analysis, I think a construct like “conscientiousness” can be analytically useful. That is, I think that we can avoid the question of whether the Five Factors are actually a comprehensive catalog of essential personality traits while recognizing that there’s some such property of educational perseverance and that it is potentially measurable. (Angela Lee Duckworth’s “grit” concept has been a prominent rebranding of this basic human capacity, although it has begun to generate some criticism.) The question is, does this trait really exist independent of intelligence, and how effective of a predictor is it compared to IQ testing?

Intelligence, Achievement, Test Scores, and Grades

In educational testing, it’s a constant debate: to what degree do various tests measure specific and independent qualities of tested subjects, and to what degree are they just rough approximations of IQ? You can find reams of studies concerning this question. The question hinges a great deal on the subject matter; obviously, a really high IQ isn’t going to mean much if you’re taking a Latin test and you’ve never studied Latin. On the other hand, tests like the SAT and its constituent sections tend to be very highly correlated with IQ tests, to the point where many argue that the test is simply a de facto test for g, the general intelligence factor that IQ tests are intended to measure. What makes these questions difficult, in part, is that we’re often going to be considering variables that are likely to be highly correlated within individuals. That is, the question of whether a given achievement test measures something other than is harder to answer because people with a high are also those who are likely to score highly on an achievement test even if that test effectively measures something other than g. Make sense?

Today’s study offers two main research questions:

first, whether achievement and intelligence tests are empirically distinct; second, how much variance in achievement measures is accounted for by intelligence vs. by personality, whereby R2 increments of personality after adjusting for intelligence are the primary interest

I’m not going to wade into the broader debate about whether various achievement tests effectively measure properties distinct from IQ. I’m not qualified, statistically, to try and separate the various overlapping sums of squares in intelligence and achievement testing. And given that the g-men are known for being rather, ah, strident, I’d prefer to avoid the issue. Besides I think the first question is of much more interest to professionals in psychometrics and assessment than the general public. (This week’s study is in fact a replication of a study that was in turn disputed by another researcher.) But the second question is interesting and relevant to everyone interested in education: how much of a given student’s outcomes are the product of intelligence and how much is the product of personality? In particular, can we see a difference in how intelligence (as measured with IQ and its proxies) influences test scores and grades and how personality (as operationalized through the Five Factor Model) influences them?

The Present Study

In the study at hand, the researchers utilized a data set of 13,648 German 9th graders. The student records included their grades; their results on academic achievement tests; their results on a commonly-used test of the Five Factors; and their performance on a test of reasoning/general intelligence (a Raven’s Standard Progressive Matrices analog) and a processing speed test, which are often used in this kind of cognitive research.

The researchers undertook a multivariable analysis of variance analysis called “exploratory structural equation modeling.” I would love to tell you what that is and how it works but I have no idea. I’m not equipped, statistically, to explain the process or judge whether it was appropriate in this instance. We’re just going to have to trust the researchers and recognize that the process does what analysis of variance does generally, which is to look at the quantitative relationships between variables to explain how they predict, or fail to predict, each other. The nut of it is here:

First, we regressed each of the four cognitive skill measures on all Big Five dimensions. Second, we decomposed the variance of the achievement measures (achievement test scores and school grades) by regressing them on intelligence alone and then on personality and intelligence jointly.

(“Decomposing” variables, in statistics, is a fancy way of saying that you’re using mathematical techniques to identify and separate variables that might be otherwise difficult to separate thanks to their close quantitative relationships.)

What did they find? The results are pretty intuitive. There is, as to be expected, a strong (.76) correlation between performance on the intelligence test and performance on achievement tests. There’s also a considerable but much weaker relationship between achievement tests and grades (.44) and the general intelligence test and grades (.32). So kids who are smarter as defined by achievement and reasoning tests do get better grades, but the relationship isn’t super strong. There are other factors involved. And a big part of that unexplained variance, according to that research, is personality.

The Big Five explain a substantial, and almost identical, share of variance in grades and achievement tests, amounting to almost one-fifth. By comparison, they explain less than half as much—<.10—of the variance in reasoning, and almost none in processing speed (0.07%).

In other words, if you’re trying to predict how students will do on grades and achievement tests, their personalities are pretty strong predictors. But if you’re trying to predict their pure reasoning ability, personality is pretty useless. And the good-at-tests, bad grades students like high school Freddie are pretty plentiful:

the predictive power of intelligence is markedly different for these two achievement measures: it is much higher than that of personality in the case of achievement—but much lower in the case of school grades, where personality alone explains almost two times more variance than intelligence alone does.

So it would seem there may be some validity to the concept of the naturally bright and grinders after all. And the obverse, the less naturally bright but highly motivated grinder types?

Conscientiousness has a substantial positive relationship with grades—but negative relationships with both achievement test scores and reasoning.

In other words, the more conscientious you are, the better the grades you receive, even though you score lower on achievement and intelligence tests. Unsurprisingly, Conscientiousness (the “grit,” perseverance, stick-to-itiveness factor) correlated most highly with school grades, at .27. The ability to continue to work diligently and through adversity makes a huge difference on getting good grades but is much less important when it comes to raw intelligence testing.

What It Means

Ultimately, this research result is intuitive and matches with the personal experience of many. As someone who spent a lot of his life skating by on being bright, and only really became academically focused late in my undergraduate education, there’s something selfishly comforting here. But in the broader, more socially responsible sense, I think we should take care not to perpetuate any stigmas about the grinders. On the one hand, our culture is absolutely suffused with celebrations of conscientiousness and hard work, so it’s not like I think grinders get no credit. And it is important to say that there are certain scenarios where pure reasoning ability matter; if you’re intent on being a research physicist or mathematician, for example, or if you’re bent on being a chess Grandmaster, hard work will not be sufficient, no matter what Malcolm Gladwell says. On the other hand, I am eager to contribute in whatever way to undermining the Cult of Smartness. We’ve perpetuated the notion that those naturally gifted with high intelligence are our natural leaders for decades, and to show for it we have immense elite failures and a sickening lack of social responsibility on Wall Street and in Silicon Valley, where the supposed geniuses roam.

What we really need, ultimately, from both our educational system and our culture, is a theme I will return to in this blog again and again: a broader, more charitable, more humanistic definition of what it means to be a worthwhile human being.

(Thanks to SlateStarCodex for bringing this study to my attention.)


Success Academy Charter Schools accepted $550,000 from pro-Trump billionaires

The charter school movement retains a reputation for being at least nominally politically progressive. This is strange, as the movement entails defunding public institutions and removing public accountability and replacing them with for-profit or not-for-profit-in-name-only institutions, and in doing so causing cuts in stable, unionized public sector jobs. Charter schools are notorious union busters and in some locales have resulted in hugely disproportionate job cuts against black women. The basic assumptions of school “choice” rely on conservative economic arguments – that markets always improve quality. As terrible as Betsy Devos’s appointment to Secretary of Education in the Trump administration may be, it at least helps clarify what should already be obvious: that support for charter schools is conservative on its face.

Here’s a little more evidence for you, concerning school reform darling Success Academy Charter Schools, celebrated for their strong metrics and notorious for their abusive methods in achieving them.

Via my friend and comrade Mindy Rosier, a public school special education teacher and tireless activist, I learned about these financial disclosures from the Mercer Family Foundation, provided by The Mercer Family Foundation is the funding arm of a secretive, powerful family of reactionary billionaires who have spent the last few years empowering Republicans generally and Donald Trump specifically. And in 2014 they donated $550,000 to Success Academy:

Overall, the Mercer Family Foundation’s donations are a veritable Who’s Who of reactionary conservatism, with large donations going to the Heritage Foundation, the Cato Institute, the George W. Bush Foundation, the Barry Goldwater Institute, the Manhattan Institute…. also receives key financial backing from the Mercer Family Foundation. What kind of publication is Breitbart?

How does Success Academy justify taking money from people who fund such hateful rhetoric? I don’t know. Betsy Woodruff has written about this connection, pointing out that Donald Trump has become a staunch advocate of charter schools, but otherwise it’s failed to generate much attention. Might be time for an enterprising reporter to pick up the phone.

One way or another, this should be just another clear indication of the obvious: the charter school movement is part of the same conservative movement that brought us Donald Trump.

do peer effects matter? nah

There’s long been a belief that peer effects play a significant role in how well students perform academically – that is, that learning alongside higher-achieving peers likely helps students achieve themselves, while learning alongside lower-achieving peers might drag them down. Is that the case?

Probably not. The newer, larger, higher-quality studies don’t show evidence for that in quantitative outcomes, anyway. A large study looking at exam schools in New York and Boston – that is, selective public high schools in large urban districts – found that even though enrolling in these institutions dramatically increased the average academic performance of peers (thanks to the screening process to get in), the impact on relative performance was essentially nil. That’s true in terms of test metrics like the PSAT, SAT, and AP Scores, and in terms of college outcomes after graduation. It just doesn’t matter much. A study among students transitioning from the primary school level to the secondary school level in England, where dramatic changes occur in peer group composition, found a significant but very small effect from peer group in quantitative indicators. Like, really small. Assuming the null is a pretty good bet in a lot of education research.

Of course, none of this means there’s no human value in sending your kids to school with elite peers. There are many things that matter in life beyond quantitative education indicators. (Though you’d never know that if you listen to some pundits.) Your kids may find their school experience more pleasant, and it may help them network later in life, if they attend school with high achievers. On the other hand, it will inevitably increase the homogeneity of their learning environment, which seems less than ideal to me in a multicultural democracy like ours. Public schools that are struggling desperately need financial secure parents who have the social capital necessary to advocate for them, too. Either way, though, if you’re worrying about how peer effects will impact your kid’s outcomes, you shouldn’t. Like an awful lot of things that parents worry about when it comes to their children, it just doesn’t matter much.