“Evaluating the Comparability of Two Measures of Lexical Diversity”

I’m excited and grateful to share with you the news that my article “Evaluating the Comparability of Two Measures of Lexical Diversity” has been accepted for publication in the applied linguistics journal System. The article should appear in the journal in the next several months. I would like to take a minute and explain it in language everyone can understand. I will attempt to define any complex or unusual terms and to explain processes as simply as possible, but these topics are complicated so getting to understanding will take some time and effort. This post will probably be of interest to very few people so I don’t blame you if you skip it.

Edit: Here is a link for those of you with online journal archive access. For those without, email me if you would like a copy.

What is lexical diversity?

“Lexical diversity” is a term used in applied linguistics and related fields to refer to the displayed range of vocabulary in a given text. (Related terms include lexical density, lexical richness, etc., which differ based on changes in how these terms are used and defined — for example, some systems use weightings based on the relative rarity of words.) Evaluating that range seems intuitively simple, and yet developing a valid, reliable metric for such evaluation has proven unusually tricky. A great many attempts to create such metrics have been undertaken with limited success. Some of the more exciting attempts now utilize complex algorithmic processes that would not have been practically feasible before the advent of the personal computer. My paper compares two of them and provides empirical justification for a claim about their mechanism made by other researchers.

Why do we care about it?

Lexical diversity and similar metrics have been used for a wide variety of applications. Being able to display a large vocabulary is often considered an important aspect of being a sophisticated language user. This is particularly true because we recognize a distinction between active vocabulary, or the words a language user can utilize effectively in their speech and writing, and a passive vocabulary, or the words a language user can define when challenged, such as in a test. This is an important distinction for real-world language use. For example, many tests of English as a second language involve students choosing the best definition for an English term from a list of possible definitions. But clearly, being able to choose a definition from a list and being able to effectively use a word in real-life situations are different skills. This is a particularly acute issue because of the existence of English-language “cram schools” where language learners study lists of vocabulary endlessly but get little language experience of value. Lexical diversity allows us to see how much vocabulary someone actually integrates into their production. This has been used to assess the proficiency of second language speakers; to detect learning disabilities and language impairments in children; and to assess texts for readability and grade-level appropriateness, among other things. Lexical diversity also has application to machine learning of language and natural language processing, such as is used in computerized translation services.

Why is it hard to measure?

The essential reason for the difficulty in assessing diversity in vocabulary lies in the recursive, repetitive nature of functional vocabulary. In English linguistics there is a distinction between functional and lexical vocabulary. Functional vocabulary contains grammatical information and is used to create syntactic form, and contains categories like the articles (determiners) and prepositions; lexical vocabulary delivers propositional content and contains categories like nouns and verbs. Different languages have different frequencies of functional vocabulary relative to lexical. Languages with a great deal of morphology — that is, languages where words change a great deal depending on their grammatical context– have less need for functional vocabulary, as essential grammatical information can be embedded in different word forms. Consider Latin and its notorious number of versions of every word, and then contrast with Mandarin, which has almost no similar morphological changes at all. English lies closer on the spectrum to Mandarin than to Latin; while we have both derivational morphology (that is, changes to words that change their parts of speech/syntactic category, the way -ness changes adjectives to nouns) and inflectional morphology (that is, changes to words that maintain certain grammatical functions like tense without changing parts of speech, the way -ed  changes present to past tense), in  comparison to a language like Latin we have a pretty morphologically inert language. To substitute for this, we have a) much stricter rules for word order than a language like Latin and b) more functional vocabulary to provide structure.

What does this have to do with assessing diversity in vocabulary? Well, first, we have a number of judgement calls to make when it comes to deciding what constitutes a word. There’s a whole vast literature about where to draw the line to determine what makes a word separate from another, utilizing terms like wordlemma, and word family. (We have a pretty good sense that dogs is the same word as dog but what about doglike or He was dogging me for days, etc.) I don’t want to get too far into that because it would take a book. It’s enough to say here that in most computerized attempts to measure lexical diversity, such as the ones I’m discussing here, all constructions that differ by even a single letter are classified as different terms. In part, this is a practical matter, as asking computers to tell the difference between inflectional grammar and derivational grammar is currently not practical. We would hope that any valid measure of lexical diversity would be sufficiently robust to account for the minor variations owing to different forms.

So: the simplest way to assess the amount of diversity would simply be to count the number of different terms in a sample. This measure has been referred to in the past as Number of Different Words (NDW) and is now conventionally referred to as Types. The problem here is obvious: you could not reliably compare a 75-word sample to a 100-word sample, let alone a 750-word sample. To account for this, researchers developed what’s called a Type-to-Token  Ratio (TTR). This figure simply places the number of unique words (Types) in the numerator and the number of total words (Tokens) in the denominator, to generate a ratio that is 1 or lower. The highest possible TTR, 1, is only possible if you never repeat a term, such as if you are counting (one, two, three…) without repeating. The lowest possible TTR, 1/tokens, is only possible if you say the same word over and over again (one,  one, one….). Clearly, in real-world language samples, TTR will lie somewhere in between those extremes. If half of all your terms are new words, your TTR would be .50, for example.

Sounds good, right? Well, there’s a problem, at least in English. Because language is repetitive by its nature, and particularly because functional vocabulary like articles and prepositions are used constantly– think of how many times you use the words “the” and “to” in a given conversation– TTR has an inevitable downward trajectory. And this is a problem because, as TTR inevitably falls, we lose the ability to discriminate between language samples of differing lengths, which is precisely why TTR was invented in the first place. For example, a 100-word children’s story might have the same TTR as a Shakespeare play, as the constant repetition of functional vocabulary overwhelms the greater diversity in absolute terms of the latter. We can therefore say that TTR is not robust to changes in sample size, and repeated empirical investigations have demonstrated that this sensitivity can apply even when the difference in text lengths are quite small. TTR fails to adequately control for the confounding variable it was expressly intended to control for.

A great many attempts have been made to adjust TTR mathematically– Guiraud’s Root TTR, Somer’s S– but none of them have worked.

What computational methods have been devised to measure lexical diversity?

Given the failure of straightforwardly mathematical attempts to  adjust TTR, and with the rise of increasingly powerful and accessible computer programs for processing text, researchers turned to algorithmic/computational models to solve the problem. One of the first such models was the vocd algorithm and the metric it returns, D. D stands today as one of the most popular metrics for assessing diversity in vocabulary. For clarity, I refer to D as “VOCD-D” in this research.

Developed primarily by the late David Malvern, Gerard McKee, and Brian Richards, along with others, the vocd algorithm in essence assesses the change in TTR as a function of text length and generates a measure, VOCD-D, that approximates how TTR changes as a text grows in length. Consider the image below, which I’ve photographed from Malvern et. al’s 2004 book Lexical diversity and language development: Quantification and assessment. (I apologize for the image quality.)

ttr over tokens

What you’re looking at is a series of ideal curves depicting changing TTR ratios over a given text length. As we move from the left to the right, we’re moving from a shorter to a longer text. As I said, the inevitable trajectory of these curves is downward. They all start in the same place, at 1, and fall from there. And if we extend these curves far enough, they would eventually end up in the same place, bunched together near the bottom, making it difficult to discriminate between different texts. But as these curves demonstrate, they do not fall at the same rate, and we can quantitatively assess the rate of downward movement in a TTR curve. This, in essence, is what vocd does.

The depicted curves here are ideal in the sense that they are artificial for the process of curve fitting. Curve fitting procedures are statistical methods to match real-world data, which is stochastic (that is, involves statistical distortion and noise), to approximations based on theoretical concepts. Real-world TTR curves are in fact far more jagged than this. But what we can do with software is to match real-world curves to these ideal curves to obtain a relative value, and that’s how vocd returns a VOCD-D measurement. The algorithm contains an equation for the relationship between text length, TTR, and VOCD-D, processes large collections of texts, and returns a value (typically between 40-120) that can be used to assess how diverse the vocabulary is in those texts. (VOCD-D values can really only be understood relative to each other.) The developers of the metric define the relationship between TTR, for tokens, and D for a given value along a TTR curve as TTR = D/N[(1+2n/d)^1/2 – 1]

Now, vocd uses a sampling procedure to obtain these figures. By default, the algorithm takes 100 random samples of 35 tokens, then 36 tokens, then 37, etc., until 50 tokens are taken in the last sample. In other words, the algorithm grabs 100 randomly-chosen samples of 35 words, then 36, etc., and returns an average figure for VOCD-D. The idea is that, because different segments of a language sample might have significantly different levels of displayed diversity in vocabulary, we should draw samples of differing sizes taken at random from throughout each text, in order to ensure that the obtained value is a valid measure. (The fact that lexical diversity is not consistent throughout a given text should give us pause, but that’s a whole other ball of wax.) Several programs that utilize the vocd algorithm also run through the whole process three times, averaging all returned results together for a figure called Doptimum.

VOCD-D is still affected by text length, and its developers caution that outside of an ideal range of perhaps 100-500 words, the figure is less reliable. Typical best practices involve combining VOCD-D with other measures, such as the Maas Index and MTLD (Measure of Textual Lexical Diversity), in order to make research more robust. Still, VOCD-D has shown itself to be far more robust across differing text lengths than TTR, and since the introduction of widely-available software that can measure it, notably the CLAN application from Carnegie Mellon’s CHILDES project, it has become one of the most commonly used metrics to assess lexical diversity.

So what’s the issue with vocd?

In a series of articles, Phillip McCarthy of the University of Memphis’s Institute for Intelligent Systems and Scott Jarvis of Ohio University identified a couple of issues with the vocd algorithm. They argue that the algorithm produces a metric which is in fact a complex approximation of another measure that is a) less computationally demanding and b) less variable. McCarthy and Jarvis argued that vocd‘s complex curve-fitting process actually approximates another value which can be statistically derived from a language sample based on hypergeometric sampling. Hypergeometric sampling is a kind of probability sampling that occurs “without replacement.” Imagine that you have a bag filled with black and white marbles. You know the number of marbles and the number of each color. You want to know the probability that you will withdraw a marble of a particular color each time you reach in, or what number of each color you can expect in a certain number of pulls, etc. If you are placing the marbles back in the bag after checking (with replacement), you use binomial sampling. If you don’t put the stones back (without replacement), you use hypergeometric sampling. McCarthy and Jarvis argued, in my view persuasively, that the computational procedure involved in vocd simply approximated a more direct, less variable value based on calculating the odds of any individual Type (unique word) appearing in a sample of a given length, which could be accomplished with hypergeometric sampling. VOCD-D, according to Jarvis and McCarthy, ultimately approximates the sum of the probabilities of a given type appearing in a sample of a given length. The curve-fitting process and repeated random sampling merely introduces computational complexity and statistical noise. McCarthy and Jarvis’s developed an alternative algorithm and metric. Though statistically complex, the operation is simple for a computer, and this metric has the additional benefit of allowing for exhaustive sampling (checking every type in every text) rather than random sampling. McCarthy and Jarvis named their metric HD-D, or Hypergeometric Distribution of Diversity.

(If you are interested in a deeper consideration of the exact statistical procedures involved, email me and I’ll send you some stuff.)

McCarthy and Jarvis found that HD-D functions similarly to VOCD-D, with less variability and requiring less computational effort. The latter isn’t really a big deal, as any modern laptop can easily churn through millions of words with vocd in a reasonable time frame. What’s more, McCarthy and Jarvis explicitly argued that research utilizing VOCD-D does not need to be thrown out, but rather that there is a simpler, less variable method to generate an equivalent value. But we should strive to use measures that are as direct and robust as possible, so they advocate for HD-D over VOCD-D, as well as calling for a concurrent approach utilizing other metrics.

McCarthy and Jarvis supported their theoretical claims of the comparability of VOCD-D and HD-D with a small empirical evaluation of the equivalence. They did a correlational study, demonstrating a very strong relationship between VOCD-D and HD-D, supporting their argument for the statistical comparability of the two measures. However, their data set was relatively small. In a 2012 article, Rei Koziumi and Yo In’nami argued that Jarvis and McCarthy’s data set suffered from several drawbacks:

(a) it used only spoken texts of one genre from L2 learners; (b) the number of original texts was limited (N = 38); and (c) only one segment was analyzed for 110-200 tokens, which prevented us from investigating correlations between LD measures in longer texts. Future studies should include spoken and written texts of multiple genres, employ more language samples, use longer original texts, and examine the effects of text lengths of more than 200 tokens and the relationships between LD measures of equal-sized texts of more than 100 tokens.

My article is an attempt to address each of these limitations. At heart, it is a replication study involving a vastly larger, more diverse data set.

What data and tools did you use?

I used the fantastic resource The International Corpus Network of Asian Learners of English, a very large, very focused corpus developed by Dr. Shin’ichiro Ishikawa of Kobe University. What makes the ICNALE a great resources is a) its size, b) its diversity, and c) its consistency in data collection. As the website says, “The ICNALE holds 1.3 M words of controlled essays written by 2,600 college students in 10 Asian countries and areas as well as 200 English Native Speakers.” Each writer in the ICNALE data set writes two essays, allowing for comparisons across prompts. And the standardization of the collection is almost unheard of, with each writer having the same prompts, the same time guidelines, and the same word processor. Many or most corpora have far less standardization of texts, making it much harder to draw valid inferences from the data. Significantly for lexical diversity research, the essays are spell checked, reducing the noise of misspelled words which can artificially inflate type counts.

For this research, I utilized the ICNALE’s Chinese, Korean, Japanese, and English-speaking writers, for a data set of 1,200 writers and 2,400 texts. This allowed me to compare results between first- and second-language writers, between writers of different language backgrounds, and between prompts. The texts contained a much larger range of word counts (token counts) than McCarthy and Jarvis’s original corpus.

I analyzed this data set with CLAN, in order to obtain VOCD-D values, and with McCarthy’s Gramulator software, in order to obtain HD-D values. I then used this data to generate Pearson product-moment correlation matrices comparing values for VOCD-D and HD-D across language backgrounds and prompts, utilizing the command-line statistical package SAS.

What did you find? 

My research provided strong empirical support for McCarthy and Jarvis’s prior research. All of the correlations I obtained in my study were quite high, above .90, and they came very close to the measures obtained by McCarthy and Jarvis. Indeed, the extremely tight groupings of correlations across language backgrounds, prompts, and research projects strongly suggests that the observed comparability identified in McCarthy and Jarvis’s work is the result of the mathematical equivalence they have identified. My replication study delivered very similar results using similar tools and a greatly expanded sample size, arguably confirming the previous results.

Why is this a big deal?

Well it isn’t, really. It’s small-bore, iterative work that is important to a small group of researchers and practitioners. It’s also a replication study, which means that it’s confirming and extending what prior researchers have already found. But that’s what a lot of necessary research is — slowly chipping away at problems and gradually generating greater confidence about our understanding of the world. I am also among a large number of researchers who believe that we desperately need to do more replication studies in language and education research, in order to confirm prior findings. I’ve always wanted to have one of my first peer-reviewed research articles be a replication study. I also think that these techniques have relevance and importance to the development of future systems of natural language processing and corpus linguistics.

As for lexical diversity, well, I still think that we’re fundamentally failing to think through this issue. As well as VOCD-D and HD-D work in comparison to a measure like TTR, they are still text-length dependent. The fact that best practices require us to use a number of metrics to validate each other suggests that we still lack a best metric of lexical diversity. My hunch (or more than a hunch, at this point) is not that we haven’t devised a metric of sufficient complexity, but that we are fundamentally undertheorizing the notion of lexical diversity. I think that we’re simply failing to adequately think through what we’re looking for and thus how to measure it. But that’s the subject of another article, so you’ll just have to stay tuned.

tl;dr version: Some researchers said that two ways to use a computer to measure diversity in vocabulary are in fact the same way, really, and provided some evidence to support that claim. Freddie threw a huge data set with a lot of diversity at the question and said “right on.”


1 Comment

Comments are closed.