do we want perfect inter-rater reliability?

In the social sciences and empirical humanities, it’s frequently easy to envy the directness of measurement that characterizes some work in the natural sciences. Because so much of what we measure in topics such as education, psychology, or sociology are incorporeal, it can be tempting to join a field where you can measure things in grams and meters. (This is not, obviously, to underestimate the amount of dedication and work good scientific inquiry requires.)

Instead of assessing directly measurable attributes like mass or length, researchers in my field and related fields typically have to define a construct and then operationalize that construct. The construct is the feature that we are interested in investigating. An example of a construct from recent research of mine is lexical diversity, defined by David Malvern and Brian Richards in The Encyclopedia of Applied Linguistics as “a complex property that summarizes the range of vocabulary and the avoidance of repetition” in a writing sample. You’ll note that, while this is fairly straightforward definition of an intuitively simple idea, in and of itself it does not tell us how we could formalize or quantify this quality. For this, we need to operationalize, to come up with a particular research practice that somehow systematizes our construct and provides an output that can be interpreted.

(Although the terms construct and operationalization are most likely to be used in quantitative research contexts, they are also used in qualitative research, so operationalizing should not be thought of merely as quantification.)

So, in my own research, I operationalize lexical diversity as D, an interval measure generated from an algorithm called vocd which can analyze vast collections of written text at great speed. For another example, a psychologist studying the construct social confidence might operationalize that construct through a Lickert scale, where test subjects evaluate a statement such as “I feel confident in social interactions” with a set range of responses from “I strongly agree with this statement” to “I strongly disagree with this statement.” In contrast, someone studying the mass of egret chicks doesn’t have to define a particular construct or come up with an operationalization. Mass is directly measurable. (There are, of course, very deep philosophical and epistemological conversations we could have here, but that’s for another time.)

Frequently in research on oral or written language, our operationalization involves raters. When we are assessing language proficiency or quality, we’re investigating a complex and multivariate phenomenon that we nevertheless have to reduce to easily and quickly interpreted metrics. Typically, this is done through trained raters listening to or reading  test samples and then scoring them on a preset scale defined by complex rubrics. This is the process through which, for example, the SAT’s Writing section is assessed, and the TOEFL Speaking section, and a large number of similar assessment instruments. I myself have worked as a rater in this type of testing for both assessments of writing and assessments of speaking.

Such work may not be long for this world. Numerous technologies that utilize computers to automate the work of rating language samples, oral and written, have been and are being developed. These technologies are notoriously easy to “fool,” in the sense that they can be presented with syntactically and lexically correct samples that are nonetheless semantically nonsensical. This is likely to remain the case until there are truly major advances in artificial intelligence, thanks to the complexities of the semantic-syntactic interface.

(An example frequently used to demonstrate this difficulty: “The committee denied the group their permit, as they advocated violence” and “The committee denied the group their permit, as they feared violence” are superficially grammatically identical, parsed without ambiguity or trouble by essentially all native speakers, and yet are parsed syntactically differently, thanks to semantic information. Computers have enormous difficulties with these types of problems, and crucially these difficulties cannot be solved with more raw processing power.)

The possibility of these “false positives” is not typically considered a major impediment to the use of these tests, however, by the kind of people who push them. They point out, reasonably, that it’s exceedingly unlikely for a low-proficiency speaker or writer to produce these kinds of samples at random, and that the ability of experts to produce these false positives has very little to do with the software’s ability to detect correct spelled words, grammatical competence, and similarly functional features. There is an extensive literature on the subject of automated scoring, particularly for written essays, and I don’t want to rehash them here. I do want to point out one of my own reasons for opposing this type of automation, and it in fact has to do with one of the very reasons this software exists: inter rater reliability.

“Reliability,” in the social sciences, refers to the tendency of an assessment or experimental instrument to operate the same way across different administrations. If there are factors that influence the results of a given administration that are not dependent on what we’re actually trying to measure, we call this “construct-irrelevant” variance. If, for example, test takers perform worse on a test because the air conditioning made the site where they took the test uncomfortably cold, that factor obscures what we’re really trying to assess, and adds a degree of unfairness and chance. In rated assessments, we try to maintain reliability through training, through providing detailed rubrics that clearly delineate the reasons certain ratings are assigned, and through maintaining adequately consistent testing conditions.

One of the ways we check up on ourselves is through inter-rater reliability. In a typical rated testing scheme, two raters each rate the same language sample independently of one another. This provides a necessary check on ratings that remain, despite all of the structuring and training, rather subjective. Usually, if two raters disagree slightly in their ratings, those ratings are averaged or aggregated together; if they disagree beyond a certain threshold defined by the makers of the test, a third rater is brought in to adjudicate the dispute. The tendency of raters on the same tests to rate together is analyzed with measures of consistency and consensus. These range from simple correlation coefficients that run from 0 (no agreement) to 1 (total agreement) to incredibly complex statistical measures and graphical representations of agreement, tendency, severity, etc. These all serve to answer some version of the question: do raters agree?

A low correlation between raters, or low inter-rater reliability, is considered a major problem for test administrators, and for good reason. After all, if we are saying that our test identifies some useful or “real” notion of proficiency in a test subject’s language, we should expect that the people adjudicating that proficiency would be capable of agreeing. This is one of the reasons people pursue automated scoring (though surely a less important one than the economic incentive of eliminating rater pay!): a computer program that derives its assessment of language proficiency from an algorithm will never disagree with itself. Its assessment is perfectly mechanistic and perfectly internally reliable. The “inter-rater” reliability of a computer will always be 1.

But is this really a good thing? For test administrators, and companies like ETS and Pearson that have monetized testing consistency, it’s a silly question. Of course perfect inter-rater reliability is the goal! It eliminates the appearance of unfairness, it makes test results more easily interpretable, and it contributes to the validity of the test instrument itself. Human disagreement, in this reading, is necessarily the result of human imperfection, and the computerized age of language assessment will be an age of certainty.

But let’s ask this question: why do human beings disagree on the quality of a provided language sample? Yes, it’s perfectly possible, and probably common, for raters to disagree thanks to one or both having inadequate training, indifference to the quality of their rating, fatigue, poor rating conditions, rating drift over scoring of many samples…. But there’s another set of reasons for a lack of reliability that are less easily dismissed as the result of error: disagreement. Informed, responsible disagreement. As someone who works closely with other language raters, I can tell you that it’s not at all unusual for responsible raters who are both trying to fairly and accurately assess a given sample’s quality to disagree strongly, even with reference to a comprehensive and technical set of rating guidelines. It happens all the time.

This reflects a simple fact about language: that as well as linguists have been able to divide it up into a large variety of constituent parts, and as much as they’ve developed a highly specific vocabulary (some would say jargon) for describing and dividing those parts, we still interpret language in a way that involves irreducible complexity. Some people might naturally find morphosyntactic errors more inherently distracting than prosodic errors. Some might be more forgiving of limited idiomaticity, if the same subject demonstrates syntactic complexity. Some raters of writing might be more lenient with grammatical or spelling mistakes if the writer shows rhetorical or stylistic sophistication. These differences are very hard to adjudicate. We might some day decide on a rigid hierarchy of importance for these and other features, but I doubt it, and if it happened it would necessarily come via authority and fiat, not consensus. Language is like that: perfectly subject to empirical investigation, yet perfectly fuzzy.

Think about one of the reasons for inter-rater disagreement. There’s research that suggests that, when non-native English speakers rate the spoken English of other non-native English speakers, the respective language backgrounds influence the rating. In other words, native Mandarin speakers tend to rate other native Mandarin speakers more leniently, likely owing to familiarity, similar phrasal stress patterns, similarities in syntax…. Is this a problem? A conservative epistemology would say yes: the language background of the rater should not result in differences in rating, and this represents construct-irrelevant variances.

But what’s the purpose of tests of spoken English? To assess how well a speaker of English can make him- or herself understood to other speakers of English. Are there situations where a native Mandarin speaker might have to interpret the English of another native Mandarin speaker? Not only do those situations exist, I’ve been present in those situations many times. That’s the reality of the internationalizing university. In this context, then, the difference between a native Mandarin rater and a rater of a different native language background– their disagreement, which reduces inter-rater reliability– is a feature, not a bug. The use of multiple, potentially disagreeing raters reflects a communicative world wherein people do not assess language in exactly the same way.

Or, to put it more abstractly, linguistic communication as a holistic concept includes not merely the production of language but the reception of language, and that both parts condition the success or failure of the communicative act, and that attempts to eliminate this variation via algorithms actually degrades the accuracy of the whole system. A computer’s ability to assess all language identically across subjects and contexts is not a benefit, but a hindrance. It is a classic example of higher reliability resulting in lower validity.

I’m arguing that, to a degree, we might see inter-rater disagreement not as a problem, but as a simple fact of an incredibly complex linguistic world, one where different people are always going to judge language in somewhat different ways. This isn’t empirical nihilism. It’s just reflecting a simple reality: lots of people use language, and they all use it slightly differently, and any attempt to assess language fairly should reflect that. Yes, our standardized tests are going to continue to exist, and I will accept that they probably have to, given certain social and economic realities. They are going to continue to pare down an immense complicated set of features and skills to numbers. What I am asking for– what I am saying we have to do– is to at once use them when absolutely necessary and yet to recognize their profound limitations at the same time. There’s no contradiction, there.

There are, I believe, broader lessons to be drawn from this discussion, about empiricism, its limits, and the objective reality of subjective reality. I leave it to you to draw them.

1 Comment

  1. Would a panel of qualified but diverse raters be a good way of increasing inter-rater reliability while maintaining or perhaps increasing validity?

    Obviously, for the cost reasons you mentioned, there’s reasons this strategy isn’t widely employed for standardized testing. But I think that’s been a solution that’s often been employed to the problem you raise when it occurs in other contexts.

Leave a Comment

Your email address will not be published. Required fields are marked *