Some people are getting overly worked up about this study, showing a high correlation between machine scoring and human scoring of certain writing tasks. Some of it is glee from people in the university-hating media set; there’s also some rending of garments by those in the humanities who love nothing more than the excuse to rend their garments. As is typically the case, the reaction is out of proportion with the evidence.
First: this is not really news. For the types of writing tasks tested in this study, computer scoring has long been highly correlative with human scoring. You occasionally hear that writing can’t be reliably assessed quantitatively, again sometimes by people who want to squash writing and the humanities as respected disciplines and sometimes by people who are afraid of quantitative assessment in the humanities. The truth is that, for that kind of test and with careful construction of a grading rubric, inter-rater reliability between human scorers can be extremely high, with high .8s/low .9s fairly common. For short order essays on particular prompts, without research requirements and oriented towards five paragraph essay formats, organizations like ETS have got reliability down pat. It’s thus little wonder that computers can achieve similar correlation. (Such computerized assessments are notoriously susceptible to awarding high scores to deliberately nonsensical essays, but some have reasonably responded that you have to have a pretty strong grasp on the elements of sentence construction and paragraphing to create that kind of false positive anyway.)
No, the problem is not reliability but validity: short order, limited, context-free essays like those typically employed on standardized tests simply have very little to do with what we want writing instruction to do. The kind of tests employed in the SAT, ACT, GRE, TOEFL, etc., can tell you a thing or two about a student’s ability to form syntactically coherent sentences and paragraphs. But we want, and should want, more from writing instruction than that kind of remedial skill. What goes into a competent college student essay is at a far higher level of complexity: knowing how to form an intelligent position, how to research in order to devise that opinion and support it with responsibly generated evidence, how to express that position in a responsibly limited and contingent way, how to organize that argument in a logical and sophisticated way, how to adequately demonstrate your attention to the other side’s claims and arguments, and hopefully how to express all of that in a style and idiom that is elegant and personal. All of that is merely to rise to the level of competence. Excellence is leagues ahead of that stuff. And computers can currently test almost none of that, let alone teach it.
Look, I’m interested in quantitative research methods in composition myself. You’re not likely to find many people within my field more sympathetic to quantitative techniques than I am. I have a lot of arguments with peers who don’t support quantitative assessment even in limited contexts. But for the higher-order concerns of probabilistic arguments, responsibly generated by sorting through various claims of differing accuracy, in the scrum of factual disagreement, we have to be modest in our application of numerical scoring. And we have quantitative reasons to think so! Take one recent example: research from the New Jersey Institute of Technology (published in the latest issue of Research in the Teaching of English) on the popular ACCUPLACER test for placement into freshman English. The test’s results had very low correlation with a given student’s success in freshman composition. I don’t doubt that ACCUPLACER and similar tests can tell if a student’s 25-minute (or whatever) essay is grammatically and structurally sound. But that’s just not sufficient to guarantee success as a collegiate writer. (If it was, our bar for success would be depressingly low indeed.)
I hold, like a lot of people in the field, that the best assessment of collegiate writing is a writing portfolio, filled with a variety of texts from different genres and contexts (and ideally classes from a variety of departments), scored by multiple raters who pay attention to local context and the purpose of the writing. Obviously, there are resource issues there. (See Brian Huot’s (Re)Articulating Writing Assessment for a long consideration of these issues.) But that’s what people who spend all their time thinking about this stuff believe is the best system, totally independent of pedagogical or philosophical resistance to quantitative or computerized assessment.
Unfortunately, some from outside the field are likely to see this as indicative of typical academic intransigence.