a simple way testing companies shirk transparency

I’m drawing close to completing and defending my dissertation. I’m going to write up some stuff about it soon, as I genuinely think there’s some interesting things for people who are interested in educational policy in there. (You may roll your eyes if you choose.) Briefly, though, I want to say a quick word about testing companies as agents of accountability.

Educational testing companies are a fact of life. They exist and will continue to exist and they will play a large role in various assessment and gatekeeping functions in our society. There are many problems with the industry writ large — and please note that in most cases the non-profit designation that some of these hold is meaningless — but they also have their virtues. I could complain about a dozen things with ETS, for example, but they also do a great deal of important research, some of which is critical of their own products. Testing companies of various stripes will remain necessary, as it’s  simply not practically possible for all states or institutions to develop tests internally, particularly tests that can be mutually intelligible and aggregable with other tests. So while there’s a lot of things that I lament about these companies, they’re here to stay and some of them do  good work.

But there really are a lot of perverse incentives, and one of the most powerfully distorting is the compulsion to secrecy on industry secrets grounds. Make no mistake: the competition between these companies and organizations is vicious, and they work relentlessly to position themselves as the industry standard in various domains. There’s a tipping point effect with tests like these: in order for their outcomes to be interpretable by most people, they have to attract a sufficient number of students to take the test. Otherwise there’s no popular context in which to place them. The “1600 Club” made sense to people because such a huge number of students  take the SATs. (The “2400 Club” just never had that ring, though.) Even though the ACT has made deep inroads into the traditional turf of the SAT, meanwhile, telling someone you got a 30 on your ACT just doesn’t have adequate  context to be meaningful to most. So the competition for market share is particularly important and particularly intense in this field.

One of the consequences is that a lot of these companies are fiercely jealous of their internal research and processes. That’s bad enough for test takers who have a vested interest in the validity and reliability of these instruments. But when it combines with a huge higher education assessment movement that’s coming from the heights of our policy apparatus, and that is deeply politicized, the problem multiplies. Obama’s proposal to create college “value” rankings, and to tie those rankings to access to federal education  funds, depends deeply on the effectiveness of a set of competing tests of higher education. Yet the public is denied essential information about these instruments, thanks to the invocation of industry secrets and test security.

Theoretically, you can request data and do research on these tests, and it does happen. But the impediments that companies put towards doing such research and publishing it in a timely fashion amount to a serious disincentive. Given the incredibly competitive academic job market and the ever-growing tenure requirements at research universities, most researchers don’t have time to wait for their research to be approved.

So let me tell you this story. In my first year of coursework I developed an interest in the relationship between the displayed range of vocabulary in an essay and that essay’s score on the type of short-essay writing tests that are common to the SAT, ACT, GRE, and similar. I was teaching a freshman composition class that, due to random chance and Purdue’s giant international population, happened to include 14 non-native English speakers out of my 19 total students. My students often seemed to struggle to find the words they needed to express themselves effectively in their papers, which frequently resulted in repetitive and limited expression. However, vocabulary simply is not a major element of college writing pedagogy, as most native speakers are assumed to have adequate vocabulary to write effectively, and attention tends to be paid to different mechanical and higher-order concerns. So I investigated algorithms that are used to assess vocabulary in a writing sample and undertook a simple correlational study, investigating the relationship between displayed range of diversity in a set of real test essays and the scores those essays received, and comparing the results between first language and second language writers. The professor I handed it in to for a class loved it and encouraged me to publish it.

However, I soon ran into a problem. The data set I had used was a set of real essays written for a popular, well-known test given to potential college students, developed by one of the largest testing companies in the world. I had thought that this would be both easier than trying to generate my own data set and more valuable, because I could see how essays produced under real test conditions and rated by official raters fared in my research. But then I was reminded by the professor who had given me the essays that I had to be very careful about how I shared my work. I can’t even tell you, if you can  believe it, the name of the test or the company that made it.

Here’s what she told me. If I wanted to publish the paper, I would have to get the explicit permission of the testing company, or they would sue. In order to get permission, I would have to draft the paper and send it in to the company. They would then give it an initial review. This would typically take about 6 months but theoretically they could take as long as a year. They would then send the paper back to me with  requested revisions, which my professor told me would certainly be extensive. I would then make the revisions and send them in to the company for approval, which could again take months even if they turned around and approved the changes. Then I would submit the paper to a given journal. That journal’s initial editorial review to determine whether it be sent out for peer review could take a month. The peer reviewers would typically be given a month to six weeks to get a response back to the editor, although it could take longer, and experience teaches me that the editor could sit on it for weeks before I got my response. Whether I got a revise and resubmit or an acceptance, revisions would be required before publication, as they always are. I would then take the time and effort to make the revisions required by the journal. But then I would be required to send the new version to the testing company for its own review, as any substantial change to the paper would have to go through their review process again. After I had made any additional changes they wanted, again over a time frame of months, I could send it back to the journal– whose editors might very well require a second round of review or edits of their own, putting the paper back in the cycle. It was very unlikely that the process would take less than several years.

It won’t surprise you to learn that I decided it just wasn’t worth it, despite the fact that I thought (and still think) that the research was worthwhile. It wasn’t a total bust; I did eventually develop my paper for System out of my initial research. But the problem is obvious: academics must publish constantly if they are to have any hope of a successful academic career, and the vast time frames of this kind of approval system provide a clear disincentive to using testing company data. And note that my own research was not at all intended as a diagnostic or critical examination of the test in question; it merely was designed using their data set. How amenable these companies might be to critical research during their internal review processes, I couldn’t say.

My dissertation concerns the Collegiate Learning Assessment+, its place within the broad assessment movement, and its proposed implementation here at Purdue. As I said above, I don’t think testing companies are the devil, and in fact I think that the Council for Aid to Education that develops the test is one of the testing organizations with the most responsible perspective and best people. But the test, like so many of these tests, is in essence a black box. At the beginning of my research I requested a set of real student responses to their essay-based instrument, the Performance Task. I was rebuffed: the CAE does not provide such data to anyone. Indeed, even the institutions that implement the test are not permitted to view real student answers, despite the fact that they pay $35/student for the test. I was told that there was a data set that I could access of student scores that provided institutional and demographic data in order to look for trends. That wasn’t really what I was looking for, but I requested it anyway — only to find that the previously-mentioned accommodation was not going to be made. I was told that my research request was insufficiently specific for them to provide me with the data set. This, while I worked at a public university that was proposing to implement their test at considerable expenditure and effort. Realizing that I was unlikely to have an easy time of things generally, I changed the methodology of my dissertation to history and journalism. The odds of ending up with nothing to write about were just too much of a risk not to.

All of this happens in the middle of an assessment push that is constantly portrayed as a matter of accountability. In his book Measuring College Learning Responsibly, Richard Shavelson, one of the developers of the CLA+ and one of the true good guys in educational testing, writes that “‘Trust me’ is an inadequate response to a demand for accountability” from colleges and universities. But I would turn that point right back around: what, exactly, does the CAE offer us beyond a “trust me,” when they jealously guard access to their actual student-produced data? CAE, to their credit, produces a great deal of research on their test. But as they surely know, internally-generated or funded research is never going to meet the standard of true objectivity and accountability. Today, a large majority of the research performed on the test was produced under the auspices of the organization that developed it. That is an unhealthy condition by any lights, particularly when the potential stakes are so high. I am not uniformly opposed to higher education assessment in general or the CLA+ specifically, but these issues are precisely why so many faculty and community members distrust these tests and resist their implementation.

So please don’t think that educational assessment is merely a question of those who want accountability and those who obstruct it. There is no such thing as accountability without mutual transparency, and until true transparency is demanded of the test companies, we can’t say with confidence or rigor that their instruments measure what they intend to measure or do so reliably across contexts.