Sorry I’ve been so out of the loop. Here’s a description of some current research of mine, taken from something I wrote in a different forum.
I’m currently engaged in a research project investigating the composition processes of second language learners, in this case Chinese and Hindi L1s. I am utilizing corpus linguistics software to mine a vast archive of student essays for certain patterns of argumentative and rhetorical structures. The software reports back to me about frequency and position, and through these outputs I can statistically compare the use of such structures between demographics– language of origin, years of education in English, etc. And since a key to pragmatic results from second language studies is often reference to native speakers, I’m creating a baseline through reference to an equivalently-sized corpus of L1 English subjects.
Much of corpus linguistics has focused on the level of morphosyntax, for the simple reason that the software is better equipped to look for certain word-level constructions or word pairings than it is to examine the larger, more complex, and more variable argumentative plane. English is notoriously morphologically inert; that is, our use of inflections such as affixes is quite limited in comparison to other languages. (Compare, for instance, to a language like Spanish.) For this reason, searching for particular syntactic structures with computers can be quite tricky. It’s also for this reason that formalist poets in other languages often have an easier go of it than in English– it’s much harder to write a villanelle or in terza rima when words lack consistent inflectional endings. In a language like Latin, word order is vastly more malleable because the inflections carry so much of the information necessary for meaning. In English, word order is quite mutable in an absolute sense but quite restricted in comparison to many languages. (There are exceptions, such as floating quantifiers, eg all– “All the soldiers will eat,” “The soldiers all will eat,” “The soldiers will all eat,” etc.)
But recently, researchers in composition have had some success in looking for certain idiomatic constructions as a clue to the kind of arguments that students are making. Some of these are obvious, such as the use of formal hedges (“to be sure”) or boosters (“without question”), and those are types of features for which I’m searching. Some are more complicated and require a little more finesse to search for effectively.
Code glosses, for example. A code gloss is an attempt by a writer to explain to readers what a particular word or term in his or her text is meant to convey in the context of the particular writing. Code glosses are not or not merely definitions; a definition provides denotative information that is accurate or inaccurate regardless of context. A code gloss, in contrast, has to provide the information necessary for a reader to follow the writer’s argument, and so a code gloss could fail as a general definition but succeed in its specific purpose. (This paragraph itself amounts to a code gloss.) The study of these kinds of features in writing, if you’re feeling fancy, is referred to as metadiscourse. Many types of metadiscourse have certain formal clues that can be used to search for them in large corpora.
Unfortunately, false positives are common. The further you get from a restricted set of idiomatic phrases, the more likely it becomes that the computer will return a morphologically identical but argumentatively distinct feature– so a search for “to be sure” as a formal hedge will also return “I looked it up in a dictionary to be sure that I got it right,” which is not a hedge. The flexibility of language, one of our great strengths as a species, makes this sort of thing inevitable to a certain extent. The recourse is often just to sift through the returned results, looking for false positives. (Or, if you’re lucky enough to have one, making a research assistant do it!) You might ask why to bother with the computer at all, if you have to perform a reality check yourself with most strings. The answer is just that it’s possible to look through the, say, 600 returned examples from a given search string and eliminate the false positives but not to look through the 500,000-2,000,000 words in a given corpus looking for what you want to find.
Beyond that, your only recourse is to building effective search strings given the interface of the particular corpus linguistics software you are using. This requires carefully calibrating wildcards, places in the search string where the software can include any result. You can restrict these wildcards in a variety of ways– for example, you can allow the wildcard to return any particular letter or one of a certain number of letters. Or you can bind the wildcard in terms of immediacy of surrounding letters or words; that is, the wildcard can be formatted so that the software will look a certain distance in characters from a particular search term. The more open-ended you make your search strings, the more likely you are to have false positives that have to be laboriously culled for accurate data; the more restrictive you are, the more likely you are to exclude relevant examples and thus jeopardize the quality of your research.
And that’s why I’m here on a Saturday morning: I’m poking around with a particular search string, looking at the results it returns, and trying to fine tune it in order to better approach the results that I want. All of this is in the service of coming up with research that can express certain qualification- and caveat-filled conclusions, responsibly presented, in order to provide some small amount of progress in our understanding of second language literacy acquisition, which is one of my primary research interests. It’s what I love to do.