for want of data

One of the hardest parts of being a researcher is getting access to data. This is particularly acute if you, like me, work in research fields where there is very little grant funding available, making it difficult to give language users incentives to give you samples created under the controlled conditions that are necessary for rigorous research. (There’s lots of money in education research, but it comes with a ton of strings attached, which I may write about someday.) I think this is changing, and as someone who does a lot of corpus linguistics, the availability of focused corpora like the ICNALE is a great boon. Also, broader, less prescribed corpora like the MICUSP are potentially useful if you’re prepared to do some work to standardize what you’re looking for. Still, now and for the foreseeable future, it’s hard to get access to the kind of texts that you need, particularly if you’re looking at a specific task, group, or instrument.

I’m recently feeling this difficulty acutely, as I experienced a setback with my dissertation along these lines. (I’m not trying to be arch. I’ll write about it soon; I just want time to process the next step and speak with fairness and friendliness about it.)

I actually thought of this in regards to the Satoshi Nakamoto story that has been lighting up the internet today. Some have expressed doubt about Newsweek‘s outing of Dorian S. Nakamoto as the Satoshi Nakamoto, in part because of the comparison between these two emails. If I had access to more of the writing of Dorian S. Nakamoto, it would be easy for me to use the kind of textual processing tools I use everyday to assess the similarity in writing styles between the two. Computerized textual processing is still remarkably limited, in many ways, and can’t readily assess things like stylistic or rhetorical quality. But one thing this software is very good at is analyzing texts for their indicative features and comparing them to a reference corpus. Expressed with the necessary caveats and with a responsibly-generated statistical probability, you could do a useful analysis very quickly.

Unfortunately, I’m not aware of any other available writing from Dorian S. Nakamoto, and this one email is not nearly sufficient to make a responsible comparison. I’m not saying that you shouldn’t do an eyeball test or use common sense to compare them. I’m just saying that this is not nearly enough textual information to do a responsible statistical, automated analysis. Of course, it wouldn’t surprise me if someone does one with just that email anyway– or if someone already has!

