Texas Sharpshooters and spreadsheet temptation

The Texas Sharpshooter Fallacy functions as both a joke and a warning. The idea is simple but powerful. A Texan decides he wants to prove his shooting prowess to his friends. He takes out his handgun and empties its magazine into the side of his barn. He then paints a bulls-eye where his shots are clustered.

This joke is connected to empirical investigation in a very important way: when we go look for any connection, rather than a specific one that’s generated from theory and hypotheses, we are likely to find something. Right now, I’m working with a very large, data-rich corpus of essays by second language writers. A lot of my research involves using computer programs to mine large collections of texts for their patterns and features. To say that this work is bolstered by the ready availability of data would be understating things: it is largely only possible thanks to that availability. The collection of this corpus, under controlled conditions, is itself an artifact of the power of computing and the internet, as is my ability to access it. But access to data invites temptation. I have a spreadsheet with tons of information under a large number of categories. It would only take a couple clicks for me to generate correlations for all of the quantitative data in the spreadsheet. Similarly, it would be trivially easy for me to run a chi square test on all of the categorical data and look for associations.

To make matters worse, the nature of statistical significance tells us that, just by chance, if we look for significant relationships between enough variables, we will eventually find some, simply by random chance. I’ve never felt the temptation too keenly. But were I in desperate need of a publication, and aware that studies that show a correlation or association are far more likely to be published than those that don’t? I might be tempted to just see what’s in there. And while I can’t be sure, often when I read studies from education or experimental psychology– fields that, unlike my own, typically require statistically significant results in order to publish– I suspect that someone’s gone barn hunting. There are some statistical checks that we can do to help ascertain when someone’s done this sort of thing, but ultimately we are at the mercy of researchers to responsibly report the order of events in their chain of research and to tell us important details like how many variables they used in a multilinear regression or fed into an ANOVA.

All of this, incidentally, is another reason why it’s profoundly misguided to speak of only being interested in empiricism, not theory or ideology, a la Ezra Klein. There is no such thing as empiricism without theory; theory is necessary to generate, analyze, and understand data. Insisting on the necessity of theory doesn’t spring from any aesthetic or romantic commitments or place one on a humanistic-empirical divide. We embed empiricism in theory not because we choose to but because there is no alternative. The assumptions inherent to methods, methodologies, and epistemologies all impact both the collection and analysis of empirical results. And while we might identify the Texas Sharpshooter as a particularly pernicious or dishonest failure to interrogate the theories that underlie our empirical work, that failure exists on a continuum of sins that occur when we refuse to acknowledge the social, political, and theoretical framework in which all empiricism is embedded. We must remember that more data does not liberate us from the need for careful and skeptical epistemology; indeed, more data only makes the careful consideration of epistemological questions more important.

2 Comments

Leave a Comment

Your email address will not be published. Required fields are marked *