Like a lot of terms that get bandied around, it’s not always clear what “data journalism” means, but I’ll risk the potential for being a bit vague and assume that most people know what I’m talking about. We’ve seen a rapid growth in the use of arguments based on statistics in the popular media in the last several years. In particular, we’ve seen growth in journalists and commentators running statistical analyses themselves, rather than just reporting the statistics that have been prepared by academics or government agencies. This is potentially a real boon to our ability to understand the world around us, but it carries with it all of the potential for misleading statistical arguments.

My request is pretty simple. All statistical techniques, particularly the basic parametric statistical techniques that are most likely to show up in data journalism, require the satisfaction of assumptions and checking of diagnostic measures to ensure that hidden bias isn’t misleading us. Many of these assumptions and diagnostics are ultimately judgment calls, relying on practitioners to make informed decisions about what degree of wiggle room is appropriate given the research scenario. There are, however, conventions and implied standards that people can use to guide their decisions. The most important and useful kind of check, though, is the eyes of other researchers. Given that the ability to host graphs, tables, and similar kinds of data online is simple and nearly free, I think that **data journalists should provide links to the graphs and tables they use to check assumptions and diagnostic measures**. I don’t expect to find these graphs and tables sitting square in the center of a blog post, and I expect that 90% of readers wouldn’t bother to look. But there’s nothing to risk in having them available, and transparency, accountability, and collaboration to gain.

*****

That’s the simple part, and you can feel free to close tab. For a little more:

What kind of assumptions and diagnostics am I talking about? Let’s consider the case of one of the most common types of parametric methods, linear regression. Whether we have a single predictor for simple linear regression or multiple predictors for multilinear regression, fundamentally regression is a matter of assessing the relationship between quantitative (continuous) predictor variables and a quantitative (continuous) outcome variable. For example, we might ask how well SAT scores predict college GPA; we might ask how well age, weight, and height predict blood pressure. The types of regression analysis, and the issues therein, are vast, and I’m little more than a dedicated beginner. But I know enough to talk about some of the assumptions we need to check and some problems we have to look out for. I want to talk a little bit about these not because I think I’m in a position to teach others statistics, or because regression is the only statistical process that we need to see assumptions and diagnostics for. Rather, I think regression is an illustrative example through which to explore *why *we need to check this stuff.

There are four assumptions that need to be true to run a linear (least squares) regression: independence of observations, linearity, constancy of variance, and normality. (Some purists add a fifth, existence, which, whatever.)

**Independence of Observations**

This is the biggie, and it’s why doing good research can be so hard and expensive. It’s the necessary assumption that one observation does not affect another. This is the assumption that requires randomness. Remember that in statistics *error*, or necessary and expected variation, is inevitable, but *bias*, or the systematic influence on observations, is lethal. Suppose you want to take the average height of the student body of your college. You get a sample size of 30. (Not necessarily too small!) If your sample is truly random, and you get a sample mean of 5’8, but your actual student population mean is 5’7, that’s error. That’s life. On the other hand, if you only sample people who are leaving basketball practice, and you get an average height of 6’2, that’s bias. The observations aren’t independent; they share a common feature which is influencing your results. When we talk about randomness in sampling, we mean that every individual in the population should have the same chance of being part of the sample. Practically, true randomness in this sense is often impossible, but there are standards for how random you can make things. Getting random samples is expensive because you have to find some way to compel or entice people in a large population to participate, which is why convenience samples, though inherently problematic, are so common.

Independence is scary because threats to it so often lurk out of sight. And the presumption of independence often prohibits certain kind of analysis that we might find natural. For example, think of assigning control and test conditions to classes rather than individual students in educational research. This is often the only practical way to do it; you can’t fairly ask teachers to only teach half their students one technique and half another. You give one set of randomly-assigned classes a new pedagogical technique, while using the old standard with your control classes. You give a pre- and post-test to both and pop both sets of results in an ANOVA. You’ve just violated the assumption of independence: we know that there are clustering effects of children within classrooms; that is, their results are not entirely independent of each other. We can correct for this sort of thing using techniques like hierarchical modeling, but first we have to recognize that those dangers exist!

How would a lack of independence affect regression? Well, suppose you wanted to define the relationship between average number of hours sleep per night and Body Mass Index. But say you chose your sample by asking people as they left the gym. Your sample is now made up primarily by people who exercise regularly. Maybe the relationship is different for the sedentary. Maybe people who exercise a lot can sleep less and stay trim, but those who are sedentary have a strong relationship between BMI and numbers of hours of sleep. If you only are looking at the fit because of your sampling, you have no way to know.

Independence is the assumption that is least subject to statistical correction. It’s also the assumption that is the hardest to check just by looking at graphs. Confidence in independence stems mostly from rigorous and careful experimental design. You can check a graph of your observations (your actual data points) against your residuals (the distance between your observed values and the linear progression from your model), which can sometimes provide clues. But ultimately, you’ve just got to know your data was collected appropriately. On this one, we’re largely on our own. However, I think **it’s a good idea for data journalists to provide a Residuals vs. Observations graph **when they run a regression.

Here’s a Residuals vs. Observations graph I pulled off of Google Images. This is what we want to see: snow. Clear nonrandom patterns in this plot are bad.

**Linearity**

The name of the technique is linear regression, which means that observed relationships should be roughly linear to be valid. In other words, you want your relationship to fall along a more or less linear path as you move across the *x* access; the relationship can be weaker or it can be stronger, but you want it to be more or less as strong as you move across the line. This is particularly the case because curvilinear relationships can appear to regression analysis to be no relationship. Regression is all about interpolation: if I check my data and find a strong linear relationship, and my data has a range from A to B, I should be able to check any *x* value within A and B and have a pretty good prediction for *y*. (What “pretty good” means in practice is a matter of residuals and *r*-squared, or the portion of the variance in Y that’s explained by my Xs.) If my relationship isn’t linear, my confidence in that prediction is unfounded.

Take a look at these scatter plots. Both show close to zero linear relationship according to Pearson’s product-moment coefficient:

And yet clearly, there’s something very different going on from one plot to the next. The first is true random variance; there is no consistent relationship between our *x *and *y *variables. The second is a very clear association; it’s just not a linear relationship. The degree to which *y *varies along *x *changes over different values for *x. *Failure to recognize that non-linear relationship could compel us to think that there is no relationship at all. If the violation of linearity is as clear and consistent as in this scatter plot, it can be cleaned up fairly easily by transforming the data. I currently have the advantage of a statistical consulting service on campus, but I also find that the internet is full of sweet, generous nerds who enjoy helping with such things.

Regression is fairly robust to violations of linearity, as well, and it’s worth noting that any relationship that is sufficiently lower than 1 will be non-linear in the strict sense. But clear, consistent curves in data can invalidate our regression analyses.

Readers could check data journalism for linearity if **scatter plots are posted for simple linear regression . **For multilinear regression, it’s a bit messier; you could plot every individual predictor, but I would be satisfied if you just mention that you checked linearity.

**Constancy of variance**

Also known by one of my very favorite ten-cent words, homoscedasticity. Constancy of variance means that, along your range of *x *predictors, your *y *varies about as much; it has as much spread, as much error. That is, if an SAT score predicts freshman year GPA with a certain degree of consistency for students scoring 600, it should be about as consistent for students scoring 1200, 1800, and 2400.

Why? Think again about interpolation. I run a regression because I want to understand a relationship between various quantitative variables, and often because I want to use my predictor variables to… predict. Regression is useful insofar as I can move along the axes of my *x *values and produce a meaningful, subject-to-error-but-still-useful value for *y*. Violating the assumption of constant variance means that you can’t predict *y *with equal confidence as you move around *x*(s).

Here’s a residuals plot showing the dreaded megaphone effect: the error (size of residuals, difference between observations and results expected from the regression equation) increases as we move from low to high values of *x. *The relationship is strong at low values of *x *and much weaker at high values.

We could check homoscedasticity by **having access to residual plots**. Violations of constant variance can often be fixed via transformation, although it may often be easier to use techniques that are more inherently robust to this violation, such as quantile regression.

**Normality**

The concept of the normal distribution is at once simple and counterintuitive, and I’ve spent a lot of my walks home trying to think of the best way to explain it. The “parametric” in parametric statistics refers to the assumption that there is a given underlying distribution for most observable data, and frequently this distribution is the normal distribution or bell curve. Think of yourself walking down the street and noticing that someone is unusually tall or unusually short. The fact that you notice is in and of itself a consequence of the normal distribution. When we think of someone that is unusually tall or short, we are implicitly assuming that we will find fewer and fewer people as we move further along the extremes of the height distribution. If you see a man in North American who is 5’10, he is above average height, but you wouldn’t bat an eye; if you see a man who is 6’3, you might think yourself, that’s a tall guy; when you see someone who is 6’9, you say, wow, he is tall!, and when you see a 7 footer, you take out your cell phone. This is the central meaning of the normal distribution: that the average is more likely to occur than extremes, and that the relationship between position on the distribution and probability of occurrence is predictable.

Not everything in life is normally distributed. Poll 1,000 people and ask how much money they received in car insurance payments last year and it won’t look normal. But a remarkable amount of naturally occurring phenomena are normally distributed, simply thanks to the reality of numbers and extremes, and the central limit theorem teaches us that essentially all averages are normally distributed. (That is, if I take a 100 person sample of a population for a given quantitative trait, I will get a mean; if I take another 100 person sample, I will get a similar but not exact mean, and so on. If I plot those means, they will be normal even if the overall distribution is not.)

The assumption of normality in regression requires our data to be roughly normally distributed; in order to assess the relationship of *y *as it moves across *x*s, we need to know the relative frequency of extreme observations to observations close to the mean. It’s a fairly robust assumption, and you’re never going to have perfectly normal data, but too strong of a violation will invalidate your analysis. We check normality with what’s called a qq plot. Here’s an almost-perfect one, again scraped from Google Images:

That strongly linear, nearly 45 degree angle is just what we want to see. Here’s a bad one, demonstrating the “fat tails” phenomenon– that is, too many observations clustered at the extremes relative to the mean:

I will confess that, when I work with my statistic instructors, I still can’t predict what he will deem a “good enough” quantile plot. But this is just another way to say that I’m a beginner. Data journalists would do a good deed by **posting publicly-accessible qq plots**.

**Diagnostics**

OK, so 2000 words into this thing, we’ve checked out four assumptions. Are we good? Well, not so fast. We need to check a few diagnostic measures, or what my stats instructor calls “the laundry list.” This is a matter of investigating *influence*. When we run an analysis like regression, we’re banking on the aggregate power of all of our observations to help us make responsible observations and inferences. We never want to rely too heavily on individual or small numbers of observations because that increases the influence of error in our analysis. Diagnostic measures in regression typically involve using statistical procedures to look for influential observations that have too much sway over our analysis.

The first thing to say about outliers is that you want a *systematic *reason for eliminating them. There are entire books about the identification and elimination of outliers, and I’m not remotely qualified to say what the best method is. But you never want to toss an observation simply because it would help your analysis. When you’ve got that one data point that’s dragging your line out of significance, it’s tempting to get rid of it, but you want to analyze that observation for a methodology-internal justification for eliminating it. On the other hand, sometimes you have the opposite situation: your purported effect is really the product of a single or small number of influential outliers that have dragged the line in your favor (that is, to a *p*-value you like). Then, of course, the temptation is simply to not mention the outlier and published it anyway. Especially if a tenure review is in your future…

Some examples of influential observation diagnostics in regression include examining leverage, or outliers in your predictors that have a great deal of influence on your overall model; Cook’s Distance, which tells you how different your model will be if you delete a given observation; DFBetas, which tells you how a given predictor observation influences on a particular parameter estimate; and more. Most modern statistical packages like SAS or R have built-in commands for checking diagnostic measures like these. While offering numbers would be nice, I would mostly like it if **data journalists reassured readers that they had run diagnostic measures for regression **and found acceptable results. Just let me know: I looked for outliers and influential observations and things came back fairly clean.

(Here’s a recent post I wrote about the frustration of researchers failing to speak about a potential outlier.)

*****

Regression is just one part of a large number of techniques and applications that are happening in data journalism right now. But essentially any statistical techniques are going to involve checking assumptions and diagnostic measures. A typical ANOVA, for example, the categorical equivalent of regression, will involve checking some of the same assumptions. In the era of the internet, there is no reason not to provide a link to a brief, simple rundown of what quality controls were pursued in your analysis.

None of these things are foolproof. Sums of squares are spooky things; we get weird results as we add and remove predictors from our models. Individual predictors are strongly significant by themselves but not when added together; models are significant with no individual predictors significant; individual predictors are highly significant without model significance; the order you put your predictors in changes everything; and so on. It’s fascinating and complicated. We’re always at the mercy of how responsible and careful researchers are. But by sharing information, we raise the odds that what we’re looking at is a real effect.

This might all sound like an impossibly high bar to clear. There are so many ways things can go wrong. And it’s true that, in general, I worry that people today are too credulous towards statistical arguments, which are often advanced without sufficient qualifications. There are some questions that statistics certainly can not answer. But there is a lot we can and do know. We know that age is highly predictive of height in children but not in adults; we know that there is a relationship between SAT scores and freshman year GPA; we know point differential is a better predictor of future win-loss record than past win-loss record. We can learn lots of things, but we always do it better together. So I think that data journalists should share their work to a greater degree than they do now. That requires a certain compromise. After all, it’s scary to have tons of strangers looking over your shoulder. So I propose that we get more skeptical and critical on our statistical arguments as a media and readership, but more forgiving of individual researchers who are, after all, only human. That strikes me as a good bargain.

And one I’m willing to make myself, as I’m opening up my comments here so that you all can point out the mistakes I’ve inevitably made.

A reasonable quick-and-dirty test here is to

simulatea bunch of quantile-quantile plots from actual normal distributions with the same sample size, and see how easy it is to pick out the QQ plot of your data from among the simulations.(This works for most other kinds of plots, also–if you’re unsure about whether the plot indicates a bad fit, just compare a plot of actual data with a plot of simulated data from your fitted model.)

Great suggestions, thanks for it.

I support every suggestion in this piece, and I’d extend the recommendations beyond “data journalism”. It’s actually pretty surprising how lax many academic journals are with this stuff. (Here’s an interesting look at what might come to light if statistical analyses were scrutinized more closely.) The problem is that really checking someone’s analysis–to the point of going through the code they used–is a ton of work, and spending a ton of time on this stuff is a fairly low priority of most people who review articles. Especially since they’re not paid for it.

Freddie,

it’s very easy to do very robust regression when you have non homoscedasticity data. It’s not even a particularly advanced technique – in the simplest case, weighted least squares works fine. You can also estimate confidence intervals quite easily even in the case of non-uniform variance.

Aside from that, if you are discussing linear regression, you really should reference http://arxiv.org/abs/1008.4686 (I *highly* reccomand it) which is a basic discussion of all the considerations you should take into account when fitting a simple straight line to data. It also proposes a very simple mechanism to take outliers into account by marginalizing over them.

Again, none of this is even remotely close to cutting edge techniques used by real data analysts. The problem is that data journalism is not done by PhD statisticians, but, if we are lucky, by well meaning journalists that have a little experience using R or Stata.

Of course it’s not cutting edge – I suggest it all precisely because it’s simple stuff we can reasonably expect to crowd source to dedicated amateur readers.

This isn’t really a criticism, but you know about heteroskedasticity-robust standard errors, right? That usually solves for the problem of heteroskedasticity if the variance of the error term IS heteroskedastic. Robust standard errors also work with homoskedastic errors, so for the purposes of data journalism (which I assume is not quite as rigorous as say, an economics journal article for publishing), one should always use heteroskedasticity-robust standard errors (at least for cross-sectional data).

Also, the biggest issue which you missed but I’m sure you know about is omitted variable bias (economics would be easy if omitted variables weren’t a thing).

Finally, you might be interested in checking out Tobit (not perfect by any means), which is meant to deal with the issue of selection bias.

Overall, good article. I don’t really agree with you on much except for civil liberties, but you’re always a pleasure to read.

I’m putting up a conference working paper tonight and following your recommendations in the appendix. Seems like a reasonable idea.