1. Statistical error is inevitable and necessary. Error, in its generic usage, pretty much always means that something has gone wrong. But statistical error refers instead to the natural variation in a distribution and the consequences of that variation. If I take a sample of some quantifiable variable, like height, and I find an average or other measure, there is always going to be some difference between the average we calculate from the sample and the true average of the population.(If we use inferential statistics to try and predict a particular measure, the difference between that prediction and an observation is called a residual.) Error is OK! If we have an adequate sample size, an appropriate and genuinely random sampling mechanism, and we use appropriate calculations of variance and error, we can often reach responsible quantitative conclusions that we can express with great confidence. Not always! But often.
2. Bias is neither. When people hear “error,” what they often think naturally of is bias. Bias refers to systematic problems of data collection or sampling that cause discrepancies in descriptive and inferential statistics that cannot be accounted for with measurements of error or variance. If I decide to measure the average height of Purdue students, but I only sample from players on the basketball team, that’s bias. And bias is a big problem.
3. Not everything is normally distributed, but averages (almost) always are. The normal distribution — the bell curve — is at once fundamental to statistics and not always easy to grasp. I can’t do a good job of explaining it all here. I do want to say, though, that there is a kind of misunderstanding about the normal distribution that’s easy to fall into. Lots of times people ask why things have to be normally distributed– why would nature require things to be distributed that way? But the normal distribution is less a product of nature and more a product of our conceptions of big or small and average. First, some things aren’t normally distributed. If you sampled how much people had gotten paid out in car insurance reimbursements in the last year, for example, you’d likely find that the biggest number would be at zero, then a gap for common deductible sizes, then clumps of payments past those cutoffs.
But many, many things are normally distributed, and generally, the more variables contribute to a given quantitative result, the more likely the distribution is to be normal. Think about it this way. Think of the times you’ve seen someone walking down the street who was unusually tall or unusually short. What makes that height unusual? Why did you take notice? Because extremes are rare. If height was evenly distributed across the spectrum, you’d be no more surprised to see a 7 foot tall person than you would be to see a 5’8 person. What the normal distribution says is that, for normally distributed variables, you’re going to find very few at the extremes and a large clump near the average, and further that there is a certain portion of the distribution within defined spaces around that average. (Defined, that is, by the standard deviation.)
Why would height be normally distributed? Well, think of all the various factors that contribute to height: several different genes, nutrition, childhood health, random chance. The odds of all of them breaking in the direction of being short or tell are very low. Suppose for the sake of example that 10 genes contribute to height. (Totally making that number up.) If there’s a 50/50 chance that any one gene is expressed in the shorter or taller way, we could easily calculate the odds that all 10 ended up being short or tall, and those odds would be quite low. Instead, most people would expect some short, some tall, and wind up near the middle. That doesn’t make very short or tall people impossible. With billions of repetitions, you get extremes. That’s how we get Yao Ming. But Yao is what we call a “three sigma” outlier– that is, he’s three standard deviations or more away from the average. That means he’s way rarer than one in a hundred.
Even things that are not normally distributed themselves, however, have normally distributed averages. Meaning that if I took a sample of 100 Americans and measured them on a non-normally distributed variable, they wouldn’t be normally distributed themselves. But if I took another 100 Americans and measured them and noted the average, and then another hundred and another hundred, then laid them out on a distribution, the averages would be normally distributed themselves. Think about what an average does: it pulls in the extremes. So you get a clumping effect that makes averages fall into a bell curve even when underlying distributions don’t. That ends up being hugely important for inferential statistics like regression.
4. Averages often need standard deviations to be understood. We use averages all the time by themselves, and they can be useful and necessary. But frequently, we need to combine them with measures of spread to really understand them. An average (arithmetic mean) is a measure of central location. That means we’re using a distribution and trying to represent it as a single data point. That can be potentially very misleading. Suppose you own a restaurant and you want to do a customer satisfaction survey. You use a scale from 0-10. What if the average is 5? What should you do? Well, it depends. If everybody is grading your restaurant around a 5, you know you’ve got a consistently mediocre establishment, and you’ll want to improve in a certain way. But if the votes are all clustered around the extremes of the spectrum, with lots of 0s and 10s, you’ll also get an average near 5. But how you interpret that result would be completely different. You might have a great night waitstaff but a terrible day staff, or you might be getting bad produce and meat on a certain day of the week. The average can’t help you by itself.
The standard deviation, a measure of spread, can help make the average more meaningful. The standard deviation is a measurement of the variance that’s calculated in such a way so that deviations on either side of the average don’t cancel each other out — so that a high and low difference from the average don’t combine to show up as zero difference — and so that it can be easily compared to the mean. So in the above example, the first situation might have a standard deviation of 1. You would know, in other words, that about 68% of the survey respondents rated the restaurant between 4-6. On the other hand, the second situation’s standard deviation would be closer to 5, telling you that the average couldn’t really be trusted.
5. The proportional relationship between a sample and its population is irrelevant to standard error. This one is a mind-bender, but it’s true. Given a reasonable definition of a sample, and given that the sample is really random, a sample size of 100 is of the same accuracy no matter if the population is 100,000 or 100,000,000. The calculations of standard error are exactly the same. A 100 person sample is just as accurate a predictor of the total population of Fargo, the total population of North Dakota, the total population of the United States, and the total population of the world. (Again, provided that sample is a truly random sample from each of those populations.)
That’s not true if your sample is a significant percentage of the total population, but then that’s not a real sample, and in most cases, you could never get a sample that’s a significant percentage of the population of interest anyway.
6. The value of adding more observations to your sample diminishes exponentially. In the formula for calculating a standard error (which is what gets you those polling margins of error you see every election season), the n is placed under a square root sign, which means that the power of adding more observations to reduce error decreases exponentially. For this reason, your first 100 observations give you about as much positive effect on your margins of error as the next 900. This is part of the reason you rarely see giant sample sizes in human subject research; the value to your accuracy is just too low compared to work and cost.