some stats stuff

1. Statistical error is inevitable and necessary. Error, in its generic usage, pretty much always means that something has gone wrong. But statistical error refers instead to the natural variation in a distribution and the consequences of that variation. If I take a sample of some quantifiable variable, like height, and I find an average or other measure, there is always going to be some difference between the average we calculate from the sample and the true average of the population.(If we use inferential statistics to try and predict a particular measure, the difference between that prediction and an observation is called a residual.) Error is OK! If we have an adequate sample size, an appropriate and genuinely random sampling mechanism, and we use appropriate calculations of variance and error, we can often reach responsible quantitative conclusions that we can express with great confidence. Not always! But often.

2. Bias is neither. When people hear “error,” what they often think naturally of is bias. Bias refers to systematic problems of data collection or sampling that cause discrepancies in descriptive and inferential statistics that cannot be accounted for with measurements of error or variance. If I decide to measure the average height of Purdue students, but I only sample from players on the basketball team, that’s bias. And bias is a big problem.

3. Not everything is normally distributed, but averages (almost) always are. The normal distribution — the bell curve — is at once fundamental to statistics and not always easy to grasp. I can’t do a good job of explaining it all here. I do want to say, though, that there is a kind of misunderstanding about the normal distribution that’s easy to fall into. Lots of times people ask why things have to be normally distributed– why would nature require things to be distributed that way? But the normal distribution is less a product of nature and more a product of our conceptions of big or small and average. First, some things aren’t normally distributed. If you sampled how much people had gotten paid out in car insurance reimbursements in the last year, for example, you’d likely find that the biggest number would be at zero, then a gap for common deductible sizes, then clumps of payments past those cutoffs.

But many, many things are normally distributed, and generally, the more variables contribute to a given quantitative result, the more likely the distribution is to be normal. Think about it this way. Think of the times you’ve seen someone walking down the street who was unusually tall or unusually short. What makes that height unusual? Why did you take notice? Because extremes are rare. If height was evenly distributed across the spectrum, you’d be no more surprised to see a 7 foot tall person than you would be to see a 5’8 person. What the normal distribution says is that, for normally distributed variables, you’re going to find very few at the extremes and a large clump near the average, and further that there is a certain portion of the distribution within defined spaces around that average. (Defined, that is, by the standard deviation.)

Why would height be normally distributed? Well, think of all the various factors that contribute to height: several different genes, nutrition, childhood health, random chance. The odds of all of them breaking in the direction of being short or tell are very low. Suppose for the sake of example that 10 genes contribute to height. (Totally making that number up.) If there’s a 50/50 chance that any one gene is expressed in the shorter or taller way, we could easily calculate the odds that all 10 ended up being short or tall, and those odds would be quite low. Instead, most people would expect some short, some tall, and wind up near the middle. That doesn’t make very short or tall people impossible. With billions of repetitions, you get extremes. That’s how we get Yao Ming. But Yao is what we call a “three sigma” outlier– that is, he’s three standard deviations or more away from the average. That means he’s way rarer than one in a hundred. 

Even things that are not normally distributed themselves, however, have normally distributed averages. Meaning that if I took a sample of 100 Americans and measured them on a non-normally distributed variable, they wouldn’t be normally distributed themselves. But if I took another 100 Americans and measured them and noted the average, and then another hundred and another hundred, then laid them out on a distribution, the averages would be normally distributed themselves. Think about what an average does: it pulls in the extremes. So you get a clumping effect that makes averages fall into a bell curve even when underlying distributions don’t. That ends up being hugely important for inferential statistics like regression.

4. Averages often need standard deviations to be understood. We use averages all the time by themselves, and they can be useful and necessary. But frequently, we need to combine them with measures of spread to really understand them. An average (arithmetic mean) is a measure of central location. That means we’re using a distribution and trying to represent it as a single data point. That can be potentially very misleading. Suppose you own a restaurant and you want to do a customer satisfaction survey. You use a scale from 0-10. What if the average is 5? What should you do? Well, it depends. If everybody is grading your restaurant around a 5, you know you’ve got a consistently mediocre establishment, and you’ll want to improve in a certain way. But if the votes are all clustered around the extremes of the spectrum, with lots of 0s and 10s, you’ll also get an average near 5. But how you interpret that result would be completely different. You might have a great night waitstaff but a terrible day staff, or you might be getting bad produce and meat on a certain day of the week. The average can’t help you by itself.

The standard deviation, a measure of spread, can help make the average more meaningful. The standard deviation is a measurement of the variance that’s calculated in such a way so that deviations on either side of the average don’t cancel each other out — so that a high and low difference from the average don’t combine to show up as zero difference — and so that it can be easily compared to the mean. So in the above example, the first situation might have a standard deviation of 1. You would know, in other words, that about 68% of the survey respondents rated the restaurant between 4-6. On the other hand, the second situation’s standard deviation would be closer to 5, telling you that the average couldn’t really be trusted.

5. The proportional relationship between a sample and its population is irrelevant to standard error. This one is a mind-bender, but it’s true. Given a reasonable definition of a sample, and given that the sample is really random, a sample size of 100 is of the same accuracy no matter if the population is 100,000 or 100,000,000. The calculations of standard error are exactly the same. A 100 person sample is just as accurate a predictor of the total population of Fargo, the total population of North Dakota, the total population of the United States, and the total population of the world. (Again, provided that sample is a truly random sample from each of those populations.)

That’s not true if your sample is a significant percentage of the total population, but then that’s not a real sample, and in most cases, you could never get a sample that’s a significant percentage of the population of interest anyway.

6. The value of adding more observations to your sample diminishes exponentially. In the formula for calculating a standard error (which is what gets you those polling margins of error you see every election season), the is placed under a square root sign, which means that the power of adding more observations to reduce error decreases exponentially. For this reason, your first 100 observations give you about as much positive effect on your margins of error as the next 900. This is part of the reason you rarely see giant sample sizes in human subject research; the value to your accuracy is just too low compared to work and cost.

This entry was posted in The Discipline. Bookmark the permalink.

9 Responses to some stats stuff

  1. 2) Good observation about bias. Not necessarily a great word.

    3) I think the statement needs to be a little more restricted. http://en.wikipedia.org/wiki/Sampling_distribution#Examples

    Everything taught in intro stats class is i.i.d. But it takes work to get i.i.d. samples in real life. Viz, the famous story of Literary Digest versus George Gallup.

    4) Excellent point. And maybe skew as well (or simply up-spread / down-spread).

    My go-to examples of this would be [1] race times for men -v- women, [2] test scores for boys -v- girls, [3] test scores for Blacks -v- Whites.

    • 6) you are absolutely right as far as i.i.d. goes, again. It takes a lot more work to add another decimal of precision, which is why Gallup uses n=1000. (cost/benefit tradeoff)

      BUT — think Wisdom of Crowds here — if you MISSED some area of the sample space (think: did you sample the Hmong? the gay men? the billionaires? the trannies? — anyone who doesn’t fit the “typicalstan”

      - remember in checking out AIDS the government statisticians weren’t asking gay San Francisco boys how many partners they had a year. It was >100 whereas the rest of the population was around 1 or 2. (But not normal! Bunched at the left both because you can’t go less than zero and because, religiosity / ugliness / love / whatever factors are making people monogamous or virginal / celibate / not getting laid — aren’t necessarily gaussian.)

      - remember in asking the prostitutes how many clients they had a dya, the satisticians (who came from another country and didn’t know the language/culture/common sense) were sampling the ugly ones–or at least the girsl who were taking fewer clients. (The busy ones weren’t available to be counted. Correalated errors / collinearity)

      on Nate Silver’s site or Tyler Cowen’s or something,

  2. Also: this may interest you so I’ll just drop it off here.

    I no longer think about the normal as a Fat Bell Curve, or as “the ubiquitous distribution”. (It’s only ubiquitous if you’re looking for it everywhere. Try looking for Poisson’s, you’ll also see them everywhere. Try looking for linear dynamical systems, you’ll also see them everywhere. It’s true with any intellectual tool until you get enough tools.)

    Instead I think of it as “the default splat”.

    • It goes down really fast: exp(−x²) ≈ 1/2xx = 1/1, 1/2, 1/16, 1/512, 1/65536, 1/33554432, 1/68719476736 (=”six-sigma”). This doesn’t fit the “fat bell curve” pictures one often sees. plot( dnorm, -50,50,lwd=3) instead of from −3..3. We really are “picking out a point” (the mean) except just hedging ourselves a bit. Like fat-fingering.
    • It’s the “unit” of the Fourier transform ℱ(normal)=normal. So it’s “perfectly balanced” in some sense.
    • I now think of the Gaussian, instead of imagining a bunch of observations of some variable and hoping they’re gaussian, as using a Gaussian smoother on e.g. a time series or a photo image. (You can try this in Pinta or on Google Finance .. exponential moving average is I believe the Gaussian kernel.)

    By the way, in reality nothing is Gaussian, and that’s provable since no actually existing variable ever attains −∞ as a value. More prosaically look at require(nlme); data(oxboys); plot( density( Oxboys$height ), lwd=3)( or you’ll see it if you google my post “How not to draw a probability distribution”.

    Gaussians are a useful tool, not reality. You can prove that certain sampling distributions converge to Gaussian and importantly the convolution of uniforms is Gaussian. You can also prove that you can build certain stuff up with Gaussian. But mostly it’s just a simple item which is useful for theory (eg building up covariance matrices) when there’s not a reason, or any tools to reason with, for the more complicated things. (Remember it has no higher moments for exampl!)

  3. Michael Guarino says:

    I hate to be pedantic, but this is the sort of post that encourages it. For (6), sn^-.5 does not decrease exponentially (s being standard deviation). sc^-n does. The point is still good though, just not as strong.

    • Freddie says:

      Ah damn. I will amend the post when I get to a real keyboard. Thanks!

      • Michael Guarino says:

        Actually, now that I think about it more, I was not quite right either! The parallel would be s/ln(n). That would require exponential increase in n to get a unit increase in the denominator. sn^-.5 requires quadratic increase in n.

  4. Simeon says:

    This is all very good for platonists. But what of the nihilists who do Bayesian statistics?

    • Freddie says:

      Sadly, I lack the calculus chops to speak intelligently about Bayesian statistics.

      • Freddie, it requires no calculus. The Bayesian identity comes from Venn diagrams.

        A|B is the fraction of B that is taken up by A∩B.

        You can work out Bayes’ theorem from there by trying to switch the order to B|A.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>