I recently came across an old book about statistics that is as pertinent today as it was when it was written about 65 years ago, How to Lie with Statistics. The beginning of the book draws a quote that is often mistakenly attributed to Mark Twain. It was originally Brittish author, Benjamin Disraeli who said, “There are three types of lies. Lies, damn lies, and statistics.”
The book delves into the inherent biases in surveys and survey taking. It points out that making truely random samples is particularly expensive.
What random in the phrase random sample means is that each person that in the population is equally likely to be chosen. This is rarely the case. Virtually any method chosen for sampling of people has at least some inherent bias.
If for example, you have the phone numbers of the entire population that you are intent on studying. If you call a sample of them at random and ask them to take a survey, not every one of them will volunteer. For this reason, you will continue to go to the next “random” person on the list to ask if they want to take a survey.
At this point, the members of the sample population are not equally likely to get chosen. One member of the population has a zero percent chance of being chosen. The other members of the population have a tiny bit more of a chance of being chosen. Usually, this small increase in probability would not be significant enough to significantly bias the sample, bit it would add a tiny bit of bias.
The book also has an entire chapter about how averages lie. Averages can be among the most misleading of statistics. The type of average, taken can give you completely different results.
In natural phenomena such as the height of people that have a normal distribution, the difference among the mean median and mode is not significant.
Assuming you do not know if the population has a normal distribution, the type of average should be given to give it meaning.
For example, if a report states but does not specify an average for income, the three types of averages could be vastly different.
Say, for example, your sample happens to include Warren Buffet, one of the richest people in the world. His income from the investment, would be so high that it would completely skew the mean. It would be what I would call a black hole outlier that is so extreme that it raises the mean very significantly to make it much more than it might have been.
In such a case, the more meaningful type of average would be the median. Half of the sample would have an income that is higher than the median, while the other half would be lower than the median.
The mode, the most often repeated number, might not be particularly instructive, because several different incomes could, in theory, occur the same number of times at any point in the sample’s range. So the sample could have several modes that are not necessarily related to the center of the sample’s range.
Another example that they give in the book about an unspecified average being misleading is the average family size. Lots of families could have one or two people, some families could have three or four, and a few families could have six, ten or even twelve people.
The mean in this case would not be as skewed as it was in the case of the income. However, a mean of a sample that includes families with a high number of children might be significantly higher than the median in the sample of income.
Human Sources of Bias in Surveys
The book gives another survey example that demonstrates a human source of possible bias. A survey was given of Yale graduates from a class that graduated 30-years before. The survey found that the reported income of the graduates was extremely high.
Some may have reported more income than they actually made to boost their egos or brag.
It is also possible that some of the graduates reported less income than they had to make sure that the IRS did not learn about their higher income bracket.
If both were equally likely they could cancel each other out. However, over-reporting income is probably more likely.
Additionally, in doing the survey, Yale probably did not keep track of where all of the graduates lived for the 30 years since their graduating. The wealthier and more successful were probably easier to keep track of and more likely to report where they lived than the less successful people from the class. Furthermore, the less successful people might be less likely to report where they live and their income.
The human factor in surveys makes answers inherently unreliable, especially on sensitive questions.
Later sections of the book focus on lying with graphs, which I will cover at a later date.
Huff, D. and Geis, I. 1954. How to Lie with Statistics. New York, New York. W. W. Norton & Company, Inc.