- Random Numbers and Probability Distributions
- Casino Royale: Roll the Dice
- Normal Distribution
- The Student Who Taught Everyone Else
- Statistical Distributions in Action
- Hypothetically Yours
- The Mean and Kind Differences
- Worked-Out Examples of Hypothesis Testing
- Exercises for Comparison of Means
- Regression for Hypothesis Testing
- Analysis of Variance
- Significantly Correlated
- Summary
Summary
Let me hypothesize in the concluding section of this chapter that you are now at least familiar with the statistical concepts about testing assumptions and hypotheses. The process of stating one’s assumptions and then using statistical methods to test them is at the core of statistical analysis. I would like to conclude this section with a warning or two about the limitations of statistical analysis. As budding data scientists, you may naively assume that the techniques you have learned can be applied to all problems. Such a conclusion would be erroneous.
Recall the story of European settlers who spotted a black swan in Western Australia that immediately contradicted their belief that all swans were white. The settlers could have treated the black swan as an outlier, a data point that is very different from the rest of the observations. They could have ignored this one observation. But that would have been a mistake, because in this particular case, a black swan challenged the existing knowledge base.
Let me explain this with an example of when an outlier/s might be ignored. Assume you are working with the housing sales data where the average sale price in the neighborhood is around $450,000. However, you may have a couple or more housing units in the same data set that sold for more than two million dollars each. Given the nature of the housing stock in the neighborhood, you might conclude that a very small number of housing units in the area are much larger in size than the rest of the housing stock and hence have transacted for a larger amount. Because you are interested in forecasting the average price of an average house in the neighborhood, you might declare the very expensive transactions as outliers and exclude those from the analysis.
Now let us assume that you were (in a previous life) Charles Darwin’s assistant and assigned the task to document the colors of swans found on the planet. As you landed in Western Australia with the rest of the settlers, you also spotted a black swan. Would you have treated the black swan as an outlier? The answer is emphatically no. Just one out of ordinary outcome or observation that could not be foreseen based on our prior body of knowledge is not an outlier, but the most important observation to ponder in detail.
Similarly, I would like to draw your attention again to Professor Jon Danielsson’s estimation that an S&P 500 single-day decline of 23% in 1987 would happen once out of every 12 universes. We know that financial market meltdowns of similar proportions happen at a more rapid frequency than the statistical models would allow us to believe. Our continued reliance on the Gaussian distribution, which we refer to as the Normal distribution, erroneously lead us to believe that natural phenomenon can be approximated using the Normal distribution. This erroneous assumption is behind our poor risk perception of natural disasters and overconfidence in financial markets.
I submit that a data scientist is not one who believes the use of algorithms and statistical methods will provide him or her with “the” answer. Instead, I believe a data scientist is one who is fully cognizant of one’s innate inability in predicting the future. A data scientist is one who appreciates the analytics will deliver an informed possible view of the future out of many other possible incarnations. A data scientist is one who never becomes a victim of compound ignorance; that is, the state when one is ignorant of one’s own ignorance.