3.6 Summary
The purpose of exploratory data analysis is to get a sense of what the data is, looking for both patterns and anomalies, before hypothesizing about results. As John Tukey said,
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
Be approximately right rather than exactly wrong.
Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.
Awk is well worth learning as a core tool for exploratory data analysis, because you can use it for quick counting, summarization, and searching. It certainly won’t handle everything, but in conjunction with other tools, especially spreadsheets and plotting libraries, it’s excellent for getting a quick understanding of what a dataset contains.
A big part of this is to identify anomalies and weirdnesses. As a colleague at Bell Labs once told us long ago, “a third of all data is bad.” Although he perhaps exaggerated for rhetorical effect, we have seen plenty of examples of datasets where a significant part really was flaky and untrustworthy. If you build a set of tools and techniques for looking at your data, you’ll be better able to find the places where it needs to be cleaned up or at least treated cautiously.