- 3.1 The Sinking of the Titanic
- 3.2 Beer Ratings
- 3.3 Grouping Data
- 3.4 Unicode Data
- 3.5 Basic Graphs and Charts
- 3.6 Summary
3.5 Basic Graphs and Charts
Visualization is an important component of exploratory data analysis, and fortunately there are really good plotting libraries that make graphs and charts remarkably easy. This is especially true of Python, with packages like Matplotlib and Seaborn, but Gnuplot, which is available on Unix and macOS, is also good for quick plotting. And of course Excel and other spreadsheet programs create good charts. We’re not going to do much more here than to suggest minimal ways to plot data; after that, you should do your own experiments.
Is there a correlation between ABV and rating? Do reviewers prefer higher-alcohol beer? A scatter plot is one way to get a quick impression, but it’s hard to plot 1.5 million points. Let’s use Awk to grab a 0.1% sample (about 1,500 points), and plot that:
$ awk -F'\t' 'NR%1000 == 500 {print $2, $5}' rev.tsv >temp $ gnuplot plot 'temp' $
This produces the graph in Figure 3-1. There appears to be at most a weak correlation between rating and ABV.
FIGURE 3.1 Beer rating as a function of ABV
Tukey’s boxplot visualization shows the median, quartiles, and other properties of a dataset. A boxplot is sometimes called a box and whiskers plot because the “whiskers” at each end of the box extend from the box typically by one and a half times the range between the lower and upper quartile. Points beyond the whiskers are outliers.
This short Python program generates a boxplot of beer ratings for the sample described above. The file temp contains the ratings and ABV, one pair per line, separated by a space, with no heading.
import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv('temp', sep=' ', header=None) plt.boxplot(df[0]) plt.show()
It produces the boxplot of Figure 3-2, which shows that the median rating is 4, and half the ratings are between the quartiles of 3.5 and 4.5. The whiskers extend to at most 1.5 times the inter-quartile range, and there are outliers at 1.5 and 1.0.
FIGURE 3.2 Boxplot of beer ratings sample.
It’s also possible to see how well any particular beer or brewery does, perhaps in comparison to mass-market American beers:
$ awk -F'\t' '/Budweiser/ { s += $2; n++ } END {print s/n, n }' rev.tsv 3.15159 3958 $ awk -F'\t' '/Coors/ { s += $2; n++ } END {print s/n, n }' rev.tsv 3.1044 9291 $ awk -F'\t' '/Hill Farmstead/ { s += $2; n++ } END {print s/n, n }' rev.tsv 4.29486 1555
This suggests a significant ratings gap between mass-produced beers and small-scale craft brews.