Exploratory Data Analysis
- 3.1 The Sinking of the Titanic
- 3.2 Beer Ratings
- 3.3 Grouping Data
- 3.4 Unicode Data
- 3.5 Basic Graphs and Charts
- 3.6 Summary
The previous chapter described a number of small scripts for personal use, often idiosyncratic or specialized. In this chapter, we’re going to do something that is also typical of how Awk is used in real life: we’ll use it along with other tools to informally explore some real data, with the goal of seeing what it looks like. This is called exploratory data analysis or EDA, a term first used by the pioneering statistician John Tukey.
Tukey invented a number of basic data visualization techniques like boxplots, inspired the statistical programming language S that led to the widely-used R language, co-invented the Fast Fourier Transform, and coined the words “bit” and “software.” The authors knew John Tukey as a friend and colleague at Bell Labs in the 1970s and 1980s, where among a large number of very smart and creative people, he stood out as someone special.
The essence of exploratory data analysis is to play with the data before making hypotheses or drawing conclusions. As Tukey himself said,
“Finding the question is often more important than finding the answer. Exploratory data analysis is an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques.”
In many cases, that involves counting things, computing simple statistics, arranging data in different ways, looking for patterns, commonalities, outliers and oddities, and drawing basic graphs and other visual displays. The emphasis is on small, quick experiments that might give some insight, rather than polish or refinement; those come later when we have a better sense of what the data might be telling us.
For EDA, we typically use standard Unix tools like the shell, wc, diff, sort, uniq, grep, and of course regular expressions. These combine well with Awk, and often with other languages like Python.
We will also encounter a variety of file formats, including comma- or tab-separated values (CSV and TSV), JSON, HTML, and XML. Some of these, like CSV and TSV, are easily processed in Awk, while others are sometimes better handled with other tools.
3.1 The Sinking of the Titanic
Our first dataset is based on the sinking of the Titanic on April 15, 1912. This example was chosen, not entirely by coincidence, by one of the authors, who was at the time on a trans-Atlantic boat trip, passing not far from the site where the Titanic sank.
Summary Data: titanic.tsv
The file titanic.tsv, adapted from Wikipedia, contains summary data about the Titanic’s passengers and crew. As is common with datasets in CSV and TSV format, the first line is a header that identifies the data in the lines that follow. Columns are separated by tabs.
Type Class Total Lived Died Male First 175 57 118 Male Second 168 14 154 Male Third 462 75 387 Male Crew 885 192 693 Female First 144 140 4 Female Second 93 80 13 Female Third 165 76 89 Female Crew 23 20 3 Child First 6 5 1 Child Second 24 24 0 Child Third 79 27 52
Many (perhaps all) datasets contain errors. As a quick check here, each line should have five fields, and the total in the third field should equal field four (lived) plus field five (died). This program prints any line where those conditions do not hold:
NF != 5 || $3 != $4 + $5
If the data is in the right format and the numbers are correct, this should produce a single line of output, the header:
Type Class Total Lived Died
Once we’ve done this minimal check, we can look at other things. For example, how many people are there in each category?
The categories that we want to count are not identified by numbers, but by words like Male and Crew. Fortunately, the subscripts or indices of Awk arrays can be arbitrary strings of characters, so gender["Male"] and class["Crew"] are valid expressions.
Arrays that allow arbitrary strings as subscripts are called associative arrays; other languages provide the same facility with names like dictionary, map or hashmap. Associative arrays are remarkably convenient and flexible, and we will use them extensively.
NR > 1 { gender[$1] += $3; class[$2] += $3 } END { for (i in gender) print i, gender[i] print "" for (i in class) print i, class[i] }
gives
Male 1690 Child 109 Female 425 Crew 908 First 325 Third 706 Second 285
Awk has a special form of the for statement for iterating over the indices of an associative array:
for (i in array) { statements }
sets the variable i in turn to each index of the array, and the statements are executed with that value of i. The elements of the array are visited in an unspecified order; you can’t count on any particular order.
What about survival rates? How did social class, gender and age affect the chance of survival among passengers? With this summary data we can do some simple experiments, for example, computing the survival rate for each category.
NR > 1 { printf("%6s %6s %6.1f%%\n", $1, $2, 100 * $4/$3) }
We can sort the output of this test by piping it through the Unix command sort -k3 -nr (sort by third field in reverse numeric order) to produce
Child Second 100.0% Female First 97.2% Female Crew 87.0% Female Second 86.0% Child First 83.3% Female Third 46.1% Child Third 34.2% Male First 32.6% Male Crew 21.7% Male Third 16.2% Male Second 8.3%
Evidently women and children did survive better on average.
Note that these examples treat the header line of the dataset as a special case. If you’re doing a lot of experiments, it may be easier to remove the header from the data file than to ignore it explicitly in every program.
Passenger Data: passengers.csv
The file passengers.csv is a larger file that contains detailed information about passengers, though it does not contain anything about crew members. The original file is a merger of a widely used machine-learning dataset with another list from Wikipedia. It has 11 columns including home town, lifeboat assignment, and ticket price:
"row.names","pclass","survived","name","age","embarked", "home.dest","room","ticket","boat","sex" ... "11","1st",0,"Astor, Colonel John Jacob",47,"Cherbourg", "New York, NY","","17754 L224 10s 6d","(124)","male" ...
How big is the file? We can use the Unix wc command to count lines, words and characters:
$ wc passengers.csv 1314 6794 112466 passengers.csv
or a two-line Awk program like the one we saw in Chapter 1:
{ nc += length($0) + 1; nw += NF } END { print NR, nw, nc, FILENAME }
Except for spacing, they produce the same results when the input is a single file.
The file format of passengers.csv is comma-separated values. Although CSV is not rigorously defined, one common definition says that any field that contains a comma or a double quote (") must be surrounded by double quotes. Any field may be surrounded by quotes, whether it contains commas and quotes or not. An empty field is just "", and a quote within a field is represented by a doubled quote, as in """,""", which represents ",". Input fields in CSV files may contain newline characters. For more details, see Section A.5.2.
This is more or less the format used by Microsoft Excel and other spreadsheet programs like Apple Numbers and Google Sheets. It is also the default input format for data frames in Python’s Pandas library and in R.
In versions of Awk since 2023, the command-line argument --csv causes input lines to be split into fields according to this rule. Setting the field separator to a comma explicitly with FS=, does not treat comma field separators specially, so this is useful only for the simplest form of CSV: no quotes. With older versions of Awk it may be easiest to convert the data to a different form using some other system, like an Excel spreadsheet or a Python CSV module.
Another useful alternative format is tab-separated values or TSV. The idea is the same, but simpler: fields are separated by single tabs, and there is no quoting mechanism so fields may not contain embedded tabs or newlines. This format is easily handled by Awk, by setting the field separator to a tab with FS="\t" or equivalently with the command-line argument -F"\t".
As an aside, it’s wise to verify whether a file is in the proper format before relying on its contents. For example, to check whether all records have the same number of fields, you could use
awk '{print NF}' file | sort | uniq -c | sort -nr
The first sort command brings all instances of a particular value together; then the command uniq -c replaces each sequence of identical values by a single line with a count and the value; and finally sort -nr sorts the result numerically in reverse order, so the largest values come first.
For passengers.csv, using the --csv option to process CSV input properly, this produces
1314 11
Every record has the same number of fields, which is necessary for valid data in this dataset, though not sufficient. If some lines have different numbers of fields, now use Awk to find them, for example with NF != 11 in this case.
With a version of Awk that does not handle CSV, the output using -F, will be different:
624 12 517 13 155 14 15 15 3 11
This shows that almost all fields contain embedded commas.
By the way, generating CSV is straightforward. Here’s a function to_csv that converts a string to a properly quoted string by doubling each quote and surrounding the result with quotes. It’s an example of a function that could go into a personal library.
# to_csv - convert s to proper "..." function to_csv(s) { gsub(/"/, "\"\"", s) return "\"" s "\"" }
(Note how quotes are quoted with backslashes.)
We can use this function within a loop to insert commas between elements of an array to create a properly formatted CSV record for an associative array, or for an indexed array like the fields of a line, as illustrated in the functions rec_to_csv and arr_to_csv:
# rec_to_csv - convert a record to csv function rec_to_csv( s, i) { for (i = 1; i < NF; i++) s = s to_csv($i) "," s = s to_csv($NF) return s } # arr_to_csv - convert an indexed array to csv function arr_to_csv(arr, s, i, n) { n = length(arr) for (i = 1; i <= n; i++) s = s to_csv(arr[i]) "," return substr(s, 1, length(s)-1) # remove trailing comma }
The following program selects the five attributes class, survival, name, age, and gender, from the original file, and converts the output to tab-separated values.
NR > 1 { OFS="\t"; print $2, $3, $4, $5, $11 }
It produces output like this:
1st 0 Allison, Miss Helen Loraine 2 female 1st 0 Allison, Mr Hudson Joshua Creighton 30 male 1st 0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25 female 1st 1 Allison, Master Hudson Trevor 0.9167 male
Most ages are integers, but a handful are fractions, like the last line above. Helen Allison was two years old; Master Hudson Allison appears to have been 11 months old, and was the only survivor in his family. (From other sources, we know that the Allison’s chauffeur, George Swane, age 18, also died, but the family’s maid and cook both survived.)
How many infants were there? Running the command
$4 < 1
with tab as the field separator produces eight lines:
1st 1 Allison, Master Hudson Trevor 0.9167 male 2nd 1 Caldwell, Master Alden Gates 0.8333 male 2nd 1 Richards, Master George Sidney 0.8333 male 3rd 1 Aks, Master Philip 0.8333 male 3rd 0 Danbom, Master Gilbert Sigvard Emanuel 0.3333 male 3rd 1 Dean, Miss Elizabeth Gladys (Millvena) 0.1667 female 3rd 0 Peacock, Master Alfred Edward 0.5833 male 3rd 0 Thomas, Master Assad Alexander 0.4167 male
Exercise 3-1. Modify the word count program to produce a separate count for each of its input files, as the Unix wc command does.
Some Further Checking
Another set of questions to explore is how well the two data sources agree. They both come from Wikipedia, but it is not always a perfectly accurate source. Suppose we check something absolutely basic, like how many passengers there were in the passengers file:
$ awk 'END {print NR}' passengers.csv 1314
This count includes one header line, so there were 1313 passengers. On the other hand, this program adds up the counts for non-crew members from the third field of the summary file:
$ awk '!/Crew/ { s += $3 }; END { print s }' titanic.tsv 1316
That’s a discrepancy of three people, so something is wrong.
As another example, how many children were there?
awk --csv '$5 <= 12' passengers.csv
produces 100 lines, which doesn’t match the 109 children in titanic.tsv. Perhaps children are those 13 or younger? That gives 105. Younger than 14? That’s 112. We can guess what age is being used by counting passengers who are called “Master”:
awk --csv '/Master/ {print $5}' passengers.csv | sort -n
The largest age in this population is 13, so that’s perhaps the best guess, though not definitive.
In both of these cases, numbers that ought to be the same are in fact different, which suggests that the data is still flaky. When exploring data, you should always be prepared for errors and inconsistencies in form and content. A big part of the job is to be sure that you have identified and dealt with potential problems before starting to draw conclusions.
In this section, we’ve tried to show how simple computations can help identify such problems. If you collect a set of tools for common operations, like isolating fields, grouping by category, printing the most common and least common entries, and so on, you’ll be better able to perform such checks.
Exercise 3-2. Write some of these tools for yourself, according to your own needs and tastes.