6.6 Observational Units Across Multiple Tables
The last bit of data tidying relates to the situation in which the same type of data is spread across multiple data sets. This issue was also covered in Chapter 4, when we discussed data concatenation and merging. One reason why data might be split across multiple files would be the size of the files. By splitting up data into various parts, each part would be smaller. This may be good when we need to share data on the Internet or via email, since many services limit the size of a file that can be opened or shared. Another reason why a data set might be split into multiple parts would be to account for the data collection process. For example, a separate data set containing stock information could be created for each day.
Since merging and concatenation have already been covered, this section will focus on techniques for quickly loading multiple data sources and assembling them together.
The Unified New York City Taxi and Uber Data is a good choice to illustrate these processes. The entire data set contains data on more than 1.3 billion taxi and Uber trips from New York City, and is organized into more than 140 files. For illustration purposes, we will work with only five of these data files. When the same data is broken into multiple parts, those parts typically have a structured naming pattern associated with them.
First let’s download the data. Do not worry too much about the details in the following block of code. The raw_data_urls.txt file contain a list of URLs where each URL is the download link to a part of the taxi data. We begin by opening and reading the file, and iterating through each line of the file (i.e., each data URL). We download only the first 5 data sets since the files are fairly large. We use some string manipulation (Chapter 8) to create the path where the data will be saved, and use the urllib library to download our data.
import os import urllib # code to download the data # download only the first 5 data sets from the list of files with open('../data/raw_data_urls.txt', 'r') as data_urls: for line, url in enumerate(data_urls): if line == 5: break fn = url.split('/')[-1].strip() fp = os.path.join('..', 'data', fn) print(url) print(fp) urllib.request.urlretrieve(url, fp)
In this example, all of the raw taxi trips have the pattern fhv_tripdata_YYYY_XX.csv, where YYYY represents the year (e.g., 2015), and XX represents the part number. We can use the a simple pattern matching function from the glob library in Python to get a list of all the filenames that match a particular pattern.
import glob # get a list of the csv files from the nyc-taxi data folder nyc_taxi_data = glob.glob('../data/fhv_*') print(nyc_taxi_data)
['../data/fhv_tripdata_2015-04.csv', '../data/fhv_tripdata_2015-05.csv', '../data/fhv_tripdata_2015-03.csv', '../data/fhv_tripdata_2015-01.csv', '../data/fhv_tripdata_2015-02.csv']
Now that we have a list of filenames we want to load, we can load each file into a dataframe. We can choose to load each file individually, as we have been doing so far.
taxi1 = pd.read_csv(nyc_taxi_data[0]) taxi2 = pd.read_csv(nyc_taxi_data[1]) taxi3 = pd.read_csv(nyc_taxi_data[2]) taxi4 = pd.read_csv(nyc_taxi_data[3]) taxi5 = pd.read_csv(nyc_taxi_data[4])
We can look at our data and see how they can be nicely stacked (concatenated) on top of each other.
print(taxi1.head(n=2)) print(taxi2.head(n=2)) print(taxi3.head(n=2)) print(taxi4.head(n=2)) print(taxi5.head(n=2)) Dispatching_base_num Pickup_date locationID 0 B00001 2015-04-01 04:30:00 NaN 1 B00001 2015-04-01 06:00:00 NaN Dispatching_base_num Pickup_date locationID 0 B00001 2015-05-01 04:30:00 NaN 1 B00001 2015-05-01 05:00:00 NaN Dispatching_base_num Pickup_date locationID 0 B00029 2015-03-01 00:02:00 213.0 1 B00029 2015-03-01 00:03:00 51.0 Dispatching_base_num Pickup_date locationID 0 B00013 2015-01-01 00:30:00 NaN 1 B00013 2015-01-01 01:22:00 NaN Dispatching_base_num Pickup_date locationID 0 B00013 2015-02-01 00:00:00 NaN 1 B00013 2015-02-01 00:01:00 NaN
We can concatenate them just as we did in Chapter 4.
# shape of each dataframe print(taxi1.shape) print(taxi2.shape) print(taxi3.shape) print(taxi4.shape) print(taxi5.shape) (3917789, 3) (4296067, 3) (3281427, 3) (2746033, 3) (3126401, 3)
# concatenate the dataframes together taxi = pd.concat([taxi1, taxi2, taxi3, taxi4, taxi5]) # shape of final concatenated taxi data print(taxi.shape)
(17367717, 3)
However, manually saving each dataframe will get tedious when the data is split into many parts. As an alternative approach, we can automate the process using loops and list comprehensions.
6.6.1 Load Multiple Files Using a Loop
An easier way to load multiple files is to first create an empty list, use a loop to iterate though each of the CSV files, load the CSV files into a Pandas dataframe, and finally append the dataframe to the list. The final type of data we want is a list of dataframes because the concat function takes a list of dataframes to concatenate.
# create an empty list to append to list_taxi_df = [] # loop though each CSV filename for csv_filename in nyc_taxi_data: # you can choose to print the filename for debugging # print(csv_filename) # load the CSV file into a dataframe df = pd.read_csv(csv_filename) # append the dataframe to the list that will hold the dataframes list_taxi_df.append(df) # print the length of the dataframe print(len(list_taxi_df))
5
# type of the first element print(type(list_taxi_df[0]))
<class 'pandas.core.frame.DataFrame'>
# look at the head of the first dataframe print(list_taxi_df[0].head())
Dispatching_base_num Pickup_date locationID 0 B00001 2015-04-01 04:30:00 NaN 1 B00001 2015-04-01 06:00:00 NaN 2 B00001 2015-04-01 06:00:00 NaN 3 B00001 2015-04-01 06:00:00 NaN 4 B00001 2015-04-01 06:15:00 NaN
Now that we have a list of dataframes, we can concatenate them.
taxi_loop_concat = pd.concat(list_taxi_df) print(taxi_loop_concat.shape)
(17367717, 3)
# Did we get the same results as the manual load and concatenation? print(taxi.equals(taxi_loop_concat))
True
6.6.2 Load Multiple Files Using a List Comprehension
Python has an idiom for looping though something and adding it to a list, called a list comprehension. The loop given previously, which is shown here again without the comments, can be written in a list comprehension (Appendix N).
# the loop code without comments list_taxi_df = [] for csv_filename in nyc_taxi_data: df = pd.read_csv(csv_filename) list_taxi_df.append(df) # same code in a list comprehension list_taxi_df_comp = [pd.read_csv(data) for data in nyc_taxi_data]
The result from our list comprehension is a list, just as the earlier loop example.
print(type(list_taxi_df_comp))
<class 'list'>
Finally, we can concatenate the results just as we did earlier.
taxi_loop_concat_comp = pd.concat(list_taxi_df_comp) # are the concatenated dataframes the same? print(taxi_loop_concat_comp.equals(taxi_loop_concat))
True