6.3 Columns Contain Multiple Variables
Sometimes columns in a data set may represent multiple variables. This format is commonly seen when working with health data, for example. To illustrate this situation, let’s look at the Ebola data set.
ebola = pd.read_csv('../data/country_timeseries.csv') print(ebola.columns)
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone', 'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates', 'Deaths_Spain', 'Deaths_Mali'], dtype='object')
# print select rows print(ebola.iloc[:5, [0, 1, 2, 3, 10, 11]])
Date Day Cases_Guinea Cases_Liberia Deaths_Guinea 0 1/5/2015 289 2776.0 NaN 1786.0 1 1/4/2015 288 2775.0 NaN 1781.0 2 1/3/2015 287 2769.0 8166.0 1767.0 3 1/2/2015 286 NaN 8157.0 NaN 4 12/31/2014 284 2730.0 8115.0 1739.0 Deaths_Liberia 0 NaN 1 NaN 2 3496.0 3 3496.0 4 3471.0
The column names Cases_Guinea and Deaths_Guinea actually contain two variables. The individual status (cases and deaths, respectively) as well as the country name, Guinea. The data is also arranged in a wide format that needs to be unpivoted.
ebola_long = pd.melt(ebola, id_vars=['Date', 'Day']) print(ebola_long.head())
Date Day variable value 0 1/5/2015 289 Cases_Guinea 2776.0 1 1/4/2015 288 Cases_Guinea 2775.0 2 1/3/2015 287 Cases_Guinea 2769.0 3 1/2/2015 286 Cases_Guinea NaN 4 12/31/2014 284 Cases_Guinea 2730.0
print(ebola_long.tail())
Date Day variable value 1947 3/27/2014 5 Deaths_Mali NaN 1948 3/26/2014 4 Deaths_Mali NaN 1949 3/25/2014 3 Deaths_Mali NaN 1950 3/24/2014 2 Deaths_Mali NaN 1951 3/22/2014 0 Deaths_Mali NaN
6.3.1 Split and Add Columns Individually (Simple Method)
Conceptually, the column of interest can be split based on the underscore in the column name, _. The first part will be the new status column, and the second part will be the new country column. This will require some string parsing and splitting in Python (more on this in Chapter 8). In Python, a string is an object, similar to how Pandas has Series and DataFrame objects. Chapter 2 showed how Series can have method such as mean, and DataFrames can have methods such as to_csv. Strings have methods as well. In this case we will use the split method that takes a string and splits the string up based on a given delimiter. By default, split will split the string based on a space, but we can pass in the underscore, _, in our example. To get access to the string methods, we need to use the str accessor (see Chapter 8 for more on strings). This will give us access to the Python string methods and allow us to work across the entire column.
# get the variable column # access the string methods # and split the column based on a delimiter variable_split = ebola_long.variable.str.split('_')
print(variable_split[:5])
0 [Cases, Guinea] 1 [Cases, Guinea] 2 [Cases, Guinea] 3 [Cases, Guinea] 4 [Cases, Guinea] Name: variable, dtype: object
print(variable_split[-5:])
1947 [Deaths, Mali] 1948 [Deaths, Mali] 1949 [Deaths, Mali] 1950 [Deaths, Mali] 1951 [Deaths, Mali] Name: variable, dtype: object
After we split on the underscore, the values are returned in a list. We know it’s a list because that’s how the split method works,3 but the visual cue is that the results are surrounded by square brackets.
# the entire container print(type(variable_split))
<class 'pandas.core.series.Series'>
# the first element in the container print(type(variable_split[0]))
<class 'list'>
Now that the column has been split into the various pieces, the next step is to assign those pieces to a new column. First, however, we need to extract all the 0-index elements for the status column and the 1-index elements for the country column. To do so, we need to access the string methods again, and then use the get method to get the index we want for each row.
status_values = variable_split.str.get(0) country_values = variable_split.str.get(1)
print(status_values[:5])
0 Cases 1 Cases 2 Cases 3 Cases 4 Cases Name: variable, dtype: object
print(status_values[-5:])
1947 Deaths 1948 Deaths 1949 Deaths 1950 Deaths 1951 Deaths Name: variable, dtype: object
print(country_values[:5])
0 Guinea 1 Guinea 2 Guinea 3 Guinea 4 Guinea Name: variable, dtype: object
print(country_values[-5:])
1947 Mali 1948 Mali 1949 Mali 1950 Mali 1951 Mali Name: variable, dtype: object
Now that we have the vectors we want, we can add them to our dataframe.
ebola_long['status'] = status_values ebola_long['country'] = country_values
print(ebola_long.head())
Date Day variable value status country 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea
6.3.2 Split and Combine in a Single Step (Simple Method)
In this subsection, we’ll exploit the fact that the vector returned is in the same order as our data. We can concatenate (see Chapter 4) the new vector or our original data.
variable_split = ebola_long.variable.str.split('_', expand=True) variable_split.columns = ['status', 'country'] ebola_parsed = pd.concat([ebola_long, variable_split], axis=1)
print(ebola_parsed.head())
Date Day variable value status country status 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea Cases 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea Cases 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea Cases 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea Cases 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea Cases country 0 Guinea 1 Guinea 2 Guinea 3 Guinea 4 Guinea
print(ebola_parsed.tail())
Date Day variable value status country status 1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali Deaths 1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali Deaths 1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali Deaths 1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali Deaths 1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali Deaths country 1947 Mali 1948 Mali 1949 Mali 1950 Mali 1951 Mali
6.3.3 Split and Combine in a Single Step (More Complicated Method)
In this subsection, we’ll again exploit the fact that the vector returned is in the same order as our data. We can concatenate (see Chapter 4) the new vector or our original data.
We can accomplish the same result in a single step by taking advantage of the fact that the split results return a list of two elements, where each element is a new column. We can combine the list of split items with the built-in zip function. zip takes a set of iterators (e.g., lists, tuples) and creates a new container that is made of the input iterators, but each new container created has the same index as the input containers. For example, if we have two lists of values,
constants = ['pi', 'e'] values = ['3.14', '2.718']
we can zip the values together:
# we have to call list on the zip function # to show the contents of the zip object # in Python 3, zip returns an iterator print(list(zip(constants, values)))
[('pi', '3.14'), ('e', '2.718')]
Each element now has the constant matched with its corresponding value. Conceptually, each container is like a side of a zipper. When we zip the containers, the indices are matched up and returned.
Another way to visualize what zip is doing is taking each container passed into zip and stacking the containers on top of each other (think about the row-wise concatenation described in Section 4.3.1), thereby creating a dataframe of sorts. zip then returns the values on a column-by-column basis in a tuple.
We can use the same ebola_long.variable.str.split(' ') to split the values in the column. However, since the result is already a container (a Series object), we need to unpack it so that we have the contents of the container (each status–country list), rather than the container itself (the series).
In Python, the asterisk operator, *, is used to unpack containers.4 When we zip the unpacked containers, the effect is the same as when we created the status values and the country values earlier. We can then assign the vectors to the columns simultaneously using multiple assignment (Appendix Q).
ebola_long['status'], ebola_long['country'] = zip(*ebola_long.variable.str.split('_'))
print(ebola_long.head())
Date Day variable value status country 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea