- Learning Objectives
- 4.1 Columns Contain Values, Not Variables
- 4.2 Columns Contain Multiple Variables
- 4.3 Variables in Both Rows and Columns
- Conclusion
4.2 Columns Contain Multiple Variables
Sometimes columns in a data set may represent multiple variables. This format is commonly seen when working with health data, for example. To illustrate this situation, let’s look at the Ebola data set.
ebola = pd.read_csv('data/country_timeseries.csv') print(ebola.columns)
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone', 'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates', 'Deaths_Spain', 'Deaths_Mali'], dtype='object')
# print select rows and columns print(ebola.iloc[:5, [0, 1, 2,10]])
Date Day Cases_Guinea Deaths_Guinea 0 1/5/2015 289 2776.0 1786.0 1 1/4/2015 288 2775.0 1781.0 2 1/3/2015 287 2769.0 1767.0 3 1/2/2015 286 NaN NaN 4 12/31/2014 284 2730.0 1739.0
The column names Cases_Guinea and Deaths_Guinea actually contain two variables. The individual status (cases and deaths, respectively) as well as the country name, Guinea. The data is also arranged in a wide format that needs to be reshaped (with the .melt() method).
First, let’s fix the problem we know how to fix, by melting the data into long format.
ebola_long = ebola.melt(id_vars=['Date', 'Day'])
print(ebola_long)
Date Day variable value 0 1/5/2015 289 Cases_Guinea 2776.0 1 1/4/2015 288 Cases_Guinea 2775.0 2 1/3/2015 287 Cases_Guinea 2769.0 3 1/2/2015 286 Cases_Guinea NaN 4 12/31/2014 284 Cases_Guinea 2730.0 ... ... ... ... ... 1947 3/27/2014 5 Deaths_Mali NaN 1948 3/26/2014 4 Deaths_Mali NaN 1949 3/25/2014 3 Deaths_Mali NaN 1950 3/24/2014 2 Deaths_Mali NaN 1951 3/22/2014 0 Deaths_Mali NaN [1952 rows x 4 columns]
Conceptually, the column of interest can be split based on the underscore in the column name, _. The first part will be the new status column, and the second part will be the new country column. This will require some string parsing and splitting in Python (more on this in Chapter 11). In Python, a string is an object, similar to how Pandas has Series and DataFrame objects. Chapter 2 showed how Series can have methods such as .mean(), and DataFrames can have methods such as .to_csv(). Strings have methods as well. In this case, we will use the .split() method that takes a string and “splits” it up based on a given delimiter. By default, .split() will split the string based on a space, but we can pass in the underscore, _, in our example. To get access to the string methods, we need to use the .str. attribute. .str. is a special type of attribute that Pandas calls an “accessor” because it can “access” string methods (see Chapter 11 for more on strings). Access to the Python string methods and allow us to work across the entire column. This will be the key to parting out the multiple bits of information stored in each value.
4.2.1 Split and Add Columns Individually
We can use the .str accessor to make a call to the .split() method and pass in the _ understore.
# get the variable column # access the string methods # and split the column based on a delimiter variable_split = ebola_long.variable.str.split('_')
print(variable_split[:5])
0 [Cases, Guinea] 1 [Cases, Guinea] 2 [Cases, Guinea] 3 [Cases, Guinea] 4 [Cases, Guinea] Name: variable, dtype: object
After we split on the underscore, the values are returned in a list. We can tell it’s a list by:
Knowing about the .split() method on base Python string objects4
Visually seeing the square brackets in the output, [ ]
Getting the type() of one of the items in the Series
# the entire container print(type(variable_split))
<class 'pandas.core.series.Series'>
# the first element in the container print(type(variable_split[0]))
<class 'list'>
Now that the column has been split into various pieces, the next step is to assign those pieces to a new column. First, however, we need to extract all the 0-index elements for the status column and the 1-index elements for the country column. To do so, we need to access the string methods again, and then use the .get() method to “get” the index we want for each row.
status_values = variable_split.str.get(0) country_values = variable_split.str.get(1)
print(status_values)
0 Cases 1 Cases 2 Cases 3 Cases 4 Cases ... 1947 Deaths 1948 Deaths 1949 Deaths 1950 Deaths 1951 Deaths Name: variable, Length: 1952, dtype: object
Now that we have the vectors we want, we can add them to our dataframe.
ebola_long['status'] = status_values ebola_long['country'] = country_values
print(ebola_long)
Date Day variable value status country 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea ... ... ... ... ... ... ... 1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali 1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali 1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali 1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali 1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali [1952 rows x 6 columns]
4.2.2 Split and Combine in a Single Step
We can actually do the above steps in a single step. If we look at the .str.split() method documentation (you can find this by looking by going to the Pandas API documentation > Series > String Handling (.str.) > .split() method5), there is a parameter named expand that defaults to False, but when we set it to True, it will return a DataFrame where each result of the split is in a separate column, instead of a Series of list containers.
# reset our ebola_long data ebola_long = ebola.melt(id_vars=['Date', 'Day']) # split the column by _ into a dataframe using expand variable_split = ebola_long.variable.str.split('_', expand=True) print(variable_split)
0 1 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea 3 Cases Guinea 4 Cases Guinea ... ... ... 1947 Deaths Mali 1948 Deaths Mali 1949 Deaths Mali 1950 Deaths Mali 1951 Deaths Mali [1952 rows x 2 columns]
From here, we can actually use the Python and Pandas multiple assignment feature (Appendix Q), to directly assign the newly split columns into the original DataFrame. Since our output variable_split returned a DataFrame with two columns, we can assign two new columns to our ebola_long DataFrame.
ebola_long[['status', 'country']] = variable_split
print(ebola_long)
Date Day variable value status country 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea ... ... ... ... ... ... ... 1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali 1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali 1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali 1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali 1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali [1952 rows x 6 columns]
You can also opt to do this as a concatenation (pd.concat()) function call as well (Chapter 6).