Reshaping and Cleaning Common Data

By Daniel Y. Chen
Feb 21, 2018

📄 Contents

␡

6.1 Introduction
Concept Map
Objectives
6.2 Columns Contain Values, Not Variables
6.3 Columns Contain Multiple Variables
6.4 Variables in Both Rows and Columns
6.5 Multiple Observational Units in a Table (Normalization)
6.6 Observational Units Across Multiple Tables
6.7 Conclusion

⎙ Print

< Back Page 5 of 9 Next >

This chapter is from the book 

Pandas for Everyone: Python Data Analysis

Learn More Buy

6.3 Columns Contain Multiple Variables

Sometimes columns in a data set may represent multiple variables. This format is commonly seen when working with health data, for example. To illustrate this situation, let’s look at the Ebola data set.

ebola = pd.read_csv('../data/country_timeseries.csv')
print(ebola.columns)

Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia',
       'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal',
       'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali',
       'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
       'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
       'Deaths_Spain', 'Deaths_Mali'],
     dtype='object')

# print select rows
print(ebola.iloc[:5, [0, 1, 2, 3, 10, 11]])

         Date  Day  Cases_Guinea  Cases_Liberia  Deaths_Guinea  0    1/5/2015  289        2776.0            NaN         1786.0
1    1/4/2015  288        2775.0            NaN         1781.0
2    1/3/2015  287        2769.0         8166.0         1767.0
3    1/2/2015  286           NaN         8157.0            NaN
4  12/31/2014  284        2730.0         8115.0         1739.0


   Deaths_Liberia
0             NaN
1             NaN
2          3496.0
3          3496.0
4          3471.0

The column names Cases_Guinea and Deaths_Guinea actually contain two variables. The individual status (cases and deaths, respectively) as well as the country name, Guinea. The data is also arranged in a wide format that needs to be unpivoted.

ebola_long = pd.melt(ebola, id_vars=['Date', 'Day'])
print(ebola_long.head())

         Date  Day      variable   value
0    1/5/2015  289  Cases_Guinea  2776.0
1    1/4/2015  288  Cases_Guinea  2775.0
2    1/3/2015  287  Cases_Guinea  2769.0
3    1/2/2015  286  Cases_Guinea     NaN
4  12/31/2014  284  Cases_Guinea  2730.0

print(ebola_long.tail())

           Date  Day     variable  value
1947  3/27/2014    5  Deaths_Mali    NaN
1948  3/26/2014    4  Deaths_Mali    NaN
1949  3/25/2014    3  Deaths_Mali    NaN
1950  3/24/2014    2  Deaths_Mali    NaN
1951  3/22/2014    0  Deaths_Mali    NaN

6.3.1 Split and Add Columns Individually (Simple Method)

Conceptually, the column of interest can be split based on the underscore in the column name, _. The first part will be the new status column, and the second part will be the new country column. This will require some string parsing and splitting in Python (more on this in Chapter 8). In Python, a string is an object, similar to how Pandas has Series and DataFrame objects. Chapter 2 showed how Series can have method such as mean, and DataFrames can have methods such as to_csv. Strings have methods as well. In this case we will use the split method that takes a string and splits the string up based on a given delimiter. By default, split will split the string based on a space, but we can pass in the underscore, _, in our example. To get access to the string methods, we need to use the str accessor (see Chapter 8 for more on strings). This will give us access to the Python string methods and allow us to work across the entire column.

# get the variable column
# access the string methods
# and split the column based on a delimiter
variable_split = ebola_long.variable.str.split('_')

print(variable_split[:5])

0    [Cases, Guinea]
1    [Cases, Guinea]
2    [Cases, Guinea]
3    [Cases, Guinea]
4    [Cases, Guinea]
Name: variable, dtype: object

print(variable_split[-5:])

1947    [Deaths, Mali]
1948    [Deaths, Mali]
1949    [Deaths, Mali]
1950    [Deaths, Mali]
1951    [Deaths, Mali]
Name: variable, dtype: object

After we split on the underscore, the values are returned in a list. We know it’s a list because that’s how the split method works,³ but the visual cue is that the results are surrounded by square brackets.

# the entire container
print(type(variable_split))

<class 'pandas.core.series.Series'>

# the first element in the container
print(type(variable_split[0]))

<class 'list'>

Now that the column has been split into the various pieces, the next step is to assign those pieces to a new column. First, however, we need to extract all the 0-index elements for the status column and the 1-index elements for the country column. To do so, we need to access the string methods again, and then use the get method to get the index we want for each row.

status_values = variable_split.str.get(0)
country_values = variable_split.str.get(1)

print(status_values[:5])

0    Cases
1    Cases
2    Cases
3    Cases
4    Cases
Name: variable, dtype: object

print(status_values[-5:])

1947    Deaths
1948    Deaths
1949    Deaths
1950    Deaths
1951    Deaths
Name: variable, dtype: object

print(country_values[:5])

0    Guinea
1    Guinea
2    Guinea
3    Guinea
4    Guinea
Name: variable, dtype: object

print(country_values[-5:])

1947   Mali
1948   Mali
1949   Mali
1950   Mali
1951   Mali
Name: variable, dtype: object

Now that we have the vectors we want, we can add them to our dataframe.

ebola_long['status'] = status_values
ebola_long['country'] = country_values

print(ebola_long.head())

         Date  Day      variable   value status country
0    1/5/2015  289  Cases_Guinea  2776.0  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  Cases  Guinea

6.3.2 Split and Combine in a Single Step (Simple Method)

In this subsection, we’ll exploit the fact that the vector returned is in the same order as our data. We can concatenate (see Chapter 4) the new vector or our original data.

variable_split = ebola_long.variable.str.split('_', expand=True)
variable_split.columns = ['status', 'country']
ebola_parsed = pd.concat([ebola_long, variable_split], axis=1)

print(ebola_parsed.head())

         Date  Day      variable   value status country status  0    1/5/2015  289  Cases_Guinea  2776.0  Cases  Guinea  Cases
1    1/4/2015  288  Cases_Guinea  2775.0  Cases  Guinea  Cases
2    1/3/2015  287  Cases_Guinea  2769.0  Cases  Guinea  Cases
3    1/2/2015  286  Cases_Guinea     NaN  Cases  Guinea  Cases
4  12/31/2014  284  Cases_Guinea  2730.0  Cases  Guinea  Cases

  country
0  Guinea
1  Guinea
2  Guinea
3  Guinea
4  Guinea

print(ebola_parsed.tail())

           Date  Day     variable  value  status country  status  1947  3/27/2014    5  Deaths_Mali    NaN  Deaths    Mali  Deaths
1948  3/26/2014    4  Deaths_Mali    NaN  Deaths    Mali  Deaths
1949  3/25/2014    3  Deaths_Mali    NaN  Deaths    Mali  Deaths
1950  3/24/2014    2  Deaths_Mali    NaN  Deaths    Mali  Deaths
1951  3/22/2014    0  Deaths_Mali    NaN  Deaths    Mali  Deaths

     country
1947    Mali
1948    Mali
1949    Mali
1950    Mali
1951    Mali

6.3.3 Split and Combine in a Single Step (More Complicated Method)

In this subsection, we’ll again exploit the fact that the vector returned is in the same order as our data. We can concatenate (see Chapter 4) the new vector or our original data.

We can accomplish the same result in a single step by taking advantage of the fact that the split results return a list of two elements, where each element is a new column. We can combine the list of split items with the built-in zip function. zip takes a set of iterators (e.g., lists, tuples) and creates a new container that is made of the input iterators, but each new container created has the same index as the input containers. For example, if we have two lists of values,

constants = ['pi', 'e']
values = ['3.14', '2.718']

we can zip the values together:

# we have to call list on the zip function
# to show the contents of the zip object
# in Python 3, zip returns an iterator
print(list(zip(constants, values)))

[('pi', '3.14'), ('e', '2.718')]

Each element now has the constant matched with its corresponding value. Conceptually, each container is like a side of a zipper. When we zip the containers, the indices are matched up and returned.

Another way to visualize what zip is doing is taking each container passed into zip and stacking the containers on top of each other (think about the row-wise concatenation described in Section 4.3.1), thereby creating a dataframe of sorts. zip then returns the values on a column-by-column basis in a tuple.

We can use the same ebola_long.variable.str.split(' ') to split the values in the column. However, since the result is already a container (a Series object), we need to unpack it so that we have the contents of the container (each status–country list), rather than the container itself (the series).

In Python, the asterisk operator, *, is used to unpack containers.⁴ When we zip the unpacked containers, the effect is the same as when we created the status values and the country values earlier. We can then assign the vectors to the columns simultaneously using multiple assignment (Appendix Q).

ebola_long['status'], ebola_long['country'] =     zip(*ebola_long.variable.str.split('_'))

print(ebola_long.head())

         Date  Day      variable   value status country
0    1/5/2015  289  Cases_Guinea  2776.0  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  Cases  Guinea

< Back Page 5 of 9 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address