Data Cleaning in Python Pandas¶

Tips & Tricks to overcome the mundane grind of data cleaning.¶

Updated March 10th, 2020

Import Depedencies¶

In [2]:

import pandas as pd

Import Test Data (using pd.read_csv)¶

In [16]:

url = "https://raw.githubusercontent.com/HP-Nunes/cheatSheets/master/input/hepData.csv" # [1]
df = pd.read_csv(url, error_bad_lines=False) # Caution: 'Offending lines' will be skipped and not imported

Pandas Essentials: Examining your dataframe¶

In [42]:

df.head() # Returns the first 5 records of the dataframe

Out[42]:

	Location	FIPS	Mortality Rate, 1980*	Mortality Rate, 1985*	Mortality Rate, 1990*	Mortality Rate, 1995*	Mortality Rate, 2000*	Mortality Rate, 2005*	Mortality Rate, 2010*	Mortality Rate, 2014*	% Change in Mortality Rate, 1980-2014
0	United States	NaN	0.49 (0.46, 0.53)	0.53 (0.50, 0.56)	0.78 (0.74, 0.82)	1.42 (1.35, 1.49)	1.84 (1.75, 1.94)	0.86 (0.82, 0.91)	0.30 (0.29, 0.32)	0.29 (0.27, 0.32)	-40.74 (-48.49, -32.30)
1	Alabama	1.0	0.50 (0.45, 0.55)	0.53 (0.49, 0.58)	0.81 (0.76, 0.87)	1.45 (1.36, 1.55)	1.88 (1.75, 2.00)	0.81 (0.75, 0.87)	0.25 (0.23, 0.28)	0.24 (0.21, 0.27)	-51.75 (-60.02, -43.16)
2	Autauga County, Alabama	1001.0	0.37 (0.29, 0.47)	0.42 (0.33, 0.52)	0.68 (0.54, 0.83)	1.18 (0.94, 1.44)	1.46 (1.16, 1.79)	0.68 (0.52, 0.86)	0.20 (0.14, 0.27)	0.20 (0.13, 0.28)	-46.72 (-63.99, -24.18)
3	Baldwin County, Alabama	1003.0	0.43 (0.35, 0.52)	0.46 (0.39, 0.54)	0.69 (0.59, 0.81)	1.19 (1.03, 1.40)	1.50 (1.28, 1.76)	0.66 (0.55, 0.78)	0.22 (0.17, 0.27)	0.21 (0.16, 0.28)	-51.28 (-64.03, -34.52)
4	Barbour County, Alabama	1005.0	0.43 (0.34, 0.53)	0.45 (0.36, 0.54)	0.66 (0.53, 0.81)	1.22 (0.99, 1.47)	1.69 (1.37, 2.06)	0.70 (0.55, 0.89)	0.21 (0.16, 0.29)	0.20 (0.14, 0.28)	-52.71 (-66.28, -34.42)

In [41]:

df.tail() # Returns the last 5 records of the dataframe

Out[41]:

	Location	FIPS	Mortality Rate, 1980*	Mortality Rate, 1985*	Mortality Rate, 1990*	Mortality Rate, 1995*	Mortality Rate, 2000*	Mortality Rate, 2005*	Mortality Rate, 2010*	Mortality Rate, 2014*	% Change in Mortality Rate, 1980-2014
3189	Sweetwater County, Wyoming	56037.0	0.21 (0.17, 0.26)	0.24 (0.20, 0.29)	0.44 (0.36, 0.53)	0.96 (0.80, 1.13)	1.20 (0.98, 1.45)	0.54 (0.42, 0.68)	0.16 (0.12, 0.21)	0.15 (0.10, 0.21)	-27.23 (-49.28, 2.16)
3190	Teton County, Wyoming	56039.0	0.19 (0.15, 0.23)	0.21 (0.17, 0.26)	0.37 (0.31, 0.44)	0.78 (0.66, 0.92)	0.92 (0.76, 1.10)	0.40 (0.32, 0.49)	0.12 (0.09, 0.15)	0.11 (0.08, 0.15)	-41.52 (-58.04, -19.65)
3191	Uinta County, Wyoming	56041.0	0.21 (0.17, 0.27)	0.25 (0.20, 0.31)	0.45 (0.35, 0.56)	0.96 (0.78, 1.16)	1.20 (0.94, 1.49)	0.53 (0.39, 0.68)	0.15 (0.10, 0.21)	0.15 (0.10, 0.22)	-31.09 (-54.60, 2.87)
3192	Washakie County, Wyoming	56043.0	0.21 (0.17, 0.26)	0.25 (0.19, 0.31)	0.46 (0.37, 0.55)	1.03 (0.84, 1.24)	1.23 (0.97, 1.56)	0.52 (0.38, 0.68)	0.14 (0.10, 0.20)	0.14 (0.09, 0.20)	-34.69 (-56.71, -3.32)
3193	Weston County, Wyoming	56045.0	0.18 (0.14, 0.23)	0.21 (0.17, 0.26)	0.40 (0.32, 0.49)	0.90 (0.74, 1.08)	1.12 (0.89, 1.40)	0.47 (0.36, 0.60)	0.12 (0.08, 0.17)	0.11 (0.07, 0.16)	-39.63 (-57.96, -13.14)

In [49]:

%%time 
        # To observe the processing time of a query

print('______________________________________________________________________________________')
print(df.dtypes) # Observe the attribute type of each field
print('______________________________________________________________________________________')
print(df.info()) # Returns display information such as the number of rows and columns, the total memory usage,
                 # the data type of each column, and the number of non-NaN elements.
print(' ')

______________________________________________________________________________________
Location                                  object
FIPS                                     float64
Mortality Rate, 1980*                     object
Mortality Rate, 1985*                     object
Mortality Rate, 1990*                     object
Mortality Rate, 1995*                     object
Mortality Rate, 2000*                     object
Mortality Rate, 2005*                     object
Mortality Rate, 2010*                     object
Mortality Rate, 2014*                     object
% Change in Mortality Rate, 1980-2014     object
dtype: object
______________________________________________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3194 entries, 0 to 3193
Data columns (total 11 columns):
Location                                 3194 non-null object
FIPS                                     3193 non-null float64
Mortality Rate, 1980*                    3194 non-null object
Mortality Rate, 1985*                    3194 non-null object
Mortality Rate, 1990*                    3194 non-null object
Mortality Rate, 1995*                    3194 non-null object
Mortality Rate, 2000*                    3194 non-null object
Mortality Rate, 2005*                    3194 non-null object
Mortality Rate, 2010*                    3194 non-null object
Mortality Rate, 2014*                    3194 non-null object
% Change in Mortality Rate, 1980-2014    3194 non-null object
dtypes: float64(1), object(10)
memory usage: 274.6+ KB
None
 
CPU times: user 8.15 ms, sys: 0 ns, total: 8.15 ms
Wall time: 6.28 ms

In [54]:

print('The dataframe has ' + str(len(df)) + ' rows') # Prints the number of rows/records.
print('The dataframe has ' + str(len(df.columns)) + ' columns') # Returns the number of columns/fields
print('The dataframe has ' + str(df.shape) + ' rows and columns') # My personal favorite
print('The dataframe has ' + str(df.size) + ' total records') # I.e. rows * columns

The dataframe has 3194 rows
The dataframe has 11 columns
The dataframe has (3194, 11) rows and columns
The dataframe has 35134 total records

Example 1: Splitting the numerical values of a field into new fields¶

For the field '% Change in Mortality Rate, 1980-2014' I'd like to have both Confidence Interval, and the change in mortality rate, in separate fields: 'CI Lower Boundary','CI Upper Boundary', and 'Mortality Rate' respectively.

In [58]:

print('These are the values we want to split into three separate new fields:')
df['% Change in Mortality Rate, 1980-2014'].head()

These are the values we want to split into three separate new fields:

Out[58]:

0    -40.74 (-48.49, -32.30)
1    -51.75 (-60.02, -43.16)
2    -46.72 (-63.99, -24.18)
3    -51.28 (-64.03, -34.52)
4    -52.71 (-66.28, -34.42)
Name: % Change in Mortality Rate, 1980-2014, dtype: object

In [65]:

df[['CI Lower Boundary','CI Upper Boundary']] = df['% Change in Mortality Rate, 1980-2014'].str.split(",",expand=True,) 
    # This splits the values in seperate fields by the comma.
    
df['CI Upper Boundary'] = df['CI Upper Boundary'].str.strip(')')
    # Removing the parenthesis.
df['CI Lower Boundary'] = df['CI Lower Boundary'].str.split('(').str[1]
    # Removing the parenthesis and everything left of it (indicated by the index of [1])
df['Mortality Rate'] = df['% Change in Mortality Rate, 1980-2014'].str.split('(').str[0]
    # Removing the parenthesis and everything right of it (indicated by the index of [0])

In [66]:

print('Outcome successful:')
df[['% Change in Mortality Rate, 1980-2014','CI Lower Boundary','CI Upper Boundary','Mortality Rate']].head()

Outcome successful:

Out[66]:

	% Change in Mortality Rate, 1980-2014	CI Lower Boundary	CI Upper Boundary	Mortality Rate
0	-40.74 (-48.49, -32.30)	-48.49	-32.30	-40.74
1	-51.75 (-60.02, -43.16)	-60.02	-43.16	-51.75
2	-46.72 (-63.99, -24.18)	-63.99	-24.18	-46.72
3	-51.28 (-64.03, -34.52)	-64.03	-34.52	-51.28
4	-52.71 (-66.28, -34.42)	-66.28	-34.42	-52.71

Example 2: Converting field's types and arithmatic between fields¶

I'm interested to find the Margin of Error of the Confidence Intervals, and have it written in a new field named 'Margin of Error'.

In [89]:

print('Observe the field type; we can not do math if the values are recognized as strings (i.e. text):')
df[['CI Lower Boundary','CI Upper Boundary']].dtypes

Observe the field type; we can not do math if the values are recognized as strings (i.e. text):

Out[89]:

CI Lower Boundary    object
CI Upper Boundary    object
dtype: object

In [93]:

print('Both fields are being identified as strings, so we need to convert to floats in order to find the margin of error')

df['CI Upper Boundary'] = df['CI Upper Boundary'].astype(float)
df['CI Lower Boundary'] = df['CI Lower Boundary'].astype(float)
df['Margin of Error'] = (df['CI Upper Boundary'] - df['CI Lower Boundary']) / 2

df['Margin of Error'].head()

Both fields are being identified as strings, so we need to convert to floats in order to find the margin of error

Out[93]:

0     8.095
1     8.430
2    19.905
3    14.755
4    15.930
Name: Margin of Error, dtype: float64

Example 3: Combining days and hours into a new column, and converting to DateTime.¶

I want to combine two columns, dates and hours, into a new field. Then I want to convert the field into DateTime so that I can sort chronologically my field and plot timeseries

In [3]:

df1 = pd.read_csv("input/pool.csv")

In [9]:

print(df1.dtypes)
df1.head()

Aquatic Programs Count    float64
day                        object
Day of Week                object
Family Swim Count         float64
Lap Swim Count            float64
Outage Count              float64
tally_Time                 object
Water Excercise Count     float64
dtype: object

Out[9]:

	Aquatic Programs Count	day	Day of Week	Family Swim Count	Lap Swim Count	Outage Count	tally_Time	Water Excercise Count
0	NaN	03/03/2020	Tuesday	NaN	NaN	NaN	05:45:00	NaN
1	NaN	03/03/2020	Tuesday	NaN	NaN	NaN	06:15:00	NaN
2	NaN	03/03/2020	Tuesday	NaN	NaN	NaN	06:45:00	NaN
3	NaN	03/03/2020	Tuesday	NaN	NaN	NaN	07:15:00	NaN
4	NaN	03/03/2020	Tuesday	NaN	NaN	NaN	07:45:00	NaN

We have identified our fields of interest: "day" and "tally_Time"

In [12]:

df1['dateTime'] = pd.to_datetime(df1['day'] + ' ' + df1['tally_Time']) # Combine the fields and convert to dateTime
df1 = df1.sort_values(by='dateTime') #Sort chronologically by the date and the hour
df1 = df1.drop(['day','tally_Time'],axis=1) #Remove unwanted/redundant columns
print(df1.dtypes)

df1.head()

Aquatic Programs Count           float64
Day of Week                       object
Family Swim Count                float64
Lap Swim Count                   float64
Outage Count                     float64
Water Excercise Count            float64
dateTime                  datetime64[ns]
dtype: object

Out[12]:

	Aquatic Programs Count	Day of Week	Family Swim Count	Lap Swim Count	Outage Count	Water Excercise Count	dateTime
64	NaN	Tuesday	NaN	1.0	NaN	NaN	2019-01-08 05:45:00
65	NaN	Tuesday	NaN	5.0	NaN	NaN	2019-01-08 06:15:00
66	NaN	Tuesday	NaN	7.0	NaN	NaN	2019-01-08 06:45:00
67	NaN	Tuesday	NaN	9.0	NaN	NaN	2019-01-08 07:15:00
68	NaN	Tuesday	NaN	7.0	NaN	NaN	2019-01-08 07:45:00

Export the edited dataset to CSV¶

In [94]:

df.to_csv('output/hepData_edit.csv')

About the Data¶

[1] United States Infectious Disease Mortality Rates by County 1980-2014 | GHDx. [online] Available at: http://ghdx.healthdata.org/record/ihme-data/united-states-infectious-disease-mortality-rates-county-1980-2014 [Accessed 5 Mar. 2020].