Data Cleaning in Python Pandas


Tips & Tricks to overcome the mundane grind of data cleaning.

Made by HP-Nunes @geocuriosity
Updated March 10th, 2020

Import Depedencies

In [2]:
import pandas as pd

Import Test Data (using pd.read_csv)

In [16]:
url = "https://raw.githubusercontent.com/HP-Nunes/cheatSheets/master/input/hepData.csv" # [1]
df = pd.read_csv(url, error_bad_lines=False) # Caution: 'Offending lines' will be skipped and not imported

Pandas Essentials: Examining your dataframe

In [42]:
df.head() # Returns the first 5 records of the dataframe
Out[42]:
Location FIPS Mortality Rate, 1980* Mortality Rate, 1985* Mortality Rate, 1990* Mortality Rate, 1995* Mortality Rate, 2000* Mortality Rate, 2005* Mortality Rate, 2010* Mortality Rate, 2014* % Change in Mortality Rate, 1980-2014
0 United States NaN 0.49 (0.46, 0.53) 0.53 (0.50, 0.56) 0.78 (0.74, 0.82) 1.42 (1.35, 1.49) 1.84 (1.75, 1.94) 0.86 (0.82, 0.91) 0.30 (0.29, 0.32) 0.29 (0.27, 0.32) -40.74 (-48.49, -32.30)
1 Alabama 1.0 0.50 (0.45, 0.55) 0.53 (0.49, 0.58) 0.81 (0.76, 0.87) 1.45 (1.36, 1.55) 1.88 (1.75, 2.00) 0.81 (0.75, 0.87) 0.25 (0.23, 0.28) 0.24 (0.21, 0.27) -51.75 (-60.02, -43.16)
2 Autauga County, Alabama 1001.0 0.37 (0.29, 0.47) 0.42 (0.33, 0.52) 0.68 (0.54, 0.83) 1.18 (0.94, 1.44) 1.46 (1.16, 1.79) 0.68 (0.52, 0.86) 0.20 (0.14, 0.27) 0.20 (0.13, 0.28) -46.72 (-63.99, -24.18)
3 Baldwin County, Alabama 1003.0 0.43 (0.35, 0.52) 0.46 (0.39, 0.54) 0.69 (0.59, 0.81) 1.19 (1.03, 1.40) 1.50 (1.28, 1.76) 0.66 (0.55, 0.78) 0.22 (0.17, 0.27) 0.21 (0.16, 0.28) -51.28 (-64.03, -34.52)
4 Barbour County, Alabama 1005.0 0.43 (0.34, 0.53) 0.45 (0.36, 0.54) 0.66 (0.53, 0.81) 1.22 (0.99, 1.47) 1.69 (1.37, 2.06) 0.70 (0.55, 0.89) 0.21 (0.16, 0.29) 0.20 (0.14, 0.28) -52.71 (-66.28, -34.42)
In [41]:
df.tail() # Returns the last 5 records of the dataframe
Out[41]:
Location FIPS Mortality Rate, 1980* Mortality Rate, 1985* Mortality Rate, 1990* Mortality Rate, 1995* Mortality Rate, 2000* Mortality Rate, 2005* Mortality Rate, 2010* Mortality Rate, 2014* % Change in Mortality Rate, 1980-2014
3189 Sweetwater County, Wyoming 56037.0 0.21 (0.17, 0.26) 0.24 (0.20, 0.29) 0.44 (0.36, 0.53) 0.96 (0.80, 1.13) 1.20 (0.98, 1.45) 0.54 (0.42, 0.68) 0.16 (0.12, 0.21) 0.15 (0.10, 0.21) -27.23 (-49.28, 2.16)
3190 Teton County, Wyoming 56039.0 0.19 (0.15, 0.23) 0.21 (0.17, 0.26) 0.37 (0.31, 0.44) 0.78 (0.66, 0.92) 0.92 (0.76, 1.10) 0.40 (0.32, 0.49) 0.12 (0.09, 0.15) 0.11 (0.08, 0.15) -41.52 (-58.04, -19.65)
3191 Uinta County, Wyoming 56041.0 0.21 (0.17, 0.27) 0.25 (0.20, 0.31) 0.45 (0.35, 0.56) 0.96 (0.78, 1.16) 1.20 (0.94, 1.49) 0.53 (0.39, 0.68) 0.15 (0.10, 0.21) 0.15 (0.10, 0.22) -31.09 (-54.60, 2.87)
3192 Washakie County, Wyoming 56043.0 0.21 (0.17, 0.26) 0.25 (0.19, 0.31) 0.46 (0.37, 0.55) 1.03 (0.84, 1.24) 1.23 (0.97, 1.56) 0.52 (0.38, 0.68) 0.14 (0.10, 0.20) 0.14 (0.09, 0.20) -34.69 (-56.71, -3.32)
3193 Weston County, Wyoming 56045.0 0.18 (0.14, 0.23) 0.21 (0.17, 0.26) 0.40 (0.32, 0.49) 0.90 (0.74, 1.08) 1.12 (0.89, 1.40) 0.47 (0.36, 0.60) 0.12 (0.08, 0.17) 0.11 (0.07, 0.16) -39.63 (-57.96, -13.14)
In [49]:
%%time 
        # To observe the processing time of a query

print('______________________________________________________________________________________')
print(df.dtypes) # Observe the attribute type of each field
print('______________________________________________________________________________________')
print(df.info()) # Returns display information such as the number of rows and columns, the total memory usage,
                 # the data type of each column, and the number of non-NaN elements.
print(' ')
______________________________________________________________________________________
Location                                  object
FIPS                                     float64
Mortality Rate, 1980*                     object
Mortality Rate, 1985*                     object
Mortality Rate, 1990*                     object
Mortality Rate, 1995*                     object
Mortality Rate, 2000*                     object
Mortality Rate, 2005*                     object
Mortality Rate, 2010*                     object
Mortality Rate, 2014*                     object
% Change in Mortality Rate, 1980-2014     object
dtype: object
______________________________________________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3194 entries, 0 to 3193
Data columns (total 11 columns):
Location                                 3194 non-null object
FIPS                                     3193 non-null float64
Mortality Rate, 1980*                    3194 non-null object
Mortality Rate, 1985*                    3194 non-null object
Mortality Rate, 1990*                    3194 non-null object
Mortality Rate, 1995*                    3194 non-null object
Mortality Rate, 2000*                    3194 non-null object
Mortality Rate, 2005*                    3194 non-null object
Mortality Rate, 2010*                    3194 non-null object
Mortality Rate, 2014*                    3194 non-null object
% Change in Mortality Rate, 1980-2014    3194 non-null object
dtypes: float64(1), object(10)
memory usage: 274.6+ KB
None
 
CPU times: user 8.15 ms, sys: 0 ns, total: 8.15 ms
Wall time: 6.28 ms
In [54]:
print('The dataframe has ' + str(len(df)) + ' rows') # Prints the number of rows/records.
print('The dataframe has ' + str(len(df.columns)) + ' columns') # Returns the number of columns/fields
print('The dataframe has ' + str(df.shape) + ' rows and columns') # My personal favorite
print('The dataframe has ' + str(df.size) + ' total records') # I.e. rows * columns
The dataframe has 3194 rows
The dataframe has 11 columns
The dataframe has (3194, 11) rows and columns
The dataframe has 35134 total records

Example 1: Splitting the numerical values of a field into new fields


For the field '% Change in Mortality Rate, 1980-2014' I'd like to have both Confidence Interval, and the change in mortality rate, in separate fields: 'CI Lower Boundary','CI Upper Boundary', and 'Mortality Rate' respectively.

In [58]:
print('These are the values we want to split into three separate new fields:')
df['% Change in Mortality Rate, 1980-2014'].head()
These are the values we want to split into three separate new fields:
Out[58]:
0    -40.74 (-48.49, -32.30)
1    -51.75 (-60.02, -43.16)
2    -46.72 (-63.99, -24.18)
3    -51.28 (-64.03, -34.52)
4    -52.71 (-66.28, -34.42)
Name: % Change in Mortality Rate, 1980-2014, dtype: object
In [65]:
df[['CI Lower Boundary','CI Upper Boundary']] = df['% Change in Mortality Rate, 1980-2014'].str.split(",",expand=True,) 
    # This splits the values in seperate fields by the comma.
    
df['CI Upper Boundary'] = df['CI Upper Boundary'].str.strip(')')
    # Removing the parenthesis.
df['CI Lower Boundary'] = df['CI Lower Boundary'].str.split('(').str[1]
    # Removing the parenthesis and everything left of it (indicated by the index of [1])
df['Mortality Rate'] = df['% Change in Mortality Rate, 1980-2014'].str.split('(').str[0]
    # Removing the parenthesis and everything right of it (indicated by the index of [0])
In [66]:
print('Outcome successful:')
df[['% Change in Mortality Rate, 1980-2014','CI Lower Boundary','CI Upper Boundary','Mortality Rate']].head()
Outcome successful:
Out[66]:
% Change in Mortality Rate, 1980-2014 CI Lower Boundary CI Upper Boundary Mortality Rate
0 -40.74 (-48.49, -32.30) -48.49 -32.30 -40.74
1 -51.75 (-60.02, -43.16) -60.02 -43.16 -51.75
2 -46.72 (-63.99, -24.18) -63.99 -24.18 -46.72
3 -51.28 (-64.03, -34.52) -64.03 -34.52 -51.28
4 -52.71 (-66.28, -34.42) -66.28 -34.42 -52.71

Example 2: Converting field's types and arithmatic between fields


I'm interested to find the Margin of Error of the Confidence Intervals, and have it written in a new field named 'Margin of Error'.

In [89]:
print('Observe the field type; we can not do math if the values are recognized as strings (i.e. text):')
df[['CI Lower Boundary','CI Upper Boundary']].dtypes
Observe the field type; we can not do math if the values are recognized as strings (i.e. text):
Out[89]:
CI Lower Boundary    object
CI Upper Boundary    object
dtype: object
In [93]:
print('Both fields are being identified as strings, so we need to convert to floats in order to find the margin of error')

df['CI Upper Boundary'] = df['CI Upper Boundary'].astype(float)
df['CI Lower Boundary'] = df['CI Lower Boundary'].astype(float)
df['Margin of Error'] = (df['CI Upper Boundary'] - df['CI Lower Boundary']) / 2

df['Margin of Error'].head()
Both fields are being identified as strings, so we need to convert to floats in order to find the margin of error
Out[93]:
0     8.095
1     8.430
2    19.905
3    14.755
4    15.930
Name: Margin of Error, dtype: float64

Example 3: Combining days and hours into a new column, and converting to DateTime.


I want to combine two columns, dates and hours, into a new field. Then I want to convert the field into DateTime so that I can sort chronologically my field and plot timeseries

In [3]:
df1 = pd.read_csv("input/pool.csv")
In [9]:
print(df1.dtypes)
df1.head()
Aquatic Programs Count    float64
day                        object
Day of Week                object
Family Swim Count         float64
Lap Swim Count            float64
Outage Count              float64
tally_Time                 object
Water Excercise Count     float64
dtype: object
Out[9]:
Aquatic Programs Count day Day of Week Family Swim Count Lap Swim Count Outage Count tally_Time Water Excercise Count
0 NaN 03/03/2020 Tuesday NaN NaN NaN 05:45:00 NaN
1 NaN 03/03/2020 Tuesday NaN NaN NaN 06:15:00 NaN
2 NaN 03/03/2020 Tuesday NaN NaN NaN 06:45:00 NaN
3 NaN 03/03/2020 Tuesday NaN NaN NaN 07:15:00 NaN
4 NaN 03/03/2020 Tuesday NaN NaN NaN 07:45:00 NaN

We have identified our fields of interest: "day" and "tally_Time"

In [12]:
df1['dateTime'] = pd.to_datetime(df1['day'] + ' ' + df1['tally_Time']) # Combine the fields and convert to dateTime
df1 = df1.sort_values(by='dateTime') #Sort chronologically by the date and the hour
df1 = df1.drop(['day','tally_Time'],axis=1) #Remove unwanted/redundant columns
print(df1.dtypes)

df1.head()
Aquatic Programs Count           float64
Day of Week                       object
Family Swim Count                float64
Lap Swim Count                   float64
Outage Count                     float64
Water Excercise Count            float64
dateTime                  datetime64[ns]
dtype: object
Out[12]:
Aquatic Programs Count Day of Week Family Swim Count Lap Swim Count Outage Count Water Excercise Count dateTime
64 NaN Tuesday NaN 1.0 NaN NaN 2019-01-08 05:45:00
65 NaN Tuesday NaN 5.0 NaN NaN 2019-01-08 06:15:00
66 NaN Tuesday NaN 7.0 NaN NaN 2019-01-08 06:45:00
67 NaN Tuesday NaN 9.0 NaN NaN 2019-01-08 07:15:00
68 NaN Tuesday NaN 7.0 NaN NaN 2019-01-08 07:45:00

Export the edited dataset to CSV

In [94]:
df.to_csv('output/hepData_edit.csv')

About the Data

[1] United States Infectious Disease Mortality Rates by County 1980-2014 | GHDx. [online] Available at: http://ghdx.healthdata.org/record/ihme-data/united-states-infectious-disease-mortality-rates-county-1980-2014 [Accessed 5 Mar. 2020].