Made by HP-Nunes @geocuriosity
import pandas as pd
url = "https://raw.githubusercontent.com/HP-Nunes/cheatSheets/master/input/hepData.csv" # [1]
df = pd.read_csv(url, error_bad_lines=False) # Caution: 'Offending lines' will be skipped and not imported
df.head() # Returns the first 5 records of the dataframe
Location | FIPS | Mortality Rate, 1980* | Mortality Rate, 1985* | Mortality Rate, 1990* | Mortality Rate, 1995* | Mortality Rate, 2000* | Mortality Rate, 2005* | Mortality Rate, 2010* | Mortality Rate, 2014* | % Change in Mortality Rate, 1980-2014 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | NaN | 0.49 (0.46, 0.53) | 0.53 (0.50, 0.56) | 0.78 (0.74, 0.82) | 1.42 (1.35, 1.49) | 1.84 (1.75, 1.94) | 0.86 (0.82, 0.91) | 0.30 (0.29, 0.32) | 0.29 (0.27, 0.32) | -40.74 (-48.49, -32.30) |
1 | Alabama | 1.0 | 0.50 (0.45, 0.55) | 0.53 (0.49, 0.58) | 0.81 (0.76, 0.87) | 1.45 (1.36, 1.55) | 1.88 (1.75, 2.00) | 0.81 (0.75, 0.87) | 0.25 (0.23, 0.28) | 0.24 (0.21, 0.27) | -51.75 (-60.02, -43.16) |
2 | Autauga County, Alabama | 1001.0 | 0.37 (0.29, 0.47) | 0.42 (0.33, 0.52) | 0.68 (0.54, 0.83) | 1.18 (0.94, 1.44) | 1.46 (1.16, 1.79) | 0.68 (0.52, 0.86) | 0.20 (0.14, 0.27) | 0.20 (0.13, 0.28) | -46.72 (-63.99, -24.18) |
3 | Baldwin County, Alabama | 1003.0 | 0.43 (0.35, 0.52) | 0.46 (0.39, 0.54) | 0.69 (0.59, 0.81) | 1.19 (1.03, 1.40) | 1.50 (1.28, 1.76) | 0.66 (0.55, 0.78) | 0.22 (0.17, 0.27) | 0.21 (0.16, 0.28) | -51.28 (-64.03, -34.52) |
4 | Barbour County, Alabama | 1005.0 | 0.43 (0.34, 0.53) | 0.45 (0.36, 0.54) | 0.66 (0.53, 0.81) | 1.22 (0.99, 1.47) | 1.69 (1.37, 2.06) | 0.70 (0.55, 0.89) | 0.21 (0.16, 0.29) | 0.20 (0.14, 0.28) | -52.71 (-66.28, -34.42) |
df.tail() # Returns the last 5 records of the dataframe
Location | FIPS | Mortality Rate, 1980* | Mortality Rate, 1985* | Mortality Rate, 1990* | Mortality Rate, 1995* | Mortality Rate, 2000* | Mortality Rate, 2005* | Mortality Rate, 2010* | Mortality Rate, 2014* | % Change in Mortality Rate, 1980-2014 | |
---|---|---|---|---|---|---|---|---|---|---|---|
3189 | Sweetwater County, Wyoming | 56037.0 | 0.21 (0.17, 0.26) | 0.24 (0.20, 0.29) | 0.44 (0.36, 0.53) | 0.96 (0.80, 1.13) | 1.20 (0.98, 1.45) | 0.54 (0.42, 0.68) | 0.16 (0.12, 0.21) | 0.15 (0.10, 0.21) | -27.23 (-49.28, 2.16) |
3190 | Teton County, Wyoming | 56039.0 | 0.19 (0.15, 0.23) | 0.21 (0.17, 0.26) | 0.37 (0.31, 0.44) | 0.78 (0.66, 0.92) | 0.92 (0.76, 1.10) | 0.40 (0.32, 0.49) | 0.12 (0.09, 0.15) | 0.11 (0.08, 0.15) | -41.52 (-58.04, -19.65) |
3191 | Uinta County, Wyoming | 56041.0 | 0.21 (0.17, 0.27) | 0.25 (0.20, 0.31) | 0.45 (0.35, 0.56) | 0.96 (0.78, 1.16) | 1.20 (0.94, 1.49) | 0.53 (0.39, 0.68) | 0.15 (0.10, 0.21) | 0.15 (0.10, 0.22) | -31.09 (-54.60, 2.87) |
3192 | Washakie County, Wyoming | 56043.0 | 0.21 (0.17, 0.26) | 0.25 (0.19, 0.31) | 0.46 (0.37, 0.55) | 1.03 (0.84, 1.24) | 1.23 (0.97, 1.56) | 0.52 (0.38, 0.68) | 0.14 (0.10, 0.20) | 0.14 (0.09, 0.20) | -34.69 (-56.71, -3.32) |
3193 | Weston County, Wyoming | 56045.0 | 0.18 (0.14, 0.23) | 0.21 (0.17, 0.26) | 0.40 (0.32, 0.49) | 0.90 (0.74, 1.08) | 1.12 (0.89, 1.40) | 0.47 (0.36, 0.60) | 0.12 (0.08, 0.17) | 0.11 (0.07, 0.16) | -39.63 (-57.96, -13.14) |
%%time
# To observe the processing time of a query
print('______________________________________________________________________________________')
print(df.dtypes) # Observe the attribute type of each field
print('______________________________________________________________________________________')
print(df.info()) # Returns display information such as the number of rows and columns, the total memory usage,
# the data type of each column, and the number of non-NaN elements.
print(' ')
______________________________________________________________________________________ Location object FIPS float64 Mortality Rate, 1980* object Mortality Rate, 1985* object Mortality Rate, 1990* object Mortality Rate, 1995* object Mortality Rate, 2000* object Mortality Rate, 2005* object Mortality Rate, 2010* object Mortality Rate, 2014* object % Change in Mortality Rate, 1980-2014 object dtype: object ______________________________________________________________________________________ <class 'pandas.core.frame.DataFrame'> RangeIndex: 3194 entries, 0 to 3193 Data columns (total 11 columns): Location 3194 non-null object FIPS 3193 non-null float64 Mortality Rate, 1980* 3194 non-null object Mortality Rate, 1985* 3194 non-null object Mortality Rate, 1990* 3194 non-null object Mortality Rate, 1995* 3194 non-null object Mortality Rate, 2000* 3194 non-null object Mortality Rate, 2005* 3194 non-null object Mortality Rate, 2010* 3194 non-null object Mortality Rate, 2014* 3194 non-null object % Change in Mortality Rate, 1980-2014 3194 non-null object dtypes: float64(1), object(10) memory usage: 274.6+ KB None CPU times: user 8.15 ms, sys: 0 ns, total: 8.15 ms Wall time: 6.28 ms
print('The dataframe has ' + str(len(df)) + ' rows') # Prints the number of rows/records.
print('The dataframe has ' + str(len(df.columns)) + ' columns') # Returns the number of columns/fields
print('The dataframe has ' + str(df.shape) + ' rows and columns') # My personal favorite
print('The dataframe has ' + str(df.size) + ' total records') # I.e. rows * columns
The dataframe has 3194 rows The dataframe has 11 columns The dataframe has (3194, 11) rows and columns The dataframe has 35134 total records
print('These are the values we want to split into three separate new fields:')
df['% Change in Mortality Rate, 1980-2014'].head()
These are the values we want to split into three separate new fields:
0 -40.74 (-48.49, -32.30) 1 -51.75 (-60.02, -43.16) 2 -46.72 (-63.99, -24.18) 3 -51.28 (-64.03, -34.52) 4 -52.71 (-66.28, -34.42) Name: % Change in Mortality Rate, 1980-2014, dtype: object
df[['CI Lower Boundary','CI Upper Boundary']] = df['% Change in Mortality Rate, 1980-2014'].str.split(",",expand=True,)
# This splits the values in seperate fields by the comma.
df['CI Upper Boundary'] = df['CI Upper Boundary'].str.strip(')')
# Removing the parenthesis.
df['CI Lower Boundary'] = df['CI Lower Boundary'].str.split('(').str[1]
# Removing the parenthesis and everything left of it (indicated by the index of [1])
df['Mortality Rate'] = df['% Change in Mortality Rate, 1980-2014'].str.split('(').str[0]
# Removing the parenthesis and everything right of it (indicated by the index of [0])
print('Outcome successful:')
df[['% Change in Mortality Rate, 1980-2014','CI Lower Boundary','CI Upper Boundary','Mortality Rate']].head()
Outcome successful:
% Change in Mortality Rate, 1980-2014 | CI Lower Boundary | CI Upper Boundary | Mortality Rate | |
---|---|---|---|---|
0 | -40.74 (-48.49, -32.30) | -48.49 | -32.30 | -40.74 |
1 | -51.75 (-60.02, -43.16) | -60.02 | -43.16 | -51.75 |
2 | -46.72 (-63.99, -24.18) | -63.99 | -24.18 | -46.72 |
3 | -51.28 (-64.03, -34.52) | -64.03 | -34.52 | -51.28 |
4 | -52.71 (-66.28, -34.42) | -66.28 | -34.42 | -52.71 |
print('Observe the field type; we can not do math if the values are recognized as strings (i.e. text):')
df[['CI Lower Boundary','CI Upper Boundary']].dtypes
Observe the field type; we can not do math if the values are recognized as strings (i.e. text):
CI Lower Boundary object CI Upper Boundary object dtype: object
print('Both fields are being identified as strings, so we need to convert to floats in order to find the margin of error')
df['CI Upper Boundary'] = df['CI Upper Boundary'].astype(float)
df['CI Lower Boundary'] = df['CI Lower Boundary'].astype(float)
df['Margin of Error'] = (df['CI Upper Boundary'] - df['CI Lower Boundary']) / 2
df['Margin of Error'].head()
Both fields are being identified as strings, so we need to convert to floats in order to find the margin of error
0 8.095 1 8.430 2 19.905 3 14.755 4 15.930 Name: Margin of Error, dtype: float64
df1 = pd.read_csv("input/pool.csv")
print(df1.dtypes)
df1.head()
Aquatic Programs Count float64 day object Day of Week object Family Swim Count float64 Lap Swim Count float64 Outage Count float64 tally_Time object Water Excercise Count float64 dtype: object
Aquatic Programs Count | day | Day of Week | Family Swim Count | Lap Swim Count | Outage Count | tally_Time | Water Excercise Count | |
---|---|---|---|---|---|---|---|---|
0 | NaN | 03/03/2020 | Tuesday | NaN | NaN | NaN | 05:45:00 | NaN |
1 | NaN | 03/03/2020 | Tuesday | NaN | NaN | NaN | 06:15:00 | NaN |
2 | NaN | 03/03/2020 | Tuesday | NaN | NaN | NaN | 06:45:00 | NaN |
3 | NaN | 03/03/2020 | Tuesday | NaN | NaN | NaN | 07:15:00 | NaN |
4 | NaN | 03/03/2020 | Tuesday | NaN | NaN | NaN | 07:45:00 | NaN |
We have identified our fields of interest: "day" and "tally_Time"
df1['dateTime'] = pd.to_datetime(df1['day'] + ' ' + df1['tally_Time']) # Combine the fields and convert to dateTime
df1 = df1.sort_values(by='dateTime') #Sort chronologically by the date and the hour
df1 = df1.drop(['day','tally_Time'],axis=1) #Remove unwanted/redundant columns
print(df1.dtypes)
df1.head()
Aquatic Programs Count float64 Day of Week object Family Swim Count float64 Lap Swim Count float64 Outage Count float64 Water Excercise Count float64 dateTime datetime64[ns] dtype: object
Aquatic Programs Count | Day of Week | Family Swim Count | Lap Swim Count | Outage Count | Water Excercise Count | dateTime | |
---|---|---|---|---|---|---|---|
64 | NaN | Tuesday | NaN | 1.0 | NaN | NaN | 2019-01-08 05:45:00 |
65 | NaN | Tuesday | NaN | 5.0 | NaN | NaN | 2019-01-08 06:15:00 |
66 | NaN | Tuesday | NaN | 7.0 | NaN | NaN | 2019-01-08 06:45:00 |
67 | NaN | Tuesday | NaN | 9.0 | NaN | NaN | 2019-01-08 07:15:00 |
68 | NaN | Tuesday | NaN | 7.0 | NaN | NaN | 2019-01-08 07:45:00 |
df.to_csv('output/hepData_edit.csv')
[1] United States Infectious Disease Mortality Rates by County 1980-2014 | GHDx. [online] Available at: http://ghdx.healthdata.org/record/ihme-data/united-states-infectious-disease-mortality-rates-county-1980-2014 [Accessed 5 Mar. 2020].