Made by HP-Nunes @geocuriosity
Updated March 10th, 2020
import pandas as pd
url = "https://raw.githubusercontent.com/HP-Nunes/cheatSheets/master/input/hepData.csv" # [1]
df = pd.read_csv(url, error_bad_lines=False) # Caution: 'Offending lines' will be skipped and not imported
df.head() # Returns the first 5 records of the dataframe
df.tail() # Returns the last 5 records of the dataframe
%%time
# To observe the processing time of a query
print('______________________________________________________________________________________')
print(df.dtypes) # Observe the attribute type of each field
print('______________________________________________________________________________________')
print(df.info()) # Returns display information such as the number of rows and columns, the total memory usage,
# the data type of each column, and the number of non-NaN elements.
print(' ')
print('The dataframe has ' + str(len(df)) + ' rows') # Prints the number of rows/records.
print('The dataframe has ' + str(len(df.columns)) + ' columns') # Returns the number of columns/fields
print('The dataframe has ' + str(df.shape) + ' rows and columns') # My personal favorite
print('The dataframe has ' + str(df.size) + ' total records') # I.e. rows * columns
For the field '% Change in Mortality Rate, 1980-2014' I'd like to have both Confidence Interval, and the change in mortality rate, in separate fields: 'CI Lower Boundary','CI Upper Boundary', and 'Mortality Rate' respectively.
print('These are the values we want to split into three separate new fields:')
df['% Change in Mortality Rate, 1980-2014'].head()
df[['CI Lower Boundary','CI Upper Boundary']] = df['% Change in Mortality Rate, 1980-2014'].str.split(",",expand=True,)
# This splits the values in seperate fields by the comma.
df['CI Upper Boundary'] = df['CI Upper Boundary'].str.strip(')')
# Removing the parenthesis.
df['CI Lower Boundary'] = df['CI Lower Boundary'].str.split('(').str[1]
# Removing the parenthesis and everything left of it (indicated by the index of [1])
df['Mortality Rate'] = df['% Change in Mortality Rate, 1980-2014'].str.split('(').str[0]
# Removing the parenthesis and everything right of it (indicated by the index of [0])
print('Outcome successful:')
df[['% Change in Mortality Rate, 1980-2014','CI Lower Boundary','CI Upper Boundary','Mortality Rate']].head()
I'm interested to find the Margin of Error of the Confidence Intervals, and have it written in a new field named 'Margin of Error'.
print('Observe the field type; we can not do math if the values are recognized as strings (i.e. text):')
df[['CI Lower Boundary','CI Upper Boundary']].dtypes
print('Both fields are being identified as strings, so we need to convert to floats in order to find the margin of error')
df['CI Upper Boundary'] = df['CI Upper Boundary'].astype(float)
df['CI Lower Boundary'] = df['CI Lower Boundary'].astype(float)
df['Margin of Error'] = (df['CI Upper Boundary'] - df['CI Lower Boundary']) / 2
df['Margin of Error'].head()
I want to combine two columns, dates and hours, into a new field. Then I want to convert the field into DateTime so that I can sort chronologically my field and plot timeseries
df1 = pd.read_csv("input/pool.csv")
print(df1.dtypes)
df1.head()
We have identified our fields of interest: "day" and "tally_Time"
df1['dateTime'] = pd.to_datetime(df1['day'] + ' ' + df1['tally_Time']) # Combine the fields and convert to dateTime
df1 = df1.sort_values(by='dateTime') #Sort chronologically by the date and the hour
df1 = df1.drop(['day','tally_Time'],axis=1) #Remove unwanted/redundant columns
print(df1.dtypes)
df1.head()
df.to_csv('output/hepData_edit.csv')
[1] United States Infectious Disease Mortality Rates by County 1980-2014 | GHDx. [online] Available at: http://ghdx.healthdata.org/record/ihme-data/united-states-infectious-disease-mortality-rates-county-1980-2014 [Accessed 5 Mar. 2020].