1. Introduction

Name: Jared Yu
Student ID: 914640019

In the following Jupyter Notebook, there will be a step-by-step process of Data Extraction, Data Wrangling, Exploratory Data Analysis, Modeling, and a Conclusion. The dataset itself comes from https://datasf.org/opendata/ and is queried using an API. The original dataset has over 4 million observations dating several years. The dataset is titled, 'Fire Department Calls for Service,' and gives up to several observations for each 9-11 phone call that requires San Francisco Fire Department assisstance. Each observation contains variables describing different vehicles such as Fire Trucks or Ambulances which arrive on scene. Other information included are: time, location, neighborhood, etc..

The initial Exploratory Data Analysis process shows that there are two variables of particular interest related to the dataset itself. The two variables are related to the response time of each observation, along with the priority level of each incident. However, despite there being over 30 original columns, there was still a lack of columns which could be used as predictors, for this reason additional Data Scraping is done on external data sources to possibly enhance the ability to perform further modeling.

From the two random variables of interest, response time and priority level, the former is categorized as binary and the latter as continuous. For this reason there will be different modeling techniques used to analyze any predictive powers. The main tools will be Logistic Regression and Linear Regression. Additional modeling includes trying to discern whether or not the data follows a Poisson distribution. A final conclusion of the process will be included at the end of the report.

[Back to Table of Concents](#toc)

1.1 Importing Necessary Packages

Below are all the packages which are used within the notebook.

[Back to Table of Concents](#toc)
In [1]:
import pandas as pd # Import necessary packages
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
import numpy as np
from numpy import array
from numpy import argmax
import datetime as dt
from datetime import datetime
from geopy.geocoders import Nominatim
from geopy import distance
import statsmodels.formula.api as sm
import statsmodels.api as sm
from scipy import stats
from scipy.stats import poisson
import warnings
warnings.simplefilter('ignore')
import seaborn as sns
from sodapy import Socrata
from bs4 import BeautifulSoup
import requests
import re
from mpl_toolkits.basemap import Basemap
from descartes import PolygonPatch
from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from shapely.prepared import prep
from itertools import chain
import fiona
import folium
from folium.plugins import HeatMap
import os
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegressionCV
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
from matplotlib.collections import PatchCollection
%matplotlib inline
%pylab inline 
pylab.rcParams['figure.figsize'] = (15, 9) # Change the size of plots
pd.set_option("display.max_columns", 1000)
pd.set_option("display.expand_frame_repr", True)
Populating the interactive namespace from numpy and matplotlib

2. Data Extraction

To extract the dataset from the website https://datasf.org/opendata/, their API is used. This allows the dataset to be narrowed down to roughly 1 million rows, down from an original size of over 4 million. The website uses a type of SQL called SoQL, which is short for Socrata Query Language. Additionally, an account on the website is required to query an unlimited amount of observations.

[Back to Table of Concents](#toc)

2.1 Querying the API

Below the api_query() function is used to query the website for the requested subset of data. The commands below request that only observations from after the year 2015 are used for the initial analysis.

[Back to Table of Concents](#toc)

In [2]:
def api_query(SoQL_query, client_key):
    """
    https://dev.socrata.com/foundry/data.sfgov.org/enhu-st7v
    The function is created from code given to query data off the
    API of the government website. Personal login information
    is used to access the API. Additionally, a SoQL query is
    used as an input to determine how the data should be accessed.
    input: SoQL_query, a SQL-like query using SocrataQL; client_key,
    a key which connects the SoQL to this particular dataset
    output: queried dataset in dataframe format
    """
    client = Socrata(domain='data.sfgov.org', 
                     app_token='LatHs7KifDEpxpxnlKMb9SFfy', 
                     username="qzyu999@gmail.com", 
                     password="SFData999")

    result_list = client.get(client_key, limit=5000000, where=SoQL_query)

    return(pd.DataFrame.from_records(result_list))
In [3]:
# Query the data using the proper key and SoQL command
post_2015 = 'watch_date>="2015-01-01T00:00:00"'
client_key = "enhu-st7v"
df_query = api_query(SoQL_query=post_2015, client_key=client_key)
df_query.head()
Out[3]:
address als_unit available_dttm battalion box call_date call_final_disposition call_number call_type call_type_group city dispatch_dttm entry_dttm final_priority fire_prevention_district hospital_dttm incident_number location neighborhoods_analysis_boundaries number_of_alarms on_scene_dttm original_priority priority received_dttm response_dttm rowid station_area supervisor_district transport_dttm unit_id unit_sequence_in_call_dispatch unit_type watch_date zipcode_of_incident
0 CHESTNUT ST/PIERCE ST True 2015-01-01T11:43:03.000 B04 3661 2015-01-01T00:00:00.000 Fire 150011745 Other Alarm San Francisco 2015-01-01T11:38:41.000 2015-01-01T11:38:35.000 3 4 NaN 15000315 {'type': 'Point', 'coordinates': [-122.4395041... Marina 1 2015-01-01T11:40:35.000 3 3 2015-01-01T11:38:06.000 2015-01-01T11:39:14.000 150011745-E16 16 2 NaN E16 1 ENGINE 2015-01-01T00:00:00.000 94123
1 1600 Block of ALABAMA ST True 2015-01-02T07:07:56.000 B06 5666 2015-01-02T00:00:00.000 Code 2 Transport 150020416 Medical Incident Potentially Life-Threatening San Francisco 2015-01-02T06:51:37.000 2015-01-02T06:51:30.000 3 6 NaN 15000611 {'type': 'Point', 'coordinates': [-122.4105228... Bernal Heights 1 2015-01-02T06:56:16.000 3 3 2015-01-02T06:49:39.000 2015-01-02T06:53:40.000 150020416-E09 09 9 NaN E09 1 ENGINE 2015-01-01T00:00:00.000 94110
2 CALL BOX: SF INTERNATIONAL AIRPORT True 2015-01-01T17:48:43.000 B09 6913 2015-01-01T00:00:00.000 Cancelled 150012799 Aircraft Emergency Fire NaN 2015-01-01T17:42:45.000 2015-01-01T17:41:52.000 3 None NaN 15000434 {'type': 'Point', 'coordinates': [-122.3840942... None 1 NaN 3 3 2015-01-01T17:41:44.000 2015-01-01T17:43:54.000 150012799-T15 44 None NaN T15 8 TRUCK 2015-01-01T00:00:00.000 NaN
3 2800 Block of WASHINGTON ST True 2015-01-01T22:41:18.000 B04 4162 2015-01-01T00:00:00.000 Code 2 Transport 150013389 Medical Incident Potentially Life-Threatening San Francisco 2015-01-01T21:09:07.000 2015-01-01T21:08:17.000 3 4 2015-01-01T21:43:56.000 15000512 {'type': 'Point', 'coordinates': [-122.4401357... Pacific Heights 1 2015-01-01T21:14:47.000 3 3 2015-01-01T21:05:18.000 2015-01-01T21:09:34.000 150013389-77 10 2 2015-01-01T21:38:27.000 77 2 MEDIC 2015-01-01T00:00:00.000 94115
4 400 Block of POWELL ST True 2015-01-01T18:23:20.000 B01 1362 2015-01-01T00:00:00.000 Code 2 Transport 150012903 Medical Incident Potentially Life-Threatening San Francisco 2015-01-01T18:18:39.000 2015-01-01T18:14:19.000 3 1 NaN 15000450 {'type': 'Point', 'coordinates': [-122.4085665... Nob Hill 1 NaN 2 3 2015-01-01T18:12:34.000 NaN 150012903-52 01 3 NaN 52 3 MEDIC 2015-01-01T00:00:00.000 94102

2.2 Data Scraping

Later on it will be shown that there are a plethora of categorical and non-numeric variable in this dataset. When doing modeling this may make it difficult to find suitable predictors. Therefore, Data Scraping is here utlized to gather additional information about the observations from various sources on the web. The two columns here which are being scraped is the median income of neighborhoods, and distance of the fire station from an incident. The package used for doing the web scraping is BeautifulSoup4.

[Back to Table of Concents](#toc)

2.2.1 Median Income of Neighborhoods

Below the median income is scraped from a website using BeautifulSoup4. The neighborhood column and median income are taken from a table. These new columns are then merged with the dataframe to create a new column which indicates what income level an observation belongs to. Certain neighborhoods lack an income level according to the table, however an income level is displayed graphically for all neighborhoods in a geographic plot. This geographic plot is used to impute the missing data for the neighborhoods which lack a median income level.

[Back to Table of Concents](#toc)

In [4]:
# Below two columns from a table are webscraped using bs4. The website
# is requested, and the table is subset according to html. The columns
# are also reformatted to remove unecessary characters.
req = requests.get('http://www.sfindicatorproject.org/indicators/view/162/')
soup = BeautifulSoup(req.content)
soup.table['id']
soup.find(id == "example_table")
table = soup.find_all('table')[0]
table_all = table.find_all("td")
table_len = len(table_all)
district = [table_all[i].get_text().replace('\xa0', ' ') for i in list(range(table_len))[0:210:5]]
med_income = [re.sub('[$, *NA]','',table_all[i+3].get_text()) for i in list(range(table_len))[0:210:5]]

To follow the ordering in a median income plot from the same website, varying levels of income are described as ranging from 0-5. Certain neighborhoods that are not included on the website are imputed by observing their apparent median income based on the level shown on a map. It is believed that certain neighborhoods overlap in the plot, so it is best to estimate a neighborhood's income based on the available data.

Website plot: http://www.sfindicatorproject.org/img/indicators/pdf/MedianHouseholdIncome.pdf

Income Range Income Level (0-5)
statistically unstable or small population 0
$11,536 - $39,000 1
$39,000 - $67,000 2
$67,000 - $91,000 3
$91,000 - $120,000 4
$120,000 - $164,135 5
In [5]:
# Below the two webscraped columns are recategorized by median income
# range. Also, values which are missing are estimated by the given
# plot on the website. The remaining neighborhoods are re-categorized.
sf_income = pd.DataFrame([district, med_income]).T
sf_income.columns = ['district', 'med_income']
sf_income.loc[0, 'district'] = 'Bayview Hunters Point'
sf_income['med_income'][7, 14] = 0 # Impute the missing values
sf_income['med_income'][17, 36] = 1
sf_income['med_income'][9, 12, 27, 37] = 2
sf_income['med_income'][6, 23, 35] = 3
sf_income['med_income'][5, 15, 19, 29, 38] = 4
sf_income.med_income = sf_income.med_income.apply(pd.to_numeric)
sf_income.sort_values(by=['med_income'], inplace=True)
sf_income = sf_income.reset_index(drop=True)
sf_income.loc[16, 'med_income'] = 1
sf_income.loc[17:22, 'med_income'] = 2
sf_income.loc[23:33, 'med_income'] = 3
sf_income.loc[34:39, 'med_income'] = 4
sf_income.loc[40:41, 'med_income'] = 5

df_merge = pd.merge(df_query, sf_income, how='left', left_on=['neighborhoods_analysis_boundaries'], right_on=['district'])
df_merge = df_merge.drop(columns=['district'])
df_merge.med_income.head()
Out[5]:
0    4.0
1    3.0
2    NaN
3    4.0
4    2.0
Name: med_income, dtype: float64

2.2.2 Distance from Incident

Another interesting variable may be to try and find out what the distance each fire department is from a particular scene. The variable which provides this information is 'station_area,' which tells which fire station it is that sent the particular observation to a scene. In this section, we utilize the fire station location data from the department's website. The process is to scrape the addresses based on the station number, and then determine the coordinates using the 'geopy' package. Certain values require some manual imputation, and others require dropping due to a lack of matching with the data. The resultant coordinates of each fire station are then measured against the coordinates of the incident scene for each observation.

[Back to Table of Concents](#toc)

Here the first step is to scrape the fire station addresses from the website of the fire department.

In [6]:
# Below the fire station addresses are scraped from the website.
req = requests.get('https://sf-fire.org/fire-station-locations')
soup = BeautifulSoup(req.content)
addresses = soup.findAll('table')[0].findAll('tr')
list_addresses = [([i.a for i in addresses][j].contents) for j in range(len(addresses)-1)]
geolocator = Nominatim(timeout=60)

list_addresses2 = [list_addresses[i][0].replace('\xa0', ' ').split(' at')[0] for i in range(len(list_addresses))]
list_addresses3 = [(list_addresses2[i] + ', San Francisco') for i in range(len(list_addresses2))]

Next we determine the coordinates for each of these addresses, while manualing imputing three that are missing from the website.

In [7]:
# Determine the coordinates of each address, and also
# impute certain coordinates using Google
geo_list = []
[geo_list.append(geolocator.geocode(y)) for y in list_addresses3]
geo_list = list(filter(None, geo_list))
geo_list2 = [geo_list[i][1] for i in range(len(geo_list))]
geo_list2.insert(3, (37.772780, -122.389050)) # 4
geo_list2.insert(32, (37.794610, -122.393260)) # 35
geo_list2.insert(42, (37.827160, -122.368300)) # 48
del geo_list2[43] # Delete incorrect rows
del geo_list2[43]

Here a small subset of fire stations which don't have known addresses according to the website are removed. There are less than 1% of the observations which fit this description, so the predictive power of these are likely not significant. Afterwards, the distance (in meters) between the fire station of each observation and the coordinates of the incident that they respond to are calculated using 'geopy' and list comprehension.

In [8]:
# Utilize the previously merged dataframe, and remove the outliers
# which have no known addresses. Take the coordinates of the fire
# stations and calculate the distance between them and the incident.
# Add this new column to the dataframe as 'distance.'
df2_2 = df_merge.copy()
df2_2 = df2_2.loc[(df2_2.station_area != '94') & 
                  (df2_2.station_area != 'E2') & 
                  (df2_2.station_area != '47')]
df2_2.dropna(subset=['station_area'], inplace=True)

stations = df2_2.station_area.unique()
stations = pd.DataFrame(np.sort(stations.astype(int)))
starting_loc = pd.concat([stations, pd.DataFrame(geo_list2)], axis=1)
starting_loc.columns = ['station', 'lats', 'longs']
starting_loc.station = [format(i, '02d') for i in starting_loc.station]

df2_2['starting_lat'] = df2_2['station_area'].map(starting_loc.set_index('station')['lats'])
df2_2['starting_long'] = df2_2['station_area'].map(starting_loc.set_index('station')['longs'])

df2_2.reset_index(drop=True, inplace=True)

df2_2['long'] = [i['coordinates'][0] for i in df2_2['location']]
df2_2['lat'] = [i['coordinates'][1] for i in df2_2['location']]

meter_distance = []
[meter_distance.append(distance.distance(((df2_2.loc[i, 'starting_lat']), 
        (df2_2.loc[i, 'starting_long'])), ((df2_2.loc[i, 'lat']), 
        (df2_2.loc[i, 'long']))).meters) for i in range(len(df2_2))]

df2_2['distance'] = meter_distance

3. Data Wrangling

Below the functions get_time_of_day() and clean_my_data() are used to add additional columns which may be helpful for the process of data analytics along with fixing variable formats so that they are easier to work with. An example of added columns are features such as: time of day, year, month, hour, day of the week, holiday, etcetera.

[Back to Table of Concents](#toc)

In [9]:
def get_time_of_day(hour):
    """
    The get_time_of_day() function subsets each observation by the time of day,
    whether it is morning, afternoon, evening, or night.

    input: hour, the hourly column from the dataframe
    output: tod, the time of day depending on what the hour is.
    """
    # By Hour Logic: 0-4 night; 5-11 morning; 12-16 afternoon; 17-19 evening; 20-24 night
    if (hour > 4) & (hour <= 11):
        tod = "morning"
    elif (hour > 11) & (hour <= 16):
        tod = "afternoon"
    elif (hour > 16) & (hour <= 19):
        tod = "evening"
    else:
        tod = "night"
    
    return tod
In [10]:
def clean_my_data(df, col_list):
    """
    Takes in the dataframe that needs to be cleaned. Also takes in the list of column names of
    complicated and long dates, for proper date conversion.
    
    input: 'df' (the dataframe to be cleaned)
          'col_list' (the list of complicated datetime column names)
    output: 'df' (the dataframe already cleaned)
    """
    # Simple data type conversions to int
    df["number_of_alarms"] = df["number_of_alarms"].astype(int)
    df["unit_sequence_in_call_dispatch"] = df["unit_sequence_in_call_dispatch"].astype(int)
    
    # Simple datetime conversions
    df["call_date"] = pd.to_datetime(df["call_date"], format="%Y-%m-%d")
    df["watch_date"] = pd.to_datetime(df["watch_date"], format="%Y-%m-%d")
    
    # For every value in the provided complicated datetime column list, change data types to datetimes
    for val in col_list:
        df[val] = pd.to_datetime(df[val], format="%Y-%m-%dT%H:%M:%S.%f")
        
    # Day of The Week & Time of Day columns
    df["year"] = pd.DatetimeIndex(df["call_date"]).year
    df["month"] = pd.DatetimeIndex(df["call_date"]).month
    df["dotw"] = df["call_date"].dt.day_name()
    df["hour"] = pd.DatetimeIndex(df["received_dttm"]).hour
    df["time_of_day"] = df["hour"].apply(lambda row: get_time_of_day(row))

    # Holidays
    cal = calendar()
    holidays = cal.holidays(start=df.call_date.min(), end = df.call_date.max())
    df["holiday"] = df["call_date"].isin(holidays)
    
    return(df)
In [11]:
# Create the inputs for the functions using the desired variables.
date_and_time_col_list = ["received_dttm", "entry_dttm", "dispatch_dttm", "response_dttm",
                    "on_scene_dttm", "transport_dttm", "hospital_dttm", "available_dttm"]
In [12]:
# Query, clean, and display the data.
df = clean_my_data(df=df2_2, col_list=date_and_time_col_list)
df.head()
Out[12]:
address als_unit available_dttm battalion box call_date call_final_disposition call_number call_type call_type_group city dispatch_dttm entry_dttm final_priority fire_prevention_district hospital_dttm incident_number location neighborhoods_analysis_boundaries number_of_alarms on_scene_dttm original_priority priority received_dttm response_dttm rowid station_area supervisor_district transport_dttm unit_id unit_sequence_in_call_dispatch unit_type watch_date zipcode_of_incident med_income starting_lat starting_long long lat distance year month dotw hour time_of_day holiday
0 CHESTNUT ST/PIERCE ST True 2015-01-01 11:43:03 B04 3661 2015-01-01 Fire 150011745 Other Alarm San Francisco 2015-01-01 11:38:41 2015-01-01 11:38:35 3 4 NaT 15000315 {'type': 'Point', 'coordinates': [-122.4395041... Marina 1 2015-01-01 11:40:35 3 3 2015-01-01 11:38:06 2015-01-01 11:39:14 150011745-E16 16 2 NaT E16 1 ENGINE 2015-01-01 94123 4.0 37.798667 -122.436734 -122.439504 37.800403 310.889757 2015 1 Thursday 11 morning True
1 1600 Block of ALABAMA ST True 2015-01-02 07:07:56 B06 5666 2015-01-02 Code 2 Transport 150020416 Medical Incident Potentially Life-Threatening San Francisco 2015-01-02 06:51:37 2015-01-02 06:51:30 3 6 NaT 15000611 {'type': 'Point', 'coordinates': [-122.4105228... Bernal Heights 1 2015-01-02 06:56:16 3 3 2015-01-02 06:49:39 2015-01-02 06:53:40 150020416-E09 09 9 NaT E09 1 ENGINE 2015-01-01 94110 3.0 37.745241 -122.401260 -122.410523 37.746636 830.896875 2015 1 Friday 6 morning False
2 CALL BOX: SF INTERNATIONAL AIRPORT True 2015-01-01 17:48:43 B09 6913 2015-01-01 Cancelled 150012799 Aircraft Emergency Fire NaN 2015-01-01 17:42:45 2015-01-01 17:41:52 3 None NaT 15000434 {'type': 'Point', 'coordinates': [-122.3840942... None 1 NaT 3 3 2015-01-01 17:41:44 2015-01-01 17:43:54 150012799-T15 44 None NaT T15 8 TRUCK 2015-01-01 NaN NaN 37.716665 -122.400266 -122.384094 37.616882 11166.428729 2015 1 Thursday 17 evening True
3 2800 Block of WASHINGTON ST True 2015-01-01 22:41:18 B04 4162 2015-01-01 Code 2 Transport 150013389 Medical Incident Potentially Life-Threatening San Francisco 2015-01-01 21:09:07 2015-01-01 21:08:17 3 4 2015-01-01 21:43:56 15000512 {'type': 'Point', 'coordinates': [-122.4401357... Pacific Heights 1 2015-01-01 21:14:47 3 3 2015-01-01 21:05:18 2015-01-01 21:09:34 150013389-77 10 2 2015-01-01 21:38:27 77 2 MEDIC 2015-01-01 94115 4.0 37.785610 -122.446821 -122.440136 37.790810 824.540333 2015 1 Thursday 21 night True
4 400 Block of POWELL ST True 2015-01-01 18:23:20 B01 1362 2015-01-01 Code 2 Transport 150012903 Medical Incident Potentially Life-Threatening San Francisco 2015-01-01 18:18:39 2015-01-01 18:14:19 3 1 NaT 15000450 {'type': 'Point', 'coordinates': [-122.4085665... Nob Hill 1 NaT 2 3 2015-01-01 18:12:34 NaT 150012903-52 01 3 NaT 52 3 MEDIC 2015-01-01 94102 2.0 37.779417 -122.404100 -122.408567 37.788750 1108.170192 2015 1 Thursday 18 evening True

4. Data Dictionary

Below is a list of the variables which are within the dataset, along with the descriptions provided by the website from which the data is sourced. Additionally, variables which are added have their own separate dictionary also.

[Back to Table of Concents](#toc)

Variable Description
Call Number A unique 9-digit number assigned by the 911 Dispatch Center (DEM) to this call. These number are used for both Police and Fire calls.
Unit ID N/A
Incident Number A unique 8-digit number assigned by DEM to this Fire incident.
Call Type Call Type
Call Date Date the call is received at the 911 Dispatch Center. Used for reporting purposes.
Watch Date Watch date when the call is received. Watch date starts at 0800 each morning and ends at 0800 the next day.
Received DtTm Date and time of call is received at the 911 Dispatch Center.
Entry DtTm Date and time the 911 operator submits the entry of the initical call information into the CAD system
Dispatch DtTm Date and time the 911 operator dispatches this unit to the call.
Response DtTm Date and time this unit acknowledges the dispatch and records that the unit is en route to the location of the call.
On Scene DtTm Date and time the unit records arriving to the location of the incident
Transport DtTm If this unit is an ambulance, date and time the unit begins the transport unit arrives to hospital
Hospital DtTm If this unit is an ambulance, date and time the unit arrives to the hospital.
Call Final Disposition Disposition of the call (Code). For example TH2: Transport to Hospital Code 2, FIR: Resolved by Fire Department
Available DtTm Date and time this unit is not longer assigned to this call and it is available for another dispatch.
Address Address of incident (note: address and location generalized to mid block of street, intersection or nearest call box location, to protect caller privacy).
City City of incident
Zipcode of Incident Zipcode of incident
Battalion Emergency Response District (There are 9 Fire Emergency Response Districts)
Station Area Fire Station First Response Area associated with the address of the incident
Box Fire box associated with the address of the incident. A box is the smallest area used to divide the City. Each box is associated with a unique unit dispatch order. The City is divided into more than 2,400 boxes.
Original Priority Initial call priority (Code 2: Non-Emergency or Code 3:Emergency).
Priority Call priority (Code 2: Non-Emergency or Code 3:Emergency).
Final Priority Final call priority (Code 2: Non-Emergency or Code 3:Emergency).
ALS Unit Does this unit includes ALS (Advance Life Support) resources? Is there a paramedic in this unit?
Call Type Group Call types are divided into four main groups: Fire, Alarm, Potential Life Threatening and Non Life Threatening.
Number of Alarms Number of alarms associated with the incident. This is a number between 1 and 5.
Unit Type Unit type
Unit sequence in call dispatch A number that indicates the order this unit was assigned to this call
Fire Prevention District Bureau of Fire Prevention District associated with this address
Supervisor District Supervisor District associated with this address (note: these are the districts created in 2012).
Neighborhooods Analysis Boundaries Neighborhood District associated with this address
Location Location of incident (note: address and location generalized to mid block of street, intersection or nearest call box location, to protect caller privacy).
RowID Unique Call Number and Unit ID combination

Added Columns

Below are variables which have been added using existing information that was gathered from the data. Certain variables which have not been mentioned yet are included, however they will be created later on in the notebook.

Variable Description
year The year of the observation.
month The month of the observation.
dotw The day of the week of the observation.
hour The hour of the observation (based on a 24-hour clock)
time_of_day The time of the day (e.g., morning, evening, night).
long The longitude of the observation.
lat The latitude of the observation.
holiday True False column that determines if a date is a holiday.
med_income Median income of a neighborhood on a scale of 0-5.
response_time An estimate of how long it takes for the fire department to respond to a 9-11 call.



[Back to Table of Concents](#toc)

5. Exploratory Data Analysis pt. 1

Dataset Exploration

The process of Exploratory Data Analysis (EDA) is broken down into several parts. The initial part 1 is used to do some initial variable analysis. By looking into what sort of information the dataset contains, curiosity can be developed to imagine further questions which may be of interest and worth exploring. Part 2 of the EDA will include some exploration into some of the dataset's observations to search for possible entries which match with events that occurred in real life. Part 3 of the EDA will move on towards exploring the response time variable. Part 4 of the EDA will look into the priority level variable.

[Back to Table of Concents](#toc)

5.1 Examining the Datatypes

It is apparent that much of the data is relate to datetime, also the other variables are often categorical, with few numeric variables. This can make modeling difficult, so it is practical to add further columns by either merging data from other sources or creating new features from previously existing columns. Below it is evident that 10 of the variables are datetime-related, 23 are objects, and only 8 are numeric variables.

[Back to Table of Concents](#toc)

In [13]:
df.info(verbose=False) # Examine the datatype of each columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1201264 entries, 0 to 1201263
Columns: 46 entries, address to holiday
dtypes: bool(2), datetime64[ns](10), float64(6), int32(2), int64(3), object(23)
memory usage: 396.4+ MB

5.2 The Shape of the Dataset

From the dates which were queried, there are 1,196,820 rows and 43 columns on 12/2/2018. Due to the dataset updating each time the data is queried, this will change anytime the dataset is queried on a new date.

[Back to Table of Concents](#toc)

In [14]:
print(' The number of rows\'s in data is:\t{} \n \
The number of columns in the data is:\t{}'.format(df.shape[0], df.shape[1])) # Print data shape
 The number of rows's in data is:	1201264 
 The number of columns in the data is:	46

5.3 Missing Values.

There are 2,006,000 cells which have Na's. Although there are a high number of Na's, it is possible that many are intentionally Na due to a column not necessarily needing to be filled for a particular observation. e.g., the vehicle never arrives at a location, therefore there is no arrival time for the given vehicle. Also, as the dataset updates, there will be an increasing number of Na's.

[Back to Table of Concents](#toc)

In [15]:
print('The number of Na\'s in data is:\t{}'.format(sum(df.isna().sum()))) # Print number of Na's
The number of Na's in data is:	2010799

Below is a table of all the Na's by column sorted from most to few:

In [16]:
df.isna().sum().sort_values(ascending=False).head() # Show Na categories
Out[16]:
hospital_dttm     869290
transport_dttm    861800
on_scene_dttm     239822
response_dttm      32485
city                2873
dtype: int64
Variable Description
On Scene DtTm Date and time the unit records arriving to the location of the incident
Transport DtTm If this unit is an ambulance, date and time the unit begins the transport unit arrives to hospital
Hospital DtTm If this unit is an ambulance, date and time the unit arrives to the hospital.

From the below printouts of Na's for when both 'on_scene_dttm' and 'hostpital_dttm' are Na, along with when both transport_dttm' and 'hospital_dttm' are Na show a great deal of overlap. These variables are also the largest Na categories from the dataset. It would make sense to think that if an ambulance does not do one of the steps in going to the hospital, it will never ultimately arrive at the hospital. Also, not all of the observations are ambulances. For instance, a fire truck engine would never travel to a hospital and would therefore lack a date time for arriving at one.

Therefore, it may not be necessary to try and impute an excessive number of cells when they don't require any input to begin with.

In [17]:
print(' The number of Na\'s for variables \'on_scene_ddtm\' as well as \'hospital_dttm\': \t{} \n \
The number of Na\'s for variables \'hospital_dttm\': \t{} \n \
The number of Na\'s for variables \'on_scene_ddtm\': \t{}'.format(len(df.loc[df.on_scene_dttm.isna() & 
                        df.hospital_dttm.isna()]),len(df.loc[df.hospital_dttm.isna()]),
                        len(df.loc[df.on_scene_dttm.isna()]))) # Print relation between on_scene, hospital_dttm
 The number of Na's for variables 'on_scene_ddtm' as well as 'hospital_dttm': 	239427 
 The number of Na's for variables 'hospital_dttm': 	869290 
 The number of Na's for variables 'on_scene_ddtm': 	239822
In [18]:
print(' The number of Na\'s for variables \'transport_dttm\' as well as \'hospital_dttm\': \t{} \n \
The number of Na\'s for variables \'hospital_dttm\': \t{} \n \
The number of Na\'s for variables \'transport_dttm\': \t{}'.format(len(df.loc[df.transport_dttm.isna() & 
                        df.hospital_dttm.isna()]), len(df.loc[df.hospital_dttm.isna()]),
                        len(df.loc[df.transport_dttm.isna()]))) # Print relation between transport_dttm, hospital_dttm
 The number of Na's for variables 'transport_dttm' as well as 'hospital_dttm': 	861800 
 The number of Na's for variables 'hospital_dttm': 	869290 
 The number of Na's for variables 'transport_dttm': 	861800

5.4 Call Types of Fire Department Incidents

An interesting question is to find out what kind of incidents does the fire department generally encounter when they are called out for assistance. Below is a table of the different call types, and the dominant occurance seems to be medical incidents. These are likely to be situations when someone calls 911 because they are going through some sort of medical distress.

[Back to Table of Concents](#toc)

In [19]:
df.call_type.value_counts().head() # Print call type counts
Out[19]:
Medical Incident     824428
Alarms               132836
Structure Fire       109998
Traffic Collision     52825
Other                 17828
Name: call_type, dtype: int64

After observing that the dominant type of call that is received are for medical incidents, the proportion of such incidents is viewed through a bar graph. This high level of occurences of one call type in comparison to others makes it seem that it could possibly be of use later on in the modeling process.

[Back to Table of Concents](#toc)

In [20]:
# First a copy of the original dataframe is made. Then a new column is
# created which indicates whether or not a Medical Indicident occurred.
# The proportion is then calculated, and they are finally plotted.
df5_4 = df.copy()
df5_4['Medical'] = np.where((df.call_type == 'Medical Incident'), 'yes', 'no')
med_ratio = sum(df5_4['Medical'] == 'yes') / len(df)
nonmed_ratio = 1 - med_ratio
proportion_med = pd.DataFrame([med_ratio, nonmed_ratio])
proportion_med_bool = pd.DataFrame(pd.Series(['Medical', 'Non-medical']))
med_ratio_df = pd.concat([proportion_med, proportion_med_bool], axis=1)
med_ratio_df.columns = ['Proportion', 'Medical or Not']
sns.barplot(x='Medical or Not', y='Proportion', data=med_ratio_df)
sns.set(font_scale=2)
plt.title(label='Proportion of Medical Incidents to Non-medical Incidents')
Out[20]:
Text(0.5, 1.0, 'Proportion of Medical Incidents to Non-medical Incidents')

By plotting the proportion of call types which are related to fires, it is possible to see that the majority of fire department incidents are not directly related to any actual fires. This actually is an intuitive result. The reason is that with the millions of observations, there are not an equivalent amount of fire-related emergencies that people generally will encounter. The more likely event is that someone has a physical accident where they require hospital treatment.

[Back to Table of Concents](#toc)

In [21]:
# The different fire call types are categorized in a separate category
# which acts as a boolean for when a fire incident occurs. The proportion
# of these incidents are plotted in a bargraph.
df5_4['Fire'] = np.where((df.call_type == 'Structure Fire') | 
                      (df.call_type == 'Outside Fire') | 
                      (df.call_type == 'Vehicle Fire') |
                      (df.call_type == 'Explosion') |
                      (df.call_type == 'Marine Fire'), 'yes', 'no')

fire_ratio = sum(df5_4['Fire'] == 'yes') / len(df)
nonfire_ratio = 1 - fire_ratio
proportion_fire = pd.DataFrame([fire_ratio, nonfire_ratio])
proportion_fire_bool = pd.DataFrame(pd.Series(['Fire', 'Not-fire']))
fire_ratio_df = pd.concat([proportion_fire, proportion_fire_bool], axis=1)
fire_ratio_df.columns = ['Proportion', 'Fire or Not']
sns.barplot(x='Fire or Not', y='Proportion', data=fire_ratio_df)
sns.set(font_scale=2)
plt.title(label='Proportion of Fire Incidents to Not-fire Incidents')
Out[21]:
Text(0.5, 1.0, 'Proportion of Fire Incidents to Not-fire Incidents')

5.4.3 Heatmap of Fire Department Incidents

Below is a heatmap of fire department incidents based on call type. The heatmap shows the distribution of call types by the hour of day. Each call type has the relative frequency plotted rather than the actual count. Also, the range of the frequency is adjusted so that it is easier to see the distribution for all call types.

The heatmap below shows that the most common occurences for call types are during times that aren't in the early morning when most people are asleep. This makes sense, since there is the least activity going on, so there would be the least likely chance that an individual needs help from the fire department. The calls seem to pickup around 7 a.m., and decline around 9 p.m..

[Back to Table of Concents](#toc)

In [22]:
# The code below first copies and subsets the relevant columns.
# Then it groups them into an appropriate format where it can
# first have the relative frequency calculated for each hour.
# Afterwards the data is pivoted so that it can be put into
# the Seaborn heatmap.
df5_4_3 = df.copy()
df5_4_3 = df5_4_3[['call_type', 'hour']]
df5_4_3 = df5_4_3.groupby(['call_type', 'hour']).size()
df5_4_3 = pd.DataFrame(df5_4_3)
df5_4_3.reset_index(inplace=True)
df5_4_3.columns = ['call_type', 'hour', 'count']
df5_4_3['rel_freq'] = ''

for i in df5_4_3.call_type.unique(): # Calculate relative frequency
    one_call_type = df5_4_3.loc[df5_4_3.call_type == i]
    df5_4_3['rel_freq'][df5_4_3.call_type == i] = one_call_type['count']/np.sum(one_call_type['count'])

df5_4_3 = df5_4_3.pivot('call_type', 'hour', 'rel_freq')

df5_4_3.fillna(0, inplace=True)
sns.set(font_scale=1.5)
ax = sns.heatmap(df5_4_3, vmin=0.017, vmax=0.28, cmap='Blues', xticklabels=True, yticklabels=True)
ax.set(xlabel='Hour', ylabel='Call Type', title='Relative Frequency Heatmap of Call Type by Hour')
Out[22]:
[Text(106.42187499999997, 0.5, 'Call Type'),
 Text(0.5, 52.5, 'Hour'),
 Text(0.5, 1.0, 'Relative Frequency Heatmap of Call Type by Hour')]

5.5 Distribution of Fire Department Incidents by San Francisco Neighborhood

From the table it is evident that the number of 911 calls which require fire department help is not perfectly balanced. There tends to be quite a high number in certain areas, notably within the inner city where social inequality may be high.

[Back to Table of Concents](#toc)

In [23]:
df.neighborhoods_analysis_boundaries.value_counts().head() # Display the counts per neighborhood
Out[23]:
Tenderloin                        164508
South of Market                   123499
Mission                           106880
Financial District/South Beach     88263
Bayview Hunters Point              60053
Name: neighborhoods_analysis_boundaries, dtype: int64

5.5.1 Analyze the Distribution Using Geospatial Data

In order to develop a heatmap of the observations, a random sample of 5,000 rows was taken. The reason is that this sort of visualziation is intense in Python, so limiting the amount of data will allow for a smooth visualization. Also, excess observations make it difficult to differentiate any densities between various neighborhoods. This is primarily just for trying to understand what the spread of the data may appear to look like, and won't be integral towards any modeling.

The package Folium package was mainly used for the visualization. There were not many neighborhoods to load, so they were just manually added by Googling their coordinates. The heatmap shows similarly what was seen in the above value counts table. From a far out distance, the entire city looks rather evenly split with the number of observations. However, zooming in, it becomes clear that there are concentrations of them in areas which may be expected. The highest frequency of incidents occur around the inner city. Other areas which are sometimes perceived as 'less safe' such as Mission or Bayview also have a high density. More quiet locations such as Sunset and around San Francisco State University there are fewer occurrences of fire department incidents.

[Back to Table of Concents](#toc)

In [24]:
# The data below shows a sample of 5,000 observations being plotted using Folium
# in San Francisco. A heatmap is generated showing the specific locations which
# are also added and their density of fire department incidents. Additionally,
# certain locations or neighborhoods are added as clickable points.
df5_5 = df.sample(5000).copy()
sf_heatmap = folium.Map(
    location=[37.7556, -122.4399],
    tiles='Stamen Terrain',
    zoom_start=12
)

sf_coords = ([37.7941, -122.4078], [37.7785, -122.4056], [37.7847, -122.4145],
             [37.7816, -122.4156], [37.7607, -122.4680],
             [37.7219, -122.4782], [37.7599, -122.4148],
             [37.7946, -122.3999], [37.7304, -122.3844])

sf_neighborhoods = ('Chinatown', 'South of Market', 'Tenderloin', 
                    'Civic Center', 'Inner Sunset', 
                    'San Francisco State University', 'Mission', 
                    'Financial District', 'Hunter\'s Point')

def folium_popup(coords, neighborhood_names, folium_map):
    """
    The folium_popup() function creates popups for specific
    locations on a map. It uses a vector of coordinates along
    with a vector of strings to create these inside the Folium
    package.
    input: coords, neighborhood_names, folium_map; these are the
    longitude/latitude, neighborhood names, and the heatmap itself
    output: N/A, the function applies a setting to the heatmap
    without directly outputting an object
    """
    tooltip = 'Click me!'
    for i in range(len(neighborhood_names)):
        folium.Marker([coords[i][0], coords[i][1]], 
                      popup='<i>' + neighborhood_names[i] +
                      '</i>', tooltip=tooltip).add_to(folium_map)

folium_popup(coords=sf_coords, neighborhood_names=sf_neighborhoods, folium_map=sf_heatmap)

scatter = list(map(list, zip(df5_5.lat.astype(float), df5_5.long.astype(float))))

HeatMap(scatter).add_to(sf_heatmap)
sf_heatmap.save('sf_heatmap.html')
sf_heatmap
Out[24]:

5.5.1 (Continued)

From the four different screenshots below of the Folium heatmap, it is possible to see how there are varying densities of fire department incidents. Using no zoom, the city looks rather evenly spread out. However, upon closer examination it becomes apparent that the number of occurrences vary depending on the neighborhood. In the Downtown Focus, looking at the most poor areas such as the Tenderloin and Civic Center, it is possible to see how there are some of the greatest occurrences of fire department calls. Looking closer towards the beach on the Western side of the city and the homes nearer to San Francisco State University in the bottom left corner, it's possible to see a much lower density of calls.

Later on in the modeling section, there will be a test to try an understand whether these occurrences follow a Poisson distribution.

[Back to Table of Concents](#toc)

Full View



[Back to Table of Concents](#toc)

Partial Zoom



[Back to Table of Concents](#toc)

Downtown Focus (top right)



[Back to Table of Concents](#toc)

Beach Focus (bottom left)



[Back to Table of Concents](#toc)

6. Exploratory Data Analysis pt. 2

Needle in a Haystack

From the midterm it was understood that it's possible to search for certain observations which are known to have occurred prior to analyzing the dataset. This is sort of akin to trying to finding a needle in a haystack. Given over a million rows, how difficult is it to find one particular observation?

The first 'needle' is a personal incident where the ambulance was called for a personal health situation. Another possibility is to search for the recent deadly fires in California.

[Back to Table of Concents](#toc)

6.1 Sprained Ankle

The exact date of the ambulance call is unknown, but it can be estimated. At the time I was attending City College of San Francisco. However, the exact day, month, or year is unknown, but other factors are still remembered. At the time I was interested in Chemical Engineering, and was beginning to take CHEM 101A classes. According to my unofficial transcript, this took place during the Fall semester of 2016. Also, an ambulance was responsible for sending me to the Chinese Hospital.

[Back to Table of Concents](#toc)

In [25]:
df6_1 = df.copy() # Subset the data during the Fall 2015 semester.
df6_1 = df6_1.loc[(df6_1['call_date'] > '2015-08-01') & 
                  (df6_1['call_date'] <= '2015-12-31')]

print('Within this subset there are {} observations'.format(len(df6_1)))
Within this subset there are 124346 observations

The next step is to subset where the address matches the street where I was staying at during the time. Due to being unable to recall the exact date, a phone call was made to the Chinese Hospital's Records Department, and it was determined that the date of the hospital incident occurred on September 22, 2015. Noticing that this observation closely matches the events that took place on that night, the incident number is subset from the data below.

In [26]:
df6_1 = df6_1.loc[df6_1['address'].str.contains('MONTANA')] # Search for the street
df6_1.head(3)
Out[26]:
address als_unit available_dttm battalion box call_date call_final_disposition call_number call_type call_type_group city dispatch_dttm entry_dttm final_priority fire_prevention_district hospital_dttm incident_number location neighborhoods_analysis_boundaries number_of_alarms on_scene_dttm original_priority priority received_dttm response_dttm rowid station_area supervisor_district transport_dttm unit_id unit_sequence_in_call_dispatch unit_type watch_date zipcode_of_incident med_income starting_lat starting_long long lat distance year month dotw hour time_of_day holiday
178599 MONTANA ST/CAPITOL AV True 2015-08-11 22:13:04 B09 8373 2015-08-11 Code 2 Transport 152233793 Medical Incident Potentially Life-Threatening San Francisco 2015-08-11 21:25:17 2015-08-11 21:24:17 3 9 2015-08-11 21:59:30 15085156 {'type': 'Point', 'coordinates': [-122.4590635... Oceanview/Merced/Ingleside 1 2015-08-11 21:31:07 3 3 2015-08-11 21:23:37 2015-08-11 21:25:22 152233793-53 33 11 2015-08-11 21:45:21 53 2 MEDIC 2015-08-11 94112 3.0 37.710962 -122.458897 -122.459064 37.716603 626.309947 2015 8 Tuesday 21 night False
178877 MONTANA ST/CAPITOL AV True 2015-08-11 21:34:07 B09 8373 2015-08-11 Code 2 Transport 152233793 Medical Incident Potentially Life-Threatening San Francisco 2015-08-11 21:25:17 2015-08-11 21:24:17 3 9 NaT 15085156 {'type': 'Point', 'coordinates': [-122.4590635... Oceanview/Merced/Ingleside 1 2015-08-11 21:29:19 3 3 2015-08-11 21:23:37 2015-08-11 21:26:06 152233793-E15 33 11 NaT E15 1 ENGINE 2015-08-11 94112 3.0 37.710962 -122.458897 -122.459064 37.716603 626.309947 2015 8 Tuesday 21 night False
213327 200 Block of MONTANA ST False 2015-09-22 20:45:20 B09 8373 2015-09-22 Code 2 Transport 152653417 Medical Incident Non Life-threatening San Francisco 2015-09-22 19:31:16 2015-09-22 19:31:06 2 9 2015-09-22 20:22:01 15101602 {'type': 'Point', 'coordinates': [-122.4614245... Oceanview/Merced/Ingleside 1 2015-09-22 19:41:07 2 2 2015-09-22 19:29:53 2015-09-22 19:31:52 152653417-AM20 33 11 2015-09-22 19:53:10 AM20 2 PRIVATE 2015-09-22 94112 3.0 37.710962 -122.458897 -122.461425 37.716637 668.093854 2015 9 Tuesday 19 evening False

Below it is quite evident that this is the exact account of the night when the hospital stay happened. To double check, the website https://www.gps-coordinates.net/ was used to look for the longitude and lattitude data for the address of the home where I was staying. The coordinates are quite similar, matching up to 3 decimal places. The website's coordinates are: 37.71677777695844,-122.46134222684145.

In [27]:
df6_1.loc[df6_1.incident_number == '15101602'] # Show the incident number
Out[27]:
address als_unit available_dttm battalion box call_date call_final_disposition call_number call_type call_type_group city dispatch_dttm entry_dttm final_priority fire_prevention_district hospital_dttm incident_number location neighborhoods_analysis_boundaries number_of_alarms on_scene_dttm original_priority priority received_dttm response_dttm rowid station_area supervisor_district transport_dttm unit_id unit_sequence_in_call_dispatch unit_type watch_date zipcode_of_incident med_income starting_lat starting_long long lat distance year month dotw hour time_of_day holiday
213327 200 Block of MONTANA ST False 2015-09-22 20:45:20 B09 8373 2015-09-22 Code 2 Transport 152653417 Medical Incident Non Life-threatening San Francisco 2015-09-22 19:31:16 2015-09-22 19:31:06 2 9 2015-09-22 20:22:01 15101602 {'type': 'Point', 'coordinates': [-122.4614245... Oceanview/Merced/Ingleside 1 2015-09-22 19:41:07 2 2 2015-09-22 19:29:53 2015-09-22 19:31:52 152653417-AM20 33 11 2015-09-22 19:53:10 AM20 2 PRIVATE 2015-09-22 94112 3.0 37.710962 -122.458897 -122.461425 37.716637 668.093854 2015 9 Tuesday 19 evening False
213596 200 Block of MONTANA ST True 2015-09-22 19:49:22 B09 8373 2015-09-22 Code 2 Transport 152653417 Medical Incident Non Life-threatening San Francisco 2015-09-22 19:31:16 2015-09-22 19:31:06 2 9 NaT 15101602 {'type': 'Point', 'coordinates': [-122.4614245... Oceanview/Merced/Ingleside 1 2015-09-22 19:35:54 2 2 2015-09-22 19:29:53 2015-09-22 19:32:43 152653417-E33 33 11 NaT E33 1 ENGINE 2015-09-22 94112 3.0 37.710962 -122.458897 -122.461425 37.716637 668.093854 2015 9 Tuesday 19 evening False

6.2 Wildfires

News headline reads, 'Bay Area Fire Crews Headed To Butte County To Battle Camp Fire.'

'The San Francisco Fire Department has sent 22 firefighters, five engines and a battalion chief to the town of Paradise as part of an urgent-need strike team to protect structures.' - CBS

According to the article, the SFFD had sent teams to help in battling the wildfire. The next step is to analyze the dataset and search for the possibility of these observations being within the dataset.

The article can be found at: https://sanfrancisco.cbslocal.com/2018/11/08/bay-area-fire-crews-headed-to-butte-county-to-battle-camp-fire/

[Back to Table of Concents](#toc)

The first step is to narrow down to the date of the wildfire and the dates before the wildfire ended.

In [28]:
df6_2 = df.copy() # Narrow down to possible dates.
df6_2 = df6_2.loc[(df6_2['call_date'] >= '2018-11-08') & (df6_2['call_date'] <= '2018-11-25')]

Looking at the city variable, none of them seem to have left the Bay Area.

In [29]:
df6_2.city.value_counts() # Analyze cities of incidents
Out[29]:
San Francisco    15305
Treasure Isla       77
Presidio            27
Yerba Buena         19
Daly City           14
Fort Mason          11
Hunters Point        8
Name: city, dtype: int64

To be more certain, a strategy is to plot all the points and look for any that are in the area of the wildfires. Checking for Na's, it is clear that all observations in the dataset have longitude and latitude information.

In [30]:
print('Are there any Na\'s?:\t{}'.format(any(df['location'].isna()))) # Check Na's
Are there any Na's?:	False

Below is an example of the mapping working as intended, where the subset data is plotted over San Francisco.

In [31]:
# Below a shape file is used alongside Basemap to help plot the neighborhoods
# of San Francisco along with the land and water. A scatterplot of the
# locations are also overlayed on the map.
DATA_DIR = "data"
SF_NEIGHBORHOODS = "sffind_neighborhoods"
sf_neighborhoods = os.path.join(DATA_DIR, "%s.shp" % SF_NEIGHBORHOODS)
shp = fiona.open(sf_neighborhoods)

fig = plt.figure()
sf_map = Basemap(projection='tmerc', 
                 lat_0 = 37, lon_0 = -122,
                 llcrnrlon=-122.518792, llcrnrlat=37.705627,
                 urcrnrlon=-122.354458, urcrnrlat=37.833067,
                 resolution='h', area_thresh=0.001)

sf_map.drawcoastlines()
sf_map.fillcontinents(color='#ddaa66',lake_color='aqua')
sf_map.drawmapboundary(fill_color='aqua')
sf_map.readshapefile(os.path.join(DATA_DIR, SF_NEIGHBORHOODS), 'SF')

x,y = sf_map(df6_2.long.tolist(), df6_2.lat.tolist())
plt.scatter(x, y, marker='.', zorder=10, alpha=1)
plt.title("Distribution of Fire Department Incidents in San Francisco During the Wildfires", size=30)
fig.set_size_inches(15,15)
plt.rc('font', size=20)
plt.show()

Before it was determined that there are no Na's for the location column. Therefore, if any of the units that were sent from the SFFD had gone to Butte County they would be visible. However, none of them are apparent in the plot below. Therefore, it would make sense to believe that the units that were assigned to fight the wildfires are not listed in the dataset. It also makes sense that since the dataset is for 9-11 calls, there wouldn't have been a 9-11 call for the fire department since the incident is outside of the SFFD area.

In [32]:
fig = plt.figure() # Applying the same mapping above to the Butte County
wildfire_map = Basemap(projection='tmerc', 
                       lat_0 = 37, lon_0 = -122,
                       llcrnrlon=-123.0708, llcrnrlat=38.7695,
                       urcrnrlon=-119.9638, urcrnrlat=40.452,
                       resolution='h', area_thresh=0.001)

wildfire_map.fillcontinents(color='#ddaa66',lake_color='aqua')
wildfire_map.drawmapboundary(fill_color='aqua')
wildfire_map.readshapefile(os.path.join(DATA_DIR, SF_NEIGHBORHOODS), 'SF')

x,y = wildfire_map(df6_2.long.tolist(), df6_2.lat.tolist())
plt.scatter(x, y, marker='.', zorder=10, alpha=1)
plt.title("Distribution of San Francisco Fire Department Incidents in Butte County During the Wildfires", size=30)
fig.set_size_inches(15,15)
plt.show()

Comparison

By using an image of the wildfire location from Google Maps, it's possible to compare the two areas and see by the bodies of water such as Lake Tahoe. The above mapped area is clearly the same as the below location on Google maps. Despite this, there is no evidence that the SFFD had ever sent any trucks that were accounted for within the dataset. This does not deny that any were sent, merely that such events were not recorded within this dataset.

[Back to Table of Concents](#toc)

Google Maps



[Back to Table of Concents](#toc)

7. Exploratory Data Analysis pt. 3

Analysis of Response Times

One of the first variables that will be examined for the purpose of potential modeling is Response Time. This new variable is calculated using two currently existing variables in the dataset. The two variables being used are: received_dttm and on_scene_dttm.

Variable Description
Received DtTm Date and time of call is received at the 911 Dispatch Center.
On Scene DtTm Date and time the unit records arriving to the location of the incident

By calculating the difference, the response_time variable tells us how long it took for the fire department or other emergency response vehicle to respond to a 9-11 call after receiving information from a dispatcher. This is a conintuous variable, and this EDA will serve to help better understand the potential of doing modeling later on using techniques like Linear Regression.

[Back to Table of Concents](#toc)

7.1 Response Times

Response times are calculated using the variables that indicate when a 911 call is made, and when the unit arrives at the scene. The difference is turned into seconds, and the null values are removed. A boxplot indicates that there are various outliers which heavily skew the data. When using a boxplot, it is not used for any deeper purpose other than to examine outliers. The response time is then subset into seconds which are non-negative, and times which fall under an hour. This is roughly 98% of the data.

[Back to Table of Concents](#toc)

In [33]:
# First a copy is made of the data, then the received_dttm and
# on_scene_dttm variables are used to determine the actual
# response time for the vehicles from the time of 911-call
# until time of arrival to the incident. The observations
# for where there are no values for either are removed. The
# datetime package is used to determine the total seconds
# from call to arrival.
df7_1 = df.copy()
called_911 = df7_1.received_dttm
arrived_911 = df7_1.on_scene_dttm

where_arrive_time_missing = arrived_911.notnull()

arrived_911_clean = arrived_911[where_arrive_time_missing]
called_911_clean = called_911[where_arrive_time_missing]
arrived_911_clean = arrived_911_clean.reset_index(drop = True)
called_911_clean = called_911_clean.reset_index(drop = True)

FMT = "%Y-%m-%d %H:%M:%S"
tdelta = [(datetime.datetime.strptime(str(arrived_911_clean[x]), FMT) - 
           datetime.datetime.strptime(str(called_911_clean[x]), FMT)).total_seconds() \
          for x in range(len(arrived_911_clean))]

ax = sns.distplot(tdelta) # Boxplot of response times
ax.axes.set_title('Density Plot of Response Times with KDE Estimate', fontsize=30)
ax.set_xlabel('Time in Seconds',fontsize=20)
ax.set_ylabel('Frequency',fontsize=20)
Out[33]:
Text(0, 0.5, 'Frequency')

It is apparent from the histogram that the response time of most of the observations spike around 8 minutes, before slowly taking longer evident by a heavy skew to the right. To further determine a better subset, certain times are subset from the original. It is apparent that a very small amount of observations are considerably long (longer than an hour). Also, It is apparent from the boxplot that certain values are negative. This does not make sense and are likely errors. So a new subset will ignore these observations. This subset includes almost 99% of the data.

In [34]:
print(' The probability that response time is longer than an hour:\t\t{} \n \
The probability that response time is positive and less than an hour:\t{}'.format(np.mean([i > 
    3600 for i in tdelta]), np.mean([((i > 0) & (i < 3600)) for i in tdelta]))) # Show density of data
 The probability that response time is longer than an hour:		0.0025097717803050004 
 The probability that response time is positive and less than an hour:	0.9863413497642084

Below is a histogram of the response times using the new subset. Also included is a KDE line drawn over the histogram. From the plot it is apparent that there is a strong right skew to the data. The data also appears unimodal with a peak near 500 seconds.

In [35]:
t_delta_false = [((i > 0) & (i < 3600)) for i in tdelta] # Subset observations that are positive and less than an hour
t_delta_false = pd.Series(tdelta).loc[t_delta_false].reset_index(drop=True)

ax = sns.distplot(t_delta_false) # Histogram with KDE estimate
ax.axes.set_title('Density Plot of Response Times with KDE Estimate with Subset',fontsize=30)
ax.set_xlabel('Time in Seconds',fontsize=20)
ax.set_ylabel('Density',fontsize=20)
Out[35]:
Text(0, 0.5, 'Density')

Noticing a strong skew, the log is taken of the response time to observe whether or not there would be a significant improvement. After apply log, it seems that the distribution of the response time appears far more normal.

In [36]:
# To do additional plotting, it would be simpler to subset the
# dataframe so that the response time is included as a column
# and the rows with Na's are not included.
df7_1 = df7_1.loc[df7_1.on_scene_dttm.notnull()]
df7_1['response_time'] = pd.DataFrame(tdelta)
df7_1 = df7_1.loc[(df7_1.response_time > 0) & (df7_1.response_time < 3600)]

df = df7_1.copy() # Use as new dataframe

# Below the response time is logged and then plotted.
log_response = sns.distplot(np.log(df7_1.response_time))
log_response.axes.set_title('Log of the Histogram of Response Times',fontsize=30)
log_response.set_xlabel('Time in Seconds',fontsize=20)
log_response.set_ylabel('Frequency',fontsize=20)
Out[36]:
Text(0, 0.5, 'Frequency')

7.1.1 Transformations

The previous data shows that the response time is likely not ready to be modeled due to the extreme skew of the data. Below the Box Cox Transformation method will be used as the primary tool to determine what transformation may best be used on the response time variable. First log is used to examine the effect. Next, Box Cox Transformation is done using the SciPy package. The Box Cox Transformation shows that the λ which is optimal is quite close to 0. Under the rules of Box Cox Transformation, when λ=0, the transformation to use is log y. This would explain why the initial use of log on the response time variable so greatly improved the appearance distribution.

[Back to Table of Concents](#toc)

To further determine which transformation would be best for the response time variable, a Box Cox Transformation is done to determine what λ exponent would be best.

In [37]:
# Box Cox Transformation determines the optimal lambda, very close to 0.
x_transform, boxcox_log = stats.boxcox(df7_1.response_time)
print('The optimal value for lambda in the Box Cox Transformation is :\t{}'.format(boxcox_log))
The optimal value for lambda in the Box Cox Transformation is :	-0.13279947389715804
In [38]:
# The new transformation is shown below, where the data
# appears closer to the normal distribution
y_lambda = sns.distplot(x_transform)
y_lambda.axes.set_title('Box Cox Transformation Density Plot of Response Times',fontsize=30)
y_lambda.set_xlabel('Time in Seconds',fontsize=20)
y_lambda.set_ylabel('Frequency',fontsize=20)
Out[38]:
Text(0, 0.5, 'Frequency')

The code below is taken from Scipy's normaltest function to find out if a sample dataset is normally distributed. The null and alternative are:

H0:The dataset comes from a normal distribution.

H1:The null is not true.

In [39]:
k2, p = stats.normaltest(x_transform)
alpha = 5e-2
print("p-value = {:g}".format(p))

if p < alpha:  # null hypothesis: x comes from a normal distribution
     print("The null hypothesis can be rejected")
else:
     print("The null hypothesis cannot be rejected")
p-value = 0
The null hypothesis can be rejected

The result is that we choose to reject the null hypothesis in favor of the alternative at α=0.05 confidence level. Although the dataset may appear bell-shaped, the strong peak in the data may be the factor which is ruining the ability for the data itself to follow a normal distribution. Despite this result, we will continue looking at the modeling process to try and understand if any appropriate model can be found.

Lastly, the transformed response is added to the dataframe.

In [40]:
df['new_response'] = pd.DataFrame(x_transform) # Add the Box Cox Transformation

7.2 Number of Alarms

The number of alarms in an incident refers to the number of personnel which may be involved in increasingly serious situations. For example, a 1 alarm fire can require a fewer amount of fire trucks and emergency response vehicles to attend to the scene. However, as the number of alarms increase, this indicates that a situation requires more and more units to respond. This is one of the other variables within the dataset which have more meaning to them apart from being related to the time or location of the incident.

It becomes apparent that when accounting for multiple alarm fires, certain vehicles must travel from further distances from a scene in order to arrive. This may account for some of the reason why certain observations have such long response times. So this idea will be investigated in this section.

[Back to Table of Concents](#toc)

In [41]:
df7_2 = df.copy() # It is apparent that most incidents are 1 alarm
df7_2.number_of_alarms.value_counts()
Out[41]:
1    755510
2      1184
3       320
4       205
5        74
Name: number_of_alarms, dtype: int64

Looking at the 5-alarm fire, it turns out they are all observations which are responses to the same situation. It turns out from: https://abc7news.com/news/two-injured-in-fire-that-damaged-6-buildings-in-san-francisco/1391617/ that there was a severe fire which required this many units. Several houses were in danger due to flames, and two people were injured.

In [42]:
df7_2.loc[df7_2.number_of_alarms == 5].call_date.value_counts() # Examine the 5-alarm fire
Out[42]:
2016-06-18    74
Name: call_date, dtype: int64

Looking at the 4-alarm fire, similarly numerous response vehicles had to arrive to help out in a dangerous situation. In this case, there were five separate incidents which each accounted for the 281 observations. In one incident, there was a massive fire in the Mission district in which one person sadly had passed away. https://abc7news.com/news/1-dead-6-injured-in-sf-mission-district-fire-/495372/ Also, in another case there was a large fire in North Beach. https://sanfrancisco.cbslocal.com/2018/03/17/firefighters-battling-3-alarm-fire-in-north-beach/ The article describes how 40 trucks and 140 firefighters had arrived to the scene.

In [43]:
df7_2.loc[df7_2.number_of_alarms == 4].call_date.value_counts() # Examine the 4-alarm fire
Out[43]:
2015-01-28    67
2017-06-16    49
2016-10-20    49
2015-01-31    40
Name: call_date, dtype: int64

7.3 Looking for Predictors

When choosing a model for Linear Regression, it is important that there are suitable predictors in the model. Within this section we will analyze the predictors and their relationship to the response time by plotting. Ideally we are looking for variables with a linear relationship with the response variable. With continuous variables we will use the pairplot() function from Seaborn. This is an effective way to examine numerous continuous variables to understand their relationship with the response variable. The other method for dealing with categorical variables will be to utilize boxplots to see if any differences amongst categories are significant enough to act as predictors (e.g., a difference in the distribution of male and female boxplots when examining gender).

[Back to Table of Concents](#toc)

Below is a pairs matrix of the continuous variables with the two possible response variables, 'new_response' and 'response_time.' A linear regression line is drawn through the plots to help identify which variables may serve as valuable predictors. A subset of the dataframe is taken for the pairs matrix, due to the process taking too long for the entire dataset. (I understand that the labels should be rotated, and that the title should be more centered, however this has not yet been achieved. Finding a solution online has been troublesome.)

In [44]:
df7_3 = df.sample(1000).copy() # Sample a subset due to time of function running

Looking at the top two rows, it is apparent that none of the numeric variables exhibit the desired linear diagonal relationship with either 'response_time' or 'new_response.' This is disappointing and shows that the dataset will have serious issues with trying to predict these two variables.

In [45]:
df7_3.rename(columns={'unit_sequence_in_call_dispatch':'unit_seq'}, inplace=True)
ax = sns.pairplot(df7_3[['new_response', 'response_time', 'year',
                         'month', 'hour', 'med_income', 'number_of_alarms', 
                         'unit_seq', 'distance']], kind="reg")
ax.fig.suptitle('Pairs Matrix of Possible Continuous Predictors')
plt.show() # Create a pairs matrix of the continuous variables

Below is an example of using boxplots to understand the possibility of categorical predictors within the dataset. Different categorical variables were tested, but not all are shown. It is apparent that a significan amount of outliers exist throughout the data, and yet most of the data centers around just below 500 seconds. Only in the 'city' variable does there seem to be any slight variation. However, the majority of the data is based in San Francisco, so it is hard to imagine what power such a parameter would have in the final modeling. The neighborhood variable is also surprisingly not showing any desirable distribution. It seems that no matter which neighborhood, the fire department will regularly have the same distribution of response times. Considering the geospatial heatmap of the city earlier, it was imagined that certain areas would be more efficient. It is possible that the design of the San Francisco Fire Dept. layout is to place themselves strategically so that they are within a similar response time throughout the area. This would make it efficient for them to arrive at a scene in an efficient manner.

In [46]:
ax = sns.boxplot(x='response_time', y='neighborhoods_analysis_boundaries', data=df)
ax.set_title('Boxplot of Neighborhoods vs. Response Time')
ax.set_ylabel('Neighborhoods')
ax.set_xlabel('Response Times') # Boxplot of neighborhoods vs. response times
Out[46]:
Text(0.5, 0, 'Response Times')
In [47]:
ax = sns.boxplot(x='response_time', y='battalion', data=df)
ax.set_title('Boxplot of Battalion vs. Response Time')
ax.set_ylabel('Battalion')
ax.set_xlabel('Response Times') # Boxplot of battalion vs. response times
Out[47]:
Text(0.5, 0, 'Response Times')
In [48]:
ax = sns.boxplot(x='response_time', y='city', data=df)
ax.set_title('Boxplot of City vs. Response Time')
ax.set_ylabel('City')
ax.set_xlabel('Response Times') # Boxplot of city vs. response times
Out[48]:
Text(0.5, 0, 'Response Times')

8. Exploratory Data Analysis pt. 4

Priority Level Analysis

Another variable which is of potential interest is the priority level. The priority level according to the variable description provides the information on how immediate a particular incident is. Below are the three priority variables that are within the dataset. The values which are possible range from numbers 1, 2, and 3, to letters A, B, C, E, and I. The variable descritions indicate that there are the two major values 2 and 3. The former represents a Non-Emergency, and the latter represents an Emergency.

Variable Description
Original Priority Initial call priority (Code 2: Non-Emergency or Code 3:Emergency).
Priority Call priority (Code 2: Non-Emergency or Code 3:Emergency).
Final Priority Final call priority (Code 2: Non-Emergency or Code 3:Emergency).

Due to the nature of these variables, it would be of interest to possibly predict if a new observation will be Code 2 or 3 depending on certain conditions.

After emailing the offices who are responsible for the SFFD dataset, a response gave back information on the other values of the variables other than Code 2 or 3. It was explained that variables such as A, B, C, etc., are related to different medical situations such as breathing status or allergic reaction. In the email, it was stated that levels A, B, and C generally are within Code 2, while D and E are generally Code 3. Level I has relatively few occurrences, and therefore will be dropped since there was no mention of it within the email. The same applies for Code 1 incidents.

[Back to Table of Concents](#toc)

8.1 Priority Level

The actual meaning of the priority levels are not clear from the dataset alone. However, after going through an email conversation with those responsible for the dataset, certain specific levels will be imputed with their more general priority level. Below is a table of changes:

Previous New
A, B, C 2 (driving without lights/sirens)
D, E 3 (driving with lights and sirens)
1, I dropped

This was done for the variables 'original_priority' and 'priority.' In the 'final_priority' variable, all values are already either 2 or 3.

[Back to Table of Concents](#toc)

In [49]:
df8 = df.copy() # Drop or change priority levels

df8.original_priority.replace(['A', 'B', 'C', 'D', 'E'], # Change 'original_priority'
                              ['2', '2', '2', '3', '3'], inplace=True)
df8 = df8[(df8.original_priority != 'I') & (df8.original_priority != '1')]
df8.original_priority.dropna(axis=0, inplace=True)

df8.priority.replace(['A', 'B', 'C', 'E'], # Change 'priority'
                     ['2', '2', '2', '3'], inplace=True)
df8 = df8[(df8.priority != 'I') & (df8.priority != '1')]

df = df8.copy()

8.2 Hierarchical Clustering

Previously, to develop a better understanding of the varying priority levels, an unsupervised learning technique called Hierarchical Clustering was going to be used. Hierarchical clustering in seaborn works in a bottom-top fashion. Rather than assigning labels prior to modeling, this method checks for groups which stand out when plotted on a heatmap. The function will also further sort these into groups so that certain categories can be either grouped or differentiated from others.

[Back to Table of Concents](#toc)

Prior to contacting the office that is responsible for the dataset, Hierarchical Clustering was used to attempt to understand what sort of pattern was underlying the priority level. At the time, the call types would be grouped into Code 3 or everything else into an 'other' group. However, after coming into contact with the office and finding out more information, this previous method is no longer desirable. Having true insight into the meaning of the data is better than merely following the patterns that may be random within the dataset.

It became apparent that the priority levels are separated into either Code 3 or not Code 3. Also, the call type groups were separated into situations which either required immediate response or not. This could have been thought of as Non Life-threatening and others. The analysis turned out to be incorrect, and is just from modeling patterns rather than knowing the meaning behind variables. The code below is just an example of old planning, and is just to demonstrate at this point the possibility of Hierarchical Clustering.

In [50]:
# The relevant columns are subset from the original dataframe. Afterwards the
# priority levels which are common are utilized. A groupby is done so that
# the data can be prepared for hierarchical clustering. Lastly the data
# is plotted below.
df8_2 = df[['priority', 'call_type_group']].copy()

df8_2 = df8_2.groupby('call_type_group')
priority = df8_2.priority.value_counts().unstack().fillna(0)
priority_normalized = priority.div(priority.sum(axis=1), axis=0)

h_cluster = sns.clustermap(priority_normalized, annot=True, cmap='Reds', fmt='g')
h_cluster.fig.suptitle("Hierarchical Clustering of Call Type Group vs. Priority", size=25)
ax = h_cluster.ax_heatmap
ax.set_xlabel('Priority Level')
ax.set_ylabel('Call Type Group')
plt.show()

9. Modeling pt. 1

Response Times

Previously some analysis was already done on the response time variable. Box Cox Transformation was used, and several predictor variables were analyzed. The model used to analyze this variable is Linear regression. Below is an analyze of the modeling using Linear regression and whether this model properly fit the data.

[Back to Table of Concents](#toc)

9.1 Linear Regression

A model which may be suited for the goal of the analysis is Linear Regression. More specifically, Multiple Linear Regression (MLR). The dependent variable which we would like to model is the response time. This is a continuous variable that was created as a feature from the dataset. It would be interesting to be able to understand what could cause an incident to take a long time or perhaps a short time.

First I will state some of the assumptions that are important towards a proper MLR model:

1. Linear relationship between outcome and independent variables.
2. Residuals are normally distributed.
3. There is no multicollinearity between the independent predictor variables.
4. Homoscedasticity, the variance of the error in the independent variables are similar.

The first assumption has been determined to be violated thus far in the process of trying to search for suitable predictor variables. Previously, a Seaborn pairplot() was shown, which showed that the response time and the log of the response time lacked a linear relationship with any of the other variables in the dataset. Another major issue is trying to add continuous variables to the X matrix. The resulting Adjusted R2 is 0. Despite these hurdles, we will move on with modeling to examine the results.



[Back to Table of Concents](#toc)

When doing MLR it is important to build a model which can properly estimate a response variable. These predictor variables can be of various types, including categorical. To use a categorical variable as a predictor requires more work, as they need to be factored into indicator variables. There is a scikit-learn function called OneHotEncoder, which does the process of One Hot Encoding (OHE). OHE helps to change variables which were previously categorical into indicator variables. The next step after creating the indicator variables is to subtract one of the levels to account for degrees of freedom.

In [51]:
def categorical_OHE(data):
    """
    The categorical_OHE() function takes in a categorical variable from a dataset
    and uses One Hot Encoding to change the column into a dataframe which is
    identifiable through an array of binary digits. The columns for each of the
    possible categories are also labeled within the dataframe.
    input: data, a categorical column from a dataset
    output: ohe_df, a dataframe which has been binarized using onehotencoding
    """
    values = array(data) # Binarize the categorical variable
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(values)
    onehot_encoder = OneHotEncoder(sparse=False)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    
    ohe_df = pd.DataFrame(onehot_encoded) # Change binarize column to dataframe
    data_names = list(pd.Series(data.unique().tolist()).sort_values())
    ohe_df.columns = data_names
    return ohe_df

Below, the dataframe with the log of the response time is copied and used as the index for which observations may be utilized within MLR.

In [52]:
df9 = df.reset_index(drop=True).copy() # Subset the relevant columns
df9.drop_duplicates('incident_number', inplace=True)
df9 = df9[['new_response', 'call_type', 'time_of_day', 'hour', 'unit_type',
             'neighborhoods_analysis_boundaries', 'holiday']]
df9.dropna(inplace=True)

In the following block of code, certain variables are chosen from the dataset to be used as predictors. They are each put through the OHE function to turn them into indicator variables for each level of the predictor. The last column is also removed. Certain variables are not used for the reason that they have too many levels, and with excessive levels of a parameter, the R2 can become inflated.

In [53]:
call_type_indicator_var = categorical_OHE(data=df9.call_type).iloc[:, :-1] # OHE the variables
tod_indicator_var = categorical_OHE(data=df9.time_of_day).iloc[:, :-1]
unit_type_indicator_var = categorical_OHE(data=df9.unit_type).iloc[:, :-1]
hour_indicator_var = categorical_OHE(data=df9.hour).iloc[:, :-1]
district_indicator_var = categorical_OHE(data=df9.neighborhoods_analysis_boundaries).iloc[:, :-1]
holiday_indicator_var = categorical_OHE(data=df9.holiday).iloc[:, :-1]

The predictor variables are later combined into a single X matrix.

In [54]:
X = pd.concat([district_indicator_var, unit_type_indicator_var,
           tod_indicator_var, call_type_indicator_var, hour_indicator_var,
              holiday_indicator_var], axis=1)
y = df9.new_response
y.reset_index(drop=True, inplace=True) # Combine the variables and create y vector

Below is a printout of the summary from doing Ordinary Least Squares regression on the model. The measure which is of interest is the Adjusted R-squared, which in this case is surprisingly high. Previously, when analyzing individual variables and their ability to act as predictors for the response variable has been questionable, with not many appearing to fit linear the model. The result can possibly be a case of overfitting, where the numerous predictor variables are artificially inflating the Adjusted R-squared value.

In [55]:
model = sm.OLS(y, X) # Fit the model using statsmodels
fit = model.fit()

The p-values are not shown due to the length of the output. However, the Adjusted R2 is shown below (possibly over-inflated).

In [56]:
print('The Adjusted R-squared is:\t{}'.format(fit.rsquared_adj))
The Adjusted R-squared is:	0.9965046987632735

Further examining of the residuals and tests for homoscedasticity will help to determine whether or not the model is appropriate. Below, is some code found online which mimics some of the output seen in RStudio after doing a Linear regression model. The first step is to gather the residuals, Pearson standardized residuals, and fitted values.

In [57]:
# Put residuals, standardized, and fitted values into a data frame
results = pd.DataFrame({'resids': fit.resid,
                        'std_resids': fit.resid_pearson,
                        'fitted': fit.predict()})

Below is a plot of the residuals against the fitted values. However, the result is not what was desired. Usually in this type of plot what is desired is a random scatter of points. Yet in this plot, there is a distinct split between two groups within highly localized areas. This plot shows that there is non-linear relationship between the predictor variables and the response variables (Assumption 1 above). Previously during EDA, it was shown that this was already the case with all the predictor variables that were found. Therefore, this result should not be surprising.

In [58]:
# Below the residuals are plotted against the fitted values.
residsvfitted = plt.plot(results['fitted'], results['resids'],  'o')
l = plt.axhline(y = 0, color = 'grey', linestyle = 'dashed')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.show(residsvfitted)

Below is a Normal Q-Q plot, it shows whether the residuals are normally distributed or not (Assumption 2 from above). The plot looks fine for the values between -2 and 3. However, the tails are quite heavy, with some seious outliers on both ends. This suggests that the residuals are not quite normal, an issue that was foreseen previously when looking at the sample distribution of the response variable. The response variable had a high peak and was tested to not follow a normal distribution. It is possible that is leading to the outliers seen below.

In [59]:
# Below is a Normal Q-Q plot
qqplot = sm.qqplot(results['std_resids'], line='s')
plt.title('Normal Q-Q')
plt.show(qqplot)

The following Scale-Location plot is used for tests of homoscedasticity. When there is a relatively random spread across the entire plot, it would be indicative of the residuals being equally shared by the predictors (Assumption 4 above). However, in this plot there are two separate distinct groups, violating this important assumption.

In [60]:
# Scale-Location plotted below
scalelocplot = plt.plot(results['fitted'], abs(results['std_resids'])**.5,  'o')
plt.xlabel('Fitted values')
plt.ylabel('Square Root of |standardized residuals|')
plt.title('Scale-Location')
plt.show(scalelocplot)

At this point it is clear that despite the impressive Adjusted R2, the resulting residual analysis shows that the model is unsuitable for the data. It would be possible to try and progress further with a Stepwise regression process previously found online (available Python packages have lacked this functionality). However, both a lack of available time, and the severity of the residuals above show that it is not likely that a Stepwise regression model would be able to make up for the serious model violations that have been shown.

10. Modeling pt. 2

Priority Level

Previously the priority level was shown to be broken down into a binary split of Code 2 or 3. This response variable fits the category of Logistic regression, and so a modeling process is shown below along with an accuracy test to show whether or not the model is appropriate.

[Back to Table of Concents](#toc)

10.1 Logistic Regression

Logistic Regression is different from Linear Regression in that the response variable that we are trying to predict is not a continuous variable. Linear Regression in the binary case (the one being used here) is trying to predict a categorical variable with something akin to a 'yes' or 'no' value. Previously it was seen that the Linear Regression model assumptions were being violated, and that would explain for the poor results during modeling. Logistic Regression has the advantage that such assumptions like linearity between dependent and independent variables or homoscedasticity are not important. However, there are other assumptions which are required.

Below are assumptions for the Logistic Regression model:

1. Binary Logistic Regression requires that the dependent variable is binary.
2. Observations must be independent.
3. Little to no multicollinearity between independent variables.
4. Linearity of independent variables and log odds.
5. High sample size.

In our case, the first assumption is fine. The dependent variable being modeled is the priority level, which has already been determined to be broken down into either Code 2 or 3. The second assumption is not certain. Whether a fire incident in one area is related to a fire incident in another area is not certain. It would be interesting to note that an observation can't occur in certain areas such as in the ocean or in areas with low populations. Certain independent variables are based off of others, so it would be important to choose one and ignore the other. The fourth assumption must be examined through plotting. In the last assumption there is no issue.

[Back to Table of Concents](#toc)

In [61]:
df10 = df.reset_index(drop=True).copy()

df10.drop_duplicates('incident_number', inplace=True)
df10 = df10[['priority', 'call_type', 'time_of_day', 'distance', 'hour', 'unit_type',
             'neighborhoods_analysis_boundaries', 'med_income', 'holiday', 'response_time']]
df10.dropna(inplace=True)

An important factor of Logistic regression is that the binary response variable is reasonably split in terms of proportion. Here it is possible to see that the split is roughly 60/40. Therefore, there is no issue in this case with the proportion. An undesirable split would be something similar to 90/10.

In [62]:
df10.priority.value_counts()[0] / (df10.priority.value_counts()[0] + 
        df10.priority.value_counts()[1]) # Show proportion of binary variable
Out[62]:
0.5912225734495408

Below are the categorical and continuous variables which are combined and used as the predictor variables. The categorical variables use the same OHE encoding as before in Linear regression. Unlike before, the continuous variables do not have an issue with prediction.

In [63]:
# Below the same categorical variables are encoded using OHE (different dataframe),
# and the continuous variables are added this time. They are combined into a dataframe
# where the y vector is a binary form of the priority variable.
continuous_var = pd.concat([df10.distance, df10.med_income, df10.response_time], axis=1).reset_index(drop=True)

call_type_indicator_var = categorical_OHE(data=df10.call_type).iloc[:, :-1]
tod_indicator_var = categorical_OHE(data=df10.time_of_day).iloc[:, :-1]
unit_type_indicator_var = categorical_OHE(data=df10.unit_type).iloc[:, :-1]
hour_indicator_var = categorical_OHE(data=df10.hour).iloc[:, :-1]
district_indicator_var = categorical_OHE(data=df10.neighborhoods_analysis_boundaries).iloc[:, :-1]
holiday_indicator_var = categorical_OHE(data=df10.holiday).iloc[:, :-1]


logit_test = linear_model.LogisticRegression(fit_intercept=False)
y_log_reg = np.where((df10.priority == '3'), 'Top', 'Not-top')
x = pd.concat([continuous_var, district_indicator_var, unit_type_indicator_var,
           tod_indicator_var, call_type_indicator_var, hour_indicator_var,
              holiday_indicator_var], axis=1)

Below is a printout of the mean accuracy rate after fitting the Logistic regression model. The mean accuracy is roughly 70%, which is not too poor.

In [64]:
logit_test = linear_model.LogisticRegression(fit_intercept=False)
fit = logit_test.fit(x, y_log_reg)
print('The mean accuracy rate is:\t{}'.format(fit.score(x, y_log_reg)))
The mean accuracy rate is:	0.7133054677487259

Next we will examine some of the βs that were created by the Logistic regression model.

Below is the β which is closest to 0. In other words, the 'distance' variable has the least effect compared to the rest of the other variables in terms of it's effect in predicting the response. This is logical for the reason that the least likely predictor when it comes towards the priority level of the situation is the distance to the location. If there is an emergency situation, it would only make sense that the priority level is assigned based off the incident and not in any way connected to the distance of the fire station to the incident itself.

In [65]:
x.columns[np.where((np.abs(fit.coef_) == np.min(np.abs(fit.coef_)))[0])] # Analyze the beta closest to 0
Out[65]:
Index(['distance'], dtype='object')

The largest β is the 'Alarms' category from the 'call_type' variable. This is a logical result, because when fire alarms go off, it makes sense that these situations require fire fighters to react right away by sending emergency vehicles before the fire or other danger spreads further.

In [66]:
x.columns[np.where((np.abs(fit.coef_) == np.max(np.abs(fit.coef_)))[0])] # Analyze the largest beta
Out[66]:
Index(['Alarms'], dtype='object')

The smallest β is the 'Traffic Collision' category from the 'call_type' variable. In Logistic regression both the large positive and large negative values have important effects on the result. It is the number closest to 0 which has the least effect. So, when traffic collisions occur the model predicts that there is more of an effect of the priority being a lower level. It is possible that in general, traffic accidents are not as dangerous. For that reason, 9-11 response vehicles don't always arrive with the greatest priority. However, when there are serious incidents, the frequency of those top priority incidents don't outweigh the effect of the more common traffic incidents which are not as dangerous.

In [67]:
x.columns[np.where((fit.coef_ == np.min(fit.coef_))[0])] # Analyze the smallest beta
Out[67]:
Index(['PRIVATE'], dtype='object')

A final test is to use Leave-5-out cross-validation to re-test the accuracy of the Logistic regression model. A serious error would be to perform an estimate using the data which the model was trained on. The advantage of using cross-validation is that it will partition the dataset, using new tests and training sets so that any outliers can possibly be accounted for when only testing the dataset using only one instance of partitioning.

In [68]:
clf = LogisticRegressionCV(cv=5, random_state=0, # Leave-5-out cross-validation
                            multi_class='multinomial').fit(x, y_log_reg)

The result is actually a slightly larger accuracy rate than the one that was predicted previously. This shows that the model may perhaps be useful in trying to predict whether or not a response variable would be a high or low priority incident.

In [69]:
clf.score(x, y_log_reg) # Mean accuracy of the L5O CV
Out[69]:
0.6843582890810173

Similar to the Linear regression case, it would be possible to utilize Stepwise regression to try to build a more accurate model. However, there is a lack of time to do so, and this will not be done for the project. A Stepwise regression model could possibly build a stronger model using select categories from the predictor variables, and they would be scored based on their AIC. Also, a Goodness-of-Fit test could also be done using Logistic regression to identify how appropriate the model is.

Thinking on the usefulness of the model, it is not certain at what level of accuracy would the test be worth sharing with others. For instance, it is possible that either the Sensitivity or the Specificity of the test make it such that the test is not useful in a practical situation. Also, there could be an issue with False Positives or False Negatives leading to errors which are not acceptable. Initially it seemed that predicting such an outcome would be desirable and practical for fire departments (e.g., extra units or higher alert level depending on area, time of day, etcetera). However, without a proper metric to compare this to, there could be unacceptable risks that are unaccounted for. An example would be predicting falsely at a time when the risk is too high.

11. Modeling pt. 3

Poisson Distribution

During World War 2, London was under attack by German rockets. The British wanted to determine whether or not the Germans were targeting any specific location in London. There was uncertainty to the level of technology that Germany rockets have achieved at the time. Therefore, a British statistician divided London into equal squares to determine the proportion of hits. The sample proportion followed highly closely a Poisson Distribution, and so the rockets were considered to be randomly falling on the city.

A similar strategy will be used to try and determine if the 9-11 calls are randomly occurring. According to Wikipedia, the Poisson distribution is described as such:

"...is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event."

In this situation, the events are 9-11 calls. However the fixed interval of time or space is somewhat more difficult. If using a fixed space, then the space would be San Francisco. However, if events were to occur independently in this space there may be some violation of the assumption. Within the city there are certain places such as Lake Merced where there are no 9-11 calls from within the area. Also, there are places such as the ocean where 9-11 calls can also not take place. Therefore, a fixed time interval will be used instead due to the possibility that the assumption will not be as extremely violated. In this case, one particular hour will be used of a given day.

[Back to Table of Concents](#toc)

The first step is to drop duplicates of incident numbers. Each 9-11 call can have several vehicles arrive at the scene. Each vehicle in the dataset will contain an observation. Therefore, to narrow down the data to individual cases, duplicates are dropped.

In [70]:
df11 = df.copy() # First a copy of the data is made, and the duplicate incidents are dropped.
df11.drop_duplicates('incident_number', inplace=True)
df11 = df11.loc[(df11.year == 2016) & (df11.month == 6) * (df11.hour == 15)]
df11 = df11.reset_index(drop=True)

Afterwards, a range of longitude and latitude data are taken to help create a grid. Once this is finished, four variablese lat_top/bottom and long_top/bottom are created. They represent dummy variables which sort through all the possible n×m grid coordinates. The purpose is that it would be simpler to count the grid itself and find out how many observations are within each grid rather than to go through each observation to count what grid each observation is within.

In [71]:
# The range of the coordinates are used, and a grid is created.
# Each cell of the grid is then scanned to save time over
# checking each observation. This creates a count for all
# of the samples in each grid.
long_range = np.linspace(min(df11.long), max(df11.long), 18)
lat_range = np.linspace(max(df11.lat), min(df11.lat), 20)

m = len(long_range)
n = len(lat_range)

lat_top = np.repeat(lat_range[:-1], [m-1]*(n-1), axis = 0)
lat_bottom = np.repeat(lat_range[1:], [m-1]*(n-1), axis = 0)
long_left = np.array(list(long_range[:-1])*(n-1))
long_right = np.array(list(long_range[1:])*(n-1))

grid_cord = pd.DataFrame({'lat_top':lat_top, 'lat_bottom':lat_bottom, 'long_left':long_left, 'long_right':long_right})

def cell_count(i):
    count = sum((df11.lat < grid_cord.lat_top[i]) & (df11.lat > grid_cord.lat_bottom[i]) & 
                (df11.long < grid_cord.long_right[i]) & (df11.long > grid_cord.long_left[i]))
    return(count)

counts = [cell_count(i) for i in range(grid_cord.shape[0])]

Once the sample counts are collected from the data, then it is possible to construct the data frame and perform the necessary statistics to find out the Poisson Probability and the Poisson Count for each cell.

The sample proportion is determined by taking the number of cell counts with x number of calls and dividing it by the total number of counts for all cells.

The sample mean, or ˉλ is taken by taking the total number of calls multiplied by the cell counts, summing, and dividing by the total number of cell counts.

The Poisson Probability is calculated by using the Poisson function from SciPy, and using as inputs the number of calls along with ˉλ.

The Poisson Count is calculated by multiplying each Poisson Probability with the total number of calls.

In [72]:
# Strange error, requires having import nearby.
from scipy.stats import poisson
# The dataframe is first created called sample_prop_table.
# Then the statistics for sample proportion, Poisson
# probability and Poisson count are calculated.
grid_cord['count'] = counts
grid_cord_counts = grid_cord['count'].value_counts()
sample_prop_table = pd.DataFrame({'number_of_calls':grid_cord_counts.index, 
                                  'cell_counts': np.array(grid_cord_counts)})

cell_sum = sum(sample_prop_table.cell_counts)
sample_prop_table['sample_prop'] = sample_prop_table.cell_counts.apply(lambda x: x / cell_sum)
lambda_bar = sum(sample_prop_table.number_of_calls * sample_prop_table.cell_counts) / cell_sum
sample_prop_table['pois_prob'] = poisson.pmf(sample_prop_table.number_of_calls, lambda_bar)
sample_prop_table['pois_count'] = sample_prop_table.pois_prob * cell_sum

sample_prop_table.head()
Out[72]:
number_of_calls cell_counts sample_prop pois_prob pois_count
0 0 177 0.547988 0.187904 60.692838
1 1 57 0.176471 0.314142 101.467902
2 2 28 0.086687 0.262596 84.818370
3 3 22 0.068111 0.146338 47.267203
4 4 12 0.037152 0.061163 19.755642

Ideally, if the data truly followed a Poisson distribution and could be thought of as possibly being random, then the Poisson Count would look similar to the cell counts. However, this is not the case which is evident in the table above. A possibility is that there is a case of a Zero-inflated Poisson Distribution. This is not a simple calculation in comparison, and is a rather advanced level topic which is not covered in undergraduate studies. Therefore, I will not go as deep into the testing. Instead, a Chi-squared Goodness of Fit test will be used to analyze whether the data follows a Poisson distribution.

H0: There is no significant difference between the observed and the expected value.

H1: There is a significant difference between the observed and expected value.

df=n2=5732=571

So therefore the decision rule is to reject H0 if χ2>644.8

The calculation for χ2 is as follows: χ2=[(OE)2E] where O is the observed value, and E is the expected value.

In [73]:
print('The calculated test statistic is:\t{}'.format(sum((sample_prop_table.cell_counts - 
                            sample_prop_table.pois_count)**2 / sample_prop_table.pois_count)))
The calculated test statistic is:	1.0865263262610718e+73

Since the value for χ2 is much larger than 644.8, we reject the null hypothesis in favor of the alternative hypothesis. The conclusion is then that the data does not follow a Poisson distribution.

However, it is recognized that certain cells are definitely 0, and not randomly 0. So this violates the assumption of independence. For instance, cells which are in the ocean will not have any call counts. There may be a case of overdispersion within the data, and therefore alternate models may be more fitting such as Negative Binomial or Zero-inflated Poisson Distributions.

12. Conclusion

When first selecting the dataset, this dataset appeared to be quite valuable. There were numerous observations, along with a healthy number of columns. As the time spent working with the dataset grew, it became apparent that utilizing this dataset would become quite difficult. The reason being that despite having numerous variables, many of them were highly correlated with each other. The effect is that this narrows down the number of usable variables. For instance, several variables are related to the date, time, and location of an incident. Although such information is useful, it is difficult to utilize every one of them when they are so like each other. Also, another major issue is a lack of numeric variables. Often in Statistics, when models are being created based on data, the data often contains various numeric values which help in the process of estimation and prediction. Not having a significant number of continuous variables can make the task of prediction quite daunting.

Overcoming these initial hurdles was not an easy task. A process of creating various new features was the initial goal. Trying to do feature engineering based off the currently existing data would often lead to variables which were not highly different from each other. For instance, creating new columns such as month, day, year, and hour does not prove to be incredibly beneficial in the modeling process. Although web scraping was done, it is not easy to find a great deal of publicly available data on San Francisco that would be useful in the process of data analysis. Going through the steps of EDA, however, helped to show more clearly what sort of variables existed in the dataset, and how they could more easily be understood for their use during modelling.

Some of the initial Data Visualization steps proved to be the most insightful. Creating a heatmap of the times of call types showed that most incidents would occur within a specific timeframe. The geospatial frequency heatmap also showed that incidents would mainly occur around certain areas rather than others. Looking at specific incidents which was like finding a needle in a haystack helped in developing interest and closeness with the dataset itself. It is also interesting to find out that most of the calls that occur are specifically Medical-related. The actual proportion of fire incidents is very low in relation to other call types. Also, when fires do occur, in may require over 50 emergency response vehicles to arrive at the scene, indicated by the number of alarms that are required.

When it comes down to the modeling itself, two variables became of interest. The response time of the vehicles to the scene, along with the priority level. It was interesting to note that emergency response vehicles could take up to an hour to arrive, something that seems impractical. However, the true reason for this is not yet known, while guesses could be made as to the reason why this is so. It is possible that the 9-11 call was made, but the actual request for the response vehicle was not made until later after police had already arrived. Also, the priority level is something of interest in that call types are all broken down ultimately into either Code 2 or 3. This binary result shows whether the emergency response vehicles will be driving with lights and sirens on. Trying to understand this situation and how it relates to the other variables in the dataset could provide valuable insight into what is happening on a regular basis. Lastly, after seeing that the frequency of calls in a geographic heatmap were not evenly spaced out, it was of interest to try and determine what sort of distribution the incidents followed.

Response times were a difficult variable to try and understand. Modeling them using Linear regression seemed to be a wise choice in the beginning. However, when analyzing the distribution, it was not possible to transform it such that it would follow a normal distribution. Also, searching for predictors within the dataset was surprisingly difficult. Variables that were imagined to be useful proved to be irrelevant. For example, the neighborhoods all share a similar distribution of response times. When doing a pairs matrix, there was a lack of meaningful continuous variables which could be modeled along with the response time. During the process of Linear regression, there were several obstacles along the way. Using sci-kit learn, there was an incredibly weak R2 that resulted from the fitting. However, when modeling with the statsmodels package, an overly inflated Adjusted R2 would result. Such an issue seems to be something that can happen often in Python. Considering that the language itself is rather bare-boned and relies primarily on third-parties for tools, having trust in the results of the modeling can be debatable. Even RStudio suffers from a similar issue where results are dependent on third-party packages that are not trusted. During the analysis of the residuals, none of the plots seemed to support any of the assumptions of Linear regression. It would have been possible to continue with a Stepwise regression method to search for a suitable model, but there was a lack of time in addition to assumptions already being violated severely.

The second part of the modeling proved to be more productive. In this case, there were no issues including continuous variables alongside categorical variables. The resulting mean accuracy using sci-kit learn's score() function showed that the accuracy was relatively decent, being above 0.7 in the Leave-5-out cross-validation test. It is possible that the model is effective in predicting whether an event would be a Code 2 or 3 incident based on certain inputs. A possibility moving forward (given more time) would be to use Stepwise regression to create a reduced model in comparison to the current 'full' model. The two models could be compared using the Likelihood Ratio test G2 to find out if the new reduced model was more suitable. However, an issue would be having trust in certain functionality which is not widely known or tested in Python. An easier solution would be to move the dataset to RStudio and perform the Stepwise regression to identify the predictors and their test statistics. However, Python in general seems most suited for Machine Learning algorithms rather than Statistics models which already have their own software (e.g., SAS, STATA, etcetera). The conclusion of the modeling showed that there was doubt about the practicality of the model due to accuracy issues. Although a 70% mean accuracy rate may appear good during the modeling process, the issue with False Positives or False Negatives in a real-life environment could prove to be too severe of a risk.

The third part of the modeling involving the distribution of the data was perhaps of greatest interest. The dataset in general appears rather scattered, with many variables describing categorical aspects that are difficult to judge. The reason being that they are more difficult to interpret in a logical manner compared to numeric variables. So, when trying to determine the geographic counts of the data, it was fascinating trying to find out whether the data followed a Poisson distribution as suspected. The results were somewhat disappointing, as they were unable to mimic the results of the British Statistician during World War II. However, looking at the Poisson counts, they seemed somewhat to follow the sample counts, despite the large Chi-squared test statistic. It is possible that more advanced modeling utilized a Zero-inflated Poisson distribution could make up for the lack of fit.

In the end a major frustration was possibly that the dataset chosen is not optimal for analysis. Despite a plentiful number of observations, it was still difficult to perform modeling. I believe that a more central issue is that the response variables which we were trying to predict are within too narrow of a scope in relation to other variables. To be more concise, there are already so many fire stations located in the city, and it takes on average only a few minutes for a response vehicle to arrive. In such a case, it is not necessarily sensible to try and do any modeling. In a system where events occur regularly and at a common pace within a narrow space and timeframe, it would not make sense to try and perform any prediction. This would be akin to trying to predict when a street light turns red or green when it already does so at regular intervals. Another view is that Data Science can be considered a tool for exploiting inefficiencies within a system so that they can be acted upon and used for the benefit of one or multiple parties. An example would be a business finding out that a demographic was split into multiple groups, and advertising to each of them differently to increase revenue. However, in this dataset, it is such that no matter where or when the incident takes place, the resulting response is quite similar. This is not necessarily a bad thing either, it shows that the San Francisco Fire Department has trained incredibly hard to make an efficient system where they are able to save lives in a fluid and practical manner.

Beyond the idea of choosing a new dataset to work with, perhaps if there were more time it would be possible to do additional EDA to better understand the variables. If this was done, there could be the discovery of more interesting response variables that could be modeled. Also, gathering more variables by Data Scraping or other methods could help to create more numeric variables for the purpose of estimation and prediction. If these additional columns could be found elsewhere, then it is possible that the models would be more effective. Another issue could be a lack of knowledge about advanced Machine Learning techniques. The class was cut short by two weeks, and several advanced topics were never covered. Unsupervised Learning seems to be a great tool for a dataset such as this where the categories seem so random and difficult to interpret logically. Using something like Neural Networks could possibly propel greater insight. However, a lack of understanding into these algorithms lead to only the use of Statistical models which have already been learned. In this case, Linear and Logistic regression proved to be difficult to use, and not entirely suited for the task at hand.

[Back to Table of Concents](#toc)

13. References

Below are all the additional references which were utilized throughout the project.

https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3
https://www.youtube.com/watch?v=gRvE94hO4TM
https://www.kdnuggets.com/2016/03/data-science-process.html
https://stackoverflow.com/questions/35077507/how-to-right-align-and-justify-align-in-markdown
https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed
https://medium.com/@sambozek/ipython-er-jupyter-table-of-contents-69bb72cf39d3
https://stackoverflow.com/questions/3939361/remove-specific-characters-from-a-string-in-python
https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
http://www.sfindicatorproject.org/indicators/view/162
http://www.sfindicatorproject.org/img/indicators/pdf/MedianHouseholdIncome.pdf
https://stackoverflow.com/questions/52454807/set-pandas-column-values-based-on-index-range
https://stackoverflow.com/questions/35077507/how-to-right-align-and-justify-align-in-markdown
https://stackoverflow.com/questions/26266362/how-to-count-the-nan-values-in-a-column-in-pandas-dataframe
http://gradientdissent.com/blog/analyzing-2-months-of-real-crime-data-from-san-francisco-and-seattle.html#.W_3u1vZFxPY
https://stackoverflow.com/questions/19384532/how-to-count-number-of-rows-per-group-and-other-statistics-in-pandas-group-by
https://stackoverflow.com/questions/16327055/how-to-add-an-empty-column-to-a-dataframe
https://seaborn.pydata.org/generated/seaborn.heatmap.html
https://stackoverflow.com/questions/15923826/random-row-selection-in-pandas-dataframe
https://stackoverflow.com/questions/8372399/zip-with-list-output-instead-of-tuple
https://chrisalbon.com/python/data_wrangling/pandas_create_column_using_conditional/
https://stackoverflow.com/questions/30944577/check-if-string-is-in-a-pandas-dataframe
https://stackoverflow.com/questions/45414572/unable-to-use-scatter-plot-in-basemap
http://introtopython.org/visualization_earthquakes.html
https://stackoverflow.com/questions/17431441/matplotlib-scatter-plot-to-foreground-on-top-of-a-contour-plot
http://gradientdissent.com/blog/analyzing-2-months-of-real-crime-data-from-san-francisco-and-seattle.html#.W_4oPfZFxPb
https://stackoverflow.com/questions/3096953/how-to-calculate-the-time-interval-between-two-time-strings
https://stackoverflow.com/questions/37093454/typeerror-must-be-string-not-datetime-datetime-when-using-strptime
https://stackoverflow.com/questions/23747451/filtering-all-rows-with-nat-in-a-column-in-dataframe-python
https://stackoverflow.com/questions/32152233/getting-the-range-of-data-from-a-list-in-python
https://www.statisticssolutions.com/chi-square-goodness-of-fit-test/
https://www.medcalc.org/manual/chi-square-table.php
http://courses.wcupa.edu/rbove/Berenson/10th%20ed%20CD-ROM%20topics/section12_5.pdf
https://stackoverflow.com/questions/23307301/pandas-replacing-column-values-in-dataframe
https://stackoverflow.com/questions/40867977/pandas-drop-certain-values-from-a-column-with-a-string-title
https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python
https://stackoverflow.com/questions/29688899/pandas-checking-if-a-date-is-a-holiday-and-assigning-boolean-value
https://stackoverflow.com/questions/50754471/seaborn-heatmap-not-displaying-all-xticks-and-yticks/50754993
https://www.dataquest.io/blog/web-scraping-tutorial-python/
https://stackoverflow.com/questions/16835449/python-beautifulsoup-extract-text-between-element
https://stackoverflow.com/questions/20522820/how-to-get-tbody-from-table-from-python-beautiful-soup
https://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/
https://stackoverflow.com/questions/20517650/how-to-delete-the-last-column-of-data-of-a-pandas-dataframe
https://stackoverflow.com/questions/20522820/how-to-get-tbody-from-table-from-python-beautiful-soup
https://stackoverflow.com/questions/16096754/remove-none-value-from-a-list-without-removing-the-0-value
https://developers.google.com/edu/python/lists
https://stackoverflow.com/questions/134934/display-number-with-leading-zeros
https://stackoverflow.com/questions/46789098/create-new-column-in-dataframe-with-match-values-from-other-dataframe
https://stackoverflow.com/questions/5807195/how-to-get-coordinates-of-address-from-python
https://stackoverflow.com/questions/19363881/replace-none-value-in-list
https://www.google.com/
https://www.wikipedia.org/
https://planspace.org/20150423-forward_selection_with_statsmodels/
https://stackoverflow.com/questions/36813396/how-to-show-the-title-for-the-diagram-of-seaborn-pairplot-or-pridgrid
https://stackoverflow.com/questions/19758364/rename-a-single-column-header-in-a-pandas-dataframe
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
https://www.statisticssolutions.com/assumptions-of-logistic-regression/
https://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/
https://stackoverflow.com/questions/42406233/how-to-add-title-to-seaborn-boxplot
https://stackoverflow.com/questions/37508158/how-to-extract-a-particular-value-from-the-ols-summary-in-pandas/41212509
https://stackoverflow.com/questions/35417111/python-how-to-evaluate-the-residuals-in-statsmodels
https://underthecurve.github.io/jekyll/update/2016/07/01/one-regression-six-ways.html#Python
https://data.library.virginia.edu/diagnostic-plots/
https://en.wikipedia.org/wiki/Cross-validation_(statistics)



[Back to Table of Concents](#toc)