In the previous lessons, we learnt about several ways of analyzing data visually such as using bar plots, scatter plots, and categorical plots. When our datsset consists of geospatial information such as zipcodes, states, and geographical coordinates, we can further explore by overlaying the data on top of spatial maps. In this lesson, you will learn to perform simple visual analysis of spatial data with Pandas and folium
library using on-time performance data of dometics flights. The data forthis exercise was downloaded from Kaggle. We will explore the data to understand if there a geospatial pattern in the data using visualization.
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.
Folium is a Python library that makes it easier to create maps using Leaflet, which is an open source javascript library for creating interactive maps. Folium map can be used for a range of purposes from simple visualization to creating interactive dashboard applications.
To install folium with pip:
pip install folium
or with conda:
conda install -c conda-forge folium
import os
import pandas as pd
import folium
DATA_DIR = "/home/asimbanskota/t81_577_data_science/weekly_materials/week7/files"
airlines = os.path.join(DATA_DIR, 'airlines.csv')
airports = os.path.join(DATA_DIR, 'airports.csv')
flights = os.path.join(DATA_DIR, 'flights.csv')
df_air = pd.read_csv(airlines)
df_ap = pd.read_csv(airports)
df = pd.read_csv(flights)
/home/asimbanskota/anaconda3/envs/api/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3062: DtypeWarning: Columns (7,8) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
Lets create a feature that keeps the value of total number of flights by airport.
count_flights = df['ORIGIN_AIRPORT'].value_counts().reset_index()
count_flights.rename({'index': 'IATA_CODE', 'ORIGIN_AIRPORT': 'count_flights'}, axis = 1, inplace = True)
df_ap = df_ap.merge(count_flights, on = 'IATA_CODE')
Maps are defined as a folium.Map object. We start with creating a base map by providing the latitude and logitude of the center of the map. This will instantiate a map object for a given location ( 45 degree lat and 96 degree west). Once the base map is created, other map objects can be incrementally added on top of the folium.Map.
m = folium.Map(location=[45, -96], zoom_start =4)
We can access and display the map object within the notebook simply by referring to its name m
.
m
Since we have geographical latitude and longitude associated with the airports, we can simply plot the locations on top of the map object as follows.
Before that, we need to remove NaN values from the location columns. We might also need to filter out airports with relatively lower number of flights to avoid visual clutter.
df_ap = df_ap.dropna(subset = ['LATITUDE','LONGITUDE'])
df_ap = df_ap[df_ap['count_flights'] > 5000]
for lat, lon, ct in zip(df_ap['LATITUDE'], df_ap['LONGITUDE'], df_ap['count_flights']):
folium.CircleMarker(
[lat, lon],
popup = ('Count_flights: ' + str(ct)
),
key_on = ct,
).add_to(m)
m
The above map is neither pretty nor intuitive. May be we can color code the circles with each color representing the some ranges of flights departed from the airport. Lets first use Pandas Cut function to bin the flight counts in four equally sized buckets.
df_ap['count_cl'] = pd.qcut(df_ap['count_flights'],4, labels=False)
colordict = {0: 'lightblue', 1: 'lightgreen', 2: 'orange', 3: 'red'}
for lat, lon, count_cl, count_flights in zip(df_ap['LATITUDE'], df_ap['LONGITUDE'], df_ap['count_cl'], df_ap['count_flights']):
folium.CircleMarker(
[lat, lon],
popup = ('Count_cl: ' + str(count_cl)
),
color='b',
key_on = count_cl,
#threshold_scale=[0,1,2,3],
fill_color=colordict[count_cl],
fill=True,
fill_opacity=0.7,
legend_name='SALE PRICE'
).add_to(m)
m
The above shows the values of your interest in terms of categories. You can go one step futher and display the size of each circle relative to the number of flights departed from the respective airports.
latitude = 45
longitude = -96
m = folium.Map(location=[latitude, longitude], zoom_start=4)
for lat, lon, count_cl, count_flights in zip(df_ap['LATITUDE'], df_ap['LONGITUDE'], df_ap['count_cl'], df_ap['count_flights']):
folium.CircleMarker(
[lat, lon],
radius= count_flights/25000,
popup = ('Count_cl: ' + str(count_cl)
),
color='b',
key_on = count_cl,
threshold_scale=[0,1,2,3],
fill_color= 'crimson',
fill=True,
fill_opacity=0.7
).add_to(m)
m
In the similar vain, you can analyze other appropriate features in the data as well.
Chlopleth map is a thematic map that uses differences in shading and coloring to indicate the corresponding values of interest.
In order to create a chloropleth map, we need to have a dataset that defines the boundary of geographical units of our interest such as US states, zipcodes etc. Such geographical file can come with various formats, one of them is geoJSON, whihc is nothing other that a JSON formatted data with additional geographical information. The geoJSON file for the US states can be obtained from the URL below.
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
state_geo = f'{url}/us-states.json'
df_geo = pd.read_json(state_geo)
#To create a base map, simply pass your starting coordinates to Folium:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=dfs,
columns=['STATE', 'count_flights'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Flight_counts'
).add_to(m)
folium.LayerControl().add_to(m)
m