First, we need a graph. A graph is just a bunch of objects, typically called nodes which are connected to each other. The connections are called edges, and can have a direction, like Tom owes money to Andy, or can be non-directional, like a road connecting two cities.
A graph is a collection of nodes and edges. The BFS algo can tell us if two nodes are connected, and finds the shortest path b/w them.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from geopy.distance import great_circle
from collections import deque
To make the graph interesting, I am using the airlines and route information database from openflights.org.
So we have two datasets, one about airports, and one about routes b/w airports. For the purpose of the graph, the airports are the nodes, and the routes are the edges b/w them. The routes have a direction, and the cost or weight of the route is the distance.
cols = ["Airport ID", "Name", "City", "Country", "IATA", "ICAO", "Latitude", "Longitude", "Altitude",
"Timezone", "DST", "Tz database", "Type", "Source"]
airports = pd.read_csv("data/airports.dat", header=None, names=cols, index_col=0)
print(f"There are {airports.shape[0]} airports")
airports.head()
There are 7184 airports
Name | City | Country | IATA | ICAO | Latitude | Longitude | Altitude | Timezone | DST | Tz database | Type | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Airport ID | |||||||||||||
1 | Goroka Airport | Goroka | Papua New Guinea | GKA | AYGA | -6.081690 | 145.391998 | 5282 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
2 | Madang Airport | Madang | Papua New Guinea | MAG | AYMD | -5.207080 | 145.789001 | 20 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
3 | Mount Hagen Kagamuga Airport | Mount Hagen | Papua New Guinea | HGU | AYMH | -5.826790 | 144.296005 | 5388 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
4 | Nadzab Airport | Nadzab | Papua New Guinea | LAE | AYNZ | -6.569803 | 146.725977 | 239 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
5 | Port Moresby Jacksons International Airport | Port Moresby | Papua New Guinea | POM | AYPY | -9.443380 | 147.220001 | 146 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
Step one is is to get rid of all the info we don't need for our graph.
So for Airports, I am just keeping:
# dropping the columns we don't need
keep_cols = ["City", "Country", "Latitude", "Longitude"]
airports = airports[keep_cols]
print(f"There are {airports.shape[0]} airports")
airports.head()
There are 7184 airports
City | Country | Latitude | Longitude | |
---|---|---|---|---|
Airport ID | ||||
1 | Goroka | Papua New Guinea | -6.081690 | 145.391998 |
2 | Madang | Papua New Guinea | -5.207080 | 145.789001 |
3 | Mount Hagen | Papua New Guinea | -5.826790 | 144.296005 |
4 | Nadzab | Papua New Guinea | -6.569803 | 146.725977 |
5 | Port Moresby | Papua New Guinea | -9.443380 | 147.220001 |
# getting rid of null values
airports = airports.dropna(axis=0)
airports.shape
(7140, 4)
The second dataset is the routes flown by airlines.
As of January 2012, the OpenFlights/Airline Route Mapper Route Database contains 59036 routes between 3209 airports on 531 airlines spanning the globe, as shown in the map above.
cols = ["Airline", "Airline ID", "Source airport", "Source Airport ID", "Destination airport",
"Dest Airport ID", "Codeshare", "Stops", "Equipment"]
routes = pd.read_csv("data/routes.dat", header=None, names=cols)
print(f"There are {routes.shape[0]} routes")
routes.head()
There are 67663 routes
Airline | Airline ID | Source airport | Source Airport ID | Destination airport | Dest Airport ID | Codeshare | Stops | Equipment | |
---|---|---|---|---|---|---|---|---|---|
0 | 2B | 410 | AER | 2965 | KZN | 2990 | NaN | 0 | CR2 |
1 | 2B | 410 | ASF | 2966 | KZN | 2990 | NaN | 0 | CR2 |
2 | 2B | 410 | ASF | 2966 | MRV | 2962 | NaN | 0 | CR2 |
3 | 2B | 410 | CEK | 2968 | KZN | 2990 | NaN | 0 | CR2 |
4 | 2B | 410 | CEK | 2968 | OVB | 4078 | NaN | 0 | CR2 |
We just need:
# first drop all rows where stops aren't 0, as we only want direct connections
routes = routes[routes["Stops"] == 0]
keep_cols = ["Source Airport ID", "Dest Airport ID"]
routes = routes[keep_cols]
print(f"There are {routes.shape[0]} routes")
routes.head(10)
There are 67652 routes
Source Airport ID | Dest Airport ID | |
---|---|---|
0 | 2965 | 2990 |
1 | 2966 | 2990 |
2 | 2966 | 2962 |
3 | 2968 | 2990 |
4 | 2968 | 4078 |
5 | 4029 | 2990 |
6 | 4029 | 6969 |
7 | 4029 | \N |
8 | 4029 | 6160 |
9 | 6156 | 2952 |
Now there are some values which aren't numbers, so we need to clean them up, as you can see in row 7 above.
def make_int_or_null(x):
"""returns int or np.nan if can't return int"""
try:
return int(x)
except:
return np.nan
routes = routes.applymap(make_int_or_null)
print(f"There are {routes.shape[0]} routes before dropping null values")
routes = routes.dropna(axis=0)
print(f"there are {routes.shape[0]} after dropping null values")
There are 67652 routes before dropping null values there are 67229 after dropping null values
routes.head()
Source Airport ID | Dest Airport ID | |
---|---|---|
0 | 2965.0 | 2990.0 |
1 | 2966.0 | 2990.0 |
2 | 2966.0 | 2962.0 |
3 | 2968.0 | 2990.0 |
4 | 2968.0 | 4078.0 |
Now we need to be able to get the distance b/w two airports to be able to assign a weight in our graphs.
def route_distance(edge):
"""takes a route as a pandas row, and returns distance b/w them"""
src_airport = airports.iloc[int(edge["Source Airport ID"])]
dest_airport = airports.iloc[int(edge["Dest Airport ID"])]
src_lat = src_airport["Latitude"]
src_long = src_airport["Longitude"]
dest_lat = dest_airport["Latitude"]
dest_long = dest_airport["Longitude"]
src_loc = (float(src_lat), float(src_long))
dest_loc = (float(dest_lat), float(dest_long))
return great_circle(src_loc, dest_loc).km
#checking algo
print(route_distance(routes.iloc[100]))
204.03946015838426
so the data seems ready to go. We now have two pandas dataframes, airports and routes, and a function which returns the distance b/w airports.
there are many ways to make a graph. One way is to build out all the nodes and connections. So here I'll initiate the graph as a dictionary of nodes, with the data being the edges.
A blank graph where the key is each airport:
graph = {}
for i, row in airports.iterrows():
graph[i] = []
Now going through each route and adding (dest_airport, distance) to each src_airport in the graph:
from tqdm import tqdm
for i, row in tqdm(routes.iterrows(), total=len(routes), mininterval=0.5):
try:
src_airport = airports.iloc[int(row["Source Airport ID"])]
dest_airport = airports.iloc[int(row["Dest Airport ID"])]
except:
pass
dist = great_circle((src_airport["Latitude"], src_airport["Longitude"]),
(dest_airport["Latitude"],dest_airport["Longitude"]) ).km
n = int(row["Source Airport ID"])
d = int(row["Dest Airport ID"])
if n in graph.keys():
graph[n].append((d, int(dist)))
100%|██████████| 67229/67229 [01:22<00:00, 813.66it/s]
Now say I want to find out the graph for Karachi. Karachi has three airports, so only looking at the first one for now...
find = airports["City"] == "Karachi"
airports[find]
City | Country | Latitude | Longitude | |
---|---|---|---|---|
Airport ID | ||||
2206 | Karachi | Pakistan | 24.906500 | 67.160797 |
2213 | Karachi | Pakistan | 24.893600 | 66.938797 |
11911 | Karachi | Pakistan | 24.874201 | 67.118500 |
So I can see all the cities it's connected to, though since some cities have multiple airports they show up twice. So how do I deal with that? I can ignore multiple airports as the distance is the same, though a more complex graph would take into account the different $$ value.
for i, t in enumerate(graph[2206]):
if t[0] in graph.keys():
print(i, airports.loc[t[0]]["City"], t[1])
else:
continue
0 Abu Dhabi 1559 1 Chengdu 6496 2 Bangkok 3409 3 Dubai 1507 4 Abu Dhabi 1559 5 Dubai 1507 6 Sharjah 1415 7 Bahrain 7065 8 Bahawalpur 10912 9 Islamabad 626 10 Lahore 335 11 Faisalabad 1588 12 Multan 89 13 Peshawar 656 14 Quetta 621 15 Tehran 11793 16 Colombo 3970 17 Dubai 1507 18 Islamabad 626 19 Lahore 335 20 Multan 89 21 Peshawar 656 22 Quetta 621 23 Abu Dhabi 1559 24 Bahawalpur 10912 25 Mumbai 4096 26 Dhaka 5079 27 Dera Ghazi Khan 18538 28 Delhi 4405 29 Dammam 6480 30 Dubai 1507 31 Gwadar 37 32 Islamabad 626 33 Jeddah 7265 34 Kathmandu 6126 35 Kuala Lumpur 9354 36 Lahore 335 37 London 9212 38 Faisalabad 1588 39 Muscat 1514 40 Madinah 7420 41 Moenjodaro 61 42 Multan 89 43 Peshawar 656 44 Panjgur 577 45 Riyadh 6449 46 Rahim Yar Khan 679 47 Sharjah 1415 48 Sukkur 246 49 Turbat 8236 50 Quetta 621 51 Toronto 10578 53 Dammam 6480 54 Jeddah 7265 55 Riyadh 6449 56 Bangkok 3409 57 Muscat 1514 58 Istanbul 13782 59 Colombo 3970 60 Muscat 1514 61 Jeddah 7265
airports.shape
(7140, 4)