Analyzing Venues in the Suburbs of Johannesburg with Machine Learning

Introduction

This is an IBM Applied Data Science Project - The Battle of Neighborhoods

Johannesburg, informally known as Jozi, Joburg, or "The City of Gold", is the largest city in South Africa and one of the 50 largest urban areas in the world. It is the provincial capital and largest city of Gauteng, which is the wealthiest province in South Africa. Johannesburg is the seat of the Constitutional Court, the highest court in South Africa. The city is located in the mineral-rich Witwatersrand range of hills and is the centre of large-scale gold and diamond trade. It was one of the host cities of the official tournament of the 2010 FIFA World Cup.

In this notebook, I analyzed different kinds of venues using the power of k-means clustering to seek the hidden patterns about the most visited venues in each of the suburbs within the City of Johannesburg municipality.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# import k-means from clustering stage
from sklearn.cluster import KMeans


#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')
Libraries imported.

1. Exploring the Data Set

In [ ]:
#import wget
#import os

#if os.path.exists('metropolitan_suburbs_region.geojson'):
#    os.remove('metropolitan_suburbs_region.geojson')

#wget.download('https://adi45.carto.com/tables/metropolitan_suburbs_region/public/map#/metropolitan_suburbs_region.geojson')

#print('\nData downloaded!')
In [ ]:
#with open('map') as json_data:
#    contents_jhb = json.load(json_data)
In [2]:
json_file_path = "C:/Users/vnbri/Documents/metropolitan_suburbs_region.geojson"

with open(json_file_path, 'r') as j:
     contents_jhb = json.loads(j.read())
In [ ]:
#contents_jhb

The relevent data is coming from the features key, it essentially a list of suburbs also known as neighborhoods. Thus it suffices to define a new variable that contains this data.

In [3]:
neighborhood_data = contents_jhb['features']

The first item of this list is

In [4]:
neighborhood_data[0]
Out[4]:
{'type': 'Feature',
 'geometry': {'type': 'MultiPolygon',
  'coordinates': [[[[28.073783, -26.343133],
     [28.071239, -26.351536],
     [28.068717, -26.350644],
     [28.06663, -26.351362],
     [28.065161, -26.352135],
     [28.064671, -26.35399],
     [28.064877, -26.355691],
     [28.062172, -26.357752],
     [28.062041, -26.358277],
     [28.060888, -26.359223],
     [28.059649, -26.358614],
     [28.057891, -26.35947],
     [28.057638, -26.359593],
     [28.055997, -26.358723],
     [28.055519, -26.358016],
     [28.054823, -26.35756],
     [28.053975, -26.357049],
     [28.053888, -26.357053],
     [28.053743, -26.357021],
     [28.05365, -26.357011],
     [28.053533, -26.356974],
     [28.053417, -26.356923],
     [28.053267, -26.356899],
     [28.053174, -26.35689],
     [28.053071, -26.356797],
     [28.052969, -26.35668],
     [28.052884, -26.35654],
     [28.052796, -26.356395],
     [28.052777, -26.356344],
     [28.052703, -26.356297],
     [28.052661, -26.356218],
     [28.052656, -26.356134],
     [28.052595, -26.356027],
     [28.052488, -26.355933],
     [28.052413, -26.355863],
     [28.052325, -26.355821],
     [28.052259, -26.355765],
     [28.052175, -26.3557],
     [28.052105, -26.355681],
     [28.051984, -26.355639],
     [28.051895, -26.355583],
     [28.051853, -26.355504],
     [28.051807, -26.355411],
     [28.05176, -26.35528],
     [28.051736, -26.355177],
     [28.051694, -26.355056],
     [28.05167, -26.354893],
     [28.051656, -26.354786],
     [28.051656, -26.354721],
     [28.051633, -26.354651],
     [28.051586, -26.354586],
     [28.05146, -26.354474],
     [28.051339, -26.354381],
     [28.051195, -26.354302],
     [28.051158, -26.354274],
     [28.051064, -26.354194],
     [28.050925, -26.354197],
     [28.05085, -26.354218],
     [28.050775, -26.354241],
     [28.050687, -26.354316],
     [28.050515, -26.354316],
     [28.05081, -26.35153],
     [28.051068, -26.349079],
     [28.06033, -26.344981],
     [28.06707, -26.341999],
     [28.067085, -26.341972],
     [28.070609, -26.342068],
     [28.073783, -26.343133]]]]},
 'properties': {'cartodb_id': 1,
  'subplace_c': 761001001,
  'province': 'Gauteng',
  'wardid': '74202012',
  'district_m': 'Sedibeng',
  'local_muni': 'Midvaal',
  'main_place': 'Alberton',
  'mp_class': 'Settlement',
  'sp_name': 'Brenkondown',
  'suburb_nam': 'Brenkondown',
  'metro': 'Johannesburg',
  'african': 330,
  'white': 24,
  'asian': 0,
  'coloured': 2,
  'other': 0,
  'totalpop': 356}}

1.1 Transforming the Data into a Pandas Data Frame

The next task is essentially transforming this data of nested Python dictionaries into a pandas dataframe. The first step is to create an empty pandas data frame.

In [5]:
# define the dataframe columns
column_names = ['Province', 'District', 'Local_municipality','Main Place', 'Suburb','Metro','Latitude','Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then we have to examine the empty dataframe to make sure that the columns are as intended.

In [6]:
neighborhoods
Out[6]:
Province District Local_municipality Main Place Suburb Metro Latitude Longitude
In [7]:
for data in neighborhood_data:
    province = data['properties']['province']
    district = data['properties']['district_m']
    local_muni_name = suburb_name = data['properties']['local_muni'] 
    main_place = data['properties']['main_place']
    suburb_name = data['properties']['suburb_nam']
    metro = data['properties']['metro']
    
    suburb_latlon = data['geometry']['coordinates']
    suburb_lat = suburb_latlon[0][0][0][1]
    suburb_lon = suburb_latlon[0][0][0][0]
    neighborhoods = neighborhoods.append({'Province': province,
                                          'District': district,
                                          'Local_municipality': local_muni_name,
                                          'Main place': main_place,
                                          'Suburb': suburb_name,
                                          'Metro': metro,
                                          'Latitude': suburb_lat,
                                          'Longitude': suburb_lon}, ignore_index=True)
In [8]:
neighborhoods.head()
Out[8]:
Province District Local_municipality Main Place Suburb Metro Latitude Longitude Main place
0 Gauteng Sedibeng Midvaal NaN Brenkondown Johannesburg -26.343133 28.073783 Alberton
1 Gauteng Sedibeng Lesedi NaN Masetjhaba View Johannesburg -26.388533 28.384250 Duduza
2 Gauteng Sedibeng Lesedi NaN Sonstraal AH Johannesburg -26.406613 28.361255 Sonstraal
3 Gauteng West Rand Mogale City NaN Ruimsig Noord Johannesburg -26.075359 27.865240 Krugersdorp
4 Gauteng Ekurhuleni Ekurhuleni NaN Germiston Ext 3 Johannesburg -26.214897 28.181906 Germiston

It is quit obvious that we need to get reed of the main place column with missing values.

In [9]:
neighborhoods = neighborhoods.drop(neighborhoods.columns[3], axis=1)
neighborhoods.head()
Out[9]:
Province District Local_municipality Suburb Metro Latitude Longitude Main place
0 Gauteng Sedibeng Midvaal Brenkondown Johannesburg -26.343133 28.073783 Alberton
1 Gauteng Sedibeng Lesedi Masetjhaba View Johannesburg -26.388533 28.384250 Duduza
2 Gauteng Sedibeng Lesedi Sonstraal AH Johannesburg -26.406613 28.361255 Sonstraal
3 Gauteng West Rand Mogale City Ruimsig Noord Johannesburg -26.075359 27.865240 Krugersdorp
4 Gauteng Ekurhuleni Ekurhuleni Germiston Ext 3 Johannesburg -26.214897 28.181906 Germiston

So it turns out that all the rows have the required information for now. Also:

In [10]:
print('The dataframe has {} local municipalities and {} suburbs.'.format(
        len(neighborhoods['Local_municipality'].unique()),
        neighborhoods.shape[0]
    )
)
The dataframe has 238 local municipalities and 3598 suburbs.

The geopy library is used to get the latitude and longitude values of Johannesburg.

In [11]:
address = 'Johannesburg'

geolocator = Nominatim(user_agent="jhb_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Johannesburg are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of Johannesburg are -26.205, 28.049722.

Folium is used to create a map of Johannesburg with suburbs superimposed on top.

In [12]:
# create map of Johannesburg using latitude and longitude values
map_johannesburg = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, local_municipality, suburb in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Local_municipality'], neighborhoods['Suburb']):
    label = '{}, {}'.format(suburb, local_municipality)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_johannesburg)  
    
map_johannesburg
Out[12]:
Make this Notebook Trusted to load map: File -> Trust Notebook