#!/usr/bin/env python # coding: utf-8 # ![title](https://www.nationsonline.org/gallery/USA/Golden-Gate-Bridge-San-Francisco.jpg) # #
An intelligent location study and machine learning algorithms to select locations from a Italian restaurant in the city of San Francisco
#
Roque Leal
# The Italian restaurant of San Francisco are part of the culture of the city, the customs of its inhabitants and its tourist circuit. They have been the subject of study by different writers, inspirers of countless artistic creations and traditional union meeting. # In this project, the idea is to find an optimal location for a new Italian restaurant, based on machine learning algorithms taken from the "The Battle of Neighborhoods: Coursera Capstone Project" course (1). # Starting from the association of Italian restaurant with restaurants, we will first try to detect locations based on the definition of factors that will influence our decision: # # ** 1- Places that are not yet full of restaurants. ** # # ** 2- Areas with little or no cafe nearby. ** # # ** 3- Near the center, if possible, assuming the first two conditions are met. ** # # With these simple parameters we will program an algorithm to discover what solutions can be obtained. # ### Data Source # # The following data sources will be needed to extract and generate the required information: # # 1.- The centers of the candidate areas will be generated automatically following the algorithm and the approximate addresses of the centers of these areas will be obtained using one of the Geopy Geocoders packages. (2) # # 2-The number of restaurants, their type and location in each neighborhood will be obtained using the Foursquare API. (3) # # The data will be used in the following scenarios: # # ** 1- To discover the density of all restaurants and cafes from the data extracted. ** # # ** 2- To identify areas that are not very dense and not very competitive. ** # # ** 3- To calculate the distances between competing restaurants. ** # # ### Locate the candidates # # The target area will be the center of the city, where tourist attractions are more numerous compared to other places. From this we will create a grid of cells that covers the area of ​​interest which will be about 12x12 kilometers centered around the center of the city of San Francisco. # In[140]: import requests from geopy.geocoders import Nominatim address = '199 Gough St, San Francisco, CA 94102, USA' geolocator = Nominatim(user_agent="usa_explorer") location = geolocator.geocode(address) lat = location.latitude lng = location.longitude sf_center = [lat, lng] print('Coordinate of {}: {}'.format(address, sf_center), ' location : ', location) # We create a grid of the equidistant candidate areas, centered around the city center and that is 6 km around this point, for this we calculate the distances we need to create our grid of locations in a 2D Cartesian coordinate system that will allow us to then Calculate distances in meters. # # Next, we will project these coordinates in degrees of latitude / longitude to be displayed on the maps with Mapbox and Folium (3). # In[141]: #!pip install shapely import shapely.geometry #!pip install pyproj import pyproj import math def lonlat_to_xy(lon, lat): proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84') proj_xy = pyproj.Proj(proj="utm", zone=10, datum='WGS84') xy = pyproj.transform(proj_latlon, proj_xy, lon, lat) return xy[0], xy[1] def xy_to_lonlat(x, y): proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84') proj_xy = pyproj.Proj(proj="utm", zone=10, datum='WGS84') lonlat = pyproj.transform(proj_xy, proj_latlon, x, y) return lonlat[0], lonlat[1] def calc_xy_distance(x1, y1, x2, y2): dx = x2 - x1 dy = y2 - y1 return math.sqrt(dx*dx + dy*dy) print('Coordinate Verification') print('-------------------------------') print('San Francisco Center Union Square longitude={}, latitude={}'.format(sf_center[1], sf_center[0])) x, y = lonlat_to_xy(sf_center[1], sf_center[0]) print('San Francisco Center Union Square UTM X={}, Y={}'.format(x, y)) lo, la = xy_to_lonlat(x, y) print('San Francisco Center Union Square longitude={}, latitude={}'.format(lo, la)) # We create a hexagonal grid of cells: ** we move all the lines and adjust the spacing of the vertical lines so that each cell center is equidistant from all its neighbors. ** # In[142]: sf_center_x, sf_center_y = lonlat_to_xy(sf_center[1], sf_center[0]) # City center in Cartesian coordinates k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells x_min = sf_center_x - 6000 x_step = 600 y_min = sf_center_y - 6000 - (int(21/k)*k*600 - 12000)/2 y_step = 600 * k latitude = [] longitude = [] distances_from_center = [] xs = [] ys = [] for i in range(0, int(21/k)): y = y_min + i * y_step x_offset = 300 if i%2==0 else 0 for j in range(0, 21): x = x_min + j * x_step + x_offset distance_from_center = calc_xy_distance(sf_center_x, sf_center_y, x, y) if (distance_from_center <= 6001): lon, lat = xy_to_lonlat(x, y) latitude.append(lat) longitude.append(lon) distances_from_center.append(distance_from_center) xs.append(x) ys.append(y) print(len(latitudes), 'Union Square San Francisco grid - SF') # Let's look at the data we have so far: location in the center and the candidate neighborhood centers: # In[143]: import folium # In[144]: tileset = r'https://api.mapbox.com' attribution = (r'Map data © OpenStreetMap' ' contributors, Imagery © MapBox') map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution) folium.Marker(sf_center, popup='San Francisco').add_to(map_sf) for lat, lon in zip(latitude, longitude): #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_lyon) folium.Circle([lat, lon], radius=300, color='purple', fill=False).add_to(map_sf) #folium.Marker([lat, lon]).add_to(map_caba) map_sf # At this point, we now have the coordinates of the local centers / areas to be evaluated, at the same distance (the distance between each point and its neighbors is exactly the same) and approximately 4 km from downtown San Francisco. # In[145]: def get_address(lat, lng): #print('entering get address') try: #address = '{},{}'.format(lat, lng) address = [lat, lng] geolocator = Nominatim(user_agent="usa_explorer") location = geolocator.geocode(address) #print(location[0]) return location[0] except: return 'nothing found' addr = get_address(sf_center[0], sf_center[1]) print('Reverse geocoding check') print('-----------------------') print('Address of [{}, {}] is: {}'.format(sf_center[0], sf_center[1], addr)) print(type(location[0])) # In[146]: print('Getting Locations: ', end='') addresses = [] for lat, lon in zip(latitude, longitude): address = get_address(lat, lon) if address is None: address = 'NO ADDRESS' address = address.replace(', United States', '') addresses.append(address) print(' .', end='') print(' done.') # In[180]: import pandas as pd df_locations = pd.DataFrame({'Dirección': addresses, 'Latitude': latitude, 'Longitude': longitude, 'X': xs, 'Y': ys, 'Distance from centroid': distances_from_center}) df_locations.head() # In[181]: df_locations.shape # In[182]: df_locations.to_pickle('./Dataset/sf_locations.pkl') # ## Foursquare # Now we will use the Foursquare API to explore the number of restaurants available within these grids and we will limit the search to food categories to retrieve latitude and longitude data from restaurants and Italian restaurant. # In[183]: client_id = 'xxx' client_secret = 'xxx' VERSION = 'xxx' # We use the Foursquare API to explore the number of restaurants available within 4 km of downtown San Francisco and limit the search to all locations associated with the category of restaurants and especially those that correspond to Italian restaurants. # In[184]: food_category = '4d4b7105d754a06374d81259' sf_italian_categories = ['4bf58dd8d48988d110941735', '55a5a1ebe4b013909087cbb6', '55a5a1ebe4b013909087cb7c', '55a5a1ebe4b013909087cba7', '55a5a1ebe4b013909087cba1', '55a5a1ebe4b013909087cba4', '55a5a1ebe4b013909087cb95', '55a5a1ebe4b013909087cb89', '55a5a1ebe4b013909087cb9b', '55a5a1ebe4b013909087cb98', '55a5a1ebe4b013909087cbbf', '55a5a1ebe4b013909087cb79', '55a5a1ebe4b013909087cbb0', '55a5a1ebe4b013909087cbb3', '55a5a1ebe4b013909087cb74', '55a5a1ebe4b013909087cbaa', '55a5a1ebe4b013909087cb83', '55a5a1ebe4b013909087cb8c', '55a5a1ebe4b013909087cb92', '55a5a1ebe4b013909087cb8f', '55a5a1ebe4b013909087cb86', '55a5a1ebe4b013909087cbb9', '55a5a1ebe4b013909087cb7f', '55a5a1ebe4b013909087cbbc', '55a5a1ebe4b013909087cb9e', '55a5a1ebe4b013909087cbc2', '55a5a1ebe4b013909087cbad'] # 'Food' Catégorie de restaurants cafe # In[185]: def is_restaurant(categories, specific_filter=None): restaurant_words = ['restaurant', 'sushi', 'hamburger', 'seafood'] restaurant = False specific = False for c in categories: category_name = c[0].lower() category_id = c[1] for r in restaurant_words: if r in category_name: restaurant = True if 'Restaurante' in category_name: restaurant = False if not(specific_filter is None) and (category_id in specific_filter): specific = True restaurant = True return restaurant, specific def get_categories(categories): return [(cat['name'], cat['id']) for cat in categories] def format_address(location): address = ', '.join(location['formattedAddress']) address = address.replace(', USA', '') address = address.replace(', United States', '') return address def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=1000): version = '20180724' url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format( client_id, client_secret, version, lat, lon, category, radius, limit) try: results = requests.get(url).json()['response']['groups'][0]['items'] venues = [(item['venue']['id'], item['venue']['name'], get_categories(item['venue']['categories']), (item['venue']['location']['lat'], item['venue']['location']['lng']), format_address(item['venue']['location']), item['venue']['location']['distance']) for item in results] except: venues = [] return venues # In[186]: import pickle def get_restaurants(lats, lons): restaurants = {} sf_italian = {} location_restaurants = [] print('Obtaining the candidates', end='') for lat, lon in zip(lats, lons): venues = get_venues_near_location(lat, lon, food_category, client_id, client_secret, radius=350, limit=100) area_restaurants = [] for venue in venues: venue_id = venue[0] venue_name = venue[1] venue_categories = venue[2] venue_latlon = venue[3] venue_address = venue[4] venue_distance = venue[5] is_res, is_italian = is_restaurant(venue_categories, specific_filter=sf_italian_categories) if is_res: x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0]) restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_italian, x, y) if venue_distance<=300: area_restaurants.append(restaurant) restaurants[venue_id] = restaurant if is_italian: sf_italian[venue_id] = restaurant location_restaurants.append(area_restaurants) print(' .', end='') print(' done.') return restaurants, sf_italian, location_restaurants restaurants = {} sf_italian = {} location_restaurants = [] loaded = False try: with open('/Dataset/restaurants_350.pkl', 'rb') as f: restaurants = pickle.load(f) print('Restaurant data loaded.') with open('/Dataset/sf_italian_350.pkl', 'rb') as f: caba_cafe = pickle.load(f) print('Descargando Datos de las Cafeterías') with open('/Dataset/location_restaurants_350.pkl', 'rb') as f: location_restaurants = pickle.load(f) print('Downloading data from San Francisco Restaurants') loaded = True except: print('Restaurant Data Downloading') pass if not loaded: restaurants, sf_italian, location_restaurants = get_restaurants(latitudes, longitudes) # In[187]: import numpy as np # In[188]: print('**Results**',) print('Total Number of Restaurants:', len(restaurants)) print('Total Number of Italian restaurants:', len(sf_italian)) print('Percentage of Italian restaurants: {:.2f}%'.format(len(sf_italian) / len(restaurants) * 100)) print('Average of Venues per grid:', np.array([len(r) for r in location_restaurants]).mean()) # In[189]: print('List of All Restaurants') print('-----------------------') for r in list(restaurants.values())[:10]: print(r) print('...') print('Total:', len(restaurants)) # In[190]: print('List of all Italian restaurants') print('---------------------------') for r in list(sf_italian.values())[:10]: print(r) print('...') print('Total:', len(sf_italian)) # In[191]: print('Author Restaurants') print('---------------------------') for i in range(100, 110): rs = location_restaurants[i][:8] names = ', '.join([r[1] for r in rs]) print('Restaurants around location {}: {}'.format(i+1, names)) # All restaurants in the city of San Francisco are indicated in gray and those associated with Italian restaurants will be highlighted in red. # In[192]: map_sf = folium.Map(location=sf_center, zoom_start=13, tiles=tileset, attr=attribution) folium.Marker(sf_center, popup='San Francisco').add_to(map_sf) for res in restaurants.values(): lat = res[2]; lon = res[3] is_cafe = res[6] color = 'red' if is_cafe else 'grey' folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_sf) map_sf # ## Analysis # Now we calculate the distance ** from the nearest Italian restaurant to each grid ** (not only those located less than 300 m away, since we also want to know the distance to the nearest center. # In[194]: distances_to_sf_italian = [] for area_x, area_y in zip(xs, ys): min_distance = 100 for res in sf_italian.values(): res_x = res[7] res_y = res[8] d = calc_xy_distance(area_x, area_y, res_x, res_y) if d=400) print('Grids without Italian restaurants within 400 m.:', good_ind_distance.sum()) good_locations = np.logical_and(good_res_count, good_ind_distance) print('Places with both conditions met:', good_locations.sum()) df_good_locations = df_roi_locations[good_locations] # In[220]: good_latitudes = df_good_locations['Latitude'].values good_longitudes = df_good_locations['Longitude'].values good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)] map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution) HeatMap(restaurant_latlons).add_to(map_sf) folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.6).add_to(map_sf) folium.Marker(sf_center).add_to(map_sf) for lat, lon in zip(good_latitudes, good_longitudes): folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf) map_sf # In[215]: map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution) HeatMap(good_locations, radius=25).add_to(map_sf) folium.Marker(sf_center).add_to(map_sf) for lat, lon in zip(good_latitudes, good_longitudes): folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf) map_sf # Now we are going to ** group ** these locations using a machine learning algorithm in this case K-medias to create ** 8 groups that contain good locations. ** These areas, their centers and addresses will be the final result of our analysis. # In[221]: from sklearn.cluster import KMeans number_of_clusters = 8 good_xys = df_good_locations[['X', 'Y']].values kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys) cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_] map_caba = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution) HeatMap(restaurant_latlons).add_to(map_sf) folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sf) folium.Marker(sf_center).add_to(map_sf) for lon, lat in cluster_centers: folium.Circle([lat, lon], radius=500, color='gray', fill=True, fill_opacity=0.25).add_to(map_sf) for lat, lon in zip(good_latitudes, good_longitudes): folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf) map_sf # Let's look at these areas west and south of the city with a Heatmap, using shaded areas to indicate the 8 groups created: # In[222]: map_caba = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution) folium.Marker(sf_center).add_to(map_sf) for lat, lon in zip(good_latitudes, good_longitudes): folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_sf) for lat, lon in zip(good_latitudes, good_longitudes): folium.CircleMarker([lat, lon], radius=2, color='purple', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sf) for lon, lat in cluster_centers: folium.Circle([lat, lon], radius=500, color='white', fill=False).add_to(map_sf) map_sf # Now we are going to list the candidate locations # In[223]: candidate_area_addresses = [] print('==============================================================') print('Addresses of recommended locations') print('==============================================================\n') for lon, lat in cluster_centers: addr = get_address(lat, lon) addr = addr.replace(', United States', '') addr = addr.replace(', San Francisco', '') addr = addr.replace(', USA', '') addr = addr.replace(', SF', '') addr = addr.replace("'", '') candidate_area_addresses.append(addr) x, y = lonlat_to_xy(lon, lat) d = calc_xy_distance(x, y, sf_center_x, sf_center_y) print('{}{} => {:.1f}km from downtown San Francisco'.format(addr, ' '*(50-len(addr)), d/1000)) # ## Results # # # In[224]: map_sf = folium.Map(location=sf_center, zoom_start=14, tiles=tileset, attr=attribution) folium.Circle(sf_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_sf) for lonlat, addr in zip(cluster_centers, candidate_area_addresses): folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_sf) for lat, lon in zip(good_latitudes, good_longitudes): folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(map_sf) map_sf # The above locations are quite close to downtown San Francisco and each of these locations has no more than two restaurants within a radius of 250 m, no Italian Restaurant 400 m away. Any of these establishments is a potential candidate for the new restaurant, at least considering the nearby competition. The K-means unsupervised learning algorithm has allowed us to group the 8 locations with an appropriate choice for interested parties to choose from the results presented below. # # Conclusions # The objective of this project was to identify the areas of San Francisco near the center, with a small number of restaurants (especially Italian restaurants) to help stakeholders reduce the search for an optimal location for a new Italian restaurant. # # When calculating the distribution of restaurant density from the Foursquare API data, it is possible to generate a large collection of locations that meet certain basic requirements. # # This data was then grouped using machine learning algorithms (K-means) to create the main areas of interest (containing the greatest number of potential locations) and the addresses of these area centers were created. From this interpretation we can have a starting point for the final exploration by the interested parties. # # Interested parties will make the final decision on the optimal location of the restaurants based on the specific characteristics and locations of the neighborhood in each recommended area, taking into account additional factors such as the attractiveness of each location (proximity to a park or water), levels of noise / main roads. real estate availability, price, social and economic dynamics of each neighborhood, etc. # # Finally, a more complete analysis and future work should integrate data from other external databases. # # References # 1. The Battle of Neighborhoods: Coursera Capstone Project # # 2. Geopy Geocoders # # 3. Foursquare API # # 4. MapBox Location Data Visualization library for Jupyter Notebooks # ## 👍👍
I invite you to write me your ideas, your comments and above all share your opinions🌍
## # In[ ]: