Scrape McDonalds locations across the United States, and animate the order in which they were built

Original URL: http://nbviewer.ipython.org/url/rsargent.cmucreatelab.org/mcdonalds/McDonalds%20across%20the%20U.S..ipynb Check here for updates.

For this assignment:

  • You'll need Python 2.7. (You should already have this for your previous App Engine work)
  • You'll need IPython 1.x. To install, follow the directions at the top of this page
  • You'll need your own copy of this notebook. Assuming you haven't already downloaded the notebook, you're probably just viewing this notebook in the notebook viewer, which doesn't let you make any changes. Look for the download icon on the top left to bring the notebook local.
  • Start up the IPython Notebook Dashboard on your local computer
  • Follow the directions at the top of the Dashboard page to import this notebook into your local IPython Notebook server. (If you have troubles loading the file on Windows, you might need to rename the notebook to a simpler filename first, like 'mcdonalds.ipynb')

If you've succeeded in all this, you should now be running this notebook inside your local IPython Notebook server. Instead of a download icon, you'll now be seeing a menubar and a number of tool icons at the top of the page. And you can edit.

Windows 7 installation directions, from Yen-Chia Hsu:

  • Download and install Anaconda from http://continuum.io/downloads (Because Anaconda includes Python 2.7, there is no need to download Python separately).
  • Go into Windows start menu, find Anaconda folder, and click "Anaconda Command Prompt" to bring up the Anaconda console.
  • type "easy_install Scrapy" in the console to install the package, ignore the errors for now.
  • type "easy_install Twisted" in the console to install the package.
  • type "easy_install w3lib" in the console to install the package.

A few tips on using iPython:

To execute a block of code, click on it, and then hit shift-return (or Cell/Run)

Generally, you'll want to execute blocks from top to bottom. If you see a message that something's undefined, double-check that perhaps you missed executing a block that defined what you need.

Notes on browser compatibility:

The explorable visualization in this notebook makes use of WebGL, which requires Chrome, Firefox, or IE 11+. iPython notebook doesn't officially support IE 11, so you're probably best off using Chrome or Firefox.

Support and office hours:

You're playing with some really new stuff. Please don't hesitate to reach out to me by email [email protected], gchat [email protected], or to come by office hours at 4pm each day until assignment due date in NSH 3220. Come by if you have questions, get stuck, or especially if you want to do learn how to do something that's not covered.

Also, consider emailing me when you start, and I can be sure to send updates or FAQ answers to you as they develop.

Ready to go!

First, a few functions to cope with keeping a copy of the McDonalds locations on your local disk

(Read the code in each block below, and then put your cursor in each one in sequence, and hit shift-return to evaluate)

In [ ]:
#
# Functions to load and save scraped McDonalds locations to/from this local JSON file:
#

import os
import json

locations_path = os.path.expanduser('~/projects/mcdonalds/mcdonalds_locations.json')

# Read locations from your local json file.  If the file is missing, create it.
def read_or_create_locations():
    try:
        locations = json.load(open(locations_path))
        print 'Read %d locations from %s' % (len(locations), locations_path)
    except StandardError:
        locations = []
        print '%s does not exist; creating' % locations_path
        write_locations(locations)
    return locations

# Write locations to your local json file
def write_locations(locations):
    try:
        os.makedirs(os.path.dirname(locations_path))
    except StandardError:
        pass
    json.dump(locations, open(locations_path, 'w'), indent=2)
    print 'Wrote %d locations to %s' % (len(locations), locations_path)
In [ ]:
# Read previously scraped locations, if any
#
# Note:  if you need to restart python, you'll especially want to reevaluate this line
# to reload your current set of locations

locations = read_or_create_locations()

Geocode a street address into Lat/Long

In [ ]:
import urllib
import urllib2

# Geocode a street address using Google's geocoding API
# Returns {'lat': latitude, 'lng': longitude} on success
# Returns False if not found
# Raises exception if a problem occurs (such as running out of quota)
#
# Note:  Google's geolocation service is limited to around 2500 geolocations per day.  If you receive
# an error that you've run out of quota, you'll need to wait a day, or change your IP address.
#
# (For the purposes of this assignment, see below for a shortcut to download already geolocated addresses to
#  fill in your gaps)
#
#  If you run into quota limitation and don't want to wait a day, here are some ways to change your IP address:
#    - If you're on a laptop, move to a different wireless network
#    - Find an HTTP proxy at http://www.hidemyass.com/proxy-list and switch to that proxy with this python command:
#      urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': '168.63.167.183'})))
#    - Utitlize a VPN service (e.g. Hide My Ass's "Pro VPN" account, which costs $)

def geocode(address):
    # Perform API call to maps.googleapis.com, which will return address in JSON format
    url = 'http://maps.googleapis.com/maps/api/geocode/json?%s' % urllib.urlencode({'address': address, 'sensor': 'false'})
    geocode = json.loads(urllib2.urlopen(url).read())
    if geocode['status'] == 'OK' and geocode['results']:
        # Success!  Return the first result
        return geocode['results'][0]['geometry']['location']
    elif geocode['status'] <> 'ZERO_RESULTS':
        # Something failed (maybe out of quota?).  Raise an exception.
        msg = 'When trying to geocode address %s by reading url %s, received a status != OK: %s' % (address, url, geocode['status'])
        print msg
        raise Exception(msg)
    else:
        # No results for this address.  Return false.
        return False
In [ ]:
# Try it out
geocode("5000 Forbes Avenue, Pittsburgh, PA")
In [ ]:
# Go ahead and try it out on some other addresses.  Does it work outside the U.S.?

Scrape the street address of a single McDonalds store

This uses the Scrapy library for parsing HTML. If you don't already have it installed, you'll see "No module named scrapy.selector"; see the comments in the code below for a link to download and installation instructions.

In [ ]:
from scrapy.selector import Selector

# Scrape an individual McDonalds store #
# Returns {'store_number': store_number, 'address': address, 'lat': latitude, 'lng': longitude} on success
# If McDonalds website doesn't show a store by that number, return False
# If geolocation fails, will return only 'store_number' and 'address'
#
# Things to watch out for:
#
# 1) If you see "No module named scrapy.selector", you'll need to install the scrapy library.
#   See http://scrapy.org/download/ for directions how to download and install.
#
# 2) If you see errors about zope.interface, try updating zope.interface to a newer version like so:
#    "pip install --upgrade zope.interface"
#    and then restart the kernel (Kernel/Restart, at the top of the notebook, above).
#
# Or, if you have troubles getting scrapy to install properly, consider switching to the
# BeautifulSoup-based scraper below

def scrape_store(store_number):
    # Oddly, www.mcdonalds.com's map application redirects to www.mc<statename>.com to show store details.
    # But, lucky for us, any of the states sites seem to work for all store #s.  So let's go with PA for no particular reason.
    url = "http://www.mcpennsylvania.com/%d" % store_number
    
    print "Trying to fetch store# %d from %s" % (store_number, url)
    html = urllib2.urlopen(url).read()
    print "  Read %d bytes" % len(html)
    
    # Using Scrapy, build an XPath selector to parse the received HTML
    selector = Selector(text = html)
    
    # The street address is buried in the <li> tag whose class contains the word "address" (usually address_3 but
    # sometimes address_1 and maybe others).
    # Inside that <li>, find all the <h3> tags, extract text from them, and then join together with newlines in-between.
    address = '\n'.join(selector.xpath('//li[contains(@class, "address")]/h3/text()').extract())
    
    if address:
        location = {'store_number': store_number, 'address': address}
        print '  Address is %s' % address.replace('\n', '|')
        latlng = geocode(address)
        if latlng:
            print '  Geocoded to %s' % latlng
            location['lat'] = latlng['lat']
            location['lng'] = latlng['lng']
        else:
            print '  No geocode'
        return location
    else:
        print '  Store not found'
        return False
In [ ]:
# Uncomment below and comment above if you'd like to switch to BeautifulSoup for
# parsing HTML.  You can install BeautifulSoup using easy_install:
# easy_install BeautifulSoup

# from bs4 import BeautifulSoup
#
# # Scrape an individual McDonalds store #
# # Returns {'store_number': store_number, 'address': address, 'lat': latitude, 'lng': longitude} on success
# # If McDonalds website doesn't show a store by that number, return False
# # If geolocation fails, will return only 'store_number' and 'address'
# #
# # If you don't already have BeautifulSoup installed, you can install like so:
# # easy_install BeautifulSoup
#
# def scrape_store(store_number):
#     # Oddly, www.mcdonalds.com's map application redirects to www.mc<statename>.com to show store details.
#     # But, lucky for us, any of the states sites seem to work for all store #s.  So let's go with PA for no particular reason.
#     url = "http://www.mcpennsylvania.com/%d" % store_number
#     
#     print "Trying to fetch store# %d from %s" % (store_number, url)
#     html = urllib2.urlopen(url).read()
#     print "  Read %d bytes" % len(html)
#     
#     parsed = BeautifulSoup(html)
# 
#     # The street address is buried in the <li> tag whose class contains the word "address" (usually address_3 but
#     # sometimes address_1 and maybe others).
#     li = parsed.select('li[class^="address"]')
# 
#     if li:
#         # Join together the strings with newlines to construct the address
#         address = '\n'.join(li[0].strings)
#     else:
#         address = ''
# 
#     if address:
#         location = {'store_number': store_number, 'address': address}
#         print '  Address is %s' % address.replace('\n', '|')
#         latlng = geocode(address)
#         if latlng:
#             print '  Geocoded to %s' % latlng
#             location['lat'] = latlng['lat']
#             location['lng'] = latlng['lng']
#         else:
#             print '  No geocode'
#         return location
#     else:
#         print '  Store not found'
#         return False
In [ ]:
# Try it out
scrape_store(100)
In [ ]:
# The highest McDonalds store # is a bit under 17000, as of February 2014.
# Try a few more stores.  What does the function return if a store by the requested ID doesn't exist?
In [ ]:
locations=[]
In [ ]:
# Time to scrape the stores!  This will take a while.
# As noted above, you'll run out of geolocation quota after around 2500 stores.  So let's scrape up to store 999
# for now, and see what it looks like.

# Scrape up to this store#:
last_store = 999

# If we have previous scraped data, start where we left off.  Otherwise start at store # 1.
starting_store = (max([location['store_number'] for location in locations]) + 1) if locations else 1

for store_number in range(starting_store, last_store + 1):
    location = scrape_store(store_number)
    if location:
        locations.append(location)
        if len(locations) % 10 == 0:
            # Checkpoint:  every ten, save all locations to disk
            write_locations(locations)

# Done.  Save all locations
write_locations(locations)

Take a peek at mcdonalds_locations.json

Fire up your favorite editor or file viewer and verify that mcdonalds_locations.json is looking reasonable.

Displaying locations

To display the McDonalds locations on a map, we're going to build a web page and display it inside an <iframe> inside this notebook. First, we'll need the utility function "iframe_with_source", which lets you pass HTML as a big string.

In [ ]:
# iframe_with_source is a utility function to let us insert HTML into an iframe
# as an iPython result

from IPython.display import HTML
import json

def iframe_with_source(source, height):
    name = 'iframe-%d' % get_ipython().execution_count
    source = json.dumps(source)
    source = source.replace('</script', '</scr"+"ipt')
    width = '100%'
    height = '%spx' % height
    template = """
<iframe id="%s" style="width:%s; height:%s"></iframe>
<script>
document.getElementById('%s').srcdoc = %s;
</script>
"""
    # Fill in the %s slots with id, width, height, and the HTML source
    return HTML(template % (name, width, height, name, source))
In [ ]:
# Try out iframe_with_source with a on trivially simple web page.  Note python's triple-quote syntax for multi-line strings.

source = """
<html>
  <body>
    Hello world!
  </body>
</html>"""
height = 100  # pixels
iframe_with_source(source, height)

Next, we need to convert the store information into the JSON format our page is going to need.

In [ ]:
# Convert to Javascript series:  {latlng: [lat0, lng0, lat1, lng1, lat2, lng2 ...], index: [index0, index1, index2 ...]}
javascript_latlng = []
javascript_index = []
for location in locations:
    if 'lat' in location:
        javascript_index.append(location['store_number'])
        javascript_latlng.append(location['lat'])
        javascript_latlng.append(location['lng'])
javascript_series = "// Data from %d locations, index %s - %s\n" % (len(javascript_index), javascript_index[0], javascript_index[-1])
javascript_series += 'var mcDonalds = {\n  latlng: new Float32Array(%s),\n\n  index: new Float32Array(%s)\n};' % (json.dumps(javascript_latlng), json.dumps(javascript_index))
print "javascript_series contains %d locations and is %s bytes long" % (len(javascript_index), len(javascript_series))

Here's the web page. We'll construct it by sandwiching the JSON above between a header and footer. Don't let the length of the javascript below scare you. You can do pretty much all the customization you'll want inside the two functions init() and drawFrame().

When you evaluate this code below, you should see an iframe with a Google map displaying an animation of McDonalds being built.

In [ ]:
# Construct HTML source for viewing page, with javascript_series inserted
# Insert as an <iframe> to prevent conflicts between this code and the parent page.

src = """

<html style="height:100%">
<head>
    <script src="http://maps.googleapis.com/maps/api/js?sensor=false"></script>
    <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
    <script src="http://api.cmucreatelab.org/exp-0.1/js/CanvasLayer.js"></script>
    <script src="http://api.cmucreatelab.org/exp-0.1/js/utils.js"></script>
    <script src="http://api.cmucreatelab.org/exp-0.1/js/series.js"></script>
    <script>
"""

src += javascript_series;

src += """
      var map;
      var canvasLayer;
      var gl;
      var pixelsToWebGLMatrix = new Float32Array(16);
      var mapMatrix = new Float32Array(16);

      var advancing = true; // true when in 'play' mode; false when paused

      var lastTime = 0;
      var totalElapsedTime = 0;
      var fps = 50;

      var currentIndex, minIndex, maxIndex;

      function init() {
	// initialize the map
	var mapOptions = {
          zoom: 4,
          center: new google.maps.LatLng(39.3, -95.8),
	};
        var mapDiv = document.getElementById('map-div');
	map = new google.maps.Map(mapDiv, mapOptions);

        if (!initWebGL()) {
          $('#map-div').append('<div align="center" style="position: absolute; z-index:1000000; margin:50px; padding:10px; border-color:black; border-style:solid; border-width:1px; box-shadow:5px 5px 5px grey; background-color:white; font-size:20px"><b>WebGL required</b><br>Please try using a browser that supports WebGL,<br>such as Chrome, Firefox, or Internet Explorer 11.</div>');
        }

        // Convert from latlng to xy for map, and set up shader programs
        prepareSeries(gl, mcDonalds);

        currentIndex = minIndex = mcDonalds.index[0];
        maxIndex = mcDonalds.index[mcDonalds.index.length - 1];

        initPlaybackControls();
      }

      function drawFrame() {
        if (advancing) {
          advanceCurrentIndex();
        }

        // Use additive blending mode.  As we draw overlapping pixels, color values
        // are added and then clamped to go no higher than 1.  This has the effect of
        // making overlapped areas brighter, approaching white (assuming use of colors
        // which are non-zero in each of R, G, B)

        gl.enable(gl.BLEND);
        gl.blendFunc( gl.SRC_ALPHA, gl.ONE );

        // Compute WebGL transform from world xy coords to screen
        // copy pixel->webgl matrix
        mapMatrix.set(pixelsToWebGLMatrix);

        var scale = canvasLayer.getMapScale();
        scaleMatrix(mapMatrix, scale, scale);

        var translation = canvasLayer.getMapTranslation();
        translateMatrix(mapMatrix, translation.x, translation.y);

        // Erase frame
        gl.clear(gl.COLOR_BUFFER_BIT);

        // Compute point diameter, in pixels, based on zoom level

        // map.zoom is approx 4 at country level
        // How many pixels in diameter should marker be when map zoomed to country level?
        var countryPointSizePixels = 3;

        // map.zoom is approx 18 at block level
        // How many pixels in diameter should marker be when map zoomed to block level?
        var blockPointSizePixels = 90;

        var pointSize = countryPointSizePixels * Math.pow(blockPointSizePixels / countryPointSizePixels, (map.zoom - 4) / (18 - 4));
        var color = [.82, .22, .07, 1.0]; // RGBA, orange

        // How much of the marker's radius do we draw at alpha=1, before starting the slope to alpha=0?
        // 0 is very soft, 0.95 is a hard circle with well-defined edge
        var hardFraction = 0.4;  

        drawPoints(gl, mapMatrix, mcDonalds, 0, findIndex(currentIndex, mcDonalds), 
                   {color: color, pointSize: pointSize, hardFraction: hardFraction});
      }
      
      function advanceCurrentIndex() {
        var timeNow = new Date().getTime();
        if (lastTime != 0) {
          var elapsed = timeNow - lastTime;
          totalElapsedTime += elapsed;
        }
        lastTime = timeNow;

        if (totalElapsedTime > 1000 / fps) {
          totalElapsedTime = 0;
          var newIndex = currentIndex + 100;
          if (newIndex >= maxIndex) {
            // TODO(rsargent): implement delay at beginning and end
            newIndex = 0;
          }
          playbackSetIndex(newIndex);
        }
      }

      function initWebGL() {
        // initialize the canvasLayer
        var canvasLayerOptions = {
          map: map,
          resizeHandler: resize,
          animate: true,
          updateHandler: drawFrame
        };
        canvasLayer = new CanvasLayer(canvasLayerOptions);

        window.addEventListener('resize', function () {  google.maps.event.trigger(map, 'resize') }, false);

        // initialize WebGL
        gl = canvasLayer.canvas.getContext('experimental-webgl');
        return !!gl;
      }

      function initPlaybackControls() {
        $('#playback-play-pause-button').click(function() {
          advancing = (this.textContent == 'Play');
          this.textContent = advancing ? 'Pause' : 'Play';
        });

        $('#playback-range')
          .on("input change", function() {
            console.log('change, yeah');
            playbackSetIndex(this.valueAsNumber);
          })
          .mousedown(function() {
            advancing = false;
          })
          .mouseup(function() {
            if ($('#playback-play-pause-button').text() == 'Pause') {
              advancing = true;
            }
          });
      }

      function playbackSetIndex(newIndex) {
        currentIndex = newIndex;
        $('#playback-display').text(currentIndex);
        $('#playback-range').val(currentIndex);
      }

      function resize() {
        var w = canvasLayer.canvas.width;
        var h = canvasLayer.canvas.height;

        // Extend viewport to entire canvas
        gl.viewport(0, 0, w, h);

        // Map canvas pixel coordinates to WebGL coordinates
        pixelsToWebGLMatrix.set([2/w, 0,   0, 0,
                                 0,  -2/h, 0, 0,
                                 0,   0,   1, 0,
                                -1,   1,   0, 1]);
      }

      $(init);
    </script>

    <script id="pointVertexShader" type="x-shader/x-vertex">
      attribute vec4 worldCoord;
      attribute float aPointSize;

      uniform mat4 mapMatrix;

      void main() {
        // transform world coordinate by matrix uniform variable
        gl_Position = mapMatrix * worldCoord;

        // a constant size for points, regardless of zoom level
        gl_PointSize = aPointSize;
      }
    </script>
    <script id="pointFragmentShader" type="x-shader/x-fragment">
      precision mediump float;
      uniform vec4 color;
      uniform float hardFraction;

      // Circle of diameter 0.5, composed of a "hard" (alpha=1) center of radius 0.5 * hardFraction,
      // then transitioning to alpha=0 at diameter 0.5
      void main() {
        float dist = length(gl_PointCoord.xy - vec2(.5, .5));
        // TODO(rsargent):  shouldn't we just be adjusting the alpha here?  Maybe we're taking
        // advantage of the double-multiplication to get something other than linear.
        // But multiplying all the channels will break if we do something other than an additive blend.
        gl_FragColor = color * clamp((0.5 - dist) / (0.5 - 0.5 * hardFraction), 0., 1.);
      }
    </script>
</head>
<body style="height:100%; margin:0; padding:0">
    <div id="map-div" style="height:100%"></div>
    <div style="position:relative; left:90px; top:-30px; width:550px">
      <button id="playback-play-pause-button" style="width:50px">Pause</button>
      <div align="right" id="playback-display" style="width:40px; display:inline-block"></div>
      <input type="range" style="width: 300px; position:relative; top:3px"  value="0" min="0" max="17000" list="number" id="playback-range"/>
    </div>
</body>
</html>
"""

# You might find it easier to debug changes to this code by writing everything out to a .html file
# and loading it separately into your browser;  uncomment the following to do so

# open(os.path.expanduser('~/Desktop/mcdonalds.html'), 'w').write(src)

height = 500
iframe_with_source(src, height)

If you only have up to ID# 1000, the map will be looking a bit sparse. You can scrape more directly by increasing the "last_store = 1000" above and rerunning the scrape loop, but it's going to take a while and the workarounds to the geolocation quota limit are a bit annoying. So let's take a shortcut and grab ID's 1000 and above from a pre-baked JSON file.

This is going to require you to do some manual surgery on your mcdonalds_locations.json file, which you need to get comfortable with anyway if you want to be able to clean your data.

Bring up mcdonalds_locations.json in your favorite editor, and then splice in records from http://rsargent.cmucreatelab.org/mcdonalds/mcdonalds_locations_1000_and_above.json. Be sure to keep the stores in order (the display code relies on it), and avoid duplicates (in case you've scraped beyond 999 already). Also, be mindful of the commas and square brackets to keep your JSON correctly formed. (You can try pasting your creation into http://jsonlint.com/ for useful error messages if something goes wrong).

Once you believe you have a complete mcdonalds_locations.json file, you'll need to load the locations into this notebook by re-executing the "locations = read_or_create_locations()" line at the top of the notebook, and then re-execute "Convert to Javascript series" and "Construct HTML Source". If all goes well, you'll be able to explore all the McDonald's up to store # 16997.

Next steps

Seeing a store's ID

It's tricky to interact with a large number of markers on a Google map; navigating the map requires clicking, so making thousands of markers click-sensitive gets irritating fast. A better approach might be similar to the logic of image tooltip hover, where leaving a mouse still for a second or two brings up a small textbox.

Implement a way to view the ID from a store's marker, and perhaps also a link to that store's web page (remember the pattern "http://www.mcpennsylvania.com/ID" from the scraping code above). You're going to need this for the next step.

Update 2/17: There's a new primitive in ``series.js`` to assist you in this:

findClosestElement(gl, transform, series, pixelXY, maxDistInPixels):
gl               WebGL context
transform        WebGL transformation matrix (e.g. mapMatrix)
series           Data series to search (e.g. mcDonalds)
pixelXY          Pixel coords inside map div.  (If you subscribe to google maps mouse 
                 events, you can pass in event.pixel here)
maxDistInPixels  Don't return an element that's more than this distance in pixels from pixelXY

If element is found, returns an object with i (sequence), lat, and lng.

If no element found within maxDistInPixels, returns null

You can look up the store # from sequence like so: mcDonalds.index[sequence].

Assessing the accuracy your data

With the sheer number of sites, it's well outside the scope of this assignment to clean any sizeable fraction of the data. If you select street map (instead of satellite) and zoom in enough on a marker, you can often see where Google has labeled a McDonald's colocated with your marker. But it's possible errors in Google's layer could be somewhat correlated with errors in your layer. In satellite mode, you might recognize the distictive look of the restaurant, at least when it's free-standing. And of course there's always street view. (Be aware that both sources of imagery could potentially be older than the McDonald's you're trying to locate).

Pick 30 placemarks at random and try to assess whether they're likely correct, likely incorrect, or not able to tell. Use this to give a very rough estimate on bounds for the accuracy of the rest of the dataset.

Learn from your data, and tell a story

Spend some time exploring this dataset. See any surprises? Any patterns that you notice? Tell and record a brief story in your own words, using screencast software (e.g. QuickTime Player for Mac or http://camstudio.org/ for Windows). Upload to a video sharing site such as YouTube, and embed into your notebook.

In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo('QH2-TGUlwu4') # paste your youtube video ID here from the embed link

Publish your notebook

To publish your notebook, you need to perform three steps:

  • Download your notebook in .ipynb format (using File/Download at the top of the page)
  • Place your .ipynb file somewhere publicly readable on the web
  • Use the IPython Notebook Viewer to translate and view your notebook by pasting in the URL to your notebook here

You can place your .ipynb file online using your Andrew account, or you can do it as well by using your App Engine instance. An even better approach, if you're familiar with git and github, would be to create a github repo and push your .ipynb file to it. This allows others to more easily build from your notebook. And the notebook viewer has special support for pulling the .ipynb file directly from your repo's HEAD.

In [ ]: