Mapping the Places in The Death and Life of Great American Cities¶

Will Geary (wcg2111)

The Death and Life of Great American Cities is a critique of 1950s urban planning dogma. It was written in 1961 by journalist-activist Jane Jacobs when she was living in Manhattan's Greenwich Village. In the text, Jacobs compares and contrasts many different cities. The book had a tremendous impact on me when I first read it, so I thought it would be fun to make a map of every place mentioned in the book.

1. Download 'Death and Life of Great American Cities' as a pdf¶

Full pdf available here. Don't ask how I found it :) https://github.com/willgeary/janejacobs/tree/master/PDF

As you can see below, the pdf is not in the best condition.

In [7]:

from IPython.display import Image
%matplotlib inline

In [8]:

Image("https://raw.githubusercontent.com/willgeary/janejacobs/master/Images/Cover.png")

Out[8]:

2. Covert the pdf to a text file¶

The first challenge is to convert the pdf into a searchable, editable text file. This proved to be quite difficult. There are many open source tools which claim to be able do this. Some failed entirely, others returned complete gibberish (presumabely due to the poor quality of the pdf).

However, I finally found a tool that works. I used PDFMiner to convert the pdf into a text file. PDFMiner is a tool for extracting information from PDF documents. It is written in Python. More info on PDFMiner here: http://www.unixuser.org/~euske/python/pdfminer/index.html#source

In short, it works like this:

In [ ]:

"""
1. Install Python 2.4 or newer. (Python 3 is not supported.)
2. Download the PDFMiner source.
3. Unpack it.
4. Run setup.py to install:
    # python setup.py install
5. Do the following test:
    $ pdf2txt.py samples/simple1.pdf
        Hello
        World
"""

Once I installed PDFMiner, I was able to convert the entire pdf into a text file using the following command line instruction:

In [ ]:

"""
pdf2txt.py -o output.txt death_and_life_of_great_american_cities.pdf
"""

Here is a gif of PDFMiner doing it's thing. You can see that there is some gibberish / mispellings in the output.

In [9]:

Image(url='https://raw.githubusercontent.com/willgeary/janejacobs/master/Images/screencast.gif') 

Out[9]:

3. Open text file and read it into memory¶

Now we have the entire 'Death and Life' book in a searchable text file. Let's read it into memory and check it out.

In [10]:

# Read in the text file into memory
with open('output.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

In [11]:

# Take a look at the text
data[:1000]

Out[11]:

', ,,/ THEDEATH AND LIFE OFGREAT AMERICAN CITIES JANE JACOBS V-241 $4.95 \x0cTHE DEATH AND LIFE OF GREAT AMERICAN CITIES \xe2\x80\xa2 Jane Jacobs VINTAGE  BOOKS A Division of &ndo11l HOllse NEW  YORK \x0cAcknowledgment permission ill their pages: Arc/)ilrctllf\'lli F"nnll, Hilrpcr\'r MI1SIF,.,jflC, i\xef\xbf\xbd made to the following publications of this book which first  appeared Fo"nll, the Columbiil U\'fIi;;erJily to reprint portions for T/;c Reporter . \xe2\x80\xa2 @ Copyright, Ip61, by lillie Jacobi \xef\xbf\xbdIJ rights resen\'cd uncler Inlem\xef\xbf\xbdtional Published rrght COII,:,cntions. Inc., and snnult:lneously of Canada, Limited. and Pan-American Copy\xc2\xadill New York by Random House, ill Tvrunto, Dnacla, by Random House ,\\\'Ignufacrurcd in rhe United StlItcs uf America \x0cTo NEW  YOAIt  CITY where I ClIme [0 seek my fortune and found it by finding Bob, Jimmy, Ned and Mary for whom this book is written [00 \xe2\x80\xa2 \xe2\x80\xa2 \x0cAcknowledgment In particular I am grateful June Blythe, Gndy Clay, William for given by the: following So many scoru  of'

4. Parse through the text file¶

Use the natural language processing library nltk to tokenize all of the words in the text. See here for more info: http://www.nltk.org/

In [12]:

# Decode the text as ASCII string, ignoring errors caused by some unrecognizable characters
s = data.decode('ascii', 'ignore')

In [13]:

# Word tokenize the string of characters using nltk module
import nltk
tokens = nltk.word_tokenize(s)

# Preview
tokens[:10]

Out[13]:

[u',',
 u',',
 u',/',
 u'THEDEATH',
 u'AND',
 u'LIFE',
 u'OFGREAT',
 u'AMERICAN',
 u'CITIES',
 u'JANE']

5. Extract country and city mentions from text¶

Use the geotext library to extract country and cities from the text. More info on geotext here: https://pypi.python.org/pypi/geotext

In [14]:

from geotext import GeoText

In [15]:

places = GeoText(s)

Look below at a few of the cities found using this tool. Clearly, there are some mistakes (Mary and Nelson are names, not places) but some cities are correctly identified (Chicago, Philadelphia). Let's ignore this problem for now and come back to it later.

In [16]:

places.cities[:10]

Out[16]:

[u'YORK',
 u'Mary',
 u'Nelson',
 u'Kent',
 u'Most',
 u'Of',
 u'Chicago',
 u'Chicago',
 u'York',
 u'Philadelphia']

In [17]:

places.countries[:10]

Out[17]:

[u'Canada',
 u'Chad',
 u'Ireland',
 u'United States',
 u'Sweden',
 u'Solomon Islands',
 u'Jersey',
 u'United States',
 u'United States',
 u'Venezuela']

6. Try using Google Maps API to pull lat/lon coordinates for a city¶

Try pulling latitude and longitude coordinates for one city.

In [18]:

import googlemaps

In [19]:

API_key = "AIzaSyBft2Y_yT23uMct_Isw-Zhj5JEp0e-p88g"
gmaps = googlemaps.Client(key=API_key)

Test the API by attempting to geocode one of the cities:

In [20]:

places.cities[10]

Out[20]:

u'Boston'

In [21]:

geocode_result = gmaps.geocode(places.cities[10])
lat = geocode_result[0]['geometry']['location']['lat']
lon = geocode_result[0]['geometry']['location']['lng']

print (lat, lon)

(42.3600825, -71.0588801)

It works! Correctly returns the lat/lon coordinates for Boston City Hall.

In [22]:

Image("https://raw.githubusercontent.com/willgeary/janejacobs/master/Images/Boston.png")

Out[22]:

7. Pull latitude and longitutude coordinates for each city¶

Iterate through the list of cities and ask the Google Maps API for each city's lat/lon coordinates.

In [23]:

coords = {}

for city in places.cities:
    geocode_result = gmaps.geocode(city)
    
    # if google maps request doesn't return any data, don't do anything
    if len(geocode_result) == 0:
        pass
    else:
        lat = geocode_result[0]['geometry']['location']['lat']
        lon = geocode_result[0]['geometry']['location']['lng']
        coords[city] = ((lat,lon))

In [24]:

coords['Baltimore']

Out[24]:

(39.2903848, -76.6121893)

Create a table of each city and its latitude / longitude coordinates.

In [25]:

import pandas as pd
df = pd.DataFrame(coords).transpose()
df.rename(columns={0: "Latitude", 1: "Longitude"}, inplace=True)
df.head()

Out[25]:

	Latitude	Longitude
ALVIN	29.423847	-95.244101
AMI	36.030795	140.214786
ATH	52.991834	-6.985728
Along	28.162784	94.805414
Baltimore	39.290385	-76.612189

It would also be cool to count how many times each city is mentioned in the book. Let's do that.

In [58]:

city_counts = [tokens.count(city) for city in list(df.index)]

In [59]:

df['Frequency'] = city_counts

In [60]:

df.head()

Out[60]:

	Latitude	Longitude	Frequency
ALVIN	29.423847	-95.244101	2
AMI	36.030795	140.214786	3
ATH	52.991834	-6.985728	18
Along	28.162784	94.805414	6
Baltimore	39.290385	-76.612189	14

Save this table to a csv.

In [61]:

df.to_csv('mapdata.csv')

8. Map the coordinates¶

I mapped the cities using Carto. The size of the bubbles represents the number of times that each city is mentioned in the text.

In [63]:

from IPython.display import IFrame
IFrame("https://willgeary.carto.com/viz/31c6b2d0-8db4-11e6-ab40-0ef7f98ade21/embed_map", 800, 500)

Out[63]:

Conclusion¶

I'm excited to have somewhat successfully converted the pdf into text, parsed it for locations and created a map.

However, there are clearly some problems here. Namely, some of the cities parsed are not actually cities (or atleast not in the context that Jacobs intended).

Some problems that I noticed include:

Some neighborhoods/boroughs are being treated like cities. Chelsea, for example, is a neighborhood in New York that Jacobs mentions five times. It is also, however, the name of a town in Massachussetts. The latter Chelsea is being displayed in my map. The same has happened with Manhattan.
Some non-places are being treated as cities. For example: University, Most, Martin, Of, etc. There are lots of these.

In order to be fully satisfied with this map, I would need to parse through the cities manually and remove the nonsensical ones.