Will Geary (wcg2111)
The Death and Life of Great American Cities is a critique of 1950s urban planning dogma. It was written in 1961 by journalist-activist Jane Jacobs when she was living in Manhattan's Greenwich Village. In the text, Jacobs compares and contrasts many different cities. The book had a tremendous impact on me when I first read it, so I thought it would be fun to make a map of every place mentioned in the book.
Full pdf available here. Don't ask how I found it :) https://github.com/willgeary/janejacobs/tree/master/PDF
As you can see below, the pdf is not in the best condition.
from IPython.display import Image
%matplotlib inline
Image("https://raw.githubusercontent.com/willgeary/janejacobs/master/Images/Cover.png")
The first challenge is to convert the pdf into a searchable, editable text file. This proved to be quite difficult. There are many open source tools which claim to be able do this. Some failed entirely, others returned complete gibberish (presumabely due to the poor quality of the pdf).
However, I finally found a tool that works. I used PDFMiner to convert the pdf into a text file. PDFMiner is a tool for extracting information from PDF documents. It is written in Python. More info on PDFMiner here: http://www.unixuser.org/~euske/python/pdfminer/index.html#source
In short, it works like this:
"""
1. Install Python 2.4 or newer. (Python 3 is not supported.)
2. Download the PDFMiner source.
3. Unpack it.
4. Run setup.py to install:
# python setup.py install
5. Do the following test:
$ pdf2txt.py samples/simple1.pdf
Hello
World
"""
Once I installed PDFMiner, I was able to convert the entire pdf into a text file using the following command line instruction:
"""
pdf2txt.py -o output.txt death_and_life_of_great_american_cities.pdf
"""
Here is a gif of PDFMiner doing it's thing. You can see that there is some gibberish / mispellings in the output.
Image(url='https://raw.githubusercontent.com/willgeary/janejacobs/master/Images/screencast.gif')
Now we have the entire 'Death and Life' book in a searchable text file. Let's read it into memory and check it out.
# Read in the text file into memory
with open('output.txt', 'r') as myfile:
data=myfile.read().replace('\n', '')
# Take a look at the text
data[:1000]
', ,,/ THEDEATH AND LIFE OFGREAT AMERICAN CITIES JANE JACOBS V-241 $4.95 \x0cTHE DEATH AND LIFE OF GREAT AMERICAN CITIES \xe2\x80\xa2 Jane Jacobs VINTAGE BOOKS A Division of &ndo11l HOllse NEW YORK \x0cAcknowledgment permission ill their pages: Arc/)ilrctllf\'lli F"nnll, Hilrpcr\'r MI1SIF,.,jflC, i\xef\xbf\xbd made to the following publications of this book which first appeared Fo"nll, the Columbiil U\'fIi;;erJily to reprint portions for T/;c Reporter . \xe2\x80\xa2 @ Copyright, Ip61, by lillie Jacobi \xef\xbf\xbdIJ rights resen\'cd uncler Inlem\xef\xbf\xbdtional Published rrght COII,:,cntions. Inc., and snnult:lneously of Canada, Limited. and Pan-American Copy\xc2\xadill New York by Random House, ill Tvrunto, Dnacla, by Random House ,\\\'Ignufacrurcd in rhe United StlItcs uf America \x0cTo NEW YOAIt CITY where I ClIme [0 seek my fortune and found it by finding Bob, Jimmy, Ned and Mary for whom this book is written [00 \xe2\x80\xa2 \xe2\x80\xa2 \x0cAcknowledgment In particular I am grateful June Blythe, Gndy Clay, William for given by the: following So many scoru of'
Use the natural language processing library nltk to tokenize all of the words in the text. See here for more info: http://www.nltk.org/
# Decode the text as ASCII string, ignoring errors caused by some unrecognizable characters
s = data.decode('ascii', 'ignore')
# Word tokenize the string of characters using nltk module
import nltk
tokens = nltk.word_tokenize(s)
# Preview
tokens[:10]
[u',', u',', u',/', u'THEDEATH', u'AND', u'LIFE', u'OFGREAT', u'AMERICAN', u'CITIES', u'JANE']
Use the geotext library to extract country and cities from the text. More info on geotext here: https://pypi.python.org/pypi/geotext
from geotext import GeoText
places = GeoText(s)
Look below at a few of the cities found using this tool. Clearly, there are some mistakes (Mary and Nelson are names, not places) but some cities are correctly identified (Chicago, Philadelphia). Let's ignore this problem for now and come back to it later.
places.cities[:10]
[u'YORK', u'Mary', u'Nelson', u'Kent', u'Most', u'Of', u'Chicago', u'Chicago', u'York', u'Philadelphia']
places.countries[:10]
[u'Canada', u'Chad', u'Ireland', u'United States', u'Sweden', u'Solomon Islands', u'Jersey', u'United States', u'United States', u'Venezuela']
Try pulling latitude and longitude coordinates for one city.
import googlemaps
API_key = "AIzaSyBft2Y_yT23uMct_Isw-Zhj5JEp0e-p88g"
gmaps = googlemaps.Client(key=API_key)
Test the API by attempting to geocode one of the cities:
places.cities[10]
u'Boston'
geocode_result = gmaps.geocode(places.cities[10])
lat = geocode_result[0]['geometry']['location']['lat']
lon = geocode_result[0]['geometry']['location']['lng']
print (lat, lon)
(42.3600825, -71.0588801)
It works! Correctly returns the lat/lon coordinates for Boston City Hall.
Image("https://raw.githubusercontent.com/willgeary/janejacobs/master/Images/Boston.png")
Iterate through the list of cities and ask the Google Maps API for each city's lat/lon coordinates.
coords = {}
for city in places.cities:
geocode_result = gmaps.geocode(city)
# if google maps request doesn't return any data, don't do anything
if len(geocode_result) == 0:
pass
else:
lat = geocode_result[0]['geometry']['location']['lat']
lon = geocode_result[0]['geometry']['location']['lng']
coords[city] = ((lat,lon))
coords['Baltimore']
(39.2903848, -76.6121893)
Create a table of each city and its latitude / longitude coordinates.
import pandas as pd
df = pd.DataFrame(coords).transpose()
df.rename(columns={0: "Latitude", 1: "Longitude"}, inplace=True)
df.head()
Latitude | Longitude | |
---|---|---|
ALVIN | 29.423847 | -95.244101 |
AMI | 36.030795 | 140.214786 |
ATH | 52.991834 | -6.985728 |
Along | 28.162784 | 94.805414 |
Baltimore | 39.290385 | -76.612189 |
It would also be cool to count how many times each city is mentioned in the book. Let's do that.
city_counts = [tokens.count(city) for city in list(df.index)]
df['Frequency'] = city_counts
df.head()
Latitude | Longitude | Frequency | |
---|---|---|---|
ALVIN | 29.423847 | -95.244101 | 2 |
AMI | 36.030795 | 140.214786 | 3 |
ATH | 52.991834 | -6.985728 | 18 |
Along | 28.162784 | 94.805414 | 6 |
Baltimore | 39.290385 | -76.612189 | 14 |
Save this table to a csv.
df.to_csv('mapdata.csv')
I mapped the cities using Carto. The size of the bubbles represents the number of times that each city is mentioned in the text.
from IPython.display import IFrame
IFrame("https://willgeary.carto.com/viz/31c6b2d0-8db4-11e6-ab40-0ef7f98ade21/embed_map", 800, 500)
I'm excited to have somewhat successfully converted the pdf into text, parsed it for locations and created a map.
However, there are clearly some problems here. Namely, some of the cities parsed are not actually cities (or atleast not in the context that Jacobs intended).
Some problems that I noticed include:
In order to be fully satisfied with this map, I would need to parse through the cities manually and remove the nonsensical ones.