Eleanor Tutt | tutte@carnegielibrary.org | Code for PGH Work Night | June 13, 2018
This is the same Jupyter notebook I will talk through. You can also use it as a starting point to explore digital equity data during the second half of the work night.
I have been working with Jupyter notebooks on and off for several years, but I still consider myself a beginner and this talk is designed for beginners!
I'll leave lots of time for you to explore Jupyter notebooks, python, and digital equity data using this notebook as a starting point - you can slow down or speed up then, preferably in small groups with new friends. :)
# No code for this part - just talk!
# if you want to use any python packages, add them at the beginning of your notebook
# packages allow you to take "shortcuts" instead of coding everything yourself
# like if you are making a pizza and you buy dough instead of making it totally from scratch
# pandas is a python package popular with data scientists
# is common to shorten pandas to pd when you import it
import pandas as pd
# folium is a python package that helps create interactive maps using leaflet
import folium
# later we are going to read in some geojson
import simplejson
# and this will help us with some colors for a choropleth map
from branca.colormap import linear
# pandas has a function called read_csv that can import csv data and store it as a dataframe
# a dataframe is a common pandas object that stores information in rows and columns
# you can think of it a like a spreadsheet, especially if you are using .csv data
# here we are creating a dataframe named libraries from a previously downloaded csv file
# original data from https://data.wprdc.org/dataset/libraries
# you could also link directly to the wprdc download url in your code:
# https://data.wprdc.org/datastore/dump/14babf3f-4932-4828-8b49-3c9a03bae6d0
libraries = pd.read_csv("librarylocations.csv")
# here is library wifi use data - brand new on the WPRDC!
# original data from https://data.wprdc.org/dataset/clp-public-wifi
wifi = pd.read_csv("clp-public-wifi.csv")
# here we are creating a dataframe named fcc from previously downloaded data
# original data from
# we added the dype parameter to make sure tract numbers are imported as text and not numbers
fcc = pd.read_csv("tract_map_dec_2016.csv", dtype={'tractcode': str})
# adding .info() to the name of a dataframe gives us some information about the dataframe -
# how many rows of data we have, what the columns are named, and what the data types are
fcc.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 73767 entries, 0 to 73766 Data columns (total 3 columns): tractcode 73767 non-null object pcat_all 73767 non-null int64 pcat_10x1 73767 non-null int64 dtypes: int64(2), object(1) memory usage: 1.7+ MB
# you also might want to look at the first few rows of your dataframe
libraries.head(5)
_id | CLPID | Name | Address | City | State | Zip4 | County | Phone | SqFt | ... | ThOpen | ThClose | FrOpen | FrClose | SaOpen | SaClose | SuOpen | SuClose | Lat | Lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | CLP01 | ALLEGHENY LIBRARY | 1230 FEDERAL ST | PITTSBURGH | PA | 15212-4704 | Allegheny | (412) 237-1890 | 15005.0 | ... | 10:00:00 | 20:00:00 | 10:00:00 | 17:00:00 | 10:00:00 | 17:00:00 | 12:00:00 | 17:00:00 | 40.456392 | -80.006613 |
1 | 2 | CLP02 | BEECHVIEW LIBRARY | 1910 BROADWAY AVE | PITTSBURGH | PA | 15216-3130 | Allegheny | (412) 563-2900 | 8000.0 | ... | 10:00:00 | 20:00:00 | 10:00:00 | 17:00:00 | 10:00:00 | 17:00:00 | NaN | NaN | 40.407026 | -80.027741 |
2 | 3 | CLP03 | BROOKLINE LIBRARY | 708 BROOKLINE BLVD | PITTSBURGH | PA | 15226-2102 | Allegheny | (412) 561-1003 | 12651.0 | ... | 10:00:00 | 20:00:00 | 10:00:00 | 17:00:00 | 10:00:00 | 17:00:00 | NaN | NaN | 40.394399 | -80.021427 |
3 | 4 | CLP04 | CARRICK LIBRARY | 1811 BROWNSVILLE RD | PITTSBURGH | PA | 15210-3907 | Allegheny | (412) 882-3897 | 4000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.397019 | -79.987547 |
4 | 5 | CLP05 | DOWNTOWN & BUSINESS LIBRARY | 612 SMITHFIELD ST | PITTSBURGH | PA | 15222-2506 | Allegheny | (412) 281-7141 | 12709.0 | ... | 08:30:00 | 18:00:00 | 08:30:00 | 17:00:00 | 10:00:00 | 17:00:00 | NaN | NaN | 40.441749 | -79.997112 |
5 rows × 26 columns
# or the last few rows
wifi.tail(5)
CLPID | Name | Year | Month | WifiSessions | WifiMinutes | |
---|---|---|---|---|---|---|
527 | CLP19 | WOODS RUN LIBRARY | 2017 | 12 | 630 | 79279 |
528 | CLP19 | WOODS RUN LIBRARY | 2018 | 1 | 716 | 99671 |
529 | CLP19 | WOODS RUN LIBRARY | 2018 | 2 | 778 | 108100 |
530 | CLP19 | WOODS RUN LIBRARY | 2018 | 3 | 816 | 104073 |
531 | CLP19 | WOODS RUN LIBRARY | 2018 | 4 | 903 | 107865 |
# we can group by one or more columns in our data
# and then summarize by another column
# (try "mean" instead of "sum")
wifi.groupby(by=['Name', 'Year'])['WifiSessions'].sum()
Name Year ALLEGHENY LIBRARY 2016 13224 2017 19914 2018 6872 BEECHVIEW LIBRARY 2016 5924 2017 8509 2018 2886 BROOKLINE LIBRARY 2016 6840 2017 9076 2018 2350 CARRICK LIBRARY 2016 4498 2017 1932 2018 0 DOWNTOWN & BUSINESS LIBRARY 2016 36168 2017 47660 2018 16974 EAST LIBERTY LIBRARY 2016 21521 2017 29329 2018 9712 HAZELWOOD LIBRARY 2016 7968 2017 8198 2018 2638 HILL DISTRICT LIBRARY 2016 7693 2017 12886 2018 3943 HOMEWOOD LIBRARY 2016 11363 2017 14144 2018 4111 KNOXVILLE LIBRARY 2016 4953 2017 16859 2018 4362 LAWRENCEVILLE LIBRARY 2016 3877 2017 5909 2018 1788 LIBRARY FOR THE BLIND & PHYSICALLY HANDICAPPED 2016 2407 2017 3712 2018 739 MAIN (OAKLAND) LIBRARY 2016 108945 2017 134235 2018 53078 MOUNT WASHINGTON LIBRARY 2016 2959 2017 4076 2018 992 SHERADEN LIBRARY 2016 8805 2017 11926 2018 3026 SOUTH SIDE LIBRARY 2016 7946 2017 9630 2018 3264 SQUIRREL HILL LIBRARY 2016 44102 2017 57127 2018 20003 WEST END LIBRARY 2016 3999 2017 4228 2018 1147 WOODS RUN LIBRARY 2016 8495 2017 11395 2018 3213 Name: WifiSessions, dtype: int64
m = folium.Map(location=[40.442, -79.997],
zoom_start=13,
tiles="Stamen Toner")
library_layer = folium.FeatureGroup(name='CLP Libraries')
for index, row in libraries.iterrows():
library_layer.add_child(folium.CircleMarker(location=(row["Lat"],
row["Lon"]),
radius=150,
popup=row["Name"],
color=False,
fill_opacity=1,
fill_color="#d4237d"))
m.add_child(library_layer)
folium.LayerControl().add_to(m)
m
# side note: folium has a "choropleth" shortcut I generally use
# but I just learned today they are thinking of depreciating it
# so I made this map in a slightly more complicated way
# our fcc dataframe has data for the entire US
# but all those census tracts would make for a large file
# we will load a previously downloaded geojson file
# original data from https://data.wprdc.org/dataset/allegheny-county-census-tracts-2016
# a good place to explore geojson as a data structure is http://geojson.io/
allegheny_tracts = simplejson.load(open("Allegheny_County_Census_Tracts_2016.geojson"))
# create a colormap
colormap = linear.YlGn_09.scale(
fcc.pcat_all.min(),
fcc.pcat_all.max())
print(colormap(5.0))
colormap
#004529
# and a way to connect tractcode/geoids to the relevant data
fcc_dict = fcc.set_index('tractcode')['pcat_all']
fcc_dict['42003468900']
4
m2 = folium.Map(location=[40.442, -79.997],
zoom_start=13,
tiles="Stamen Toner")
fcc_layer = folium.GeoJson(
allegheny_tracts,
name='Residential Internet',
style_function=lambda feature: {
'fillColor': colormap(fcc_dict[feature['properties']['GEOID']]),
'color': 'black',
'weight': 1,
'fillOpacity': 0.8,
}
)
m2.add_child(fcc_layer)
m2.add_child(library_layer)
folium.LayerControl().add_to(m2)
m2