Last week, you dug through the WPRDC and had a chance to see the vast and varied data available there. During this project, you'll be dealing with quite a few datasets and data sources, but luckily they'll all be open data. Open data is, according to Albert Lin of the WPRDC, "a complete set of primary data made easily and permanently available in a timely fashion using electronic, machine readable, open file formats. Cost should not pose a barrier to accessing information, and no unreasonable restrictions should limit accessibility, sharing and re-use."
Your primary source of data to analyze will be the Western Pennsylvania Regional Data Center, or WPRDC. We've used WPRDC data a lot over the past few weeks, but this is the first time you'll be exploring the vast expanses of its data before.
There's a lot of data available, so it's important to know how to search through it and find what you want.
The primary way you'll find relevant datasets is through keyword search. From the master list of all datasets available at data.wprdc.org, just search for what you're looking for! For example, when I was looking for a list of all of the bus stops on Port Authority routes, I searched "bus stops," and the first (and only) result was exactly what I wanted: this lovely dataset.
However, searching around can be fruitless if you're not exactly sure what kind of data are available; fortunately, the WPRDC groups their datasets into topic-centered categories, like health, public safety, and housing. The full list of categories is available here: data.wprdc.org/group. If you can't find the specific data you want by searching for it, or if you don't know what data you want exactly, try browsing through the relevant category or categories and see what you find.
Finally, you can also sort by publisher. Each government agency or private group that displays data through the WPRDC has their own page listing all of the data they provide. You can view all of the organizations here: data.wprdc.org/organization.
Throughout the smester we imported a couple different types of data files. Here is a listing of the ways we imported
# load pandas (always do this first)
import pandas as pd
import numpy as np
# load data from a downloaded data set
center_attendance_pandas = pd.read_csv("community-center-attendance.csv",
index_col="_id") # use the column named _id as the row index
# load the 311 data directly from the WPRDC and parse dates directly
pgh_311_data = pd.read_csv("https://data.wprdc.org/datastore/dump/76fda9d0-69be-4dd5-8108-0de7907fc5a4",
index_col="CREATED_ON",
parse_dates=True) # using the dates as the index
chip = pd.read_csv("chipotle.tsv", sep="\t")
As you see here, we can just give pandas
a link to a data file from the internet and it'll just handle it; it's pretty great that way. And it was good to use that for something that keeps updating continuously, like Pittsburgh 311 calls. (As I write this, the most recent update was 4 minutes ago.)
However, links on the web are unstable: they can move or be taken down, and there's no guarantee that there will be a good internet connection everywhere. So, it's always a good idea to have a copy of any online data as a backup.
If you're dealing with data that isn't actively changing, like results from a diabetes study that occured in 2015, it's probably best to use a local copy of the data for your analysis. You've done this a lot; it's why our data analysis labs have been in GitHub repos that you git clone
instead of just downloading a notebook. Managing a bunch of files can be fiddly, but GitHub makes it easy to distribute and manage projects. You can download local copies of WPRDC data from the site and use them in your projects like any other file we've worked with, using the read_table()
and read_csv()
functions in pandas
.
Note: You may encounter some weird filetypes that we haven't dealt with in class. One I ran into was a .dbf
file, which is an older type of database. To deal with .dbf
files, we can use a module called geopandas
, which normally is used for doing spatial/map stuff in data analysis. (You'll use geopandas
elsewhere in your project, most likely.)
To read in a .dbf
file (I'm using the Port Authority's bus stop data as an example), do the following:
import geopandas as gpd
dbf = r'PAAC_Stops_1611.dbf' # this opens the database as a readable file
table = gpd.read_file(dbf)
pdtable = pd.DataFrame(table)
pdtable.head()
geopandas
can read some filetypes that pandas
doesn't natively support. If you encounter any other filetypes and the normal functions don't work, remember to use Read-Search-Ask and try to find a solution on the internet. Someone's dealt with your situation before.
If you are still having issues with your data (for stability issues or other reasons), you may need to import ssl, which can be done as follows
# run this if you have data
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
Again, there are pros and cons to downloading your data to your local machine, however, if you do download the data, uploade it to your github repo! Otherwise, people running your code can't verify your findings!