Happy Healthy Hungry -- San Francisco (H3)

In this analysis, I will walk you through my general data science process by analyzing the inspections of San Francisco restaurants using publicly available data from the Department of Public health. We will explore this data to map the cleanliness of the city, and get a better perspective on the relative meaning of these scores by looking at statistics of the data. Throughout this notebook, I show how to use a spectrum of powerful tools for data science (from UNIX shell to pandas and matplotlib) and provide some tips and data tricks. If you would just like to see what insights we garnered, read the associated blog post or simply jump to the bottom of this notebook. This notebook can be downloaded (with associated data) from its repo.

Our approach is documented here to encourage anyone (and everyone) to repeat our analyses and to provide complete transparency into our results. And because we love the scientific method

There's More!

This is one of many projects aimed at democratizing data and letting it roam free (free-range data?). Stay tuned to our feeds or blog if this is the kind of thing that excites you!

If you want to learn more or dive deeper into any of these subjects, we are always happy to discuss (and can talk for days). And if you just can't get enough of this stuff (and want a completely immersive environment), you can apply for our intensive data science bootcamp starting September 16th.

We would love to hear from all of you! Please reach out with any suggestions, feedback, or to just say hi: [email protected]

If you would like to get involved with our school please contact us! We are always looking for guest speakers/instructors, novel datasets, and companies to partner with either for internships/externships or for our hiring day

In [1]:
# Import pylab to provide scientific Python libraries (NumPy, SciPy, Matplotlib)
%pylab --no-import-all
#import pylab as pl
# import the Image display module
from IPython.display import Image

# inline allows us to embed matplotlib figures directly into the IPython notebook
%pylab inline
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib
Populating the interactive namespace from numpy and matplotlib
In [2]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/animate.gif', width=700)
Out[2]:

Problem

The first step of the Process is to define the problem we want to address. To do so let us review what we have set out to accomplish and begin exploring questions we want answered.

How clean are SF restaurants?

It is often best to arrive at a simple yet illuminating question to give you direction. Of course there are a number of sub-questions we may have that relate to our over arching problem, we can address these when we determine our goals for the analysis.

Let us review the important points to keep in mind when defining our problem:

  • The question can be qualitative, but the approach, must be quantifiable
  • What am I looking for?
  • What do I want to learn?
  • Alright to be exploratory (and the best analysis often are)
In [3]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/goal.png', width=500)
Out[3]:

Determine Goal

Now that we have a problem we hope to solve, let us begin to quantify our analysis. Since our Problem Statement is often qualitative and broad, we can ask further questions to better define what we hope to achieve.

How does an individual restaurants' score compare to the whole/aggregate of SF?

Are SF's inspections better or worse than other cities?

If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?

Some things to note about our goals and approach:

  • Determine what defines success, and to what degree.
  • Brainstorm metrics to visualize and/or calculate.
  • Ask questions that have (or can have) a definitive answer.
  • Be careful what you wish for, be aware of possible correlations, and take caution with how you measure it.
In [4]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/explore.png', width=500)
Out[4]:

Explore Data

To recap where we are in our analysis:

  • We have determined what we want to learn -- How clean are SF restaurants?
  • We explored quantifiable metrics to collect -- Individual scores, summary statistics about distribution of scores, other cities' inspection data to compare

The Explore stage of the analysis is where we will most likely spend most of our time. Now comes the fun part (in my opinion)! At this stage we will use a variety of tools (the documentation of each linked to inline) to figure out where and how to obtain data, what it looks like once we have it, and how to use it to answer of questions to achieve our goals.

Acquire

Luckily, San Francisco has much of its public government data freely accessible online. There are also great initiatives by SF companies collaborating with non-profits and government to develop open data standards. Such standardization allows for much more transparency, leading ultimately to a more engaged citizenry.

The relevant data has been downloaded for your convenience and can be found in the repo.

Examine

If you are working with this IPython notebook, download the data files into the same directory which you ran the ipython notebook command

Now that we have found the relevant data we can begin to peer inside to understand what we are working with. I recommend starting with an iterative approach, using the quickest/easiest tools first and slowly build to more complicated analyes. UNIX provides us with many powerful tools and can carry us quite far by itself. In our case the dataset came with documentation of its contents, but it still is essential to look at the raw data and compare it to the docs.

In [5]:
# Let us display a few lines of data to understand its format and fields

# Any command prefixed with '!' instructs IPython to run a shell command
# http://ipython.org/ipython-doc/rel-0.13.1/interactive/reference.html#system-shell-acces

!head -n 5 data/SFBusinesses/businesses.csv




head is a UNIX command to print only the first few lines of a file (in this case 5). This is very useful for exploring very large files (which happens a lot in data science) very quickly and easily. You can read more about it here) or by consulting its manual pages on the command-line: man head

In [6]:
# [PROTIP]: IPython has built in tab completion for commands.  
# Partially type a command or file name and hit tab.

!head -n 5 data/SFBusinesses/inspections.csv




In [7]:
!head -n 5 data/SFBusinesses/violations.csv




In [8]:
# It is a little hard to read since the headers are much
# shorter than the data.  Lets see if we can prettify it.

!head -n 5 data/SFBusinesses/violations.csv | column -t -s ','




column is used to format its input into multiple columns. It also is useful for formating columns already present in delimited data (CSV for example). With this command we used UNIX pipes), one of the most powerful and useful aspects of working in a terminal.

In [9]:
!head -n 5 data/SFBusinesses/ScoreLegend.csv | column -t -s ','




There are two different data directories, each of which has similar files. Let's try to figure out the difference between the two, since the documentation on the data does not mention anything.

In [10]:
%%bash

# We can use IPython cell 'magics' to run a cell in a subprocess. In this case we run
# the entire cell in a bash process (notice no exclamation point before the shell command)

head -n 5 data/SFFoodProgram_Complete_Data/businesses_plus.csv
"business_id","name","address","city","state","postal_code","latitude","longitude","phone_no","TaxCode","business_certificate","application_date","owner_name","owner_address","owner_city","owner_state","owner_zip"
10,"TIRAMISU KITCHEN","033 BELDEN PL","San Francisco","CA","94104","37.791116","-122.403816",,"H24",,,"Tiramisu LLC","33 Belden St","San Francisco","CA","94104"
12,"KIKKA","250 EMBARCADERO  7/F","San Francisco","CA","94105","37.788613","-122.393894",,"H24",,7/12/2002 0:00:00,"KIKKA ITO, INC.","431 South Isis Ave.","Inglewood","CA","90301"
17,"GEORGE'S COFFEE SHOP","2200 OAKDALE AVE ","San Francisco","CA","94124","37.741086","-122.401737","(141) 555-5314","H24",,4/5/1975 0:00:00,"LIEUW, VICTOR & CHRISTINA C","648 MACARTHUR DRIVE","DALY CITY","CA","94015"
19,"NRGIZE LIFESTYLE CAFE","1200 VAN NESS AVE, 3RD FLOOR","San Francisco","CA","94109","37.786848","-122.421547",,"H24",,,"24 Hour Fitness Inc","1200 Van Ness Ave, 3rd Floor","San Francisco","CA","94109"

IPython cell magics and other tricks

In [11]:
%%bash
# Can we somehow compare the columns?

head -n 1 data/SFFoodProgram_Complete_Data/businesses_plus.csv | awk -F, '{ print NF }'
head -n 1 data/SFBusinesses/businesses.csv | awk -F, '{ print NF }'
17
9

We see here that in SFFoodProgram_Complete_Data/ the files seem to be augmented with additional data. The file has almost twice as many fields present.

We used awk to figure this out by passing in the header row from the file to a simple awk script to count the number of fields (NF).

AWK is actually a programming language (in addition to a command line utility) and can by quite powerful if used correctly. This is going to be one of the standard tools at our disposal as a data scientist to explore and manipulate data.

In [12]:
# In addition to the extra fields, is anything else different?

!wc data/SFBusinesses/businesses.csv data/SFFoodProgram_Complete_Data/businesses_plus.csv
    6384   44426  656875 data/SFBusinesses/businesses.csv
    6353   85003 1215896 data/SFFoodProgram_Complete_Data/businesses_plus.csv
   12737  129429 1872771 total

wc is another standard UNIX utility that we will find ourself coming back to time and time again. Here we compare the line counts (number of records) and sizes of the files.

The first column is the number of newlines, or how many records are contained. The second column is word count and the last the number of bytes

Some interesting things to note from this very simple (yet illustrative) exploration of our data. As we might have guessed, the files in the SFFoodProgram_Complete_Data/ have more information added to the original SFBusinesses/ files. While the 'complete' data has more columns, there are actually fewer records (6353 compared to 6384) possibly due to the fact that it is harder to get the additional data for the businesses. But while a few businesses might be missing (~30), there is almost twice as much data (656KB compared to 1.2MB) in the 'complete' files if we look at byte counts.

We can endlessly explore and compare these files and contents, I encourage you to perform similar comparisons with the other extra files (SFFoodProgram_Complete_Data/) in the directories and dive deeper into each file itself. For our examination of the data, I am happy with what we have accomplished. Given these new insights, we have enough information to continue on with our analysis.

Prepare

This is typically what people refer to as data 'munging' (or 'wrangling') and often is the most tedious process when working with messy data. Due to increasing awareness of the importance of data quality, the city of SF has been making great strides in more open and accessible data. If you (the city of SF) know the format you will need going into the data collection process (inspecting restaurants) you can hopefully avoid a lot of pain later in the analysis process.

The preparation process of our analysis is not as long and cumbersome as it typically might be due to the high quality of the raw data. Because of this, I will spare you much of the tedium of this step so we can focus on the more interesting aspects of the analysis. If you want to see (and experience) the pain (all you masochists out there), we will get much deeper into data acquisition and scrubbing techniques in our data wrangling post of this series.

Transform

Now that we know the structure of our data, we can start to begin examining it statistically to get a macrosopic look at its distribution. This part of our tutorial will use much of the powerful built in functionality of NumPy, SciPy, matplotlib, and pandas. If you want to get more experience with these, there are great resources and tutorials covering these libraries in much more depth than I will here. I highly recommend taking a look at these if this analysis interests you even in the least bit.

In [13]:
'''
To perform some interesting statistical analyses, we first need to "join" our CSV files in order to associate businesses 
with their inspection scores. This data currently resides in SFBusinesses/businesses.csv and SFBusinesses/inspections.csv
'''

# import pandas library which provides an R like environment for python.
# if you do not have it installed: sudo easy_install pandas.
import pandas as pd
import scipy as sp
from scipy import stats

# store relevant file paths in variables since we may use them frequently
root_dir = 'data/SFBusinesses/'
businesses = root_dir + 'businesses.csv'
inspections = root_dir + 'inspections.csv'


# load each file into a Pandas DataFrame, pandas automatically converts the first line into a header for the columns

df_business = pd.read_csv(businesses)
df_inspection = pd.read_csv(inspections)

# inspect the first 10 rows of the DataFrame
df_inspection.head(10)
/Users/jonathandinu/anaconda/lib/python2.7/site-packages/pytz/__init__.py:35: UserWarning: Module argparse was already imported from /Users/jonathandinu/anaconda/python.app/Contents/lib/python2.7/argparse.pyc, but /Users/jonathandinu/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream
Out[13]:
business_id Score date type
0 10 98 20121114 routine
1 10 98 20120403 routine
2 10 100 20110928 routine
3 10 96 20110428 routine
4 10 100 20101210 routine
5 12 100 20121120 routine
6 12 98 20120420 routine
7 12 100 20111018 routine
8 12 100 20110401 routine
9 17 100 20120823 routine
In [14]:
'''
we can 'join' DataFrames just as we would database tables
pandas uses a left-outer join by default, meaning that all 
the records from the businesses will be present even if there
is not a corresponding row in the inspections.
'''

# join the two DataFrames on the 'business_id' column
big_table = df_business.merge(df_inspection, on='business_id')

# the joined DataFrame columns: frame1 columns + frame2 columns
# in our case it is the concatenation of df_business and df_inspection columns
print 'Business:\t' + str(df_business.columns) + '\n'
print 'Inspection:\t' + str(df_inspection.columns) + '\n'
print 'Big Table:\t' + str(big_table.columns)

# allows for row and column indexing succinctly
big_table.iloc[:10, :4]
Business:	Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code', u'latitude', u'longitude', u'phone_number'], dtype=object)

Inspection:	Index([u'business_id', u'Score', u'date', u'type'], dtype=object)

Big Table:	Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code', u'latitude', u'longitude', u'phone_number', u'Score', u'date', u'type'], dtype=object)
Out[14]:
business_id name address city
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco
1 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco
2 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco
3 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco
4 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco
5 12 KIKKA 250 EMBARCADERO 7/F San Francisco
6 12 KIKKA 250 EMBARCADERO 7/F San Francisco
7 12 KIKKA 250 EMBARCADERO 7/F San Francisco
8 12 KIKKA 250 EMBARCADERO 7/F San Francisco
9 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco

Now that we have our joined data, we can start exploring it

In [15]:
# let us first group our data by business so we can find its most recent score for the inspections

grouped_business = big_table.groupby('business_id')

# a funtion that takes a DataFrame and returns the row with the newest date
def most_recent(df, column='date'):
    return df.sort_index(by=column)[-1:]
    
# input to most_recent is the DataFrame of each group, in this case 
# all of the rows and columns for each business (grouped on business_id). 
most_recent_inspection_results = grouped_business.apply(most_recent)
 
# We applied the most_recent function to extract the row
# of the DataFrame with the most recent inspection.
most_recent_inspection_results.head()
Out[15]:
business_id name address city state postal_code latitude longitude phone_number Score date type
business_id
10 0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 98 20121114 routine
12 5 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN 100 20121120 routine
17 9 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine
19 13 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN 100 20121127 routine
24 18 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN 100 20121018 routine
In [16]:
# Filter out records without lat/long for mapping
r = most_recent_inspection_results

zero_filtered = r[(r['latitude'] != 0) & (r['latitude'] != 0)]

filtered = zero_filtered.dropna(subset=['latitude', 'longitude'])[['business_id','name', 'address', 'Score', 'date', 'latitude', 'longitude']]

filtered.to_csv('geolocated_rest.csv', index=False)
In [17]:
Image(url='http://inundata.org/R_talks/meetup/images/splitapply.png', width=500)
Out[17]:

Split-Apply-Combine

A visual representation of how group-by, aggregate, and apply semantics work

We can bin the restaurants by scores to understand the distribution of inspections better. Here we create a histogram to understand the distribution of scores better

In [18]:
from scipy.stats import expon

# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])

# pandas built-in histogram function automatically distributes and counts bin values 
h = most_recent_inspection_results['Score'].hist(bins=100)

# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 100, 2))

# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")
Out[18]:
<matplotlib.text.Text at 0x10ad87810>
In [19]:
# Now that we have a basic idea of the distribution, let us look at some more interesting statistics
scores =  most_recent_inspection_results['Score']
mean = scores.mean()
median = scores.median()

# compute descriptive summary statistics of the inspection scores
summary = scores.describe()

mode = sp.stats.mode(scores)
skew = scores.skew()

# compute quantiles
ninety = scores.quantile(0.9)
eighty = scores.quantile(0.8)
seventy = scores.quantile(0.7)
sixty = scores.quantile(0.6)

print "Skew: " + str(skew)
print "90%: " + str(ninety)
print "80%: " + str(eighty)
print "70%: " + str(seventy)
print "60%: " + str(sixty)
print summary
Skew: -1.38406008151
90%: 100.0
80%: 100.0
70%: 98.0
60%: 96.0
count    5824.000000
mean       91.590659
std         8.296706
min        44.000000
25%        88.000000
50%        94.000000
75%        98.000000
max       100.000000
dtype: float64
In [20]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/solutions.png', width=500)
Out[20]:

Propose Solutions

Since we have explored our data and have a better idea of its nature, we can begin to devise a plan to answer our questions. This is usually the most iterative part of the entire process: as we learn more about our data we modify our approach, and as modify our solutions we must re-examine our data.

Goals:

How does an individual restaurants' score compare to the whole/aggregate of SF?

Are SF's inspections better or worse than other cities?

If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?

Solutions:

Collect summary statistics (mean, median, standard deviation, etc.) about distribution of scores.

Acquire data on inspection scores for other cities, compare distribution of cities.

Perform a linear regression on historic data on past inspections combined with scores from other 'similar' restaurants.

Some things to note about formulating solutions:

  • Ask, how can I address the problem?
  • In what ways can I use the data to achieve my goals?
  • The simplest solution is often best (and the one you should try first)
  • Quantize the metrics needed for your analysis
In [21]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/metrics.png', width=500)
Out[21]:

Collect Metrics

This is the step where derivative values are often calculated, including summary statistics, transformations on the data, and correlations. There also is a bit of traditional data mining involved as most machine learning occurs in the solutions and metrics stages (in our formulation). We could even go so far as to say that the results of predictive models are simply additional metrics: the probability of defaulting on a loan, the cluster a new product belongs in, or the score of a restaurant that hasn't been inspected yet.

The purpose of this part of the process is to calculate the information you need to begin evaluating and testing you solutions and hypotheses.

In [22]:
# recall that in the Score Legend, each numeric score corresponds to a more qualitative description
!head -n 5 data/SFBusinesses/ScoreLegend.csv | column -t -s ','




In [23]:
'''
let us compute a histogram of these descriptions as well
'''

# first we need to discretize the numerical values, this can be 
# thought of as converting a continuous variable into a categorical one.
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]

# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = most_recent_inspection_results['Score'].copy()
score_transform = most_recent_inspection_results.copy()

# built-in pandas funcion which assigns each data point in 
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
In [24]:
# tranform the original DataFrame's "Score" column with the new descriptions
score_transform['Score'] = discretized_scores

score_transform[['name', 'date','Score']].head(15)
Out[24]:
name date Score
business_id
10 0 TIRAMISU KITCHEN 20121114 Good
12 5 KIKKA 20121120 Good
17 9 GEORGE'S COFFEE SHOP 20120823 Good
19 13 NRGIZE LIFESTYLE CAFE 20121127 Good
24 18 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 20121018 Good
29 23 CHICO'S PIZZA 20121228 Adequate
31 28 NORMAN'S ICE CREAM AND FREEZES 20121217 Good
37 33 CAFE BISTRO 20130130 Good
40 38 MO'S COFFEE BAR 20121213 Good
45 42 CHARLIE'S DELI CAF� 20121126 Good
48 48 ART'S CAFE 20130327 Adequate
50 53 SUSHI ZONE 20130328 Good
54 59 RHODA GOLDMAN PLAZA 20130326 Good
56 64 CAFE X + O 20130327 Good
58 69 OASIS GRILL 20130207 Good

By quantizing the scores of the restaurant inspections, we can get a better qualitative insight into the ratings. Let us compare this new distribution of quantized scores to the raw numeric values.

In [25]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/evaluate.png', width=500)
Out[25]:

Evaluate

With the metrics we need properly calculated, it is time to draw some conclusions from our analyses. We need to evaluate whether the result we have arrived at:

  • Answers our original question to an acceptable level of confidence.
  • Has allowed us to achieve our goals?
In [26]:
'''
plot a histogram of the discretized scores
'''

# create a figure with 2 subplots
fig = figure(figsize=[30,7])
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

# count each occurance of descriptions in the 'Score' column,
# and reverse this result so 'Poor' is left most and 'Good' right most
counts = score_transform['Score'].value_counts()[::-1]
plt = counts.plot(kind='bar', ax=ax2)

# decorate the plot and axis with text
ax2.set_title("Restaurant Inspections (%i total)" % sum(counts))
ax2.set_ylabel("Counts")
ax2.set_xlabel("Description")

# let us add some labels to each bar
for x, y in enumerate(counts):
    plt.text(x + 0.5, y + 200, '%.f' % y, ha='left', va= 'top')
    
# plot the original raw scores (same grapoh as earlier)
most_recent_inspection_results['Score'].hist(bins=100, ax= ax1)
# create x-axis ticks of even numbers 0-100
ax1.set_xticks(np.arange(40, 100, 2))

# add a title to the current figure, our histogram
ax1.set_title("Histogram of Inspection Scores")
ax1.set_ylabel("Counts")
ax1.set_xlabel("Score")

savefig('histograms.png', bbox_inches=0)

We can see that a majority of restaurants are adequate or good, according to the quantiles only 25% have scores less than 88. While the histogram of the numeric scores gives us a more granular look at the data, it can be quite difficult to derive value from it. Is an 86 a filthy/unhealthy restaurant or did it simply forget a few nuanced inspection rules? The Score Legend provides us a mapping from a raw score to a meaningful value, similar to the scaling of standardized test raw scores.

If we are not satisfied with our evaluation, we need to iterate on our approach:

  • Do I need more/better data?
  • Do I need to try a different proposed solution?
  • Do I need to calculate different metrics?
In [27]:
# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])

# pandas built-in histogram function automatically distributes and counts bin values 
h = most_recent_inspection_results['Score'].hist(bins=100)

# summary statistics vertical lines
axvline(x=mean,color='red',ls='solid', lw="3", label="mean")
axvline(x=median,color='green',ls='solid', lw="3", label="median")
axvline(x=mode[0][0],color='orange',ls='solid', lw="3", label="mode")

# 25th quantile
axvline(x=summary['25%'],color='maroon',ls='dashed', lw="3", label="25th")
axvspan(40, summary['25%'], facecolor="maroon", alpha=0.3)

# 75th quantile
axvline(x=summary['75%'],color='black',ls='dashed', lw="3", label="75th")
axvspan(40, summary['75%'], facecolor="yellow", alpha=0.3 )

# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 104, 2))

# add legend to graph
legend(loc=2)

# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")

savefig('quantiles.png', bbox_inches=0, transparent=True)

print summary
count    5824.000000
mean       91.590659
std         8.296706
min        44.000000
25%        88.000000
50%        94.000000
75%        98.000000
max       100.000000
dtype: float64

Iterate

Now that we have a general idea of the distribution of these scores, let us see if we can find any correlation between score ranges and health violations.

In [28]:
import re as re
import collections as c
import pprint as pp

# first let us form a 'big table' by joining the violations to the most recent inspection scores
file="data/SFFoodProgram_Complete_Data/violations_plus.csv"

df_violations = pd.read_csv(file)

violation_table = most_recent_inspection_results.merge(df_violations, on=['business_id','date'])
violation_table.head()
Out[28]:
business_id name address city state postal_code latitude longitude phone_number Score date type violation_type_id ViolationSeverity description
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 98 20121114 routine 54 Minor Unclean or degraded floors walls or ceilings
1 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine 42 Minor Unclean nonfood contact surfaces
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine 24 Minor Inadequately cleaned or sanitized food contact...
3 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine 3 Minor High risk food holding temperature
4 29 CHICO'S PIZZA 131 06TH ST San Francisco CA 94103 37.774722 -122.406761 14155251111 87 20121228 routine 42 Minor Unclean nonfood contact surfaces
In [29]:
# Let us see how the violations are distributed
figure(figsize=[18,7])

violation_hist = violation_table['description'].value_counts().plot(kind="bar")
In [30]:
# Let us see what violations a restaurant can have and still get a perfect score

figure(figsize=[18,7])

perfect_scores = violation_table[violation_table['Score'] == 100]

violation_hist = perfect_scores['description'].value_counts().plot(kind="bar")

perfect_scores
Out[30]:
business_id name address city state postal_code latitude longitude phone_number Score date type violation_type_id ViolationSeverity description
1 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine 42 Minor Unclean nonfood contact surfaces
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine 24 Minor Inadequately cleaned or sanitized food contact...
3 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine 3 Minor High risk food holding temperature
5580 5312 THE WATERFRONT PIER 9 San Francisco CA 94111 NaN NaN NaN 100 20120912 routine 54 Minor Unclean or degraded floors walls or ceilings
7747 28016 GRIDIRON BAR MONSTER PARK MEZZ LEVEL San Francisco CA NaN 0.000000 0.000000 NaN 100 20130112 routine 33 Minor Foods not protected from contamination
8857 36744 EL CASTILLITO TAQUERIA STAR 250 GOLDEN GATE AVE San Francisco CA 94102 37.781788 -122.414704 NaN 100 20130426 routine 19 Minor Inadequate and inaccessible handwashing facili...
8858 36744 EL CASTILLITO TAQUERIA STAR 250 GOLDEN GATE AVE San Francisco CA 94102 37.781788 -122.414704 NaN 100 20130426 routine 42 Minor Unclean nonfood contact surfaces
8859 36744 EL CASTILLITO TAQUERIA STAR 250 GOLDEN GATE AVE San Francisco CA 94102 37.781788 -122.414704 NaN 100 20130426 routine 45 Minor Improper storage of equipment utensils or linens
8860 36744 EL CASTILLITO TAQUERIA STAR 250 GOLDEN GATE AVE San Francisco CA 94102 37.781788 -122.414704 NaN 100 20130426 routine 56 Minor Permit license or inspection report not posted
8861 36744 EL CASTILLITO TAQUERIA STAR 250 GOLDEN GATE AVE San Francisco CA 94102 37.781788 -122.414704 NaN 100 20130426 routine 5 Minor Improper cooling methods
9216 38231 WINES OF CALIFORNIA WINE BAR PIER 39 SPACE N-111 San Francisco CA 94133 37.808240 -122.410189 NaN 100 20120601 routine 33 Minor Foods not protected from contamination
9217 38231 WINES OF CALIFORNIA WINE BAR PIER 39 SPACE N-111 San Francisco CA 94133 37.808240 -122.410189 NaN 100 20120601 routine 14 Minor High risk vermin infestation
12095 66254 GAMBINO'S 1 MARKET ST San Francisco CA 94105 37.794347 -122.394932 NaN 100 20130301 routine 20 Minor Moderate risk food holding temperature
13883 69590 HOTEL ADAGIO 550 GEARY ST San Francisco CA 94102 NaN NaN 14155775000 100 20121001 routine 19 Minor Inadequate and inaccessible handwashing facili...
13884 69590 HOTEL ADAGIO 550 GEARY ST San Francisco CA 94102 NaN NaN 14155775000 100 20121001 routine 57 Minor Food safety certificate or food handler card n...
13885 69590 HOTEL ADAGIO 550 GEARY ST San Francisco CA 94102 NaN NaN 14155775000 100 20121001 routine 43 Minor Inadequate warewashing facilities or equipment
In [31]:
violation_table['description'].value_counts()[:10]
Out[31]:
Unclean or degraded floors walls or ceilings                    1321
Unclean nonfood contact surfaces                                1068
Inadequate and inaccessible handwashing facilities               915
Unapproved or unmaintained equipment or utensils                 880
Wiping cloths not clean or properly stored or inadequate sanitizer     835
Moderate risk food holding temperature                           798
Inadequately cleaned or sanitized food contact surfaces          765
Improper food storage                                            695
Foods not protected from contamination                           634
Moderate risk vermin infestation                                 515
dtype: int64
In [32]:
# Hmmm, apparently high risk vermin infestations are minor violations
# If that is minor, what is a severe violation

df_violations['ViolationSeverity'].drop_duplicates()

# well aparently there are only minor violations
Out[32]:
0    Minor
Name: ViolationSeverity, dtype: object
In [33]:
# Let us bin health violations using the cities quantizations

descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]

# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = violation_table['Score'].copy()
violation_transform = violation_table.copy()

# built-in pandas funcion which assigns each data point in 
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
violation_transform["Scores"] = discretized_scores
In [34]:
grouped = violation_transform.groupby('Scores')
In [35]:
# let us find the most common violations for each group

# a funtion that takes a DataFrame and returns the top violations
def common_offenses(df):
    return pd.DataFrame(df['description'].value_counts(normalize=True) * 100).head(10)

top_offenses = grouped.apply(common_offenses)
top_offenses
Out[35]:
0
Scores
Adequate Unclean or degraded floors walls or ceilings 8.720770
Unclean nonfood contact surfaces 7.482806
Inadequate and inaccessible handwashing facilities 7.180193
Moderate risk food holding temperature 6.987620
Inadequately cleaned or sanitized food contact surfaces 6.244842
Unapproved or unmaintained equipment or utensils 6.217331
Wiping cloths not clean or properly stored or inadequate sanitizer 6.052270
Improper food storage 4.869326
Moderate risk vermin infestation 4.401651
Foods not protected from contamination 4.236589
Good Unclean or degraded floors walls or ceilings 11.259932
Unclean nonfood contact surfaces 8.059024
Wiping cloths not clean or properly stored or inadequate sanitizer 7.219069
Inadequate and inaccessible handwashing facilities 6.220204
Unapproved or unmaintained equipment or utensils 6.197503
Moderate risk food holding temperature 5.947787
Food safety certificate or food handler card not available 5.380250
Improper food storage 5.357548
Inadequately cleaned or sanitized food contact surfaces 4.926220
Permit license or inspection report not posted 3.564132
Needs Improvement Unclean or degraded floors walls or ceilings 7.992315
Unclean nonfood contact surfaces 6.878002
Inadequate and inaccessible handwashing facilities 5.859750
Unapproved or unmaintained equipment or utensils 5.840538
Foods not protected from contamination 5.360231
Inadequately cleaned or sanitized food contact surfaces 5.341018
High risk food holding temperature 4.918348
Wiping cloths not clean or properly stored or inadequate sanitizer 4.726225
Moderate risk food holding temperature 4.630163
Improper food storage 4.380403
Poor Unclean or degraded floors walls or ceilings 6.932931
Unclean or unsanitary food contact surfaces 6.631500
Unclean nonfood contact surfaces 6.254710
High risk vermin infestation 6.028636
Unapproved or unmaintained equipment or utensils 5.802562
Foods not protected from contamination 5.727204
Inadequate and inaccessible handwashing facilities 5.651846
High risk food holding temperature 5.275057
Inadequate food safety knowledge or lack of certified food safety manager 5.048983
Improper food storage 4.069329
In [36]:
f = figure(figsize=[18,7])
colors = ['r', 'b', 'y', 'g']

for name, group in grouped:
    group['description'].value_counts().plot(kind="bar", axes=f, alpha=0.5, color=colors.pop())
In [37]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/communicate.png', width=500)
Out[37]:

Communicate Results

An important part of visualization that is often overlooked is informational accessibility, or rather how interpretable are the results. Interactivity can go a long way towards making the content of your visualization much more digestable by allowing the consumer to 'discover' the data at their own rate.

Each restaurant is geographically binned using the D3.js hexbin plugin. The color gradient of each hexagon reflects the median inspection score of the bin, and the radius of the hexagon is proportional to the number of restaurants in the bin. Binning is first computed with a uniform hexagon radius over the map, and then the radius of each individual hexagon is adjusted for how many restaurants ended up in its bin.

Large blue hexagons represent many high scoring restaurants and small red mean a few very poorly scoring restaurants. The controls on the map allow users to adjust the radius (Bin:) of the hexagon for computing the binning as well as the range (Score:) of scores to show/use on the map. The color of the Bin: slider represents the median color of the Score: range and its size represents the radius of the hexagons. The colors of each of the Score: sliders represent the threshold color for that score, i.e. if the range is 40 - 100 the left slider's color corresponds to a score of 40 and the right slider to a score of 100. The colors for every score in-between are computed using a linear gradient.

 

Credits

Thank you to Yelp and Code for America for spearheading the initiative to make civic data more open. And this would not have been possible without the city of SF and all the health inspectors, thank you for keeping us safe.

This has been a production of Zipfian Academy