Final Project

Kevin Li

Prithvi Narasimhan

Rajiv Pasricha

Andy Zhang

Matthew Ho

Introduction and Background

Crime is unfortunately a prevalent occurrence in Chicago. With a crime index of 11, Chicago is considered to be more dangerous than 89% of the cities in the United States. Per 1,000 residents, Chicago has a violence-related crime rate of 9.08 and property-related crime rate of 30.15 (https://www.neighborhoodscout.com/il/chicago/crime). Since we believe that the quality of living standards are directly related to the socioeconomic statuses and crime rate of the communities within Chicago as a whole, we wanted to see if the quality of certain facilities in Chicago correlated with the crime rate present in the area of those facilities. We decided that we wanted to look more closely at food-centered facilities, largely restaurants located in the city of Chicago.

In Chicago alone, there are more than 7,300 restaurants among 77 communities containing more than 100 neighborhoods within a city area of 234 square miles. (https://www.cityofchicago.org/city/en/about/facts.html) Because Chicago is such a large city with so many communities in it, we believed that getting data for it would be fine-grained and that there would be a good deal of data that would give us a fine-grained overview of information regarding the communities within Chicago.

Since people of lower socioeconomic status tend to have lower accessibility to higher quality and nutritious foods (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4231366/), we wanted to extend this idea to whether people would be able to have access to good quality food in restaurants. To narrow down the scale of analysis, we decided to choose Chicago for our analysis, as it is a varied city, such as there are parts in which crime is higher than other parts.

In [1]:
# Imports
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Number of decimal places to round latitude / longitude
num_decimal_places = 2

Datasets

Yelp data

Although restaurant health data would be a good indicator on the overall safety of eating at a restaurant, a safety inspection alone cannot quantify the perceived quality of a restaurant to a real consumer. When considering this problem, we decided to use reviews from Yelp, a public website which crowdsources restaurant reviews from millions of users in a social network, as a method to quantitatively determine the “value” of a particular restaurant, from the perspective of the local community. Although Yelp provides public data on restaurant reviews, the company does not provide Chicago-related data in an easily digestible CSV or JSON file. Ultimately, since our project is focused on trend within the Chicago area, our team indirectly used data collected through Yelp’s open API. On the GitHub repository titled “jpvelez/restaurant_inspection_analysis”, the repository author provided a JSON formatted file, containing review data and location information for 2800 restaurants, all of which were ingested through Yelp’s open API (the file can be found here: https://github.com/jpvelez/restaurant_inspection_analysis/blob/master/data/yelp_restaurants_0-2800.json).

The Yelp data set consists of a list of restaurants in Chicago that have Yelp reviews, each of which has hundreds of attributes although we only care about a few. In terms of the most relevant data to this project, each restaurant has a location (latitude / longitude). The latitude and longitude data could be used in conjunction with other data sets, in order to determine a general location area for a given restaurant (see section on neighborhoods). Such a location area would enable us to better identify trends between various areas of the city. In addition, the Yelp dataset provides information on the number of reviews. We hypothesize that the number of reviews could be a clear indication on the foot traffic to a restaurant. This is based off the assumption that, across a subsection of the population, people from all demographic groups use Yelp as a platform for reviews, with similar frequency. Moreover, the number of reviews could be a clear indication on the quality of a given Yelp rating, since additionally reviews clearly translate to more data going into the numeric Yelp rating. In addition, the data contains a Yelp rating. This Yelp rating is a quantitative representation of a community’s perception of a given restaurant. We hypothesize that this information will tend to closely correlate with restaurant health data, and therefore can be used as a secondary metric to identify trends between restaurant quality and crime. At the same time, as previously discussed, the review rating could be skewed by the nature of the Yelp user base (based on income demographics).

In [2]:
df_yelp = pd.read_csv('2646.csv').dropna()

Restaurant health data

In addition to Yelp reviews, we hypothesized that restaurant health data would be a clear, secondary data source. The restaurant health data can be found at https://query.data.world/s/6z85omejrs5j4guql6w73zqr. We hypothesize that there will be a correlation between the restaurant health data and crime. Specifically, we believe that areas with higher crimes, which might translate into less overall per capita wealth, would correlate to more restaurants failing their health ratings. As in the case of the Yelp data, the tuples of latitude and longitude for each entry will help us, when doing data cleansing, determine the neighborhood of a particular restaurant. The neighborhood metric would be more relevant than a latitude and longitude, since neighborhoods are social groups of areas based on similar demographics. In addition, the data contains, for each restaurant, a facility type, including restaurants, grocery stores, hospitals, and other venues that are subject to inspection. In our particular case, we hypothesize that the correlations will center around restaurants, grocery stores, and centers of care (examples include day cares). Moreover, we are provided with a 3-level risk assessment of a restaurant.

We hypothesize that this information could help discover trends between crime data and the overall quality of a restaurant, from the perspective of a safety inspector. It is important to note that, in this specific case, a restaurant with a “HIGH” risk, could in fact, pass the inspection, although this is not necessarily the case. Therefore, the lack of granularity in this particular field, and the lack of clear metrics on what a level might mean, could possibly be a cause for concern. Moreover, we are given the inspection result, a binary value identifying if a restaurant passed or failed. This data could help us determine, using the solid metric of whether or not a restaurant is sanitary enough to produce food for human consumption, to identify trends between crime and restaurant quality. In addition, we are given data on the inspection data, which would help us get an idea about the seasonality of particular restaurant trends. One hypothesis is that there is a correlation between restaurant quality and time of the year, since restaurants, as we hypothesize, might vary in safety during busy seasons especially when there are many tourists, when compared to the relatively low traffic seasons. This would provide possibly relevant insight when compared to the crimes that occur within a certain season, although this metric might be biased, since restaurant inspections do not seem to follow a specific seasonal pattern, and a bad restaurant in winter, might just be an overall bad restaurant.

In [3]:
# Read the food-safety data
df_food = pd.read_csv('Food_Inspections_-_Map.csv').dropna()

Crime data

The crime data is comprised of three different data sets, each representing the crimes in Chicago for three different years. The datasets can be accessed through the links here:(https://query.data.world/s/1dqli2sw6l2a1pk3xgrgxcz6s, https://query.data.world/s/8yjl03dp8xtu2kbcj86r8gb3x, https://query.data.world/s/ddg3u1rd1kedlmzz2j6k00pm8). As with the food-related data sets, the crime data set contains a latitude and longitude. As previously mentioned, this latitude and longitude data could be used in order to determine the neighborhood in which a crime has taken place. Additionally, the data set provides a description on the type of crime that has occurred. We hypothesize that understanding the crime type will allow us to identify trends between certain crime types. Moreover, we are provided with the data of the crime. We hypothesize that data information could be used to identify trends based on seasonality / time of year, and when combined with demographic data from the census data set, this data could provide us with insights on how different demographics interact with restaurants, and identify trends correlated with these metrics and crime. Moreover, we are also provided information on the location description. In other words, the data set describes the environment in which a given crime occurred (such as apartment or bar/tavern). We hypothesize that the crime location will correlate with food safety and food reviews, since crimes that occur outside, could possibly lead to a sharper decline in restaurant reviews, compared to crimes that occur within an apartment. Overall, when combined with the food datasets, we hypothesize that the crime data set will allow us to identify trends between food quality / food quality perception / food safety and the different types of crime that tend to occur within Chicago communities. Since Chicago is notorious for crime, we hypothesize that Chicago, in particular, would provide us with more granular data on how crime could impact the culture of a neighborhood. In this particular data analysis, we are using restaurants, places of community gathering, as a proxy for culture.

In [4]:
df_crime1 = pd.read_csv('chicago_crime_2014.csv')
df_crime2 = pd.read_csv('chicago_crime_2015.csv')
df_crime3 = pd.read_csv('chicago_crime_2016.csv')

# Combine 3 years of crime data
df_crime = pd.concat([df_crime1, df_crime2, df_crime3]).dropna()

Census data

In addition to the restaurant and crime data, we are using census data, from the United States official census. We hypothesize that the census data will help us identify more granular statistics, based on the broad data extracted from the crime and food statistics. The census data is grouped by neighborhood, a grouping that, we hypothesize, will allow us to identify trends more effectively. In addition, the census data provides demographic information regarding each neighborhood. For each row, we are given a population. We hypothesize that this population information will allow us to normalize the food and crime data on a per capita basis. Without such normalization, it is likely that high population neighborhoods, which , given a baseline rate of crime, would have higher crime rate, and the data normalization would allow us to take into account any possible population-based data skewing. The census also provides information on home occupancy and the average household size, both of which, we hypothesize, might provide insights into the quality and type of food available , along with the amount of crime. This is because, we hypothesize that vacant households might be a sign of a bad economy, and more crime. Moreover, we hypothesize that a household size could be an indicator for population density, which we believe would provide us with other insights. Additionally, the census data provides us with a median age for each neighborhood. We hypothesize that the median age will allow us to make inferences based on food quality (young people tend to, as we hypothesize, have less spending power) and on crime, since, age, we hypothesize, could have a correlation with crime level, and therefore, might be a factor that we need to normalize for when doing our data analysis. The census data also provides information demographic breakdowns of a given neighborhood. The census data’s information on the population of a given neighborhood for a demographic x, although due to the limited sample size in some census districts for certain demographics, we decided to not analyze the granular demographic data provided. It is important to note, that the census data is from 2010, while the crime/food data that we are utilizing for this project is from 2014. Therefore, there might be some inconsistencies due to demographic / neighborhood changes. Here is a https://datahub.cmap.illinois.gov/dataset/5700ba1a-b173-4391-a26e-48b198e830c8/resource/b30b47bf-bb0d-46b6-853b-47270fb7f626/download/CCASF12010CMAP.xlsx.

In [5]:
df_census = pd.read_excel('CCASF12010CMAP.xlsx')

Data Cleaning / Pre-processing

For each data set associated with crime, restaurant health data and Yelp data, we pre-process the data, using a separate open source library, in order to identify the neighborhoods that a given location (latitude, longitude) is in. We noticed that all the datasets that we plan to use for our analysis contain latitude and longitude. Based off this observation, we hypothesized that it would be useful to identify the cultural boundaries of Chicago. In other words, we wanted to determine the neighborhood associated with a given latitude or longitude. We used a forked version of the open source Github repo craigmbooth/chicago_neighborhood_finder. This repo uses census data on neighborhood boundaries and given a CSV with latitude and longitude columns, it determines the neighborhood associated with the given location. In our fork, available at on Github at “kevinjli/chicago_neighborhood_finder”, we edited the source code so that the library outputs the neighborhood information into a CSV, as opposed to a JSON object. Although we do not run this library in our iPython notebook, before the notebook runs, the data set files inputted have been pre-processed with this library.

For the Yelp data, we round the latitude and longitude in order to get "buckets" of nearby coordinates.

In [6]:
# Save the raw values for future use
df_yelp_unrounded = df_yelp.copy()

df_yelp['Latitude'] = df_yelp['Latitude'].round(num_decimal_places)
df_yelp['Longitude'] = df_yelp['Longitude'].round(num_decimal_places)

# A consistent string to represent the location bucket
df_yelp['Location'] = df_yelp['Latitude'].astype(str) + ',' + df_yelp['Longitude'].astype(str)

For food safety, we only want to consider inspections in 2014-2016 because those are the years for which we have crime data.

In [7]:
df_food['Inspection Date'] = pd.to_datetime(df_food['Inspection Date'])
df_food = df_food[(df_food['Inspection Date'] >= '2014-01-01') & (df_food['Inspection Date'] <= '2016-12-31')]

We drop columns we won't be using, and convert the risk and results columns into a more machine-usable format.

In [8]:
df_food = df_food.drop(['DBA Name', 'AKA Name', 'License #', 'State', 'City', 'Zip', 'Violations', 'Location', 'Address', 'Unnamed: 0'], axis=1)

# Stringify "Risk" column
df_food['Risk'] = df_food['Risk'].apply(lambda x: x.split(' ')[2][1:-1])

# Intify "Results" column
df_food['Results'] = df_food['Results'].apply(lambda x: int(x == 'Pass'))

Again we round the latitude and longitude in order to bucket.

In [9]:
# Save the raw values for future use
df_food_unrounded = df_food.copy()

df_food['Latitude'] = df_food['Latitude'].round(num_decimal_places)
df_food['Longitude'] = df_food['Longitude'].round(num_decimal_places)

# A consistent string to represent the location bucket
df_food['Location'] = df_food['Latitude'].astype(str) + ',' + df_food['Longitude'].astype(str)

For the crime data, again drop columns we won't be using.

In [10]:
# Save the raw values for future use
df_crime_unrounded = df_crime.copy()

# Remove irrelevant columns
df_crime = df_crime.drop(['Block', 'ID', 'Case Number', 'Domestic', 'Arrest', 'Beat', 'Ward', 'Community Area', 'FBI Code', 'IUCR', 'Description', 'District'], axis=1)

We also remove any crimes that occurred outside of Chicago, which for some reason appeared in the raw data.

In [11]:
df_crime = df_crime[(df_crime['Latitude'] > 41) &
                    (df_crime['Latitude'] < 43) &
                    (df_crime['Longitude'] > -88) &
                    (df_crime['Longitude'] < -87)]

Again we round the latitude and longitude in order to bucket.

In [12]:
df_crime['Latitude'] = df_crime['Latitude'].round(num_decimal_places)
df_crime['Longitude'] = df_crime['Longitude'].round(num_decimal_places)

# A consistent string to represent the location bucket
df_crime['Location'] = df_crime['Latitude'].astype(str) + ',' + df_crime['Longitude'].astype(str)

For the census data, we need to remove the very top row which contains the English descriptions of each column. We then need to apply our own English descriptions to replace the encoded column names.

In [13]:
# Remove the first row of English descriptions
df_census = df_census.iloc[1:]

# Rename the columns we will use
df_census['Neighborhood'] = df_census['GEOGNAME']
df_census['Total Population'] = df_census['P0050001']
df_census['Average Age'] = df_census['P0130001']
df_census['Average Household Size'] = df_census['P0170001']
df_census['Percent Housing Occupied'] = ( df_census['H0030002'].astype(int) / df_census['H0030001'].astype(int) ) * 100;

# Remove all other columns
df_census = df_census[['Neighborhood','Total Population', 'Average Age', 'Average Household Size', 'Percent Housing Occupied']]

Data Visualization

First, we decided to visualize the collected crime data by simply plotting all of the observed crime locations. Interestingly, when making the specific points for the observed crimes smaller, simply plotting the latitude and longitude values results in a detailed map of the City of Chicago. This map is able to encompass the overall street layout of the city, and also shows where most of the crimes are concentrated in Chicago. We see that the crimes are more heavily concentrated in the downtown region and the frequency decreases as the distance from the city center increases. On the outskirts of the city, corresponding to more suburban regions, the dots are much more sparse, corresponding to a lower frequency and density of crimes.

In [14]:
# Scatter plot of where observed crimes occur in Chicago
plt.figure(figsize=(10, 10))
plt.scatter(df_crime_unrounded['Longitude'], df_crime_unrounded['Latitude'], s=0.02)

axes = plt.gca()
axes.set_xlabel('Latitude')
axes.set_ylabel('Longitude')
Out[14]:
<matplotlib.text.Text at 0x7ff6a2c93d68>

Next, we plot the same scatter plot as above, by plotting the latitude and longitude of observed crimes in Chicago. However, in this plot, we color the data points by the "District" column, reported by the City of Chicago. The main purpose of this and the following plots is to explore the boundaries of the reported regions associated with each crime.

In [15]:
# Scatter plot of where observed crimes occur, colored by District
plt.figure(figsize=(10, 10))
plt.scatter(df_crime_unrounded['Longitude'], df_crime_unrounded['Latitude'],
            s=0.02, c=df_crime_unrounded['District'])

axes = plt.gca()
axes.set_xlabel('Latitude')
axes.set_ylabel('Longitude')
Out[15]:
<matplotlib.text.Text at 0x7ff69c059a90>

Next, we repeat the same plot of the latitude and longitude of each crime in Chicago, but colored by the Beat reported by the Chicago police department, showing the beat boundaries.

In [16]:
# Scatter plot of where observed crimes occur, colored by Beat
plt.figure(figsize=(10, 10))
plt.scatter(df_crime_unrounded['Longitude'], df_crime_unrounded['Latitude'],
            s=0.02, c=df_crime_unrounded['Beat'])

axes = plt.gca()
axes.set_xlabel('Latitude')
axes.set_ylabel('Longitude')
Out[16]:
<matplotlib.text.Text at 0x7ff6ae3908d0>

Next, we repeat the same plot of the latitude and longitude of each crime in Chicago, but colored by Chicago ward, showing the ward boundaries.

In [17]:
# Scatter plot of where observed crimes occur, colored by Ward
plt.figure(figsize=(10, 10))
plt.scatter(df_crime_unrounded['Longitude'], df_crime_unrounded['Latitude'],
            s=0.02, c=df_crime_unrounded['Ward'])

axes = plt.gca()
axes.set_xlabel('Latitude')
axes.set_ylabel('Longitude')
Out[17]:
<matplotlib.text.Text at 0x7ff6adf65b00>