import pandas as pd
# pd.set_option('max_colwidth', 50)
# set this if you need to
The Health Department has developed an inspection report and scoring system. After conducting an inspection of the facility, the Health Inspector calculates a score based on the violations observed. Violations can fall into:
businesses = pd.read_csv('./data/businesses_plus.csv', parse_dates=True, dtype={'phone_number': str})
businesses.head()
# dtype casts the column as a specific data type
inspections = pd.read_csv('./data/inspections_plus.csv', parse_dates=True)
inspections.head()
violations = pd.read_csv('./data/violations_plus.csv', parse_dates=True)
violations.head()
# 1 Combine the three dataframes into one data frame called restaurant_scores
# Hint: http://pandas.pydata.org/pandas-docs/stable/merging.html
# 2 Which ten business have had the most inspections?
# 3 Group and count the inspections by type
# 4 Create a plot that shows number of inspections per month
# Bonus for creating a heatmap
# http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.heatmap.html?highlight=heatmap
# 5 Which zip code contains the most high risk violations?
# 6 If inspection is prompted by a change in restaurant ownership,
# is the inspection more likely to be categorized as higher or lower risk?
# 7 Examining the descriptions, what is the most common violation?
# 8 Create a hist of the scores with 10 bins
# 9 Can you predict risk category based on the other features in this dataset?
# 10 Extra Credit:
# Use Instagram location API to find pictures taken at the lat, long of the most High Risk restaurant
# https://www.instagram.com/developer/endpoints/locations/
############################
### A Little More Morbid ###
############################
killings = pd.read_csv('./data/police-killings.csv')
killings.head()
# 1. Make the following changed to column names:
# lawenforcementagency -> agency
# raceethnicity -> race
# 2. Show the count of missing values in each column
# 3. replace each null value in the dataframe with the string "Unknown"
# 4. How many killings were there so far in 2015?
# 5. Of all killings, how many were male and how many female?
# 6. How many killings were of unarmed people?
# 7. What percentage of all killings were unarmed?
# 8. What are the 5 states with the most killings?
# 9. Show a value counts of deaths for each race
# 10. Display a histogram of ages of all killings
# 11. Show 6 histograms of ages by race
# 12. What is the average age of death by race?
# 13. Show a bar chart with counts of deaths every month
###################
### Less Morbid ###
###################
majors = pd.read_csv('./data/college-majors.csv')
majors.head()
# 1. Delete the columns (employed_full_time_year_round, major_code)
# 2. Show the cout of missing values in each column
# 3. What are the top 10 highest paying majors?
# 4. Plot the data from the last question in a bar chart, include proper title, and labels!
# 5. What is the average median salary for each major category?
# 6. Show only the top 5 paying major categories
# 7. Plot a histogram of the distribution of median salaries
# 8. Plot a histogram of the distribution of median salaries by major category
# 9. What are the top 10 most UNemployed majors?
# What are the unemployment rates?
# 10. What are the top 10 most UNemployed majors CATEGORIES? Use the mean for each category
# What are the unemployment rates?
# 11. the total and employed column refer to the people that were surveyed.
# Create a new column showing the emlpoyment rate of the people surveyed for each major
# call it "sample_employment_rate"
# Example the first row has total: 128148 and employed: 90245. it's
# sample_employment_rate should be 90245.0 / 128148.0 = .7042
# 12. Create a "sample_unemployment_rate" column
# this column should be 1 - "sample_employment_rate"