In this notebook i explore San Francisco city employee salary data, which can be found on kaggle: https://www.kaggle.com/kaggle/sf-salaries
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['patch.force_edgecolor']=True
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()
import plotly.graph_objs as go
salaries = pd.read_csv('Salaries.csv')
C:\Users\gilew\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (3,4,5,6,12) have mixed types. Specify dtype option on import or set low_memory=False.
Because of the DtypeWarning, I have to convert the types of the following columns: 3,4,5,6. I will take care of column 12 later.
for column in ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']:
salaries[column]=pd.to_numeric(salaries[column], errors='coerce')
salaries.head()
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | NATHANIEL FORD | GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY | 167411.18 | 0.00 | 400184.25 | NaN | 567595.43 | 567595.43 | 2011 | NaN | San Francisco | NaN |
1 | 2 | GARY JIMENEZ | CAPTAIN III (POLICE DEPARTMENT) | 155966.02 | 245131.88 | 137811.38 | NaN | 538909.28 | 538909.28 | 2011 | NaN | San Francisco | NaN |
2 | 3 | ALBERT PARDINI | CAPTAIN III (POLICE DEPARTMENT) | 212739.13 | 106088.18 | 16452.60 | NaN | 335279.91 | 335279.91 | 2011 | NaN | San Francisco | NaN |
3 | 4 | CHRISTOPHER CHONG | WIRE ROPE CABLE MAINTENANCE MECHANIC | 77916.00 | 56120.71 | 198306.90 | NaN | 332343.61 | 332343.61 | 2011 | NaN | San Francisco | NaN |
4 | 5 | PATRICK GARDNER | DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) | 134401.60 | 9737.00 | 182234.59 | NaN | 326373.19 | 326373.19 | 2011 | NaN | San Francisco | NaN |
I create a heatmap, to see the distribution of Nans in our data.
sns.heatmap(salaries.isnull(), yticklabels=False, cbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1b756755e10>
There are two columns which are practically filled with Nans, so I decided to drop them. I also drop the Nan values in BasePay, since there is only few of them and Agency, since I do not need it. Next I will try to analize the Benefits column and decide whether I can fill in the values or drop the column.
salaries.drop(['Notes', 'Agency', 'Status'], axis=1, inplace=True)
salaries = salaries[np.isfinite(salaries['BasePay'])]
sns.distplot(salaries['Benefits'].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x1b75674f128>
We decite, that with a distribution like this, it is a good idea to fill the Nan benefits values with the mean values.
salaries['Benefits']=salaries['Benefits'].fillna(value=salaries['Benefits'].mean())
I decide to take a look at the datato make sure I got rid of all the Nan values.
sns.heatmap(salaries.isnull(), yticklabels=False, cbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1b756661390>
Now I take a look at the data, to get familiar with it.
salaries.head()
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | NATHANIEL FORD | GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY | 167411.18 | 0.00 | 400184.25 | 25114.34451 | 567595.43 | 567595.43 | 2011 |
1 | 2 | GARY JIMENEZ | CAPTAIN III (POLICE DEPARTMENT) | 155966.02 | 245131.88 | 137811.38 | 25114.34451 | 538909.28 | 538909.28 | 2011 |
2 | 3 | ALBERT PARDINI | CAPTAIN III (POLICE DEPARTMENT) | 212739.13 | 106088.18 | 16452.60 | 25114.34451 | 335279.91 | 335279.91 | 2011 |
3 | 4 | CHRISTOPHER CHONG | WIRE ROPE CABLE MAINTENANCE MECHANIC | 77916.00 | 56120.71 | 198306.90 | 25114.34451 | 332343.61 | 332343.61 | 2011 |
4 | 5 | PATRICK GARDNER | DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) | 134401.60 | 9737.00 | 182234.59 | 25114.34451 | 326373.19 | 326373.19 | 2011 |
salaries.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 148045 entries, 0 to 148653 Data columns (total 10 columns): Id 148045 non-null int64 EmployeeName 148045 non-null object JobTitle 148045 non-null object BasePay 148045 non-null float64 OvertimePay 148045 non-null float64 OtherPay 148045 non-null float64 Benefits 148045 non-null float64 TotalPay 148045 non-null float64 TotalPayBenefits 148045 non-null float64 Year 148045 non-null int64 dtypes: float64(6), int64(2), object(2) memory usage: 12.4+ MB
salaries.drop('Id', axis=1).describe()
BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | |
---|---|---|---|---|---|---|---|
count | 148045.000000 | 148045.000000 | 148045.000000 | 148045.000000 | 148045.000000 | 148045.000000 | 148045.000000 |
mean | 66325.448841 | 5086.740800 | 3645.661408 | 25114.344510 | 75057.851048 | 94038.185067 | 2012.520653 |
std | 42764.635495 | 11473.181557 | 8019.498636 | 13364.208347 | 50409.761170 | 62683.477905 | 1.119390 |
min | -166.010000 | -0.010000 | -7058.590000 | -13.800000 | -618.130000 | -618.130000 | 2011.000000 |
25% | 33588.200000 | 0.000000 | 0.000000 | 22747.200000 | 36958.160000 | 45077.410000 | 2012.000000 |
50% | 65007.450000 | 0.000000 | 817.170000 | 25114.344510 | 71644.710000 | 92675.510000 | 2013.000000 |
75% | 94691.050000 | 4695.710000 | 4242.900000 | 33507.310000 | 106028.000000 | 133085.410000 | 2014.000000 |
max | 319275.010000 | 245131.880000 | 400184.250000 | 96570.660000 | 567595.430000 | 567595.430000 | 2014.000000 |
I use the very convenient job title bin function from https://www.kaggle.com/mevanoff24/data-exploration-predicting-salaries
def findJobTitle(row):
police_title = ['police', 'sherif', 'probation', 'sergeant', 'officer', 'lieutenant']
fire_title = ['fire']
transit_title = ['mta', 'transit']
medical_title = ['anesth', 'medical', 'nurs', 'health', 'physician', 'orthopedic', 'pharm', 'care']
court_title = ['court', 'legal']
automotive_title = ['automotive', 'mechanic', 'truck']
engineer_title = ['engineer', 'engr', 'eng', 'program']
general_laborer_title = ['general laborer', 'painter', 'inspector', 'carpenter',
'electrician', 'plumber', 'maintenance']
aide_title = ['aide', 'assistant', 'secretary', 'attendant']
for police in police_title:
if police in row.lower():
return 'police'
for fire in fire_title:
if fire in row.lower():
return 'fire'
for aide in aide_title:
if aide in row.lower():
return 'assistant'
for transit in transit_title:
if transit in row.lower():
return 'transit'
for medical in medical_title:
if medical in row.lower():
return 'medical'
if 'airport' in row.lower():
return 'airport'
if 'worker' in row.lower():
return 'social worker'
if 'architect' in row.lower():
return 'architect'
for court in court_title:
if court in row.lower():
return 'court'
if 'major' in row.lower():
return 'mayor'
if 'librar' in row.lower():
return 'library'
if 'guard' in row.lower():
return 'guard'
if 'public' in row.lower():
return 'public works'
if 'attorney' in row.lower():
return 'attorney'
if 'custodian' in row.lower():
return 'custodian'
if 'account' in row.lower():
return 'account'
if 'garden' in row.lower():
return 'gardener'
if 'recreation' in row.lower():
return 'recreation leader'
for automotive in automotive_title:
if automotive in row.lower():
return 'automotive'
for engineer in engineer_title:
if engineer in row.lower():
return 'engineer'
for general_laborer in general_laborer_title:
if general_laborer in row.lower():
return 'general laborer'
if 'food serv' in row.lower():
return 'food service'
if 'clerk' in row.lower():
return 'clerk'
if 'porter' in row.lower():
return 'porter'
if 'analy' in row.lower():
return 'analyst'
if 'manager' in row.lower():
return 'manager'
else:
return 'other'
salaries['GeneralJobTitle'] = salaries['JobTitle'].map(findJobTitle)
pay_by_year=salaries[['Year', 'TotalPay']].groupby('Year').mean()
pay_by_year
TotalPay | |
---|---|
Year | |
2011 | 71744.103871 |
2012 | 74113.262265 |
2013 | 78808.285041 |
2014 | 75471.836912 |
sns.boxplot(x='Year', y='TotalPay', data=salaries, showfliers=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1b755166ef0>
I can notice the upward trend in the TotalPay in years 2011-2013, but in 2014 the TotalPay mean slightly decreased.
salaries_jobs_years=salaries[['GeneralJobTitle', 'TotalPay', 'Year']].groupby(['GeneralJobTitle', 'Year']).mean()
salaries_jobs_years
TotalPay | ||
---|---|---|
GeneralJobTitle | Year | |
account | 2011 | 56177.342558 |
2012 | 63186.929962 | |
2013 | 66857.307417 | |
2014 | 63641.837870 | |
airport | 2011 | 92273.242673 |
2012 | 94104.593235 | |
2013 | 98160.944660 | |
2014 | 94067.513077 | |
analyst | 2011 | 79492.329936 |
2012 | 81830.453873 | |
2013 | 87654.166634 | |
2014 | 83168.412823 | |
architect | 2011 | 96071.390972 |
2012 | 113270.148969 | |
2013 | 125333.457143 | |
2014 | 124302.629417 | |
assistant | 2011 | 43198.079421 |
2012 | 40129.112925 | |
2013 | 40410.167382 | |
2014 | 39600.355957 | |
attorney | 2011 | 126626.434486 |
2012 | 133973.170860 | |
2013 | 144234.833244 | |
2014 | 138208.046967 | |
automotive | 2011 | 78395.460507 |
2012 | 81830.392473 | |
2013 | 84965.658302 | |
2014 | 83083.261233 | |
clerk | 2011 | 40999.051714 |
2012 | 42632.039917 | |
... | ... | ... |
medical | 2013 | 79212.934761 |
2014 | 77210.463099 | |
other | 2011 | 66222.015069 |
2012 | 75053.523929 | |
2013 | 80126.012794 | |
2014 | 78234.947040 | |
police | 2011 | 109707.702262 |
2012 | 110355.916599 | |
2013 | 119961.317699 | |
2014 | 113887.308085 | |
porter | 2011 | 40985.904961 |
2012 | 42379.496088 | |
2013 | 46010.055586 | |
2014 | 45553.476173 | |
public works | 2011 | 36133.666306 |
2012 | 37266.939199 | |
2013 | 35909.203407 | |
2014 | 34340.963239 | |
recreation leader | 2011 | 10438.966533 |
2012 | 11370.921782 | |
2013 | 12207.075927 | |
2014 | 12449.075208 | |
social worker | 2011 | 60714.044061 |
2012 | 61365.984920 | |
2013 | 63954.010630 | |
2014 | 62010.384171 | |
transit | 2011 | 70427.436251 |
2012 | 74160.880798 | |
2013 | 77607.464635 | |
2014 | 71134.737660 |
104 rows × 1 columns
salaries_jobs_by_year=salaries[['GeneralJobTitle', 'TotalPay', 'Year']].groupby(['Year', 'GeneralJobTitle']).mean()
job_names=salaries_jobs_by_year.xs(2011, level='Year').index.values.tolist()
trace1 = go.Bar(
x=job_names,
y=salaries_jobs_by_year.xs(2011, level='Year')['TotalPay'],
name='2011'
)
trace2 = go.Bar(
x=job_names,
y=salaries_jobs_by_year.xs(2012, level='Year')['TotalPay'],
name='2012'
)
trace3 = go.Bar(
x=job_names,
y=salaries_jobs_by_year.xs(2013, level='Year')['TotalPay'],
name='2013'
)
trace4 = go.Bar(
x=job_names,
y=salaries_jobs_by_year.xs(2014, level='Year')['TotalPay'],
name='2014'
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
barmode='group'
)
fig = go.Figure(data=data, layout=layout)
cf.iplot(fig, filename='grouped-bar')
Now I made sure that the Total Pay was building up in almost every group until 2014, when it dropped a little bit.
The job groups which were affected the most by whatever happened in 2014 are: Court and Firefighters.
We can also notice, that only the Library and Recreation Leader group maintained it's upward trend.
The jobs which remained almost unaffected were: Assistants and Guards.
salaries_jobs_benefits=salaries[['GeneralJobTitle', 'BasePay', 'OvertimePay', 'Benefits']].groupby(['GeneralJobTitle']).mean()
salaries_jobs_benefits
BasePay | OvertimePay | Benefits | |
---|---|---|---|
GeneralJobTitle | |||
account | 61613.504421 | 287.140250 | 25708.762000 |
airport | 84780.208073 | 6566.248195 | 30316.527111 |
analyst | 82264.012407 | 166.780416 | 29064.369347 |
architect | 114872.575135 | 73.958757 | 35758.150797 |
assistant | 37802.116122 | 1831.490932 | 16931.972103 |
attorney | 132583.724982 | 54.810232 | 36687.825837 |
automotive | 68174.963729 | 9124.607881 | 27314.619298 |
clerk | 42546.542254 | 1090.232306 | 20997.084503 |
court | 64585.883368 | 199.010068 | 25710.071452 |
custodian | 41832.585984 | 1526.100917 | 22156.731947 |
engineer | 80059.337953 | 2414.459648 | 27963.838882 |
fire | 107372.900930 | 25623.372684 | 34720.246376 |
food service | 42747.673862 | 1746.415862 | 23226.094070 |
gardener | 50754.503024 | 2010.765047 | 23284.807426 |
general laborer | 75538.164022 | 6700.997761 | 27975.448840 |
guard | 16689.208360 | 1791.476922 | 9832.407203 |
library | 39851.052761 | 1.457051 | 19297.451219 |
manager | 118890.980502 | 37.074164 | 39049.842377 |
medical | 68053.483246 | 2123.112259 | 23100.226239 |
other | 70091.195464 | 2890.044988 | 26482.241173 |
police | 93186.880378 | 11636.756749 | 30192.136074 |
porter | 38968.082877 | 3260.083322 | 20179.401089 |
public works | 30809.095124 | 3133.818901 | 14508.889048 |
recreation leader | 11374.130731 | 2.426723 | 8759.502923 |
social worker | 57299.132002 | 2809.181354 | 24439.990860 |
transit | 57626.474168 | 12847.276152 | 27828.666833 |
trace1 = go.Bar(
x=job_names,
y=salaries_jobs_benefits['BasePay'],
name='Base Pay'
)
trace2 = go.Bar(
x=job_names,
y=salaries_jobs_benefits['OvertimePay'],
name='Overtime Pay'
)
trace3 = go.Bar(
x=job_names,
y=salaries_jobs_benefits['Benefits'],
name='Benefits'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
barmode='group'
)
fig = go.Figure(data=data, layout=layout)
cf.iplot(fig, filename='grouped-bar')
The most important insights from these plots are:
-Firefighters, Policeman and Transit are groups which work overtime the most.
-Attorney, Manager, Architect and Firefighters are gropus which earn the most Base Pay.
-The groups which earn the least are Guards and Recreation Leader.
import gender_guesser.detector as gender
d = gender.Detector(case_sensitive=False)
salaries['Gender']=salaries['EmployeeName'].apply(lambda x: x.split()[0]).apply(lambda x: d.get_gender(x.lower()))
salaries_job_gender=salaries[(salaries['Gender']=='male') | (salaries['Gender']=='female')][['GeneralJobTitle', 'TotalPay', 'Gender']].groupby(['GeneralJobTitle', 'Gender']).mean()
salaries_job_gender
TotalPay | ||
---|---|---|
GeneralJobTitle | Gender | |
account | female | 62720.881249 |
male | 61554.983552 | |
airport | female | 89959.928741 |
male | 97818.146010 | |
analyst | female | 81847.317095 |
male | 84359.228566 | |
architect | female | 113353.711455 |
male | 114559.383938 | |
assistant | female | 45963.404888 |
male | 37698.158883 | |
attorney | female | 134018.130759 |
male | 138007.626231 | |
automotive | female | 72252.454815 |
male | 81672.705192 | |
clerk | female | 47233.568245 |
male | 37452.355623 | |
court | female | 69189.877951 |
male | 65668.484733 | |
custodian | female | 45257.969013 |
male | 45336.296070 | |
engineer | female | 71264.697426 |
male | 89319.615244 | |
fire | female | 139360.142473 |
male | 149907.709934 | |
food service | female | 39156.752889 |
male | 54430.824388 | |
gardener | female | 50016.638519 |
male | 55155.304686 | |
general laborer | female | 80573.823189 |
male | 87667.235885 | |
guard | female | 16436.266490 |
male | 23580.319757 | |
library | female | 46529.025088 |
male | 36206.884371 | |
manager | female | 121272.730786 |
male | 122426.874990 | |
medical | female | 73540.648174 |
male | 80181.323528 | |
other | female | 70579.012167 |
male | 79993.082152 | |
police | female | 95458.979664 |
male | 120337.144226 | |
porter | female | 47195.294830 |
male | 44687.008480 | |
public works | female | 41433.817635 |
male | 32639.754413 | |
recreation leader | female | 11693.305312 |
male | 12549.577932 | |
social worker | female | 63067.066715 |
male | 62496.266117 | |
transit | female | 63954.873872 |
male | 76903.350026 |
salaries_gender=salaries[(salaries['Gender']=='male') | (salaries['Gender']=='female')][['GeneralJobTitle', 'TotalPay', 'Gender']].groupby(['Gender', 'GeneralJobTitle']).mean()
trace1 = go.Bar(
x=job_names,
y=salaries_gender.xs('male', level='Gender')['TotalPay'],
name='Male'
)
trace2 = go.Bar(
x=job_names,
y=salaries_gender.xs('female', level='Gender')['TotalPay'],
name='Female'
)
data = [trace1, trace2]
layout = go.Layout(
barmode='group'
)
fig = go.Figure(data=data, layout=layout)
cf.iplot(fig, filename='grouped-bar')
I can easily notice that there is evidence of pay discrimination based on gender.
In only 4 gropus (15%) women earn noticeably more than men.
In 9 groups (35%) women and men earn more or less the same amount of money.
In remaining 13 (50%!) groups men earn significantly more than women.
The biggest difference can be seen in the engineer group.
salaries_more_than_100k=salaries[salaries['TotalPay']>200000]['GeneralJobTitle'].value_counts().drop('other')
salaries_more_than_100k.iplot(kind='bar')
The job groups in which the most people earn supranormal amount of moneyare clearly Firemen, Policemen and Medical.