SF Salaries¶

In this notebook i explore San Francisco city employee salary data, which can be found on kaggle: https://www.kaggle.com/kaggle/sf-salaries

Importing libraries¶

In [1]:

import numpy as np

import pandas as pd

import seaborn as sns
sns.set_style("whitegrid")

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['patch.force_edgecolor']=True

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()
import plotly.graph_objs as go

Getting data¶

Doing the imports¶

In [2]:

salaries = pd.read_csv('Salaries.csv')

C:\Users\gilew\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning:

Columns (3,4,5,6,12) have mixed types. Specify dtype option on import or set low_memory=False.

Because of the DtypeWarning, I have to convert the types of the following columns: 3,4,5,6. I will take care of column 12 later.

Converting types¶

In [3]:

for column in ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']:
    salaries[column]=pd.to_numeric(salaries[column], errors='coerce')

Checking the head of the data¶

In [4]:

salaries.head()

Out[4]:

	Id	EmployeeName	JobTitle	BasePay	OvertimePay	OtherPay	Benefits	TotalPay	TotalPayBenefits	Year	Notes	Agency	Status
0	1	NATHANIEL FORD	GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY	167411.18	0.00	400184.25	NaN	567595.43	567595.43	2011	NaN	San Francisco	NaN
1	2	GARY JIMENEZ	CAPTAIN III (POLICE DEPARTMENT)	155966.02	245131.88	137811.38	NaN	538909.28	538909.28	2011	NaN	San Francisco	NaN
2	3	ALBERT PARDINI	CAPTAIN III (POLICE DEPARTMENT)	212739.13	106088.18	16452.60	NaN	335279.91	335279.91	2011	NaN	San Francisco	NaN
3	4	CHRISTOPHER CHONG	WIRE ROPE CABLE MAINTENANCE MECHANIC	77916.00	56120.71	198306.90	NaN	332343.61	332343.61	2011	NaN	San Francisco	NaN
4	5	PATRICK GARDNER	DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)	134401.60	9737.00	182234.59	NaN	326373.19	326373.19	2011	NaN	San Francisco	NaN

Cleaning the data¶

I create a heatmap, to see the distribution of Nans in our data.

In [5]:

sns.heatmap(salaries.isnull(), yticklabels=False, cbar=False)

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b756755e10>

There are two columns which are practically filled with Nans, so I decided to drop them. I also drop the Nan values in BasePay, since there is only few of them and Agency, since I do not need it. Next I will try to analize the Benefits column and decide whether I can fill in the values or drop the column.

Dropping columns, which have too many nans (Notes, Status)¶

In [6]:

salaries.drop(['Notes', 'Agency', 'Status'], axis=1, inplace=True)
salaries = salaries[np.isfinite(salaries['BasePay'])]

Plotting a distribution of benefits¶

In [7]:

sns.distplot(salaries['Benefits'].dropna())

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b75674f128>

We decite, that with a distribution like this, it is a good idea to fill the Nan benefits values with the mean values.

Filling nans with mean values of benefits¶

In [8]:

salaries['Benefits']=salaries['Benefits'].fillna(value=salaries['Benefits'].mean())

Checking our data once again¶

I decide to take a look at the datato make sure I got rid of all the Nan values.

In [9]:

sns.heatmap(salaries.isnull(), yticklabels=False, cbar=False)

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b756661390>

Looking at our clean data¶

Now I take a look at the data, to get familiar with it.

In [10]:

salaries.head()

Out[10]:

	Id	EmployeeName	JobTitle	BasePay	OvertimePay	OtherPay	Benefits	TotalPay	TotalPayBenefits	Year
0	1	NATHANIEL FORD	GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY	167411.18	0.00	400184.25	25114.34451	567595.43	567595.43	2011
1	2	GARY JIMENEZ	CAPTAIN III (POLICE DEPARTMENT)	155966.02	245131.88	137811.38	25114.34451	538909.28	538909.28	2011
2	3	ALBERT PARDINI	CAPTAIN III (POLICE DEPARTMENT)	212739.13	106088.18	16452.60	25114.34451	335279.91	335279.91	2011
3	4	CHRISTOPHER CHONG	WIRE ROPE CABLE MAINTENANCE MECHANIC	77916.00	56120.71	198306.90	25114.34451	332343.61	332343.61	2011
4	5	PATRICK GARDNER	DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)	134401.60	9737.00	182234.59	25114.34451	326373.19	326373.19	2011

In [11]:

salaries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148045 entries, 0 to 148653
Data columns (total 10 columns):
Id                  148045 non-null int64
EmployeeName        148045 non-null object
JobTitle            148045 non-null object
BasePay             148045 non-null float64
OvertimePay         148045 non-null float64
OtherPay            148045 non-null float64
Benefits            148045 non-null float64
TotalPay            148045 non-null float64
TotalPayBenefits    148045 non-null float64
Year                148045 non-null int64
dtypes: float64(6), int64(2), object(2)
memory usage: 12.4+ MB

In [12]:

salaries.drop('Id', axis=1).describe()

Out[12]:

	BasePay	OvertimePay	OtherPay	Benefits	TotalPay	TotalPayBenefits	Year
count	148045.000000	148045.000000	148045.000000	148045.000000	148045.000000	148045.000000	148045.000000
mean	66325.448841	5086.740800	3645.661408	25114.344510	75057.851048	94038.185067	2012.520653
std	42764.635495	11473.181557	8019.498636	13364.208347	50409.761170	62683.477905	1.119390
min	-166.010000	-0.010000	-7058.590000	-13.800000	-618.130000	-618.130000	2011.000000
25%	33588.200000	0.000000	0.000000	22747.200000	36958.160000	45077.410000	2012.000000
50%	65007.450000	0.000000	817.170000	25114.344510	71644.710000	92675.510000	2013.000000
75%	94691.050000	4695.710000	4242.900000	33507.310000	106028.000000	133085.410000	2014.000000
max	319275.010000	245131.880000	400184.250000	96570.660000	567595.430000	567595.430000	2014.000000

Putting people into different groups based on their job¶

I use the very convenient job title bin function from https://www.kaggle.com/mevanoff24/data-exploration-predicting-salaries

In [13]:

def findJobTitle(row):
    
    police_title = ['police', 'sherif', 'probation', 'sergeant', 'officer', 'lieutenant']
    fire_title = ['fire']
    transit_title = ['mta', 'transit']
    medical_title = ['anesth', 'medical', 'nurs', 'health', 'physician', 'orthopedic', 'pharm', 'care']
    court_title = ['court', 'legal']
    automotive_title = ['automotive', 'mechanic', 'truck']
    engineer_title = ['engineer', 'engr', 'eng', 'program']
    general_laborer_title = ['general laborer', 'painter', 'inspector', 'carpenter', 
                             'electrician', 'plumber', 'maintenance']
    aide_title = ['aide', 'assistant', 'secretary', 'attendant']
    
    for police in police_title:
        if police in row.lower():
            return 'police'    
    for fire in fire_title:
        if fire in row.lower():
            return 'fire'
    for aide in aide_title:
        if aide in row.lower():
            return 'assistant'
    for transit in transit_title:
        if transit in row.lower():
            return 'transit'
    for medical in medical_title:
        if medical in row.lower():
            return 'medical'
    if 'airport' in row.lower():
        return 'airport'
    if 'worker' in row.lower():
        return 'social worker'
    if 'architect' in row.lower():
        return 'architect'
    for court in court_title:
        if court in row.lower():
            return 'court'
    if 'major' in row.lower():
        return 'mayor'
    if 'librar' in row.lower():
        return 'library'
    if 'guard' in row.lower():
        return 'guard'
    if 'public' in row.lower():
        return 'public works'
    if 'attorney' in row.lower():
        return 'attorney'
    if 'custodian' in row.lower():
        return 'custodian'
    if 'account' in row.lower():
        return 'account'
    if 'garden' in row.lower():
        return 'gardener'
    if 'recreation' in row.lower():
        return 'recreation leader'
    for automotive in automotive_title:
        if automotive in row.lower():
            return 'automotive'
    for engineer in engineer_title:
        if engineer in row.lower():
            return 'engineer'
    for general_laborer in general_laborer_title:
        if general_laborer in row.lower():
            return 'general laborer'
    if 'food serv' in row.lower():
        return 'food service'
    if 'clerk' in row.lower():
        return 'clerk'
    if 'porter' in row.lower():
        return 'porter' 
    if 'analy' in row.lower():
        return 'analyst'
    if 'manager' in row.lower():
        return 'manager'
    else:
        return 'other'
    
salaries['GeneralJobTitle'] = salaries['JobTitle'].map(findJobTitle)

Exploring the data¶

Changes in salaries over time between different groups of people¶

In [14]:

pay_by_year=salaries[['Year', 'TotalPay']].groupby('Year').mean()
pay_by_year

Out[14]:

	TotalPay
Year
2011	71744.103871
2012	74113.262265
2013	78808.285041
2014	75471.836912

In [15]:

sns.boxplot(x='Year', y='TotalPay', data=salaries, showfliers=False)

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b755166ef0>

I can notice the upward trend in the TotalPay in years 2011-2013, but in 2014 the TotalPay mean slightly decreased.

Checking how the pays changed over the years in groups of people¶

In [16]:

salaries_jobs_years=salaries[['GeneralJobTitle', 'TotalPay', 'Year']].groupby(['GeneralJobTitle', 'Year']).mean()
salaries_jobs_years

Out[16]:

		TotalPay
GeneralJobTitle	Year
account	2011	56177.342558
	2012	63186.929962
	2013	66857.307417
	2014	63641.837870
airport	2011	92273.242673
	2012	94104.593235
	2013	98160.944660
	2014	94067.513077
analyst	2011	79492.329936
	2012	81830.453873
	2013	87654.166634
	2014	83168.412823
architect	2011	96071.390972
	2012	113270.148969
	2013	125333.457143
	2014	124302.629417
assistant	2011	43198.079421
	2012	40129.112925
	2013	40410.167382
	2014	39600.355957
attorney	2011	126626.434486
	2012	133973.170860
	2013	144234.833244
	2014	138208.046967
automotive	2011	78395.460507
	2012	81830.392473
	2013	84965.658302
	2014	83083.261233
clerk	2011	40999.051714
clerk	2012	42632.039917
...	...	...
medical	2013	79212.934761
medical	2014	77210.463099
other	2011	66222.015069
	2012	75053.523929
	2013	80126.012794
	2014	78234.947040
police	2011	109707.702262
	2012	110355.916599
	2013	119961.317699
	2014	113887.308085
porter	2011	40985.904961
	2012	42379.496088
	2013	46010.055586
	2014	45553.476173
public works	2011	36133.666306
	2012	37266.939199
	2013	35909.203407
	2014	34340.963239
recreation leader	2011	10438.966533
	2012	11370.921782
	2013	12207.075927
	2014	12449.075208
social worker	2011	60714.044061
	2012	61365.984920
	2013	63954.010630
	2014	62010.384171
transit	2011	70427.436251
	2012	74160.880798
	2013	77607.464635
	2014	71134.737660

104 rows × 1 columns

In [17]:

salaries_jobs_by_year=salaries[['GeneralJobTitle', 'TotalPay', 'Year']].groupby(['Year', 'GeneralJobTitle']).mean()
job_names=salaries_jobs_by_year.xs(2011, level='Year').index.values.tolist()

trace1 = go.Bar(
    x=job_names,
    y=salaries_jobs_by_year.xs(2011, level='Year')['TotalPay'],
    name='2011'
)
trace2 = go.Bar(
    x=job_names,
    y=salaries_jobs_by_year.xs(2012, level='Year')['TotalPay'],
    name='2012'
)
trace3 = go.Bar(
    x=job_names,
    y=salaries_jobs_by_year.xs(2013, level='Year')['TotalPay'],
    name='2013'
)
trace4 = go.Bar(
    x=job_names,
    y=salaries_jobs_by_year.xs(2014, level='Year')['TotalPay'],
    name='2014'
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
cf.iplot(fig, filename='grouped-bar')

Now I made sure that the Total Pay was building up in almost every group until 2014, when it dropped a little bit.
The job groups which were affected the most by whatever happened in 2014 are: Court and Firefighters.
We can also notice, that only the Library and Recreation Leader group maintained it's upward trend.
The jobs which remained almost unaffected were: Assistants and Guards.

Looking at how base pay, overtime pay and benefits are allocated between job groups¶

In [18]:

salaries_jobs_benefits=salaries[['GeneralJobTitle', 'BasePay', 'OvertimePay', 'Benefits']].groupby(['GeneralJobTitle']).mean()
salaries_jobs_benefits

Out[18]:

	BasePay	OvertimePay	Benefits
GeneralJobTitle
account	61613.504421	287.140250	25708.762000
airport	84780.208073	6566.248195	30316.527111
analyst	82264.012407	166.780416	29064.369347
architect	114872.575135	73.958757	35758.150797
assistant	37802.116122	1831.490932	16931.972103
attorney	132583.724982	54.810232	36687.825837
automotive	68174.963729	9124.607881	27314.619298
clerk	42546.542254	1090.232306	20997.084503
court	64585.883368	199.010068	25710.071452
custodian	41832.585984	1526.100917	22156.731947
engineer	80059.337953	2414.459648	27963.838882
fire	107372.900930	25623.372684	34720.246376
food service	42747.673862	1746.415862	23226.094070
gardener	50754.503024	2010.765047	23284.807426
general laborer	75538.164022	6700.997761	27975.448840
guard	16689.208360	1791.476922	9832.407203
library	39851.052761	1.457051	19297.451219
manager	118890.980502	37.074164	39049.842377
medical	68053.483246	2123.112259	23100.226239
other	70091.195464	2890.044988	26482.241173
police	93186.880378	11636.756749	30192.136074
porter	38968.082877	3260.083322	20179.401089
public works	30809.095124	3133.818901	14508.889048
recreation leader	11374.130731	2.426723	8759.502923
social worker	57299.132002	2809.181354	24439.990860
transit	57626.474168	12847.276152	27828.666833

In [19]:

trace1 = go.Bar(
    x=job_names,
    y=salaries_jobs_benefits['BasePay'],
    name='Base Pay'
)
trace2 = go.Bar(
    x=job_names,
    y=salaries_jobs_benefits['OvertimePay'],
    name='Overtime Pay'
)
trace3 = go.Bar(
    x=job_names,
    y=salaries_jobs_benefits['Benefits'],
    name='Benefits'
)

data = [trace1, trace2, trace3]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
cf.iplot(fig, filename='grouped-bar')

The most important insights from these plots are:
-Firefighters, Policeman and Transit are groups which work overtime the most.
-Attorney, Manager, Architect and Firefighters are gropus which earn the most Base Pay.
-The groups which earn the least are Guards and Recreation Leader.

Checking if there is any evidence of pay discrimination based on gender¶

Creating gender column with some help of gender guesser library¶

In [20]:

import gender_guesser.detector as gender
d = gender.Detector(case_sensitive=False)
salaries['Gender']=salaries['EmployeeName'].apply(lambda x: x.split()[0]).apply(lambda x: d.get_gender(x.lower()))

In [21]:

salaries_job_gender=salaries[(salaries['Gender']=='male') | (salaries['Gender']=='female')][['GeneralJobTitle', 'TotalPay', 'Gender']].groupby(['GeneralJobTitle', 'Gender']).mean()

Taking a look at data and plotting it¶

In [22]:

salaries_job_gender

Out[22]:

		TotalPay
GeneralJobTitle	Gender
account	female	62720.881249
account	male	61554.983552
airport	female	89959.928741
airport	male	97818.146010
analyst	female	81847.317095
analyst	male	84359.228566
architect	female	113353.711455
architect	male	114559.383938
assistant	female	45963.404888
assistant	male	37698.158883
attorney	female	134018.130759
attorney	male	138007.626231
automotive	female	72252.454815
automotive	male	81672.705192
clerk	female	47233.568245
clerk	male	37452.355623
court	female	69189.877951
court	male	65668.484733
custodian	female	45257.969013
custodian	male	45336.296070
engineer	female	71264.697426
engineer	male	89319.615244
fire	female	139360.142473
fire	male	149907.709934
food service	female	39156.752889
food service	male	54430.824388
gardener	female	50016.638519
gardener	male	55155.304686
general laborer	female	80573.823189
general laborer	male	87667.235885
guard	female	16436.266490
guard	male	23580.319757
library	female	46529.025088
library	male	36206.884371
manager	female	121272.730786
manager	male	122426.874990
medical	female	73540.648174
medical	male	80181.323528
other	female	70579.012167
other	male	79993.082152
police	female	95458.979664
police	male	120337.144226
porter	female	47195.294830
porter	male	44687.008480
public works	female	41433.817635
public works	male	32639.754413
recreation leader	female	11693.305312
recreation leader	male	12549.577932
social worker	female	63067.066715
social worker	male	62496.266117
transit	female	63954.873872
transit	male	76903.350026

In [23]:

salaries_gender=salaries[(salaries['Gender']=='male') | (salaries['Gender']=='female')][['GeneralJobTitle', 'TotalPay', 'Gender']].groupby(['Gender', 'GeneralJobTitle']).mean()

trace1 = go.Bar(
    x=job_names,
    y=salaries_gender.xs('male', level='Gender')['TotalPay'],
    name='Male'
)
trace2 = go.Bar(
    x=job_names,
    y=salaries_gender.xs('female', level='Gender')['TotalPay'],
    name='Female'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
cf.iplot(fig, filename='grouped-bar')

I can easily notice that there is evidence of pay discrimination based on gender.
In only 4 gropus (15%) women earn noticeably more than men.
In 9 groups (35%) women and men earn more or less the same amount of money.
In remaining 13 (50%!) groups men earn significantly more than women.
The biggest difference can be seen in the engineer group.

Looking at the number of people in every job, that earn more than $200k¶

In [24]:

salaries_more_than_100k=salaries[salaries['TotalPay']>200000]['GeneralJobTitle'].value_counts().drop('other')
salaries_more_than_100k.iplot(kind='bar')

The job groups in which the most people earn supranormal amount of moneyare clearly Firemen, Policemen and Medical.