This Analysis and Charts will help Aspiring Data Professionals make smarter decisions. Data is collected from glassdoor website.
Data is cleaned and transformed to start doing analysis.
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
data = pd.read_csv("data_scientist_jobinfo.csv")
data.head()
job_title | Location | Sector | Python | R | Scala | Spark | AWS | SQL | Excel | PowerBI | Tableau | Tensorflow | Pytorch | Keras | Company_Size | Company_Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Engineer | Winnipeg | Information Technology | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | Medium | 34 |
1 | Scientist | Toronto | Information Technology | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Small | 7 |
2 | Scientist | Toronto | Business Services | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Medium | 28 |
3 | Scientist | Vancouver | Information Technology | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | Medium | 10 |
4 | Analyst | Waterloo | -1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | Small | -1 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 532 entries, 0 to 531 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 job_title 532 non-null object 1 Location 532 non-null object 2 Sector 532 non-null object 3 Python 532 non-null int64 4 R 532 non-null int64 5 Scala 532 non-null int64 6 Spark 532 non-null int64 7 AWS 532 non-null int64 8 SQL 532 non-null int64 9 Excel 532 non-null int64 10 PowerBI 532 non-null int64 11 Tableau 532 non-null int64 12 Tensorflow 532 non-null int64 13 Pytorch 532 non-null int64 14 Keras 532 non-null int64 15 Company_Size 532 non-null object 16 Company_Age 532 non-null int64 dtypes: int64(13), object(4) memory usage: 70.8+ KB
We have 532 rows and 17 columns
fig = px.pie(data, names='job_title', title='Job Title', color_discrete_sequence=px.colors.sequential.haline)
fig.update_traces(textposition='inside', textinfo='percent+label+value', pull=[0, 0.2, 0, 0, 0, 0],
marker=dict(line=dict(color='#000000', width=2)))
fig.show()
Based on the pie chart, roughly 38.2% of the data job which were posted is Data Scientist. Data Analyst comes second with 26.3% and Data Engineer comes third with 18.4%. Other roles such as Research Scientist, Machine Learning Engineer and Director is under 10%. But it also because of over lapping that happens in job roles. Some companies include MLE's task in Data Scientist role. However it clearly shows that Data Scientist are in demand.
print("Top 10 Sectors which have the most Jobs: \n")
sector_data = data[data['Sector']!='-1']
print(sector_data['Sector'].value_counts()[:10])
Top 10 Sectors which have the most Jobs: Information Technology 131 Business Services 55 Finance 52 Biotech & Pharmaceuticals 36 Retail 29 Media 22 Manufacturing 18 Insurance 13 Telecommunications 11 Healthcare 10 Name: Sector, dtype: int64
sector_wise = sector_data.groupby(by=['Sector'])['job_title'].count()
fig = go.Figure(data=[go.Bar(x=sector_wise.index, y=sector_wise.values)])
fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
marker_line_width=1.5, opacity=0.8)
fig.update_layout(xaxis={'categoryorder':'total descending'},
title="Sector wise Total Jobs",
xaxis_title="Sectors",
yaxis_title="Total Jobs(532)")
fig.update_xaxes(tickangle=45, tickfont=dict(family='Rockwell', color='crimson', size=14))
fig.update_yaxes(tickfont=dict(family='Rockwell', color='darkblue', size=14))
fig.show()
IT sector has the most jobs than any other. In fact, it has over 100 posting while Business Services has just around 50 which is second in the order. Finance, Biotech & Pharmaceauticals and Retail sector also has more job postings. Based on this aspiring data scientists can choose which sector they should target.
pivot_data = data[data['Sector']!='-1']
pd.options.display.max_rows
pd.set_option('display.max_rows', None)
pd.pivot_table(pivot_data, index =['Sector','job_title'],values='Company_Age', aggfunc='count').sort_values(
'Company_Age', ascending = False).rename(columns={'Company_Age':'Job Count'})[:20]
Job Count | ||
---|---|---|
Sector | job_title | |
Information Technology | Scientist | 41 |
Engineer | 37 | |
Analyst | 31 | |
Business Services | Analyst | 23 |
Biotech & Pharmaceuticals | Scientist | 22 |
Finance | Scientist | 19 |
Retail | Scientist | 18 |
Business Services | Scientist | 16 |
Finance | Analyst | 15 |
Engineer | 11 | |
Information Technology | MLE | 10 |
Retail | Analyst | 10 |
Insurance | Scientist | 9 |
Business Services | Engineer | 8 |
Media | Scientist | 8 |
Analyst | 7 | |
Information Technology | Researcher | 7 |
Biotech & Pharmaceuticals | Researcher | 6 |
Manufacturing | Analyst | 6 |
Engineer | 5 |
Above table shows which job roles are most wanted by which sector. For instance Business Services needs more analysts than scientist which makes sense Since they focus on making smarter decision by analysing data rather than building models.
print(data['Company_Size'].value_counts())
pd.pivot_table(pivot_data, index =['Company_Size','job_title'],values='Company_Age', aggfunc='count').sort_values(
['Company_Size','Company_Age'], ascending = False).rename(columns={'Company_Age':'Job Count'})[:20]
Small 227 Medium 178 Large 127 Name: Company_Size, dtype: int64
Job Count | ||
---|---|---|
Company_Size | job_title | |
Small | Analyst | 41 |
Scientist | 34 | |
Engineer | 25 | |
MLE | 8 | |
Researcher | 8 | |
Director | 3 | |
Medium | Scientist | 65 |
Analyst | 47 | |
Engineer | 31 | |
Researcher | 18 | |
Director | 7 | |
MLE | 4 | |
Large | Scientist | 58 |
Engineer | 24 | |
Analyst | 22 | |
Researcher | 10 | |
MLE | 6 | |
Director | 5 |
Above table tell us that it's not only big companies that is making use of data. Now even smaller companies is starting to realize power of data and how it can help them. And they are the ones who is hiring more.
fig = px.histogram(data[data['Company_Age']>0], x="Company_Age",
opacity=.8, labels={'Company_Age':'Company Age'},
title='Histogram of Company\'s Age',
color_discrete_sequence=['rgb(0, 100, 100)'])
fig.show()
This histogram demonstrates that even newer companies are hiring data professionals to make smarter decision for their businesses. So it also shows that you don't need huge amount of data to drive more business profits. It's about how you use, what you have to solve business problems.
pd.pivot_table(data, index =['Location'],values='Company_Age', aggfunc='count').sort_values(
'Company_Age', ascending = False).rename(columns={'Company_Age':'Job_Count'})[:10]
Job_Count | |
---|---|
Location | |
Toronto | 152 |
Vancouver | 72 |
Montreal | 68 |
Mississauga | 29 |
Ottawa | 25 |
Brampton | 21 |
Calgary | 15 |
Canada | 9 |
Waterloo | 8 |
Victoria | 8 |
Above table shows that most jobs will be in bigger cities.
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]]
fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Python', 'R', 'SQL', 'Scala'])
fig.add_trace(go.Pie(labels=['Yes','No'], values=data['Python'].value_counts(), name='Python',
marker_colors=['#00FFFF','#550000']), 1, 1)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['R'].value_counts(), name='R'), 1, 2)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['SQL'].value_counts(), name='SQL'), 2, 1)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Scala'].value_counts(), name='Scala'), 2, 2)
fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
marker=dict(line=dict(color='#000000', width=2)))
fig.update(layout_title_text='Languages Requirements',
layout_showlegend=True)
fig.update_layout(
autosize=False,
width=700,
height=700)
fig = go.Figure(fig)
fig.show()
Above Pie charts illustrates that Python and SQL are the must have language for any data professionals. Other languages depends on company's requirements. Scala is also getting popular because of Apache Spark.
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'},{'type':'domain'}]]
fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Tensorflow', 'Pytorch', 'Keras'])
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Tensorflow'].value_counts(),
name='Tensorflow', marker_colors=['#550000','#00FFFF']), 1, 1)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Pytorch'].value_counts(), name='Pytorch'), 1, 2)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Keras'].value_counts(), name='Keras'), 2, 1)
fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
marker=dict(line=dict(color='#000000', width=2)))
fig.update(layout_title_text='DL Framework Requirements',
layout_showlegend=True)
fig.update_layout(autosize=False,
width=800,
height=800)
fig = go.Figure(fig)
fig.show()
Most companies requires that you know tensorflow and it's higher level API Keras. Tensorflow is more popular than Pytorch because of it's deployment functionalities. Nevertheless Pytorch is also popular for it's easy use.
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'},{'type':'domain'}]]
fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Excel', 'Tableau', 'PowerBI'])
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Excel'].value_counts(),
name='Excel', marker_colors=['#550000','#00FFFF']), 1, 1)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Tableau'].value_counts(), name='Tableau'), 1, 2)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['PowerBI'].value_counts(),
name='PowerBI'), 2,1)
fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
marker=dict(line=dict(color='#000000', width=2)))
fig.update(layout_title_text='BI Tool Requirements',
layout_showlegend=True)
fig.update_layout(autosize=False,
width=800,
height=800)
fig = go.Figure(fig)
fig.show()
In terms of visualization tools Excel is still popular but Tableau is more powerful tool which is very easy to use and doesn't require any coding skills.
specs = [[{'type':'domain'}, {'type':'domain'}]]
fig = make_subplots(rows=1, cols=2, specs=specs, subplot_titles=['AWS', 'Spark'])
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['AWS'].value_counts(),
name='AWS', marker_colors=['#550000','#00FFFF']), 1, 1)
fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Spark'].value_counts(), name='Spark'), 1, 2)
fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
marker=dict(line=dict(color='#000000', width=2)))
fig.update(layout_title_text='AWS & Spark Requirements',
layout_showlegend=True)
fig = go.Figure(fig)
fig.show()
AWS and spark are the most important technologies that one should know for better job prospects at a larger companies.
columns = ['Python', 'R', 'AWS', 'Scala', 'Excel', 'Tableau', 'PowerBI', 'Spark', 'SQL', 'Pytorch', 'Tensorflow', 'Keras']
count = []
for col in columns:
count.append(data[data[col]==1][col].count())
fig = go.Figure(data=[go.Bar(x=columns, y=count)])
fig.update_traces(marker_color='darkblue', marker_line_color='rgb(0,255,255)',
marker_line_width=1.5, opacity=.8)
fig.update_layout(xaxis={'categoryorder':'total descending'},
title="Number of times Tool & Technologies Mentioned in Job Descriptions",
xaxis_title="Tools & Technologies",
yaxis_title="Count(532)")
fig.update_xaxes(tickfont=dict(family='Rockwell', color='crimson', size=14))
fig.update_yaxes(tickfont=dict(family='Rockwell', color='darkblue', size=14))
fig.show()
This Bar Graph demonstrates that which tools you should more focus on learning. One more thing, Here keras is last but that doesn't mean that it's not required, most companies does not include it in job description because they expects you to know this basic tools for easy model development.