This Analysis and Charts will help Aspiring Data Professionals make smarter decisions. Data is collected from glassdoor website.
Data is cleaned and transformed to start doing analysis.

In [1]:

import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

In [2]:

data = pd.read_csv("data_scientist_jobinfo.csv")

In [3]:

data.head()

Out[3]:

	job_title	Location	Sector	Python	Scala	Spark	AWS	SQL	Excel	PowerBI	Tableau	Company_Size	Company_Age
0	Engineer	Winnipeg	Information Technology	1	0	0	0	1	1	1	1	Medium	34
1	Scientist	Toronto	Information Technology	1	0	1	1	1	0	0	0	Small	7
2	Scientist	Toronto	Business Services	1	1	1	1	1	0	0	0	Medium	28
3	Scientist	Vancouver	Information Technology	1	1	0	1	0	1	0	0	Medium	10
4	Analyst	Waterloo	-1	1	0	0	1	1	1	0	0	Small	-1

In [4]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 532 entries, 0 to 531
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   job_title     532 non-null    object
 1   Location      532 non-null    object
 2   Sector        532 non-null    object
 3   Python        532 non-null    int64 
 4   R             532 non-null    int64 
 5   Scala         532 non-null    int64 
 6   Spark         532 non-null    int64 
 7   AWS           532 non-null    int64 
 8   SQL           532 non-null    int64 
 9   Excel         532 non-null    int64 
 10  PowerBI       532 non-null    int64 
 11  Tableau       532 non-null    int64 
 12  Tensorflow    532 non-null    int64 
 13  Pytorch       532 non-null    int64 
 14  Keras         532 non-null    int64 
 15  Company_Size  532 non-null    object
 16  Company_Age   532 non-null    int64 
dtypes: int64(13), object(4)
memory usage: 70.8+ KB

We have 532 rows and 17 columns

In [5]:

fig = px.pie(data, names='job_title', title='Job Title', color_discrete_sequence=px.colors.sequential.haline)
fig.update_traces(textposition='inside', textinfo='percent+label+value', pull=[0, 0.2, 0, 0, 0, 0],
                 marker=dict(line=dict(color='#000000', width=2)))

fig.show()

Based on the pie chart, roughly 38.2% of the data job which were posted is Data Scientist. Data Analyst comes second with 26.3% and Data Engineer comes third with 18.4%. Other roles such as Research Scientist, Machine Learning Engineer and Director is under 10%. But it also because of over lapping that happens in job roles. Some companies include MLE's task in Data Scientist role. However it clearly shows that Data Scientist are in demand.

In [6]:

print("Top 10 Sectors which have the most Jobs: \n")

sector_data = data[data['Sector']!='-1']
print(sector_data['Sector'].value_counts()[:10])

Top 10 Sectors which have the most Jobs: 

Information Technology       131
Business Services             55
Finance                       52
Biotech & Pharmaceuticals     36
Retail                        29
Media                         22
Manufacturing                 18
Insurance                     13
Telecommunications            11
Healthcare                    10
Name: Sector, dtype: int64

In [7]:

sector_wise = sector_data.groupby(by=['Sector'])['job_title'].count()
fig = go.Figure(data=[go.Bar(x=sector_wise.index, y=sector_wise.values)])

fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.8)

fig.update_layout(xaxis={'categoryorder':'total descending'},
                  title="Sector wise Total Jobs",
                  xaxis_title="Sectors",
                  yaxis_title="Total Jobs(532)")

fig.update_xaxes(tickangle=45, tickfont=dict(family='Rockwell', color='crimson', size=14))
fig.update_yaxes(tickfont=dict(family='Rockwell', color='darkblue', size=14))

fig.show()

IT sector has the most jobs than any other. In fact, it has over 100 posting while Business Services has just around 50 which is second in the order. Finance, Biotech & Pharmaceauticals and Retail sector also has more job postings. Based on this aspiring data scientists can choose which sector they should target.

In [8]:

pivot_data = data[data['Sector']!='-1']

pd.options.display.max_rows
pd.set_option('display.max_rows', None)
pd.pivot_table(pivot_data, index =['Sector','job_title'],values='Company_Age', aggfunc='count').sort_values(
    'Company_Age', ascending = False).rename(columns={'Company_Age':'Job Count'})[:20]

Out[8]:

		Job Count
Sector	job_title
Information Technology	Scientist	41
	Engineer	37
	Analyst	31
Business Services	Analyst	23
Biotech & Pharmaceuticals	Scientist	22
Finance	Scientist	19
Retail	Scientist	18
Business Services	Scientist	16
Finance	Analyst	15
Finance	Engineer	11
Information Technology	MLE	10
Retail	Analyst	10
Insurance	Scientist	9
Business Services	Engineer	8
Media	Scientist	8
Media	Analyst	7
Information Technology	Researcher	7
Biotech & Pharmaceuticals	Researcher	6
Manufacturing	Analyst	6
Manufacturing	Engineer	5

Above table shows which job roles are most wanted by which sector. For instance Business Services needs more analysts than scientist which makes sense Since they focus on making smarter decision by analysing data rather than building models.

In [9]:

print(data['Company_Size'].value_counts())

pd.pivot_table(pivot_data, index =['Company_Size','job_title'],values='Company_Age', aggfunc='count').sort_values(
    ['Company_Size','Company_Age'], ascending = False).rename(columns={'Company_Age':'Job Count'})[:20]

Small     227
Medium    178
Large     127
Name: Company_Size, dtype: int64

Out[9]:

		Job Count
Company_Size	job_title
Small	Analyst	41
	Scientist	34
	Engineer	25
	MLE	8
	Researcher	8
	Director	3
Medium	Scientist	65
	Analyst	47
	Engineer	31
	Researcher	18
	Director	7
	MLE	4
Large	Scientist	58
	Engineer	24
	Analyst	22
	Researcher	10
	MLE	6
	Director	5

Above table tell us that it's not only big companies that is making use of data. Now even smaller companies is starting to realize power of data and how it can help them. And they are the ones who is hiring more.

In [10]:

fig = px.histogram(data[data['Company_Age']>0], x="Company_Age",
                   opacity=.8, labels={'Company_Age':'Company Age'},
                   title='Histogram of Company\'s Age',
                   color_discrete_sequence=['rgb(0, 100, 100)'])

fig.show()

This histogram demonstrates that even newer companies are hiring data professionals to make smarter decision for their businesses. So it also shows that you don't need huge amount of data to drive more business profits. It's about how you use, what you have to solve business problems.

In [11]:

pd.pivot_table(data, index =['Location'],values='Company_Age', aggfunc='count').sort_values(
    'Company_Age', ascending = False).rename(columns={'Company_Age':'Job_Count'})[:10]

Out[11]:

	Job_Count
Location
Toronto	152
Vancouver	72
Montreal	68
Mississauga	29
Ottawa	25
Brampton	21
Calgary	15
Canada	9
Waterloo	8
Victoria	8

Above table shows that most jobs will be in bigger cities.

In [12]:

specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]]

fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Python', 'R', 'SQL', 'Scala'])

fig.add_trace(go.Pie(labels=['Yes','No'], values=data['Python'].value_counts(), name='Python',
                    marker_colors=['#00FFFF','#550000']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['R'].value_counts(), name='R'), 1, 2)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['SQL'].value_counts(), name='SQL'), 2, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Scala'].value_counts(), name='Scala'), 2, 2)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='Languages Requirements',
           layout_showlegend=True)

fig.update_layout(
    autosize=False,
    width=700,
    height=700)

fig = go.Figure(fig)

fig.show()

Above Pie charts illustrates that Python and SQL are the must have language for any data professionals. Other languages depends on company's requirements. Scala is also getting popular because of Apache Spark.

In [13]:

specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'},{'type':'domain'}]]

fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Tensorflow', 'Pytorch', 'Keras'])

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Tensorflow'].value_counts(),
                     name='Tensorflow', marker_colors=['#550000','#00FFFF']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Pytorch'].value_counts(), name='Pytorch'), 1, 2)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Keras'].value_counts(), name='Keras'), 2, 1)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='DL Framework Requirements',
           layout_showlegend=True)

fig.update_layout(autosize=False,
                  width=800,
                  height=800)

fig = go.Figure(fig)
fig.show()

Most companies requires that you know tensorflow and it's higher level API Keras. Tensorflow is more popular than Pytorch because of it's deployment functionalities. Nevertheless Pytorch is also popular for it's easy use.

In [14]:

specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'},{'type':'domain'}]]

fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Excel', 'Tableau', 'PowerBI'])

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Excel'].value_counts(),
                     name='Excel', marker_colors=['#550000','#00FFFF']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Tableau'].value_counts(), name='Tableau'), 1, 2)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['PowerBI'].value_counts(), 
                     name='PowerBI'), 2,1)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='BI Tool Requirements',
           layout_showlegend=True)

fig.update_layout(autosize=False,
                  width=800,
                  height=800)

fig = go.Figure(fig)
fig.show()

In terms of visualization tools Excel is still popular but Tableau is more powerful tool which is very easy to use and doesn't require any coding skills.

In [15]:

specs = [[{'type':'domain'}, {'type':'domain'}]]

fig = make_subplots(rows=1, cols=2, specs=specs, subplot_titles=['AWS', 'Spark'])

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['AWS'].value_counts(),
                     name='AWS', marker_colors=['#550000','#00FFFF']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Spark'].value_counts(), name='Spark'), 1, 2)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='AWS & Spark Requirements',
           layout_showlegend=True)

fig = go.Figure(fig)
fig.show()

AWS and spark are the most important technologies that one should know for better job prospects at a larger companies.

In [16]:

columns = ['Python', 'R', 'AWS', 'Scala', 'Excel', 'Tableau', 'PowerBI', 'Spark', 'SQL', 'Pytorch', 'Tensorflow', 'Keras']
count = []

for col in columns:
    count.append(data[data[col]==1][col].count())


fig = go.Figure(data=[go.Bar(x=columns, y=count)])

fig.update_traces(marker_color='darkblue', marker_line_color='rgb(0,255,255)',
                  marker_line_width=1.5, opacity=.8)

fig.update_layout(xaxis={'categoryorder':'total descending'},
                  title="Number of times Tool & Technologies Mentioned in Job Descriptions",
                  xaxis_title="Tools & Technologies",
                  yaxis_title="Count(532)")

fig.update_xaxes(tickfont=dict(family='Rockwell', color='crimson', size=14))
fig.update_yaxes(tickfont=dict(family='Rockwell', color='darkblue', size=14))

fig.show()

This Bar Graph demonstrates that which tools you should more focus on learning. One more thing, Here keras is last but that doesn't mean that it's not required, most companies does not include it in job description because they expects you to know this basic tools for easy model development.