SysArmy Surveys: Analysis of argentine techies' salaries evolution (2014 - 2017)

Datasets

The used datasets have been published on the SysArmy blog since they started taking the surveys. Each survey contains differencies on their columns, mostly additions across time, but some names and order have changed to.

In order to normalize the different structures, the next section includes a couple of functions to facilitate this process. Below, is the list of datasets used at the moment.

Salaries

Period Path URL
2014.02 data/2014.02.csv http://sysarmy.com.ar/blog/wp-content/uploads/2015/01/sysarmy-resultados-encuesta-remuneracion-salarial.csv
2015.01 data/2015.01.csv http://sysarmy.com.ar/blog/wp-content/uploads/2015/09/sysarmy-encuesta-salarial-julio-2015.csv
2016.01 data/2016.01.csv https://sysarmy.com.ar/blog/wp-content/uploads/2016/02/sysarmy-encuesta-salarial-enero-2016.csv
2016.02 data/2016.02/ https://sysarmy.com.ar/blog/wp-content/uploads/2016/08/encuesta-sueldos-sysarmy-openqube.zip
2017.01 data/2017.01/ https://docs.google.com/spreadsheets/d/1diZG7ruHJ4OgO2-tbnO-ZnhjxjOnYjMVSxl7I-4LT54/edit#gid=1424918437
2017.02 - -

The datasets detailed above were downloaded and included "as is" in the data/ folder on this repository. The idea of this analysis is to not preprocess anything, and just normalize the data on the fly, of course this is possible given that the datasets are not so big.

Argentine Peso AR\$ to US\$

In one of the analysis (GDP) we will compare the salaries through time using the dollar as currency of reference. We do this in order to understand the numbers in the global economy, putting aside the effects of inflation. In relation to the inflation, the official numbers are not so good for an appropiate analysis: between 2014 and 2015 the methodoly changed, and it's been an entire year (April 2016 - April 2017) without measurement.

Period Path URL
5Y data/usdars-cur.json https://www.bloomberg.com/markets/api/bulk-time-series/price/USDARS%3ACUR?timeFrame=5_YEAR

Data preparation

In [1]:
import pandas as pd
import numpy as np
import re
from matplotlib import cm
from functools import reduce
from IPython.display import display, Markdown as md

def getPeriodDF(period, publish_date, csv_path, columns_src, columns_dst, transforms = []):
    # given a csv_path, extract src columns into dst columns by applying a transform function for a certain period name
    df_src = pd.read_csv(csv_path)
    df_dst = pd.DataFrame(columns=['period', 'publish_date']+columns_dst)
    df_dst[columns_dst] = df_src[columns_src]
    df_dst['period'] = period
    df_dst['publish_date'] = pd.to_datetime(publish_date)
    # apply custom transformations
    for transform in transforms:
        df_src, df_dst = transform(df_src, df_dst)
    return df_dst

def transformNormalize(df_src, df_dst):
    # cast to float the salary column, some datasets gets wrongly parsed due to crappy user input
    df_dst['salary'] = df_dst['salary'].astype(float)
    df_dst['age'] = df_dst['age'].astype(str)
    # normalize stack column
    re_separator = ' *, *| *; *| *\. *| *\| *| *\/ *| *- *'
    re_garbage = '"""|\*|nan|='
    df_dst['stack'] = df_dst['stack'].apply(lambda s: re.split(re_separator, re.sub(re_garbage, '', str(s)).strip().lower()))
    # munge the dataset, removing entries that we consider not realistic for our analysis
    munging_mask = (df_dst.salary > 5000) & (df_dst.salary < 300000)
    df_src = df_src.loc[munging_mask]
    df_dst = df_dst.loc[munging_mask]
    return df_src, df_dst

def transformNet2Brute(df_src, df_dst):
    # filter wether net or brute income
    mask = np.array(df_src['Bruto o neto?'].str.contains('Neto'), dtype = bool)
    net_factor = np.array([1.0] * len(df_dst));
    # scales up net salary into brute, given Argentina's social security contributions
    net_factor[ mask ] = 1/0.83
    df_dst['salary'] = net_factor * df_dst['salary']
    return df_src, df_dst

src_common_cols = [
    'Tengo',
    'Años de experiencia',
    'Años en el puesto actual',
    'Trabajo de',
    'Tecnologías que utilizás']

dst_cols = [
    'salary',
    'location',
    'age',
    'yoe',
    'yip',
    'role',
    'stack']

df1 = getPeriodDF('2014.02', '2015-01-01', '../data/2014.02.csv', 
                  ['Salario bruto mensual (AR$)', 'Trabajo en'] + src_common_cols, dst_cols, 
                  [transformNormalize])

df2 = getPeriodDF('2015.01', '2015-09-01', '../data/2015.01.csv', 
                  ['Salario bruto mensual (AR$)', 'Trabajo en'] + src_common_cols, dst_cols, 
                  [transformNormalize])

df3 = getPeriodDF('2016.01', '2016-02-01', '../data/2016.01.csv', 
                  ['Salario mensual (AR$)', 'Trabajo en'] + src_common_cols, dst_cols, 
                  [transformNormalize, transformNet2Brute])

df4 = getPeriodDF('2016.02', '2016-08-01', '../data/2016.02/argentina.csv', 
                  ['Salario mensual (en tu moneda local)', 'Argentina'] + src_common_cols, dst_cols, 
                  [transformNormalize, transformNet2Brute])

df5 = getPeriodDF('2017.01', '2017-02-01', '../data/2017.01/argentina.csv', 
                  ['Salario mensual (en tu moneda local)', 'Argentina'] + src_common_cols, dst_cols, 
                  [transformNormalize, transformNet2Brute])

# compute the union of all the datasets as a signe pandas dataframe
df = pd.concat([df1, df2, df3, df4, df5])

# extract the list of periods
periods = df.period.unique()

# extract the list of roles
roles = df.role.unique()

# generate a list of colors by period for easy visualization on next figures
colors = cm.rainbow(np.linspace(0, 1, len(periods)))

# extract the list of stacks
all_stacks = reduce(lambda res, it: np.concatenate((res, it)), df['stack'], np.array([]))
data = np.transpose(np.unique(all_stacks, return_counts=True))
dfs = pd.DataFrame(data, columns=['stack', 'count'])
dfs['count'] = dfs['count'].astype(int)

# extract the list of roles
data = np.transpose(np.unique(df['role'].apply(lambda s: s.strip()), return_counts=True))
dfr = pd.DataFrame(data, columns=['role', 'count'])
dfr['count'] = dfr['count'].astype(int)

# create a dataframe with the most frequent stack names and mask of rows in main df matching it
dfs = dfs.query('count > 100 & stack != ""').sort_values(by=['count'], ascending=[0])
dfs['df_mask'] = dfs['stack'].apply(lambda stack: np.array(df['stack'].apply(lambda s: np.isin(stack, s))))

# create a dataframe with the most frequent role names and mask of rows in main df matching it
dfr = dfr.query('count > 100 & role != ""').sort_values(by=['count'], ascending=[0])
dfr['df_mask'] = dfr['role'].apply(lambda role: np.array(df['role'] == role))

# show results
display(md('# Normalized data'))
display(md('## Survey entries'))
display(df.head())
# we take only the top stacks/roles because the fields were free-text, and this resulted
# in many outliers, ie: random text without continuity for a proper time-series analysis
display(md('## Top stacks: names, frquency and entries mask'))
display(dfs.head())
display(md('## Top roles: names, frquency and entries mask'))
display(dfr.head())

Normalized data

Survey entries

period publish_date salary location age yoe yip role stack
0 2014.02 2015-01-01 21000.00 Ciudad Autónoma de Buenos Aires 27 - 30 5 - 7 Menos de un año SysAdmin [linux, cloud]
1 2014.02 2015-01-01 10000.00 Ciudad Autónoma de Buenos Aires 35 - 40 10+ 2 - 4 SysAdmin [linux]
2 2014.02 2015-01-01 16000.00 Córdoba 27 - 30 3 - 5 1 - 2 DevOps [linux, cloud]
3 2014.02 2015-01-01 23771.13 Ciudad Autónoma de Buenos Aires 27 - 30 5 - 7 1 - 2 SysAdmin [linux]
4 2014.02 2015-01-01 16000.00 Ciudad Autónoma de Buenos Aires 30 - 33 10+ 2 - 4 DevOps [linux]

Top stacks: names, frquency and entries mask

stack count df_mask
664 windows 4499 [False, False, False, False, False, False, Fal...
300 linux 3480 [True, True, True, True, True, False, False, F...
645 vmware 1882 [False, False, False, False, False, False, Fal...
120 containers 621 [False, False, False, False, False, False, Fal...
148 docker 621 [False, False, False, False, False, False, Fal...

Top roles: names, frquency and entries mask

role count df_mask
388 Developer 4457 [False, False, False, False, False, False, Fal...
1100 SysAdmin / DevOps 2266 [False, False, False, False, False, False, Fal...
541 HelpDesk 1865 [False, False, False, False, False, False, Fal...
768 Networking 938 [False, False, False, False, False, False, Tru...
898 QA / Tester 542 [False, False, False, False, False, False, Fal...
In [2]:
import scipy.stats as sps
import matplotlib.pyplot as plt

def getPlotSalarySeries(currencySeries):
    s = currencySeries.apply(lambda x: float('{0:.2f}'.format(round(x/1000, 1))))
    mu, sigma = np.mean(s), np.std(s)
    median = np.median(s)
    t = np.linspace(mu - 3*sigma, mu + 3*sigma, 1000)
    d = sigma**2/mu
    return mu, sigma, t, d, median, s


dfrt = pd.DataFrame(columns=['role', 'count', 'mu', 'sigma', 'period', 'publish_date'])

i = 0
for period in periods:
    for j, row in dfr.iterrows():
        dfrs = df.loc[ row['df_mask'] & (df['period'] == period) ]
        if len(dfrs) == 0: continue
        mu, sigma, t, d, m, s = getPlotSalarySeries(dfrs['salary'])
        dfrt.loc[i] = [row['role'], len(dfrs), mu, sigma, period, np.max(dfrs['publish_date'])]
        i = i+1

# the ranking of technologies most frequent in the datasets order by mean of salary
dfrt = dfrt.query('count > 1').sort_values(by=['role', 'period'], ascending=[0, 0])
display(md('## Top roles: numbers by period'))
dfrt.head()

Top roles: numbers by period

Out[2]:
role count mu sigma period publish_date
45 SysAdmin / DevOps 640.0 28.695000 14.261597 2017.01 2017-02-01
34 SysAdmin / DevOps 636.0 24.607075 11.653473 2016.02 2016-08-01
23 SysAdmin / DevOps 990.0 20.572828 13.491941 2016.01 2016-02-01
16 SysAdmin 248.0 18.471371 8.054694 2015.01 2015-09-01
5 SysAdmin 197.0 16.022335 6.551042 2014.02 2015-01-01
In [3]:
import matplotlib.patches as mpatches

%matplotlib inline
plt.style.use(['classic', 'ggplot'])
plt.rcParams.update({'figure.figsize': (10.0, 6.0), 'font.size': 8})

trending_roles = np.unique(dfrt['role'])
roles_colors = cm.rainbow(np.linspace(0, 1, len(trending_roles)))
publish_dates = np.unique(dfrt['publish_date'])

y = []
for i, role in enumerate(trending_roles):
    role_y = []
    for j, date in enumerate(publish_dates):
        dfrs = dfrt.query('role == @role & publish_date == @date')
        count = 0 if len(dfrs['count']) == 0 else dfrs['count'].values[0]
        role_y.append(count)
    y.append(role_y)

# define data points for stackplot
x = publish_dates
y = np.row_stack(y)
percent = np.divide(y, y.sum(axis=0).astype(float) * 100)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.stackplot(x, percent, colors=roles_colors)
ax.set_title('Level of participation in surveys by role across time')
ax.set_ylabel('Percent (%)')

# creating the legend manually
plt.legend([mpatches.Patch(color=roles_colors[i]) for i, role in enumerate(trending_roles)],
           trending_roles, 
           bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

    
display(md('## Level of participation in surveys by role'))
display(md("""
This initial graph aims to help understanding who participated in the surveys on each period.

It is very important to understand the universe of samples, which may be a segment of the entire tech community in Argentina. There're reasons to think on this.
A valid one is that SysArmy may have started as group of sysadmins/security guys -mainly, perhaps due to the topics they post about-, then spreaded more to the devs community.

In order to prove this hypothesis, we graph the percentual level of participation by role (only top roles).
"""))
plt.show()
display(md("""
The results threw that Developers were from the beginning the majority group.  This is against the initial hypothesis, but not so falsy, neither the last conclusion:
1. It is clearly visible that the Developers group has grew through time starting on September 2015.
2. Helpdesk however, have decreased since February 2016.
3. "Storage / Backup" have falled and then dissapeared.
4. "Otro" (others) has been raising from the begining -then dissapeared-, but marked the tendency of the universe of samples which tends to diverge in respect to the role.
5. The rest of roles maintain some consistency over time.
"""))

Level of participation in surveys by role

This initial graph aims to help understanding who participated in the surveys on each period.

It is very important to understand the universe of samples, which may be a segment of the entire tech community in Argentina. There're reasons to think on this. A valid one is that SysArmy may have started as group of sysadmins/security guys -mainly, perhaps due to the topics they post about-, then spreaded more to the devs community.

In order to prove this hypothesis, we graph the percentual level of participation by role (only top roles).

The results threw that Developers were from the beginning the majority group. This is against the initial hypothesis, but not so falsy, neither the last conclusion:

  1. It is clearly visible that the Developers group has grew through time starting on September 2015.
  2. Helpdesk however, have decreased since February 2016.
  3. "Storage / Backup" have falled and then dissapeared.
  4. "Otro" (others) has been raising from the begining -then dissapeared-, but marked the tendency of the universe of samples which tends to diverge in respect to the role.
  5. The rest of roles maintain some consistency over time.

Histogram of Incomes by period (Argentina)

In [18]:
%matplotlib inline
plt.rcParams.update({'figure.figsize': (10.0, 18.0), 'font.size': 8})
fig, ax = plt.subplots(len(periods), 1)
for i, period in enumerate(periods):
    dft = df[df['period'] == period]
    mu, sigma, t, d, m, s = getPlotSalarySeries(dft.salary)
    n, bins, patches = ax[i].hist(s, 160, normed=1, alpha=0.75, color=colors[i])
    ax[i].set_ylabel('density')
    ax[i].set_title(period + ' Histogram of Income')
    ax[i].text(40, .025, '$\mu=%.2f,\ \sigma=%.2f,\ n=%d$' % (mu, sigma, len(s)))
    ax[i].axis([0, 160, 0, 0.1])
    ax[i].grid(True)
ax[len(periods)-1].set_xlabel('income [1000 AR$/mo]')
fig.canvas.draw()

display(md('## Density by range of salary'))
display(md("""
The following figures shows the density ranges of income for the Universe (Argentina samples) across time.
The histograms helps understand the probability density functions behind the data, which present some postive skew,
and a tendency to be shifted to the right through time (inflation adjustments).

It is also noticeable that the levels of dispersion increases since 2015, something that we will analyize next.
"""))
display(plt.show())

Density by range of salary

The following figures shows the density ranges of income for the Universe (Argentina samples) across time. The histograms helps understand the probability density functions behind the data, which present some postive skew, and a tendency to be shifted to the right through time (inflation adjustments).

It is also noticeable that the levels of dispersion increases since 2015, something that we will analyize next.

None

Mean and Standard deviation evolution

The following graph represents the income mean $\mu$ across time, along with the standard deviation $\sigma$. This is an interesting time-series information, because describes better the raise in currency numbers (not purchasing power) of salaries. The tendency is the growthness, which is for sure good, reflecting a correction over time of the numbers due to the inflation.

The next question would be, Are these corrections enough to at least maintain some stability on the purchasing levels? To answer this, we may need information about levels of inflation. As described in the Datasets section, there's no good data available of inflation in Argentina. A workaround to help answer this question, from a different perspective, but a valid one, is to base the analysis on the dollar currency and see the evolution of income (cotinue reading).

In [5]:
%matplotlib inline
plt.rcParams.update({'figure.figsize': (10.0, 6.0), 'font.size': 8})
mus, sigmas, dates = [], [], []
for period in periods:
    dft = df[df['period'].str.contains(period)]
    mu, sigma, t, d, m, s = getPlotSalarySeries(dft.salary)
    mus.append(mu)
    sigmas.append(sigma)
    dates.append(np.max(dft['publish_date']))
    
plt.plot(dates, mus, '+', label='$\mu$', ls='-', c='b')
plt.plot(dates, sigmas, '+', label='$\sigma$', ls=':', c='g')
plt.xlabel('date'); plt.ylabel('income [1000 AR$/mo]')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[5]:
<matplotlib.legend.Legend at 0x111008ef0>

Currency evolution (Argentine Peso)

Continuing with the previous analysis, we obtained the data of AR\$ to US\$ over time, of the last 5 years. The next step would be to sample from these series, only few points of reference for the periods that our surveys datasets correspond.

So we plot below, the peso evolution in green, and overlapped in blue, the sampled values of these series for the periods of our interest.

In [6]:
import json
import datetime
with open('../data/usdars-cur.json') as data_file:    
    data = json.load(data_file)  
    df_currency = pd.DataFrame(data[0]['price'])
    df_currency.date = pd.to_datetime(df_currency.date)

%matplotlib inline
plt.rcParams.update({'figure.figsize': (10.0, 6.0), 'font.size': 8})
delta = datetime.timedelta(days=15)
currencies = []
for date in dates:
    currency_series = df_currency[(df_currency.date >= date-delta) & (df_currency.date <= date+delta)]
    currencies.append(float(currency_series.value[:1]))

plt.xlabel('date'); plt.ylabel('1US\$ [AR\$]')
plt.plot(dates, currencies, '+', label='sampled currency for the chosen periods', ls='-', c='b')
plt.plot(df_currency.date, df_currency.value, label='by date currency from Bloomberg.com', ls='-', c='g')
plt.legend()
Out[6]:
<matplotlib.legend.Legend at 0x113a40860>

Monthly nominal GDP

Once obtained the peso value in dollar currency, we calculate the salaries mean $\mu$ over time, in dollars: Monthly Gross Domestic Product, in US dollars, tell us the purchasing capacity of Argentines salaries in nominal values. To compute the values, we just divide each $\mu$ in pesos by the dollar rate at its given moment.

$$ \sum_i{nom(GDP_i)} = \sum_i{\mu_i US\$} = \sum_i{\frac {\mu_i AR\$}{exchrate_i{\frac {AR\$}{US\$}}}} $$

The results can be compared with World Bank data for Argentina provided here:

http://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD?end=2016&locations=AR&start=2010

In [7]:
%matplotlib inline
plt.rcParams.update({'figure.figsize': (13.0, 5.0), 'font.size': 8})

fig, ax = plt.subplots(1, 2)
ax[0].set_xlabel('date'); ax[0].set_ylabel('income [1000/mo]')
ax[0].plot(dates, mus, '+', label='$\mu$ AR\$', ls='-', c='b')
ax[0].legend()

ax[1].set_xlabel('date'); ax[1].set_ylabel('income [1000/mo]')
ax[1].plot(dates, np.divide(mus, currencies), '+', label='$\mu$ US\$', ls='-', c='r')
ax[1].legend()
fig.canvas.draw()

display(md("""
### Evolution of mean income in AR$ (left), in US\$ (right)

As the figure show (left), the salary in pesos continuosly grows.
There's however, an interesting plateau in between the end of 2015 and beginning of 2016.
The two periods occurred before and after the Argentina presidential elections of October/December 2015.
It mustn't be coincidence that the numbers got practically freezed, perhaps due to the uncertainty caused by such national event.

There's another clear insight that we can identify on the figure at the right side.
Sadly the income in dollars falled down dramatically right after the elections.  This was due to the removal of the "cepo cambiario", 
which came along with a drastic devaluation of the local currency: http://www.bbc.com/mundo/noticias/2015/12/151217_argentina_fin_cepo_devaluacion_irm
"""))
display(plt.show())

Evolution of mean income in AR$ (left), in US\$ (right)

As the figure show (left), the salary in pesos continuosly grows. There's however, an interesting plateau in between the end of 2015 and beginning of 2016. The two periods occurred before and after the Argentina presidential elections of October/December 2015. It mustn't be coincidence that the numbers got practically freezed, perhaps due to the uncertainty caused by such national event.

There's another clear insight that we can identify on the figure at the right side. Sadly the income in dollars falled down dramatically right after the elections. This was due to the removal of the "cepo cambiario", which came along with a drastic devaluation of the local currency: http://www.bbc.com/mundo/noticias/2015/12/151217_argentina_fin_cepo_devaluacion_irm

None

Distribution of Income by period (Argentina)

Assuming Normal distribution

Despite the fact that histograms throw distributions more similar to a $\gamma$ (Gamma) function, and the marked possitive skewness, for simplicity we will use the $N$ (Gaussian) distribution.

  • Probability density function
$$ \large {\displaystyle f(x\;|\;\mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\;e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}} $$
  • Gaussian density paramters
$$ \large \begin{align} \mu &= \frac{1}{N} \sum_i x_i \\ \sigma^2 &= \frac{1}{N} \sum_i x_i^2 \end{align} $$

The following figure, shows for the full dataset by period, the mean ($\mu$), and the standard deviation ($\sigma$). It is noticieable the shifting to the right of the curves across time, indicating that salaries somehow followed the inflation tendency.

A new insight can be distinguished from this graph: the standard deviation of the salaries (the amplitude of the gaussian shape) starts increasing since 2016. Once again, after the presidential elections, the course of the economy changed, and then the rules and behavior of salaries. Here, an interpretation would be, that the tendency of incomes is now more sparse, meaning that for the same type of work and experience -in general- there's people with totally different salaries (wider ranges).

In [8]:
%matplotlib inline
plt.rcParams.update({'figure.figsize': (10.0, 6.0), 'font.size': 8})
plt.xlim(0,100)
for i, period in enumerate(periods):
    dft = df[df.period.str.contains(period)]
    mu, sigma, t, d, m, s = getPlotSalarySeries(dft.salary)
    plt.plot(t, sps.norm.pdf(t, mu, sigma), label=period + '\n$\mu=%.2f,\ \sigma=%.2f,\ n=%d$\n' % (mu, sigma, len(s)), ls='-', c=colors[i])
    plt.axvline(x=mu, ls=':', c=colors[i])
plt.xlabel('income [1000 AR$/mo]'); plt.ylabel('density, $N$')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[8]:
<matplotlib.legend.Legend at 0x1148dbd30>

Interactive dynamic dashboard

The next figure is dynamic. It has some controls that lets you to filter data and so, refresh the visualization comparing diferent dimensionality reductions, with the average by dataset, using the probability denstiy functions as curve.

NOTE: in order to make this work, you may need to run this notebook locally, using Jupyter Notebooks

In [16]:
from IPython.display import clear_output
import ipywidgets as widgets

filters = ['role', 'yoe', 'yip', 'location', 'age']
i_filters = [{'name': r, 'i': i} for i, r in enumerate(filters)]
filters_opts = list(map(lambda r: list(df[r].unique()), filters))
filters_checkboxes = list(map(lambda r: widgets.Checkbox(value=False, description='Use '+r, name='use_'+r, disabled=False), filters))
filters_dropdowns = list(map(lambda r: widgets.Dropdown(options=filters_opts[r['i']], value=filters_opts[r['i']][0], name=r['name'], description=r['name']+':', disabled=not filters_checkboxes[r['i']].value), i_filters))

%matplotlib notebook
plt.rcParams.update({'figure.figsize': (13.0, 6.0), 'font.size': 8})
fig, ax = plt.subplots()
ax.set_xlim(-20,100)
ax.set_ylim(0,0.08)
ax.set_xlabel('income [1000 AR$/mo]'); ax.set_ylabel('density, $N$')

def on_filters_change(change):
    for i in range(len(filters)):
        checkbox = filters_checkboxes[i]
        dropdown = filters_dropdowns[i]
        if change['owner'].name == checkbox.name:
            dropdown.disabled = not change['new']
            break
    plot_filtered_monthly_income()

def plot_filtered_monthly_income(change = None):
    plt.rcParams.update({'figure.figsize': (10.0, 8.0), 'font.size': 8})
    if not ax.lines:
        for i, period in enumerate(periods):
            dft = df[df.period.str.contains(period)]
            mu, sigma, t, d, m, s = getPlotSalarySeries(dft.salary)            
            ax.plot(t, sps.norm.pdf(t, mu, sigma), label=period, ls=':', c=colors[i])
            ax.plot(t, sps.norm.pdf(t, mu, sigma), label=period + ' $\mu=%.2f,\ \sigma=%.2f,\ n=%d$' % (mu, sigma, len(s)), ls='-', c=colors[i])
    else:
        for i, period in enumerate(periods):
            dft = df[df.period.str.contains(period)]
            for dropdown in filters_dropdowns:
                dft = dft if dropdown.disabled else dft[dft[dropdown.name].str.contains(dropdown.value)]
            mu, sigma, t, d, m, s = getPlotSalarySeries(dft.salary)
            line = ax.lines[2*i+1]
            if len(dft) > 2:
                line.set_xdata(t)
                line.set_ydata(sps.norm.pdf(t, mu, sigma))
                line.set_label(period + ' $\mu=%.2f,\ \sigma=%.2f,\ n=%d$' % (mu, sigma, len(s)))
            else:
                line.set_xdata([]), line.set_ydata([])
    title = 'Segment:' + ''.join([
        '' if dropdown.disabled else ' %s: %s' % (dropdown.name, dropdown.value)
            for dropdown in filters_dropdowns])
    ax.set_title(title)
    ax.legend(bbox_to_anchor=(1.05, 1), loc=1, borderaxespad=0.)
    ax.grid(True)
    fig.canvas.draw()
    plt.ion()
    plt.show()

boxes = []
for i in range(len(filters_dropdowns)):
    filters_checkboxes[i].observe(on_filters_change, names='value')
    filters_dropdowns[i].observe(plot_filtered_monthly_income, names='value')
    boxes.append(widgets.Box([filters_checkboxes[i], filters_dropdowns[i]]))

display(widgets.Box(boxes))
plot_filtered_monthly_income()

Comparative analysis by technology

Below, we will rank technologies by most frequent. We will exclude technologies with less than 100 samples in total -because they are in most cases, isolated, and doesn't follow clear trends.

From the resulting table we will plot the results in two ways:

  1. First as a time-series comparative of the technologies by period, indicatin the $\mu$ mean salary.
  2. Secondly, as a scatter, having the salary mean $\mu$ in the X axis, the standard deviation $\sigma$ in the Y axis and the amount of samples as the size of the dot.
In [10]:
dfst = pd.DataFrame(columns=['stack', 'count', 'mu', 'sigma', 'period', 'publish_date'])

i = 0
for period in periods:
    for j, row in dfs.iterrows():
        dfss = df.loc[ row['df_mask'] & (df['period'] == period) ]
        if len(dfss) == 0: continue
        mu, sigma, t, d, m, s = getPlotSalarySeries(dfss['salary'])
        dfst.loc[i] = [row['stack'], len(dfss), mu, sigma, period, np.max(dfss['publish_date'])]
        i = i+1

# the ranking of technologies most frequent in the datasets order by mean of salary
dfst = dfst.query('count >= 1').sort_values(by=['stack', 'period'], ascending=[0, 0])
display(md('## Top stacks: numbers by period'))
dfst.head()

Top stacks: numbers by period

Out[10]:
stack count mu sigma period publish_date
48 windows 2844.0 25.242827 13.018835 2017.01 2017-02-01
34 windows 512.0 23.590430 9.715138 2016.02 2016-08-01
19 windows 831.0 19.829122 13.524452 2016.01 2016-02-01
9 windows 177.0 18.333898 11.019761 2015.01 2015-09-01
0 windows 135.0 14.777037 5.476386 2014.02 2015-01-01
In [11]:
%matplotlib inline
plt.rcParams.update({'figure.figsize': (10.0, 6.0), 'font.size': 8})

trending_stacks = np.unique(dfst['stack'])
stacks_colors = cm.rainbow(np.linspace(0, 1, len(trending_stacks)))
for i, stack in enumerate(trending_stacks):
    dfp = dfst.query('stack == @stack')
    plt.plot(dfp['publish_date'], dfp['mu'], '+', label='$\mu$ AR\$ %s' % stack, ls='-', c=stacks_colors[i])
    plt.xlabel('date'); plt.ylabel('$\mu$')
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
display(md('## Mean salary of the most frequent technologies (stacks) through time'))

Mean salary of the most frequent technologies (stacks) through time

In [12]:
%matplotlib inline
plt.rcParams.update({'figure.figsize': (13.0, 30.0), 'font.size': 10})

fig, ax = plt.subplots(len(periods), 1)
for i, period in enumerate(periods):
    ## plot only the top ranked technologies in terms of salary for the period sample
    dftss = dfst.query("@period == period").sort_values(by=['mu', 'sigma'], ascending=[0, 0])[:30]
    ax[i].scatter(
        dftss['sigma'], dftss['mu'], marker='o', c=dftss.index, s=100*dftss['count']**(1/2), alpha=0.3,
        cmap=plt.get_cmap('Spectral'))

    for label, x, y in zip(dftss['stack'], dftss['sigma'], dftss['mu']):
        ax[i].annotate(
            label,
            xy=(x, y), xytext=(-30, 30),
            textcoords='offset points', ha='right', va='bottom',
            bbox=dict(boxstyle='round,pad=0.5', fc='white', alpha=0.5),
            arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))

    ax[i].set_xlabel('$\mu$'); ax[i].set_ylabel('$\sigma$')
    ax[i].set_title('%s - Technologies by salary, standard deviation and num of samples.' % period)
    fig.canvas.draw()
display(md('## Scatter of technologies by mean, dispersion, frequency and time'))
display(md('_NOTE: It\'s pending for these results to drive some conclusion (WIP)._'))
display(md('_NOTE: The circles radius represents the frequency (number of entries) by stack in an inversely cuadratic representation._'))

Scatter of technologies by mean, dispersion, frequency and time

NOTE: It's pending for these results to drive some conclusion (WIP).

NOTE: The circles radius represents the frequency (number of entries) by stack in an inversely cuadratic representation.