The goal of this tutorial is to present the features of the European labor market in Python consistently and with commentary. From a programming side, the main objective was working with data. From an economic side, the main objective was implementing an economic model and using programming to demonstrate features otherwise more difficult to demonstrate.
This tutorial will focus mainly on the learnings from a programming perspective. However, this introduction will set up the background from an economic side to orientate the reader. There will be additional economic commentary throughout the sections.
A simple model of the labor market that is used in this tutorial catagorizes three states a working-age person can be in:
Using three states you can analysze labor market indicators:
The flows of people from one state to another are called transitions. Transition rates are calculated from using data that shows the beginning and end value of people in states, and then is calculated from the change.
For example, UE transition rates represent the percentage of people unemployed at the beginning of the period that transition to being employed.
It can be used to approximate how likely it is for an unemployed person to find a job (Job finding probablity).
This will be elaborated upon further in the tutorial as it specific to certain sections.
In order to compare the different dynamics of countries related to transition rates and unemployment rates, data needs to be collected systematically and in a reliable manner. All data used in this tutorial is extracted from Eurostat which is an adminitrative branch of the European Commission located in Luxembourg. Its responsability is to provide statistical information to the institutions of the European Union and to encourage the harmonisation of statistical methods in order to ease comparison between data. In this section, we will discuss how Eurostat gather data and the degree of relability of its operations. Eurostat publishes its statistical database online for free on its website.
The data that will interest us in this tutorial are the one related to the European Labor market. The European Labor Force Survey is a survey conducted by Eurostat in order to find those data. The latter are obtained by interviewing a large sample of individuals directly. This data collection takes place over on a monthly, quarterly and annually basis. The European Labor Force Survey collects data by 4 manners:
The overall accuracy of these methods have been proved to be high. Retrospectively, the results have been found to lay in a 95% confidence interval. For more information related to how Eurostat collects its data, you can consult this page.
In Eurostat, you are able to download a .tsv file. Tsv files are similar to csv but use tabs to separate data instead of commas like in csv files. Or, you can use the Pyrostat API for python. However, the documentation at the moment of creating this tutorial is not clear enough to present in a clear way. It is based upon a json and unicode REST api from the eurostat page, but is beyond the scope of this tutorial. For more information see eurostat web services and pyrostat.
The data used for this tutorial comes from the Eurostat website. In particular, the data set (sdg_08_30) with information here.
Therefore, the rest of the tutorial will focus on data import, cleaning and analysis starting with the provided Excel files from this class.
This section will give a basic introduction into how data works in python and programming. Data in general can be stored in several different formats that organize it. The basic formats have been introduced in past tutorials (lists, dictionaries, tuples). More advanced ones are:
They can be imported into python and stored in a data frame using the python extension pandas. The advantage of data frames is the wide variety of operations you can perform on them in python since it is a python object (source).
As a pandas dataframe is structed by rows and column, it is easier to select data compared to a list or a dictionary. You can also easily filter by column or row in order to derive conclusions or structure the data set for analysis. You can also join different pandas datasets as well as clean data easier than if you were working with another data type. These examples are elaborated upon below.
It is often easy to import pandas as pd such so it easier to call it later on.
import pandas as pd
Additonally, it is common to also import it with the package numpy
import numpy as np
Numpy is used for analysis and computing in python.
In order to convert structured files into a pandas data frame there are a variety of options that you can find here: List of pandas functionalities. Some examples are the ability to convert excel, json, html, csv, pickle, and sql. These are generally formatted as:
read_excel
read_pickle
read_json
read_html
etc.
There is also the option with pandas to create a dataframe either manually or by using existing lists or dictionaries.
For example, dataframes can be created from dictionaries.
import pandas as pd
#here, creating a food dictionary setting categories
food_categories={"apples":"fruit",
"oranges":"fruit",
"cucumber":"vegetable",
"spinach":"vegetable",
"beef":"meat",
"pepper":"vegetable",
"banana":"fruit"
}
#created dataframe setting category as the index
food_cat_df=pd.DataFrame(food_categories, index=["category"])
food_cat_df
Some other useful functions for dataframe in pandas will be used below, but are:
The pandas documentation provides a very good description of what you can do with dataframes and if there is something that interests you beyond the application of this tutorial, it may very likely be found at Pandas Documentation
Here, we will import all the packages that are used in the program. As packages have not been introduced in past tutorials, we will explain them in Python briefly.
First, if running a Python program on your local hardrive for example in spydir, it will be necessary to install a package on your computer or a virtual enviorment (venv or virtualenv). Virtual enviorments allow you to separate packages and versions for individual projects. For more information see section for creating virtual enviorments. Otherwise, Python will not be able to understand to what you are refering as it will not be defined. It is suggested you use the functionality pip which allows you to install from the python packaging index.
Once a packages has been installed, you may then import the function into a piece of code.
However, for the purpose of this tutorial, jupyter does not require you to pip install the following packages. For convience and clarity, they will be imported in this section, but we will reference the specific libraries as a note throughout the tutorial.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from pandas import ExcelWriter
from pandas import ExcelFile
import math
import statsmodels as sm
Our data is in three different local folders:
Total/
for all the working-age population.Female/
for only the working-age females.Male/
for only the working-age males.In this case it makes sense to specify which folder to use in advance. If we want to change from Total/
to Female/
later on, we can easily change the variable sex instead of having to search for the name of the folder in the whole code. The same holds true for the variable myfolder, if the location of the data changes later on, we can apply that change right here instead of searching for all the times we used that filepath.
In general, the easiest way to find locally saved files with python on a mac is to use a relative filepath. This means Python starts searching for the file from the location of the Python code we are currently working with. Here, the folder EuropesLM_Data/ is located within the same folder as the notebook. From there, Python proceeds by entering the Male/ folder and then accesses whatever we specify later on. To execute everything the way we do in this tutorial, you should place the jupyter notebook in the same folder as the file country_codes.xlsx and the folder EuropesLM_Data.
If on Windows, the easiest way is to use a full filepath, starting at the C:/ directory and specifiing all steps to go from there.
sex = 'Male'
myfolder = "EuropesLM_Data/" + sex + "/"
Almost all countries in the world have developed a system and the infrastructure to record and approximate the unemployment rate of the labor force as accurately as possible. In order to understand its significance we shall agree on a clear definition in more detail than what was provided earlier. There are numerous definitions since the methodology for calculating the unemployment rate often varies among countries. Different definitions of employment and unemployment, as well as different data sources are used but the consensus is the following: unemployed people are those who are willing and available to work, and who have actively sought work within the past four weeks. Students, prisoners, or for example handicapped people do not match the definition and are not considered as unemployed but out of the labor force.
To calculate the unemployment rate, the number of unemployed people is divided by the number of people in the labor force, which consists of all employed and unemployed people.
$\text{Unemployment rate} = \frac{\text{Number of Unemployed people}}{\text{Labor Force}}$
Before we plot the unemployment rate for different countries, we want to give a little introduction to plotting in general.
The most common library for plotting in python is Matplotlib
. We are going to introduce its most important functionalities with some basic examples in the following section. In the following parts, we will further specify the functionalities when we use them.
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
To show the basic functionality of Matplotlib, here we create a sample plot
plt.plot
contain the x and y coordinates, to refer to it later, we can add the argument labelplt.show
functionplt.plot([1, 5, 3, 4, 7, 9],[1, 3, 3, 5, 7, 9], label = "Random Graph")
plt.show()
plt.ylabel
, we can give names to the axesplt.style.use()
. There's a variety of styles available here
plt.title()
plt.legend
, the location can be specified using the loc argumentplt.plot()
again before calling plt.show
plt.plot([1, 3, 6],[1, 3, 6], label = "Linear")
plt.plot([1, 4, 5],[1, 3, 6], label = "Not so Linear")
plt.ylabel("Some Random Numbers", fontsize = 15)
plt.style.use("seaborn")
plt.title("This is a random plot", fontsize = 15)
plt.legend(loc = "center right", fontsize = 10)
plt.show()
After this short introduction, we wanet to start by importing and plotting the measured unemployment rates.
The easiest way to handle our data is a pandas.DataFrame
. Fortunately, Pandas as a useful function to read Excel files and create DataFrames out of them: pandas.read_excel
However, depending on the shape of the data, there can be difficulties when handling Excel files with pandas. In our case that includes:
pandas
classes, it is important to use the right indexing. In this case we are dealing with (a set of) time series, so we want the index to have the proper functionality. In particular, our data refers to a full calendar year quarter which can be represented using a period. This sort of information cannot be read from the Excel file as the reference to the quarters there is just a string.At the beginning, let's load the first sheet containing the actual data into a pandas.DataFrame
. In this case, the pd.read_excel
function gets five inputs:
# We store the filename in a variable, that makes it easier to change it later
aux_name = myfolder + "lfsq_urgan.xls"
# Next, read the first Excel sheet
aux_data = pd.read_excel(aux_name, skiprows=11, index_col=0, sheet_name='Data')
aux_data
After creating the Pandas.DataFrame
, we want to start cleaning the data. As mentioned, there are two ways in which Eurostat records missing data. The first is by setting the otherwise empty cells to ":". We can change those to NaN easily by using pandas.DataFrame.replace and giving the function the inputs:
It will continue to find all ":" in aux_data and replace them with NaN. Note that those are not just different strings but a method of numpy.
aux_data = aux_data.replace(':', np.NaN)
After the first sheet, we now want to create a second Pandas.DataFrame
for the flags sheet of our excel file. The second sheet has the same format and contains flags for bad data. We will use it to set all those cells in aux_data to np.nan
.
Besides the inputs we used when reading the first sheet, here the function also gets the input:
read_excel
how many rows to parse and pandas.DataFrame.shape passes the number of rows of our DataFrame aux_data as an integer.aux_flags = pd.read_excel(aux_name, skiprows=11, index_col=0, nrows=aux_data.shape[0], sheet_name='Flags and Footnotes')
To prepare the DataFrames for later indexing, we have to transpose them. This is possible with the DataFrame.T
function. What it does is basically to exchange columns and rows of the DataFrame.
aux_data = aux_data.T
aux_flags = aux_flags.T
To find the cells in aux_data that correspond to flags in aux_flags and set them to NaN, we will use a function. This function will later:
The iteration process works as follows:
In our case that will be the row and column number (integers) of the values in the passed DataFrame.values. That also explains why we convert the DataFrames into Numpy arrays - we can't adress a cell in a DataFrame by calling DataFrame[row_number, column_number]
but we can do the same with a Numpy array.
# Our function get's two inputs, both of which will be sheets of an Excel file
def flags_to_NaN(data,flags):
# Iterate over the array, return rows with their number
for row, l in enumerate(flags):
# Iterate over the row from the last line, return a single value with its column number
for column, value in enumerate(l):
# Check for strings
if type(value) is str:
data[row,column] = np.nan
return data
Now we apply our function to the DataFrames we created in the previous steps. In detail, it does this:
for row, l in enumerate(flags)
: This returns row, the row number and l, what the row contains (Something similar to: ['b' nan nan nan nan])for column, value in enumerate(l)
: This returns column, the location of the value in l ('b' would be 0) and value, the value itself ('b')np.nan
We save the cleaned DataFrame as mydata.
aux_values = flags_to_NaN(aux_data.values, aux_flags.values)
mydata = pd.DataFrame(aux_values)
To make working with the final DataFrame easier, we'll now simplify the names of Germany and Macedonia. The DataFrame.rename
method takes input in form of a dictionary containing:
aux_data = aux_data.rename(columns = {'Former Yugoslav Republic of Macedonia, the':'Macedonia'})
aux_data = aux_data.rename(columns = {'Germany (until 1990 former territory of the FRG)':'Germany'})
Now, let's create a pd.period_range
called "dates" which we will use as an index for our final DataFrame. This practical function gets three important inputs:
pd.period_range
.It returns a Period Index with a fixed frequency that we can use as index for our DataFrame.
import pandas as pd
# Create the pd.period_range
dates = pd.period_range(start='2010Q2', end='2017Q4', freq='Q', name='Quarterly Frequence')
# Set mydata's index to be the Period "dates" we just created
mydata.index = dates
At last, some minor changes before we finish our DataFrame mydata:
# To set the names of the countries as column names,
mydata.columns = aux_data.columns
# Drop Germany
mydata = mydata.drop(['Germany'], axis=1)
# Update the name of the columns list
mydata.columns.name = 'Countries'
To prevent the data being influenced by strong seasonal deviations, we will use a "moving average". We calculate it by using:
In combination, those two calculate the moving average.
# Seasonally adjust data
mydata = mydata.rolling(4).mean()
mydata
Now we want to plot the Unemployment Rates of different countries against each other to see how they compare. We use some more functionalities of Matplotlib
here.
figsize(width, height)
, this defines the size of the figure in inches.# Modifies the defult size of plots [13.0,6.0] inches
matplotlib.rcParams['figure.figsize'] = [13.0,6.0]
#mydata['Germany'].plot(legend=True)
mydata['Sweden'].plot(legend=True) # Legend = True display the name of the country on the graph.
mydata['Finland'].plot(legend=True)
mydata['Switzerland'].plot(legend=True)
mydata['Macedonia'].plot(legend=True)
plt.ylim(0,35) # the y axis ranges from 0 to 35
plt.xlim("2011Q1","2017Q4")
plt.ylabel('Unemployment Rate (%)', size = 14) # Name x axis
plt.xlabel('Quaterly Frenquency', size = 14) # Name y axis
plt.title('Unemployment rate in Percent for different countries', weight = 'bold', size = 20)
plt.show()
Another type of data useful to provide us with insight concerning the European Labor Market is the transitions rates. They tell us how fast or how likely an individual move from a category to another. There are 9 different transitions formed by the 3 states a person can be in : 'U' for unemployed, 'E' for employed, and 'N' for out of the labor force. Therefore we can explicit all the different transitions from a period to another (t to t+1) by a matrix 3x3:
Today/tomorrrow | E | U | N |
---|---|---|---|
E | EE | EU | EN |
U | UE | UU | UN |
N | NE | NU | NN |
We can then analyse different indicators such as the EU + EN -> Job Destruction rate & UE + NE -> Job finding rate.
Nevertheless, it is important to mention that we calculate transition rates and not exactly the probabilitiy to move from 1 state to another because of the following reasons. The way data is collected tells us how many people move across states for a given period and with those data we can compute the transitions rates. For example the UE transition rate tell us the proportion of unemployed people that went from unemployment to employment at the end of a given period. To calculate it, we divide the number of people that from 1 state to another by the number of people in the initial state and we have our rate that we can call rate_UE for example.
Many transitions can happen at any time between the person is surveyed the 1st time and the second. The current data only measure the transition from point a to point b but not what happens inbetween. Since Eurostat do not register data in continuous but in a discret manner and changes happen in continuous, the transition rates are only an approximative of the probability to move from 1 state to another. Plus, since those rates should approximate probabilities, they can not be negative or greater than 1.
exec()
function¶Before we continue, we will briefly introduce the exec()
function.
It executes python code dynamically, that means it "writes" a string or executes other code every time the code passes it. In our case that makes it very useful for saving files or DataFrames with different names.
The two important parts are:
{}
brackets: They specify at which part of the string something from the .format()
section should be put in..format()
section: It specifies what should be put into the {}
when the exec function is called.In our case, in each iteration the first {}
takes the key of the dictionary and the second {}
takes the corresponding value.
example_dictionary = {"Apple" : "2€", "Cherry": "10€", "Peach": "6€", "Banana": "4€"}
# Using the exec-function, we can easily print out a "Pricelist" made up from the keys and values of a dictionary
for key in example_dictionary:
exec("print('One kilogram of {} costs {}')".format(key, example_dictionary[key]))
states = ['E', 'U', 'N']
transitions = [] # We set an empty list that will be filled with the 9 transitions
for today in states:
for tomorrow in states:
frame_name = today + tomorrow
transitions.append(frame_name)
file_name = 'lfsi_long_q_' + frame_name + '.xls'
# Load xls with data and flags
aux_name = myfolder + file_name
aux_data = pd.read_excel(aux_name, skiprows=11, index_col=0, sheet_name='Data')
aux_data = aux_data.replace(':', np.NaN)
aux_flags = pd.read_excel(aux_name, skiprows=11, index_col=0, nrows=aux_data.shape[0],
sheet_name='Flags and Footnotes')
# Transpose data and flags
aux_data = aux_data.T
aux_flags = aux_flags.T
# Change the name of Macedonia
aux_data = aux_data.rename(columns = {'Former Yugoslav Republic of Macedonia, the':'Macedonia'})
# Update the name of the columns list
aux_data.columns.name = 'Countries'
# Convert data flags into NaNs
aux_values = flags_to_NaN(aux_data.values, aux_flags.values)
mydata_transitions = pd.DataFrame(aux_values)
# Set proper period index
mydata_transitions.index = dates
# Set proper column names
mydata_transitions.columns = aux_data.columns
# Set an appropiate name to the DataFrame
mydata_transitions.name = frame_name + ' Transitions'
# Save DataFrame with proper name
exec('{} = mydata_transitions'.format(frame_name))
To compute the transitions rate we need to iterate over all the possible combinations of ['E','U','N']
and check that all the computed rates are in the [0,1] interval. In order to do the later, we can define a function that uses DataFrame.min()
and DataFrame.max()
. The function warns us if a value in the rows or column of the Dataframe is smaller than 0 or bigger than 1 since they represent probabilities. . max().max() will look at the maximum of the rows and maximum of the columns and similarily for min().min(). We don't have to check every single value because if the maximum and minimum values fulfill this criterium, the rest of the values will be okay, if not, we have to check the whole code anyway.
def check_rates(data):
if data.max().max() > 1: # If the max value of the data >1
print("ATENTION: In " + data.name + "at least one value bigger than 1")
if data.min().min() < 0: # If the mmin ax value of the data <0
print("ATENTION: In " + data.name + "at least one value smaller than 0")
The excel file names containing the data regarding the transition rates have very similar names except that they differ at the end. They end by the respective transition rate ("EE" or "UE"). The list created above containing the 9 transitions helps us to find fast way to go through all the excel files.
transition_rates = []
for today in states:
# Define E, U, and N stocks in the current period
exec(today + ' = ' + today+'E + ' + today+'U + ' + today+'N')
for tomorrow in states:
frame_name = 'rate_'+today+tomorrow
transition_rates.append(frame_name)
# Create an auxiliary data set with all the transition rates (easier to use .name)
# E.g UE/E: number of people that go from unemployment to employment/number of people employed
exec('aux_data = {}{}/{}'.format(today,tomorrow,today))
# Set an appropiate name to the DataFrame
aux_name = '{}{} Transitions rates'.format(today,tomorrow)
aux_data.name = aux_name
# Check values
check_rates(aux_data)
# Save DataFrame with proper name
exec(frame_name + ' = aux_data')
To see the results, we will plot some of our computed rates using Matplotlib
. Its important to include the sex
variable along with the data displayed in the title so there are no misunderstandings.
#import matplotlib.pyplot as plt
rate_UE.mean().sort_values(ascending=False).plot.bar(color='b', figsize=(14,7))
plt.title("Average UE Rates ({}), 2010Q2-2017Q4".format(sex))
plt.xlabel('Countries')
plt.show()
If we want to make our plot look a little fancier we can again use the broad functionality of Matplotlib
#import matplotlib
matplotlib.style.use('fivethirtyeight') # to see all style available do: print(plt.style.available)
# Display the mean of each country's unemployment rate between Q2 2010 and Q4 2017
# sort_values(ascending = False) the values are displayed from the highest to the lowest
ax = rate_UE.mean().sort_values(ascending=False).plot(kind='bar', figsize=(14,7), fontsize = 14, width=0.75)
plt.title("Average UE Rates ({}), 2010Q2-2017Q4".format(sex), fontsize=25)
plt.xlabel('') # set x label as an empty string for stylistic reason
# set individual bar lables
for p in ax.patches:
ax.annotate(str(round(p.get_height(),2)), # 2 is number of decimals after the comma displayed.
(p.get_x()+p.get_width()/2., p.get_height()-0.025), # set the location where to display the average UE rate
ha='center', va='center', xytext=(0, 10), # center the text.
textcoords='offset points',
rotation=90) # rotate the number by 90°
plt.show()
.xlsx
files¶pandas
offers built-in functionality to simplify this task. See:
This will create an Excel file that contains the same columns and rows our DataFrame does. The obvious advantage of this method is that the file can be read using Excel, while maintaining the option of importing it again in other notebooks. Pandas has a built-in functionality for this, called DataFrame.to_Excel. It works as follows:
for rate in transition_rates:
# Define file name
file_name = '{}{}.xlsx'.format(myfolder,rate)
# Store DataFrame in aux_data
exec('aux_data = {}'.format(rate))
# Define Excel Writer
writer = pd.ExcelWriter(file_name)
# Convert DataFrame to Excel object.
aux_data.to_excel(writer, sheet_name='Data')
exec("{}.to_excel(writer, sheet_name='Data')".format(rate))
# Close the Pandas Excel writer and output the Excel file
writer.save()
The Steady-State unemployment rate is given by
$$\frac{s}{s+f}$$A steady state in economics is the levels of independent variables in a model such that from one period to another, the dependent variable stays constant.
Here, this means that the steady state unemployment rate, which is determined by s and f are at a level such that that the unemployment rate stays constant from period $t$ to period $t+1$.
Steady states have a large impact when studying growth theory and other economic models. If you assume that an economy and world converges on a steady state given certain factors, then it becomes interesting to evaluate how the current levels are in comparison to the steady state.
This therefore calculated from setting the point where the same number of people lose and find a job in a period: $$sE=fU$$
And solving for $$\frac{U}{L}$$ the steady state unemployment rate
In our case, these rates are represented by the transition rates. $s = {\pi^{EN}\pi^{NU}+\pi^{NE}\pi^{EU}+\pi^{NU}\pi^{EU}}$ $f = {\pi^{UN}\pi^{NE}+\pi^{NU}\pi^{UE}+\pi^{NE}\pi^{UE}}$
That means we can calculate $UR_{SS}$ as follows:
$$UR_{SS} = \frac{\pi^{EN}\pi^{NU}+\pi^{NE}\pi^{EU}+\pi^{NU}\pi^{EU}}{(\pi^{UN}\pi^{NE}+\pi^{NU}\pi^{UE}+\pi^{NE}\pi^{UE})+(\pi^{EN}\pi^{NU}+\pi^{NE}\pi^{EU}+\pi^{NU}\pi^{EU})}$$We can use a temporary data set, where we copy the DataFrame of the country we are dealing with, to simplify the code.
To find out what the Steady State Unemployment Rate of a given Country is, visualize it and compare it to the measured unemployment rate, we will:
We start off with the DataFrame mydata_measured_unemployment we created out of the cleaned unemployment data before. To continue computing the Steady State Unemployment Rates, we will use this DataFrame to create separate DataFrames for each of our countries.
We do that by iterating over the countries in the column list of mydata and, at each iteration:
# This creates an empty list were we can store all the country names for future use
countries = []
# Now we iterate over all the column names (the countries)
for country in mydata.axes[1]:
# Now we chose the relevant column in aux_data
temp_data = pd.DataFrame(data = mydata[country])
# We rename the column which previously was the country name
temp_data = temp_data.rename(columns = {country:'Measured Unemployment Rate'})
# Here we replace the spaces in every country name with underscore
country = country.replace(' ', '_')
# Lastly, we store the country name in our "countries" list
countries.append(country)
# Save each DataFrame with the corresponding country name
exec("{} = temp_data".format(country))
Next, we will Merge all the transition rates from our .xlsx
files into the country DataFrame. From the previous step, we already have a pandas.DataFrame
for each country with measured unemployment rate. Now we need to add the transition rates. To do so, we need to import the .xlsx
files we created earlier that contain the transition rates and merge each column to the corresponding country's DataFrame. We also want to seasonally adjust the transition rates. Like previously, we use the DataFrame.rolling
method for this task.We proceed as follows:
pd.period_range
"dates" as index.states = ['E', 'U', 'N']
for today in states:
for tomorrow in states:
transition = today + tomorrow
file_name = "{}rate_{}{}.xlsx".format(myfolder,today,tomorrow)
# Load file in aux DataFrame
aux_frame = pd.read_excel(file_name, index_col=0)
aux_frame.index = dates
aux_frame
# Seasonally adjust data
aux_frame = aux_frame.rolling(4).mean()
# Iterate over all countries
for country in aux_frame.axes[1]:
frame_name = country.replace(' ', '_') # Replace spaces in country name by underscore
#aux_series = aux_frame[country]
exec("{}['{}'] = aux_frame['{}']".format(frame_name,transition,country))
Using the equation for steady state unemployment rates above, we can compute the rate for each country and period.
We can use a temporary dataset, where we copy the DataFrame of the country we are dealing with, to simplify the code. This temporary dataset is called f in our case, You can see that what is defined as $s$ and $f$ in the formula above corresponds to aux_E and aux_U.
for country in countries:
# This simplifies the code simply by exchanging "country" with "f"
exec("f = {}".format(country))
# Compute Steady State Unemployment Rate
aux_E = f['UN']*f['NE'] + f['NU']*f['UE'] + f['NE']*f['UE']
aux_U = f['EN']*f['NU'] + f['NE']*f['EU'] + f['NU']*f['EU']
f['Steady State Unemployment Rate'] = (aux_U/(aux_E+aux_U))*100
# Save f with appropiate name
exec("{} = f".format(country))
After having computed the Steady State Unemployment Rates we want to compare them to the measured data.
We start by plotting the two different rates for one example country.
exec
functions read as followsSweden[Measured Unemployment Rate].plot(legend = True)
Sweden[Steady State Unemployment Rate].plot(legend = True)
exec
gives us the advantage of only having to change the variable mycountry if we want to plot the comparison for a different countrymycountry = "Sweden"
exec("{}['Measured Unemployment Rate'].plot(legend=True)".format(mycountry))
exec("{}['Steady State Unemployment Rate'].plot(legend=True)".format(mycountry))
plt.title("Measured vs. Steady State Unemployment Rate, {} ({})".format(mycountry,sex), fontsize=14)
plt.show()
We proceed as follows. In order to be able to study in which conditions our model of the labor market performs better, we can summarise all the relevant information in one pandas.DataFrame
. To get the data in a compact format, we will calculate the means of each DataFrame using .mean
. This function adds up all the quarterly values and divides it by the number of measurements (quarters).
To create a DataFrame with the period means for all countries, the method pandas.DataFrame.append might be useful.
It lets us add the row containing the mean we calculate with the exec
and then store in the transposed aux_frame to the end of "Means" - DataFrame we create in the first line.
# Create empty "Means" DataFrame
Means = pd.DataFrame()
# Iterate over countries
for country in countries:
# Save DataFrame in aux_frame to make the code more readable
exec("aux_frame = pd.DataFrame({}.mean())".format(country))
# Transpose DataFrame
aux_frame = aux_frame.T
# Replace spaces
country = country.replace('_', ' ') # Replace underscores by spaces
# Set index to be the countries name
aux_frame.index = [country]
# Add row to "Means" - Dataframe
Means = Means.append(aux_frame)
For visualisation purposes, it is useful to add to the DataFrame the country code that identifies each country. For example, Switzerland is represented by CH. We can use the file country_codes.xlsx
saved locally to create a dictionary that tells us what the appropiate code for each country is. Then we can use that dictionary to set the index previously containing Switzerland to CH.
We proceed as follows:
pandas.DataFrame
out of the excel file with the country codes.# Create DataFrame out of Excel file
aux_codes = pd.read_excel(myfolder_gender_neutral + 'country_codes.xlsx')
# Create empty dictionary
country_code = {}
r.strip()
method removes all whitespaces at the end of the strings# Iterate over rows of DataFrame and fill dictionary
for index,row in aux_codes.iterrows(): # iterrows() Iterate over rows of DataFrame
code = row['Code'].rstrip() # removes the whitespaces
name = row['English'].rstrip()
country_code[name] = code
country_codes = pd.read_excel(myfolder_gender_neutral+'country_codes.xlsx')
country_codes.columns = ['Code', 'Own language', 'English', 'French', 'German']
# we transform the dataframe into a dictionary but we choose only the English names of the countries
# to_dict() convert a data dataframe into a dictionnary
dict = country_codes.set_index('English')['Code'].to_dict()
#dict
print(country_code)
Means['Code'] = ''
for index,row in Means.iterrows():
Means.at[index,'Code'] = country_code[index]
To save the DataFrame Means for future use there are several methods, two of them being pickle and pandas.DataFrame.to_excel
.
Basically, what Pickle does is to serialize python code. That means it converts the code in to a stream of characters that can be read and reconstructed later by another python skript. Then it saves this stream on the local drive.
Means.to_pickle("{}Means.pkl".format(myfolder))
pandas.DataFrame.to_excel
¶Another option to save our DataFrame Means on te local drive would be to create an Excel file like we did before. It contains the same columns and rows our DataFrame does. The obvious advantage of this method over DataFrame.to_pickle
is that the file can also be read using Excel, while maintaining the option of importing it again in other notebooks.
from pandas import ExcelWriter
from pandas import ExcelFile
writer = ExcelWriter('Means.xlsx')
Means.to_excel(writer,'data',index = True)
writer.save()
To test the accuracy of our computed $UR_{SS}$, we can compare the measured and steady state unemployment rates.
To start with, we will create a simple scatter plot using the matplotlib
library we introduced earlier. We'll use it to compare measured and steady-state unemployment rates.
Means.plot(x='Measured Unemployment Rate', y='Steady State Unemployment Rate', kind='scatter')
plt.show()
Using some more of matplotlib's
functionalities, we can also create a more complex scatter plot in which we display not only the data points but also the country codes and a 45 degree line that helps visualise the case in which model and data overlap. In order to add country codes, the documentation of the matplotlib
on annotating plots is very useful.
We are going to begin by creating a version of our DataFrame
Means without the NaN's. That makes it easier for Matplotlib to handle the data. The most convienient way to do this is to use pandas.DataFrame.dropna, which just removes all missing values in our DataFrame
.
Means_noNaN = Means.dropna()
Next, we assign relevant columns of Means to variables to improve readability.
Measured = Means_noNaN['Measured Unemployment Rate']
SteadyState = Means_noNaN['Steady State Unemployment Rate']
labels = Means_noNaN['Code']
To specify what our figure should look like, we give Matplotlib the necessary information.
figsize(width, height)
, this defines the size of the figure in inches.The actual data is entered in the next line, plt.scatter
. This takes the measured (X-axis) and steady state (Y-axis) unemployment rate for each country and ads them as a point in our figure. In our case, the scatterplot takes the following inputs:
After that, we add the labels (country codes) to the observations, again, we use the enumerate
function. It helps by assigning a number (i) to each country code (label). This number corresponds to the place of the observations in Measured and Steadystate and thereby enables us to assign the right country codes to the observations.
In the last steps, we:
# Here we define the figure
plt.figure(figsize=(12, 7), dpi=300)
# Input the data
plt.scatter(Measured, SteadyState, alpha=0.5, color='red')
# Assigns the country code as label to te corresponding measurement
for i, label in enumerate(labels):
plt.annotate(label, (Measured.iloc[i], SteadyState.iloc[i]))
# Store the axes-limits of our plot in a list
limits = [0,35]
# Set axes limits
plt.xlim(limits)
plt.ylim(limits)
# Create 45 degree line
plt.plot(limits,limits, alpha=0.1, color='black')
# Set labels and title
plt.xlabel('Measured')
plt.ylabel('Stedy State')
plt.title('Measured vs. Steady State Unemployment Rate. Mean 2010Q2-2017Q4.')
plt.show()
A useful method to compute the distance between the model and the data and tell how accurate our computations were is to use:
$$\text{Distance}=\log{\Big(\frac{\text{model}}{\text{data}}\Big)}^2$$We use the logarithm because it has the advantage of being magnitude-neutral, giving us the model's relative deviations from the data.
We can also regress a magnitude-neutral measure of the difference between the model and the data to see if any of the inputs of the data is correlated with the model-data discrepancy.
Means['Difference'] = 0.0 # Initiated as float
for index,row in Means.iterrows():
model = Means.at[index,'Steady State Unemployment Rate']
data = Means.at[index,'Measured Unemployment Rate']
Means.at[index,'Difference'] = (math.log(model/data)**2)
Means
Running a regression is relatively easy using the statsmodel
library. An example of how to run an OLS can be found here.
It is possible to execute most types of regressions in python.
This tutorial used OLS regression. OLS regression, meaning Ordinary Least Squares creates a regression line "minimzes the square distance between the predicted values and the actual values" source. It is a type of linear regression meaning that the model that a regression tries to fit comes from the equation:
$$y={\beta_0 +\beta_1 x+e_i}$$In this linear trend line:
When the regression is performed, it uses the OLS to fit the model by estimating the parameters ($\beta$). The results give you the values of each parameter along with the standard error. If $\beta$ is positive, it means there is a postive relationship between x and y. The p-value shows whether the effect is statistically signficant. It tests whether the p value is less than .05 (where the p-value is a hypothesis test). The R-squared value tells you how much variation in y is explained by x.
OLS regressions can be expanded by including multiple variables, called a multivariate regression. Multivariable regression means that the data cannot be multicollinearity (one independent variable being dependent on what should have been an independent variable) otherwise there is inaccurant results. More detailed, econometric limitations may be found in a statistics/econometrics 1 or 2 course or see lecture on linear regressions.
import statsmodels.api as sm
Means['constant'] = 1.0 # Create a constant
regression = sm.OLS(Means['Difference'], Means[['constant','EU','EN','UE','UN','NE','NU']], missing='drop')
regresults = regression.fit()
print(regresults.summary())
This OLS regression tests which elements account for the distance between the model and the data. The outcome of this regression means that the linear line of fit is such:
$$\text{difference}={.0768 +.0546\cdot EU -1.2626\cdot EN +.0663\cdot UE -.1830\cdot UN +.0911\cdot NE -.5086\cdot NU + e_i}$$This explains the composition of the difference between the model and the data. The composition is made up of the transition rates and their magintude effects. You can see that EN has the largest effect to decreasing the distance between the model and the data and NE has the largest.