How can a Government Improve their Nation's Overall Happiness?¶

Introduction¶

Importance¶

Happiness matters. It's how we quantify the essentials necessary for appreciating our quality of life. Everyone has heard and probably pondered this question: Would you rather be rich or happy? There is no simple answer.

A Government is an entity that aims to work for its people and their advancement as a whole; and thus its success, its country's success, can be measured by how happy their citizens are.

Goals¶

The main goal of this notebook is to determine what any given government should first improve upon if they were interested in increasing the average happiness of their citizens. This means providing the features that will give the largest immediate improvement in happiness. I will give examples using the created model on Kenya, a third world country, and the United States, a first world country, to determine which potential (both immediate and long-term) improvements they would benefit most from.

Data¶

I derived the core of the following data from the World Happiness Report of 2018, which encompasses statistics from most countries spanning 2005-2017. (The initial data was acquired by the Gallup World Poll, which I did not have access to thanks to an annual fee of $30,000.)

In the dataset happiness is measured as a ladder, the question asked of people surveyed was: “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”

Even though this dataset includes additional features (i.e. the perception of an individual's Government and Social Support) it is still fairly limited, so I'll need to merge this data with other various datasets.

Features pertaining to the goal, for reference:

Age dependency ratio - Age-population ratio of those in labor force and those not
Birth rate
Confidence in national government
Death rate
Delivery quality - Government Effectiveness, Regulatory Quality, Rule of Law, Control of Corruption
Democratic quality - Voice and Accountability, Political Stability and Absence of Violence
Freedom to make life choices - “Are you satisfied with your freedom to choose what you do with your life?”
GDP per capita
Gender inequality Index - a composite measure of gender inequality using three dimensions: reproductive health, empowerment and the labor market. A value closer to 0 means higher equality.
GINI Household income - Income inequality (0 if everyone had same income, 1 if one person had all income)
Health expenditure (% of govt.) - percentage of govt. spending on health care
Life expectancy
Life ladder
Negative affect - "Worry, sadness and anger."
Perceptions of corruption - "Is corruption widespread through your government or not?"
Positive affect - "Happiness, laugh and enjoyment."
School enrollment, tertiary - Enrollment percentage in higher education

Basic Imports¶

In [1]:

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

In [2]:

### Creating a class to suppress output (specifically for imputing with Matrix Factorization) ###
import os, sys
import warnings

class HiddenOutput:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

Data Cleaning¶

I have collected 4 different datasets on countries, already having removed all features unnecessary for this project - with the aim to keep the ones which will capture as much of the life ladder rating as possible. Too many features also runs the risk of more missing data.

First thing to do is reformat the datasets to match each other before they are joined.
We want each other dataset to match that of the happiness dataset.

So we are going from this:

   Country        Feature           2005    2006    2007    2008    2009

0  Afghanistan    Age Dependency    99.0    99.5    100.0   100.2   100.1
1  Afghanistan    Birth Rate        44.9    43.9    42.8    41.6    40.3
2  Afghanistan    Death Rate        10.7    10.4    10.1    9.8     9.5
3  Albania        Age Dependency    53.5    52.2    50.9    49.7    48.7
4  Albania        Birth Rate        12.3    11.9    11.6    11.5    11.6
5  Albania        Death Rate        5.95    6.13    6.32    6.51    6.68

to this:


                       Age Dependency    Birth Rate    Death Rate

Afghanistan    2005    99.0              44.9          10.7
               2006    99.5              43.9          10.4
               2007    100.0             42.8          10.1   
               2008    100.2             41.6          9.8
               2009    100.1             40.3          9.5

Albania        2005    53.5              12.3          5.95
               2006    52.2              11.9          6.13
               2007    50.9              11.6          6.32
               2008    49.7              11.5          6.51
               2009    48.7              11.6          6.68

Which requires creating columns out of the repeating Feature values and indices out of the year columns.

In [3]:

# The core happiness dataset
happiness_df = pd.read_csv("datasets/Happiness_2018.csv").rename(columns={'country':'Country',\
                                                                          'year': 'Year'})
happiness_df = happiness_df.drop(columns=[\
                      'GINI index (World Bank estimate)', 
                      'GINI index (World Bank estimate), average 2000-15', # Both missing too much data
                      'Healthy life expectancy at birth', # Redundant
                      'Positive affect',
                      'Negative affect', # Too similar to Life Ladder
                      'Generosity', # Amount donated, mostly attributed to GDP and culture
                      'Social support']) # Although a big factor, has greatly to do with family
happiness_df = happiness_df.set_index(['Country', 'Year'])

# Country dataset with 6 features
country_df = pd.read_csv("datasets/Country_Data_Clean.csv").rename(\
                                        columns={'Country Name':'Country'})
country_df = country_df.drop(columns=['Country Code', 'Indicator Code'])
country_df = country_df.melt(['Country', 'Indicator Name'], var_name='Year')\
                  .set_index(['Country', 'Year', 'Indicator Name'])['value']\
                  .unstack()

# Gender inequality dataset judged by reproductive health, empowerment, and labor market
inequality_df = pd.read_csv('datasets/gender_inequality_index.csv')
inequality_df = inequality_df.drop(columns=['HDI Rank (2015)'])
inequality_df = inequality_df.melt(['Country'], var_name='Year', value_name='HDI Inequality')\
            .sort_values(['Country', 'Year']).set_index(['Country', 'Year'])

# Population density per square kilometer
pop_df = pd.read_csv('datasets/Population_density.csv').rename(columns={'Country Name':'Country'})
pop_df = pop_df.drop(columns=['Indicator Code', 'Country Code'])
pop_df = pop_df.melt(['Country', 'Indicator Name'], var_name='Year')\
                  .set_index(['Country', 'Year', 'Indicator Name'])['value']\
                  .unstack()

Now that each dataset is correctly formatted, we have to match the year columns to all be int so joining is possible.

In [4]:

inequality_df = inequality_df.reset_index()
inequality_df['Year'] = inequality_df['Year'].astype(int)
inequality_df = inequality_df.set_index(['Country', 'Year'])

country_df = country_df.reset_index()
country_df['Year'] = country_df['Year'].astype(int)
country_df = country_df.set_index(['Country', 'Year'])

pop_df = inequality_df.reset_index()
pop_df['Year'] = pop_df['Year'].astype(int)
pop_df = pop_df.set_index(['Country', 'Year'])

Here we join to get the entire dataset with every feature we want.

In [5]:

df = happiness_df.join(inequality_df).join(country_df).join(pop_df, how='left', lsuffix='_left',\
                                                            rsuffix='_right')
df.drop(columns=['HDI Inequality_right'], inplace=True)
df.rename(columns={'HDI Inequality_left':'HDI Inequality'}, inplace=True)
df

Out[5]:

		Life Ladder	Log GDP per capita	Freedom to make life choices	Perceptions of corruption	Confidence in national government	Democratic Quality	Delivery Quality	gini of household income reported in Gallup, by wp5-year	HDI Inequality	Age dependency ratio (% of working-age population)	Birth rate, crude (per 1,000 people)	Death rate, crude (per 1,000 people)	Health expenditure, public (% of government expenditure)	Life expectancy at birth, total (years)	School enrollment, tertiary (% gross)
Country	Year
Afghanistan	2008	3.723590	7.168690	0.718114	0.881686	0.612072	-1.929690	-1.655084	NaN	NaN	100.215886	41.560	9.771	6.926090	58.225024	NaN
	2009	4.401778	7.333790	0.678896	0.850035	0.611545	-2.044093	-1.635025	0.441906	NaN	100.060480	40.265	9.475	12.728971	58.603683	3.903390
	2010	4.758381	7.386629	0.600127	0.706766	0.299357	-1.991810	-1.617176	0.327318	0.724	99.459839	38.940	9.193	14.404153	58.970829	NaN
	2011	3.831719	7.415019	0.495901	0.731109	0.307386	-1.919018	-1.616221	0.336764	0.713	97.667911	37.636	8.927	10.174108	59.327951	3.755980
	2012	3.782938	7.517126	0.530935	0.775620	0.435440	-1.842996	-1.404078	0.344540	0.701	95.312707	36.396	8.677	11.668976	59.679610	NaN
	2013	3.572100	7.503376	0.577955	0.823204	0.482847	-1.879709	-1.403036	0.304368	0.689	92.602785	35.253	8.445	10.591285	60.028268	NaN
	2014	3.130896	7.484583	0.508514	0.871242	0.409048	-1.773257	-1.312503	0.413974	0.676	89.773777	34.225	8.230	11.998628	60.374463	8.662800
	2015	3.982855	7.466215	0.388928	0.880638	0.260557	-1.844364	-1.291594	0.596918	0.667	86.954464	NaN	NaN	NaN	NaN	NaN
	2016	4.220169	7.461401	0.522566	0.793246	0.324990	-1.917693	-1.432548	0.418629	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2017	2.661718	7.460144	0.427011	0.954393	0.261179	NaN	NaN	0.286599	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Albania	2007	4.634252	9.077325	0.528605	0.874700	0.300681	-0.045108	-0.420024	NaN	NaN	50.862987	11.631	6.321	8.880175	76.470293	30.653669
	2009	5.485470	9.161633	0.525223	0.863665	NaN	0.048114	-0.264635	0.617361	NaN	48.637067	11.679	6.684	8.463262	76.840366	33.400749
	2010	5.268937	9.203026	0.568958	0.726262	NaN	-0.033831	-0.246433	0.543528	0.273	47.885076	11.952	6.841	8.463262	77.036951	44.540649
	2011	5.867422	9.230898	0.487496	0.877003	NaN	-0.110023	-0.278413	0.407266	0.279	46.720288	12.325	6.981	9.850791	77.240585	49.670399
	2012	5.510124	9.246649	0.601512	0.847675	0.364894	-0.060784	-0.328862	0.568153	0.281	45.835739	12.730	7.109	9.710531	77.443976	58.565491
	2013	4.550648	9.258439	0.631830	0.862905	0.338095	0.070411	-0.330956	0.633796	0.272	45.247477	13.106	7.232	9.762421	77.640463	62.547760
	2014	4.813763	9.278097	0.734648	0.882704	0.498786	0.314873	-0.187407	0.417219	0.267	44.912168	13.414	7.350	9.369005	77.830463	62.706848
	2015	4.606651	9.303031	0.703851	0.884793	0.506978	0.251629	-0.152544	0.422627	0.267	44.806973	NaN	NaN	NaN	NaN	NaN
	2016	4.511101	9.337774	0.729819	0.901071	0.400910	0.208456	-0.139161	0.416540	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2017	4.639548	9.373718	0.749611	0.876135	0.457738	NaN	NaN	0.410488	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Algeria	2010	5.463567	9.462701	0.592696	0.618038	NaN	-1.140853	-0.740093	0.492713	0.523	48.681853	24.643	5.108	9.646135	73.804049	29.834560
	2011	5.317194	9.471962	0.529561	0.637982	NaN	-1.182341	-0.776610	0.426202	0.514	49.233576	24.921	5.123	9.365661	74.070000	31.202591
	2012	5.604596	9.485086	0.586663	0.690116	NaN	-1.115535	-0.771172	0.421409	0.432	49.847713	24.946	5.130	9.988192	74.324098	32.231331
	2014	6.354898	9.509210	NaN	NaN	NaN	-1.002867	-0.783428	0.475492	0.429	51.536631	24.309	5.125	9.904623	74.808098	34.593811
	2016	5.340854	9.541166	NaN	NaN	NaN	-1.008262	-0.814304	0.604617	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2017	5.248912	9.540244	0.436670	0.699774	NaN	NaN	NaN	0.527556	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Angola	2011	5.589001	8.684613	0.583702	0.911320	0.232387	-0.747358	-1.215250	NaN	NaN	102.106756	47.018	14.642	5.584801	51.059317	6.946090
	2012	4.360250	8.699287	0.456029	0.906300	0.237091	-0.732785	-1.124386	NaN	NaN	101.836900	46.499	14.329	5.573031	51.464000	NaN
	2013	3.937107	8.729884	0.409555	0.816375	0.547732	-0.752538	-1.213750	0.588065	NaN	101.315235	45.985	14.021	7.423221	51.866171	9.923570
	2014	3.794838	8.741957	0.374542	0.834076	0.572346	-0.739363	-1.168539	0.440699	NaN	100.637667	45.483	13.720	5.004294	52.266878	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Yemen	2011	3.746256	8.244134	0.638211	0.753882	0.387044	-1.910233	-1.124420	0.412353	0.772	80.193948	34.017	7.280	4.297796	63.053537	9.974600
	2012	4.060601	8.241021	0.705815	0.793233	0.598435	-1.892755	-1.119488	0.415892	0.767	78.902603	33.481	7.145	3.932508	63.327293	NaN
	2013	4.217679	8.261730	0.542547	0.885197	0.387677	-1.854220	-1.092611	0.429792	0.762	77.734373	32.947	7.024	3.932508	63.583512	NaN
	2014	3.967958	8.233983	0.663909	0.885429	0.344929	-1.983291	-1.264091	0.447447	0.757	76.644268	32.418	6.919	3.932508	63.818195	NaN
	2015	2.982674	7.878930	0.609981	0.829098	0.263297	-2.101797	-1.374387	0.445243	0.767	75.595147	NaN	NaN	NaN	NaN	NaN
	2016	3.825631	7.751505	0.532964	NaN	0.267581	-2.222766	-1.642179	0.411021	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2017	3.253560	NaN	0.595191	NaN	0.247787	NaN	NaN	0.374522	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Zambia	2006	4.824455	7.866006	0.720972	0.785281	0.526590	0.077581	-0.642004	NaN	NaN	98.124971	43.463	13.598	14.142366	50.966610	NaN
	2007	3.998293	7.918941	0.682005	0.947914	0.404140	0.078216	-0.536810	NaN	NaN	98.223058	43.172	12.801	9.465782	52.477146	NaN
	2008	4.730263	7.966175	0.716994	0.890299	0.557462	0.154371	-0.499575	NaN	NaN	98.277253	42.810	12.060	11.217006	53.905634	NaN
	2009	5.260361	8.026193	0.696183	0.916553	0.413418	0.133997	-0.567933	0.581005	NaN	98.260795	42.382	11.391	12.632602	55.214171	NaN
	2011	4.999114	8.120028	0.662850	0.882150	0.397613	0.169676	-0.487765	0.521399	0.556	97.854237	41.415	10.283	11.132558	57.422195	NaN
	2012	5.013375	8.163204	0.787760	0.806394	0.594114	0.264068	-0.388092	0.610944	0.550	97.385802	40.928	9.822	11.350678	58.363317	NaN
	2013	5.243996	8.182191	0.769912	0.732268	0.552761	0.164946	-0.385220	0.514960	0.544	96.791310	40.471	9.402	11.008191	59.237366	NaN
	2014	4.345837	8.197678	0.811825	0.808841	0.606339	0.023306	-0.395449	0.621956	0.541	96.122165	40.052	9.018	11.310214	60.047049	NaN
	2015	4.843164	8.196217	0.758654	0.871020	0.631103	0.040718	-0.391482	0.671201	0.526	95.402326	NaN	NaN	NaN	NaN	NaN
	2016	4.347544	8.201650	0.811575	0.770644	0.696892	-0.058471	-0.460033	0.681393	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2017	3.932777	8.211670	0.823169	0.739541	0.717004	NaN	NaN	0.612799	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Zimbabwe	2006	3.826268	7.366704	0.431110	0.904757	0.317073	-1.236102	-1.570760	NaN	NaN	81.702148	34.958	17.882	8.257254	42.810707	NaN
	2007	3.280247	7.313939	0.455957	0.946287	0.225752	-1.340245	-1.653740	NaN	NaN	81.272166	35.397	16.945	7.507345	44.177756	NaN
	2008	3.174264	7.102516	0.343556	0.963846	0.181594	-1.381488	-1.701545	NaN	NaN	81.024020	35.788	15.903	6.404965	45.804488	NaN
	2009	4.055914	7.197595	0.411089	0.930818	0.285287	-1.353181	-1.717821	0.545112	NaN	80.934968	36.094	14.809	10.578529	47.624659	NaN
	2010	4.681570	7.296330	0.664718	0.828361	0.471201	-1.289599	-1.693678	0.680030	0.581	80.985702	36.267	13.711	7.471234	49.574659	5.905600
	2011	4.845642	7.418864	0.632978	0.829800	0.425926	-1.204545	-1.621979	0.514646	0.575	80.740494	36.264	12.645	7.594706	51.600366	5.823760
	2012	4.955101	7.534424	0.469531	0.858691	0.407084	-1.125315	-1.555728	0.487203	0.569	80.579870	36.077	11.626	9.691281	53.643073	5.868670
	2013	4.690188	7.565154	0.575884	0.830937	0.527755	-1.026085	-1.526321	0.555439	0.532	80.499816	35.715	10.675	9.593592	55.633000	5.871750
	2014	4.184451	7.562753	0.642034	0.820217	0.566209	-0.985267	-1.484067	0.601080	0.535	80.456439	35.189	9.819	8.486653	57.498317	NaN
	2015	3.703191	7.556052	0.667193	0.810457	0.590012	-0.893078	-1.357514	0.655137	0.540	80.391033	NaN	NaN	NaN	NaN	NaN
	2016	3.735400	7.538829	0.732971	0.723612	0.699344	-0.863044	-1.371214	0.596690	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2017	3.638300	7.538187	0.752826	0.751208	0.682647	NaN	NaN	0.581484	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1562 rows × 15 columns

In [6]:

# Num entries and num non-null values
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1562 entries, (Afghanistan, 2008) to (Zimbabwe, 2017)
Data columns (total 15 columns):
Life Ladder                                                 1562 non-null float64
Log GDP per capita                                          1535 non-null float64
Freedom to make life choices                                1533 non-null float64
Perceptions of corruption                                   1472 non-null float64
Confidence in national government                           1401 non-null float64
Democratic Quality                                          1391 non-null float64
Delivery Quality                                            1391 non-null float64
gini of household income reported in Gallup, by wp5-year    1205 non-null float64
HDI Inequality                                              754 non-null float64
Age dependency ratio (% of working-age population)          1229 non-null float64
Birth rate, crude (per 1,000 people)                        1098 non-null float64
Death rate, crude (per 1,000 people)                        1098 non-null float64
Health expenditure, public (% of government expenditure)    1089 non-null float64
Life expectancy at birth, total (years)                     1098 non-null float64
School enrollment, tertiary (% gross)                       802 non-null float64
dtypes: float64(15)
memory usage: 189.1+ KB

Notice that there are many dispersed missing values, we will fill these based on the mean of other values from the same country and column, as these values should be very close to one another. Values barely change year over year.
There are features some countries have no data on. In this case it's best to impute using Matrix Factorization to match the best guess as to what a certain piece of missing data would be.

In [7]:

# (1)
df = df.groupby(['Country']).transform(lambda x: x.fillna(x.mean()))

# (2)
from fancyimpute import MatrixFactorization
with warnings.catch_warnings(): # Ignore deprecation warnings
    warnings.simplefilter("ignore")
    with HiddenOutput():
        df.iloc[:,:] = MatrixFactorization().fit_transform(df.iloc[:,:]);

Using TensorFlow backend.

In [8]:

df.isna().sum()

Out[8]:

Life Ladder                                                 0
Log GDP per capita                                          0
Freedom to make life choices                                0
Perceptions of corruption                                   0
Confidence in national government                           0
Democratic Quality                                          0
Delivery Quality                                            0
gini of household income reported in Gallup, by wp5-year    0
HDI Inequality                                              0
Age dependency ratio (% of working-age population)          0
Birth rate, crude (per 1,000 people)                        0
Death rate, crude (per 1,000 people)                        0
Health expenditure, public (% of government expenditure)    0
Life expectancy at birth, total (years)                     0
School enrollment, tertiary (% gross)                       0
dtype: int64

There aren't anymore null values left so we now have the final dataset

In [9]:

df

Out[9]:

		Life Ladder	Log GDP per capita	Freedom to make life choices	Perceptions of corruption	Confidence in national government	Democratic Quality	Delivery Quality	gini of household income reported in Gallup, by wp5-year	HDI Inequality	Age dependency ratio (% of working-age population)	Birth rate, crude (per 1,000 people)	Death rate, crude (per 1,000 people)	Health expenditure, public (% of government expenditure)	Life expectancy at birth, total (years)	School enrollment, tertiary (% gross)
Country	Year
Afghanistan	2008	3.723590	7.168690	0.718114	0.881686	0.612072	-1.929690	-1.655084	0.385668	0.695000	100.215886	41.560000	9.771000	6.926090	58.225024	5.440723
	2009	4.401778	7.333790	0.678896	0.850035	0.611545	-2.044093	-1.635025	0.441906	0.695000	100.060480	40.265000	9.475000	12.728971	58.603683	3.903390
	2010	4.758381	7.386629	0.600127	0.706766	0.299357	-1.991810	-1.617176	0.327318	0.724000	99.459839	38.940000	9.193000	14.404153	58.970829	5.440723
	2011	3.831719	7.415019	0.495901	0.731109	0.307386	-1.919018	-1.616221	0.336764	0.713000	97.667911	37.636000	8.927000	10.174108	59.327951	3.755980
	2012	3.782938	7.517126	0.530935	0.775620	0.435440	-1.842996	-1.404078	0.344540	0.701000	95.312707	36.396000	8.677000	11.668976	59.679610	5.440723
	2013	3.572100	7.503376	0.577955	0.823204	0.482847	-1.879709	-1.403036	0.304368	0.689000	92.602785	35.253000	8.445000	10.591285	60.028268	5.440723
	2014	3.130896	7.484583	0.508514	0.871242	0.409048	-1.773257	-1.312503	0.413974	0.676000	89.773777	34.225000	8.230000	11.998628	60.374463	8.662800
	2015	3.982855	7.466215	0.388928	0.880638	0.260557	-1.844364	-1.291594	0.596918	0.667000	86.954464	37.753571	8.959714	11.213173	59.315690	5.440723
	2016	4.220169	7.461401	0.522566	0.793246	0.324990	-1.917693	-1.432548	0.418629	0.695000	95.255981	37.753571	8.959714	11.213173	59.315690	5.440723
	2017	2.661718	7.460144	0.427011	0.954393	0.261179	-1.904737	-1.485251	0.286599	0.695000	95.255981	37.753571	8.959714	11.213173	59.315690	5.440723
Albania	2007	4.634252	9.077325	0.528605	0.874700	0.300681	-0.045108	-0.420024	0.492998	0.273167	50.862987	11.631000	6.321000	8.880175	76.470293	30.653669
	2009	5.485470	9.161633	0.525223	0.863665	0.409726	0.048114	-0.264635	0.617361	0.273167	48.637067	11.679000	6.684000	8.463262	76.840366	33.400749
	2010	5.268937	9.203026	0.568958	0.726262	0.409726	-0.033831	-0.246433	0.543528	0.273000	47.885076	11.952000	6.841000	8.463262	77.036951	44.540649
	2011	5.867422	9.230898	0.487496	0.877003	0.409726	-0.110023	-0.278413	0.407266	0.279000	46.720288	12.325000	6.981000	9.850791	77.240585	49.670399
	2012	5.510124	9.246649	0.601512	0.847675	0.364894	-0.060784	-0.328862	0.568153	0.281000	45.835739	12.730000	7.109000	9.710531	77.443976	58.565491
	2013	4.550648	9.258439	0.631830	0.862905	0.338095	0.070411	-0.330956	0.633796	0.272000	45.247477	13.106000	7.232000	9.762421	77.640463	62.547760
	2014	4.813763	9.278097	0.734648	0.882704	0.498786	0.314873	-0.187407	0.417219	0.267000	44.912168	13.414000	7.350000	9.369005	77.830463	62.706848
	2015	4.606651	9.303031	0.703851	0.884793	0.506978	0.251629	-0.152544	0.422627	0.267000	44.806973	12.405286	6.931143	9.214207	77.214728	48.869367
	2016	4.511101	9.337774	0.729819	0.901071	0.400910	0.208456	-0.139161	0.416540	0.273167	46.863472	12.405286	6.931143	9.214207	77.214728	48.869367
	2017	4.639548	9.373718	0.749611	0.876135	0.457738	0.071527	-0.260937	0.410488	0.273167	46.863472	12.405286	6.931143	9.214207	77.214728	48.869367
Algeria	2010	5.463567	9.462701	0.592696	0.618038	0.492667	-1.140853	-0.740093	0.492713	0.523000	48.681853	24.643000	5.108000	9.646135	73.804049	29.834560
	2011	5.317194	9.471962	0.529561	0.637982	0.649212	-1.182341	-0.776610	0.426202	0.514000	49.233576	24.921000	5.123000	9.365661	74.070000	31.202591
	2012	5.604596	9.485086	0.586663	0.690116	0.455042	-1.115535	-0.771172	0.421409	0.432000	49.847713	24.946000	5.130000	9.988192	74.324098	32.231331
	2014	6.354898	9.509210	0.536398	0.661478	0.478946	-1.002867	-0.783428	0.475492	0.429000	51.536631	24.309000	5.125000	9.904623	74.808098	34.593811
	2016	5.340854	9.541166	0.536398	0.661478	0.453047	-1.008262	-0.814304	0.604617	0.474500	49.824943	24.704750	5.121500	9.726153	74.251561	31.965573
	2017	5.248912	9.540244	0.436670	0.699774	0.353288	-1.089971	-0.777121	0.527556	0.474500	49.824943	24.704750	5.121500	9.726153	74.251561	31.965573
Angola	2011	5.589001	8.684613	0.583702	0.911320	0.232387	-0.747358	-1.215250	0.514382	0.669352	102.106756	47.018000	14.642000	5.584801	51.059317	6.946090
	2012	4.360250	8.699287	0.456029	0.906300	0.237091	-0.732785	-1.124386	0.514382	0.651276	101.836900	46.499000	14.329000	5.573031	51.464000	8.434830
	2013	3.937107	8.729884	0.409555	0.816375	0.547732	-0.752538	-1.213750	0.588065	0.673532	101.315235	45.985000	14.021000	7.423221	51.866171	9.923570
	2014	3.794838	8.741957	0.374542	0.834076	0.572346	-0.739363	-1.168539	0.440699	0.681051	100.637667	45.483000	13.720000	5.004294	52.266878	8.434830
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Yemen	2011	3.746256	8.244134	0.638211	0.753882	0.387044	-1.910233	-1.124420	0.412353	0.772000	80.193948	34.017000	7.280000	4.297796	63.053537	9.974600
	2012	4.060601	8.241021	0.705815	0.793233	0.598435	-1.892755	-1.119488	0.415892	0.767000	78.902603	33.481000	7.145000	3.932508	63.327293	10.532505
	2013	4.217679	8.261730	0.542547	0.885197	0.387677	-1.854220	-1.092611	0.429792	0.762000	77.734373	32.947000	7.024000	3.932508	63.583512	10.532505
	2014	3.967958	8.233983	0.663909	0.885429	0.344929	-1.983291	-1.264091	0.447447	0.757000	76.644268	32.418000	6.919000	3.932508	63.818195	10.532505
	2015	2.982674	7.878930	0.609981	0.829098	0.263297	-2.101797	-1.374387	0.445243	0.767000	75.595147	34.061143	7.322143	4.140908	62.998774	10.532505
	2016	3.825631	7.751505	0.532964	0.833238	0.267581	-2.222766	-1.642179	0.411021	0.767333	80.339638	34.061143	7.322143	4.140908	62.998774	10.532505
	2017	3.253560	8.191046	0.595191	0.833238	0.247787	-1.888195	-1.154295	0.374522	0.767333	80.339638	34.061143	7.322143	4.140908	62.998774	10.532505
Zambia	2006	4.824455	7.866006	0.720972	0.785281	0.526590	0.077581	-0.642004	0.601957	0.543400	98.124971	43.463000	13.598000	14.142366	50.966610	7.134560
	2007	3.998293	7.918941	0.682005	0.947914	0.404140	0.078216	-0.536810	0.601957	0.543400	98.223058	43.172000	12.801000	9.465782	52.477146	18.847403
	2008	4.730263	7.966175	0.716994	0.890299	0.557462	0.154371	-0.499575	0.601957	0.543400	98.277253	42.810000	12.060000	11.217006	53.905634	8.398973
	2009	5.260361	8.026193	0.696183	0.916553	0.413418	0.133997	-0.567933	0.581005	0.543400	98.260795	42.382000	11.391000	12.632602	55.214171	34.661095
	2011	4.999114	8.120028	0.662850	0.882150	0.397613	0.169676	-0.487765	0.521399	0.556000	97.854237	41.415000	10.283000	11.132558	57.422195	13.709220
	2012	5.013375	8.163204	0.787760	0.806394	0.594114	0.264068	-0.388092	0.610944	0.550000	97.385802	40.928000	9.822000	11.350678	58.363317	7.605665
	2013	5.243996	8.182191	0.769912	0.732268	0.552761	0.164946	-0.385220	0.514960	0.544000	96.791310	40.471000	9.402000	11.008191	59.237366	8.960832
	2014	4.345837	8.197678	0.811825	0.808841	0.606339	0.023306	-0.395449	0.621956	0.541000	96.122165	40.052000	9.018000	11.310214	60.047049	8.092875
	2015	4.843164	8.196217	0.758654	0.871020	0.631103	0.040718	-0.391482	0.671201	0.526000	95.402326	41.836625	11.046875	11.532425	55.954186	18.551970
	2016	4.347544	8.201650	0.811575	0.770644	0.696892	-0.058471	-0.460033	0.681393	0.543400	97.382435	41.836625	11.046875	11.532425	55.954186	16.158530
	2017	3.932777	8.211670	0.823169	0.739541	0.717004	0.104841	-0.475436	0.612799	0.543400	97.382435	41.836625	11.046875	11.532425	55.954186	17.988343
Zimbabwe	2006	3.826268	7.366704	0.431110	0.904757	0.317073	-1.236102	-1.570760	0.579647	0.555333	81.702148	34.958000	17.882000	8.257254	42.810707	5.867445
	2007	3.280247	7.313939	0.455957	0.946287	0.225752	-1.340245	-1.653740	0.579647	0.555333	81.272166	35.397000	16.945000	7.507345	44.177756	5.867445
	2008	3.174264	7.102516	0.343556	0.963846	0.181594	-1.381488	-1.701545	0.579647	0.555333	81.024020	35.788000	15.903000	6.404965	45.804488	5.867445
	2009	4.055914	7.197595	0.411089	0.930818	0.285287	-1.353181	-1.717821	0.545112	0.555333	80.934968	36.094000	14.809000	10.578529	47.624659	5.867445
	2010	4.681570	7.296330	0.664718	0.828361	0.471201	-1.289599	-1.693678	0.680030	0.581000	80.985702	36.267000	13.711000	7.471234	49.574659	5.905600
	2011	4.845642	7.418864	0.632978	0.829800	0.425926	-1.204545	-1.621979	0.514646	0.575000	80.740494	36.264000	12.645000	7.594706	51.600366	5.823760
	2012	4.955101	7.534424	0.469531	0.858691	0.407084	-1.125315	-1.555728	0.487203	0.569000	80.579870	36.077000	11.626000	9.691281	53.643073	5.868670
	2013	4.690188	7.565154	0.575884	0.830937	0.527755	-1.026085	-1.526321	0.555439	0.532000	80.499816	35.715000	10.675000	9.593592	55.633000	5.871750
	2014	4.184451	7.562753	0.642034	0.820217	0.566209	-0.985267	-1.484067	0.601080	0.535000	80.456439	35.189000	9.819000	8.486653	57.498317	5.867445
	2015	3.703191	7.556052	0.667193	0.810457	0.590012	-0.893078	-1.357514	0.655137	0.540000	80.391033	35.749889	13.779444	8.398395	49.818558	5.867445
	2016	3.735400	7.538829	0.732971	0.723612	0.699344	-0.863044	-1.371214	0.596690	0.555333	80.858665	35.749889	13.779444	8.398395	49.818558	5.867445
	2017	3.638300	7.538187	0.752826	0.751208	0.682647	-1.154359	-1.568579	0.581484	0.555333	80.858665	35.749889	13.779444	8.398395	49.818558	5.867445

1562 rows × 15 columns

Data Analysis and Visualization¶

Choosing some interesting features, let's first see how and if they correlate with Life Ladder

In [10]:

with HiddenOutput():
    sm = pd.plotting.scatter_matrix(df[['Life Ladder', 'Log GDP per capita', 
                            'gini of household income reported in Gallup, by wp5-year', 
                            'HDI Inequality', 'Age dependency ratio (% of working-age population)'\
                            ]], figsize=(10, 10), diagonal='kde')
    
    # Rotate labels so they don't overlap
    [s.xaxis.label.set_rotation(45) for s in sm.reshape(-1)]
    [s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]
    
    # Prevent labels from overlapping plots
    [s.get_yaxis().set_label_coords(-1.5,0.5) for s in sm.reshape(-1)]
    
    # Hide ticks
    [s.set_xticks(()) for s in sm.reshape(-1)]
    [s.set_yticks(()) for s in sm.reshape(-1)]

The most noticeable things:

It looks like most of these features do in fact correlate well, however, household income inequality doesn't seem too promising.
The correlation of features with Life Ladder look very similar to the correlation of features with GDP.

In [11]:

# (1) Checking correlation coefficient R of household income inequality
df['Life Ladder'].corr(df['gini of household income reported in Gallup, by wp5-year'])

Out[11]:

-0.2999144917858168

There is in fact a weak downward trend associated to how satisfied a person is with their life as the nation's income inequality increases. Additionally, it looks like the income inequality has a greater affect on happiness past a certain mark. This means having up to a certain amount of income inequality makes no significant difference.

Let's get F-test p-values and use simple linear regression to see some coefficients and get a feel for which single qualities are generally important for a government to provide.

In [12]:

y = np.array(df['Life Ladder'])
X = np.array(df.iloc[:, 1:])

In [13]:

from sklearn.feature_selection import f_regression
from sklearn import linear_model

# F-test p-values
F, p = f_regression(X, y)

# Multilinear regression coefficients
clf = linear_model.LinearRegression()
clf.fit(X, y)

corr = pd.DataFrame(clf.coef_.reshape(1, -1), columns=list(df.columns[1:])).transpose() \
                    .rename(columns={0 : 'coef'}).sort_values(['coef'], ascending=False)

corr.join(pd.DataFrame(p.reshape(1, -1), columns=list(df.columns[1:])).transpose() \
            .rename(columns={0 : 'p-value'}).sort_values(['p-value'], ascending=False))

Out[13]:

	coef	p-value
Freedom to make life choices	1.834539	1.224220e-108
Log GDP per capita	0.475416	6.149166e-318
Democratic Quality	0.076849	4.872091e-162
Health expenditure, public (% of government expenditure)	0.043470	3.039275e-76
Birth rate, crude (per 1,000 people)	0.008026	3.989166e-172
School enrollment, tertiary (% gross)	0.005124	2.875886e-183
Age dependency ratio (% of working-age population)	0.002051	1.016683e-124
Life expectancy at birth, total (years)	-0.002818	3.142558e-261
Death rate, crude (per 1,000 people)	-0.055781	1.700101e-34
Delivery Quality	-0.066200	8.043331e-240
HDI Inequality	-0.142111	1.533536e-203
Confidence in national government	-0.831529	3.602172e-03
Perceptions of corruption	-0.861237	4.503876e-74
gini of household income reported in Gallup, by wp5-year	-1.068271	7.887201e-34

In [14]:

plt.xkcd()
# The mark of a succesful Government, apparently
fig, ax = plt.subplots(figsize=(7,5))
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
plt.xticks([])
plt.yticks([])
ax.set_ylim([-30, 10])

data = np.ones(100)
data[20:93] -= (np.arange(73) / 2.5)
data[93:100] = -28

plt.annotate(
    'THE MARK OF A SUCCESSFUL\n GOVERNMENT, APPARENTLY',
    xy=(1, -30), arrowprops=dict(arrowstyle='->'), xytext=(1, -25))

plt.annotate(
    'MAKE THEM LOSE ALL HOPE IN YOU',
    xy=(70, 1), xytext=(5, -45), fontsize=17)

plt.plot(data)

plt.xlabel('citizen\'s confidence in government')
plt.ylabel('citizen\'s happiness')

plt.title('HOW TO EFFECTIVELY RUN A COUNTRY')

plt.show()

Joking aside, even though an increase in confidence of the national government does negatively affect happiness in multilinear regression, it has no correlation with Life Ladder directly... and it's not very useful, nor does it make sense giving advice to lower people's confidence in their government if you want them to be happier.

In [15]:

print(df['Life Ladder'].corr(df['Confidence in national government']))
df = df.drop(columns=['Confidence in national government']);

-0.07361513467078484

Anyway, back to the multilinear coefficients:

Freedom to make life choices, or whether or not an individual feels satisfied that they are able to choose what they do with their life, is at the top. A goverment could approach this in multiple ways; from providing people more opportunities such as free college tuition to increased civil rights for all people.

The GINI of household income describes the country's income inequality and is one of the biggest factors contributing to a decreased average happiness. A country struggling with wealth imbalance should focus on a better distribution of wealth.

Now let's chart the happiness using the 6 most distinguishing features given by the linear regression coefficients. I will min-max scale each feature and then take their percentage worth of each country's Life Ladder value. This way we can see the make up of a country's happiness based on how each of their features compare to other countries.

In [16]:

plt.rcdefaults()
fig, ax = plt.subplots(figsize=(15,30))

# Use the most significant variables
ranks = df[['Life Ladder', 'Freedom to make life choices', 'Log GDP per capita',
           'gini of household income reported in Gallup, by wp5-year', 
            'Perceptions of corruption', 'HDI Inequality', 'Delivery Quality'\
           ]].groupby(['Country']).mean().sort_values(by='Life Ladder', ascending=False)

# Scale each feature value (except for y) between 0.1 and 1 by column
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range=(0.1, 1)).fit(ranks.iloc[:, 1:])
ranks.iloc[:, 1:] = sc.transform(ranks.iloc[:, 1:])

# Multiply percentage of each value by Life Ladder, cumulative sum for stacked bar chart
totals = ranks.iloc[:, 1:].sum(axis=1)
ranks.iloc[:, 1:] = ranks.iloc[:, 1:].divide(totals, axis=0).multiply(ranks['Life Ladder'], \
                                                                      axis=0).cumsum(axis=1)

# Plot each feature
y_pos = np.arange(len(ranks))
p1=plt.barh(y_pos, ranks['Delivery Quality'].get_values(), color='#FF961A')
p2=plt.barh(y_pos, ranks['HDI Inequality'].get_values(), color = '#FF301A')
p3=plt.barh(y_pos, ranks['Perceptions of corruption'].get_values(), color='#33cc33')
p4=plt.barh(y_pos, ranks['gini of household income reported in Gallup, by wp5-year']\
                        .get_values(), color='#ffbf00')
p5=plt.barh(y_pos, ranks['Log GDP per capita'].get_values(), color='#ff1a8c')
p6=plt.barh(y_pos, ranks['Freedom to make life choices'].get_values(), color='#4AAAAA')

y_indices = np.array(ranks.index)
labels = np.append(np.arange(1, y_indices.shape[0] + 1).reshape(-1, 1), \
                                        y_indices.reshape(-1, 1), axis=1)
labels = [str(row[0]) + ' ' + row[1] for row in labels]

ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.invert_yaxis()
ax.set_xlabel('Life Ladder', fontsize=13)
ax.set_title(\
    'Country Happiness with each Feature\'s Relative Makeup - Mean of Data from 2005-2017', 
    fontsize=15)
plt.ylim(len(ranks) , -1)
plt.xlim(0, 10)
plt.legend((p1[0], p2[0], p3[0], p4[0], p5[0], p6[0]), ('Delivery Quality', 'HDI Inequality',
                                                        'Perceived Corruption', 
                                                        'GINI inequality', 'GDP per Capita', 
                                                        'Freedom of Choices'))
plt.show()

As expected, many of the features that contribute to a greater life ladder shrink in percentage of makeup, and the features that hurt the life ladder increase in percentage as the ranking decreases.

Supervised Learning¶

Fitting the Model using an SVM¶

In [17]:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit

X = np.array(df.iloc[:, 1:])
y = np.array(df['Life Ladder'])
scaler = StandardScaler().fit(X)
stand_X = scaler.transform(X)

train_X, test_X, train_y, test_y = train_test_split(stand_X, y, train_size=0.7, 
                                                    test_size=0.3, shuffle=True)

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=123)
param_grid = {'C' : np.logspace(-3, 2, 6), 'gamma': np.logspace(-3, 2, 6)}

svr = SVR(kernel='rbf')
grid = GridSearchCV(svr, param_grid=param_grid, cv=cv, return_train_score=True)

grid.fit(train_X, train_y)

Out[17]:

GridSearchCV(cv=ShuffleSplit(n_splits=100, random_state=123, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), 'gamma': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [18]:

grid.best_params_

Out[18]:

{'C': 10.0, 'gamma': 0.1}

In [19]:

print('Train score:', grid.score(train_X, train_y))
print('Test score:', grid.score(test_X, test_y))

Train score: 0.9426830084552748
Test score: 0.8696420975335767

Check for underfitting/overfitting now that we have the chosen parameters

In [20]:

from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

plt.rcdefaults()
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.grid()

    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

cv = ShuffleSplit(n_splits=100, test_size=0.2)
plot_learning_curve(SVR(kernel='rbf', C=10, gamma=0.1),
                    'Learning Curve', stand_X, y, (0.5, 1.01), cv=cv, n_jobs=4)
plt.show()

There is a small gap between the CV and training scores, meaning more data would be useful, but since there is no way of getting more data outside of waiting for years - this is fine. There isn't much overfitting happening anyway.

Finding a country's features that need to be improved¶

Here I am going to create a simple function which will determine what any certain government needs to work on foremost with the goal of improving happiness. I'm sure finding the gradient at a point would work with some models, but I decided to use an 'rbf' kernel SVM. So instead to make things simpler I'm, simply put, checking the greatest change in score of each feature, updating that feature, and then checking the next greatest change in score.

In [21]:

def improve_which_features(X, tot_increase=4, step=0.1):
    '''
    Principal function for this notebook.
    Get the best features to improve when sent parameters of a country.
    
    Parameters
    ----------
    X : length 13
        Must be scaled before sent
    
    tot_increase: Number total Standard deviations to increase data by
    
    step: Amount each feature is increased by to compare  
    '''
    
    better_X = np.copy(X)
    num_features = X.shape[0]
    
    # Increment each feature by step tot_increase/step times to check 
    # which one increases the score the most each increment
    scores = np.zeros(num_features)
    additional_increase = 0
    order = np.array([])
    for i in np.arange(0, tot_increase, step):
        increase = 0
        best_feature = -1
        
        for j in range(num_features):
            temp_X_plus = np.copy(better_X)
            temp_X_plus[j] += (step + additional_increase)
            temp_X_minus = np.copy(better_X)
            temp_X_minus[j] -= (step + additional_increase)
            
            first = float(grid.predict(temp_X_plus.reshape(1, -1)))
            second = float(grid.predict(temp_X_minus.reshape(1, -1)))

            new_happ = np.maximum(first, second)
            temp_increase = new_happ - float(grid.predict(better_X.reshape(1, -1)))
            
            # Limit the increase to 1 SD. Potential to create unrealistic goals otherwise
            if np.greater(temp_increase, increase) and scores[j] < 2:
                increase = temp_increase
                best_feature = j
                
        if best_feature != -1:
            scores[best_feature] += step + additional_increase
            better_X[best_feature] += step + additional_increase
            additional_increase = 0
            order = np.append(order, np.array([best_feature]))
        else:
            additional_increase += step
            
    return order, scores

Let's use the function on two countries: Kenya and the United States:

Kenya is a third world country but is well placed to transition to a first world country. Obviously this transition is a big factor in improving happiness - first world countries clearly have much higher happiness rankings on average. Perhaps it could be easier to reach such a position by focusing on the happiness, then.
United States is a first world country but has dropped far in their happiness ranking. All the way from 3rd in 2007 to 18th last year in 2017. It would be interesting to see what could help bring them back.

Here are the column indices, for reference

In [68]:

pd.DataFrame(np.linspace(0, 12, 13, dtype=int).reshape(1, -1), columns=list(df.columns[1:]))\
                                                    .transpose().rename(columns={0 : 'Index'})

Out[68]:

	Index
Log GDP per capita	0
Freedom to make life choices	1
Perceptions of corruption	2
Democratic Quality	3
Delivery Quality	4
gini of household income reported in Gallup, by wp5-year	5
HDI Inequality	6
Age dependency ratio (% of working-age population)	7
Birth rate, crude (per 1,000 people)	8
Death rate, crude (per 1,000 people)	9
Health expenditure, public (% of government expenditure)	10
Life expectancy at birth, total (years)	11
School enrollment, tertiary (% gross)	12

In [69]:

# Kenya (2017)
order, scores = improve_which_features(scaler.transform(np.array( \
                                       df.loc[['Country', 'Kenya']])[11, 1:].reshape(1, -1))[0])
print('Order of largest increase in scores:\n', order)
print('\nTotal scores:\nFeature:   0  1   2   3    4  5   6   7   8   9   10  11  12', 
                                                                  '\n  Score:', scores)

Order of largest increase in scores:
 [0. 0. 0. 0. 0. 0. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 0. 4. 4. 0. 4. 4. 0. 7.
 4. 7. 0. 4. 7. 0. 5. 5. 5. 5. 5. 5. 5. 5. 5. 5.]

Total scores:
Feature:   0  1   2   3    4  5   6   7   8   9   10  11  12 
  Score: [1.1 0.  0.  0.  1.6 1.  0.  0.3 0.  0.  0.  0.  0. ]

For Kenya, it looks like Delivery quality is the first important feature to improve upon. Delivery quality was judged on government effectiveness, regulatory quality, rule of law, control of corruption. This means that a focus on the workings of the government and making sure it runs efficiently and for the people may be essential for Kenya to break into the first world. It would also be useful to improve the GDP a little bit before and during improvement of delivery quality. Afterwards it will be important for them to work on income equality. Stimulating the economy by increasing production and bringing more jobs could potentially solve these last two issues.

In [75]:

# United States (2017)
order, scores = improve_which_features(scaler.transform(np.array( \
                        df.loc[['Country', 'United States']])[11, 1:].reshape(1, -1))[0])
print('Order of largest increase in scores:\n', order)
print('\nTotal scores:\nFeature:  0   1   2   3    4  5   6   7    8  9   10  11  12', 
                                                                  '\n  Score:', scores)

Order of largest increase in scores:
 [4. 4. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 4. 8.
 4. 8. 4. 4. 8. 4. 8. 4. 8. 4. 8. 4. 8. 3. 4. 3.]

Total scores:
Feature:  0   1   2   3    4  5   6   7    8  9   10  11  12 
  Score: [0.  0.  0.  0.2 1.1 0.  0.  2.  0.7 0.  0.  0.  0. ]

These results show that first a slight increase in delivery quality could help. More importantly, though, is a further decrease in the age dependency ratio - at least from the list of features we have access to. After the age dependency is improved, an increase in birth rate, which goes hand in hand with age dependency, helps. Finally, continuing improvement in delivery quality is needed.

Final Remarks¶

This solution is far from perfect, more features to cover every possible factor and more data to create a more stable model would be very useful. However, I hope this shows you some of my knowledge! The concepts in this notebook could even be applied to something along the lines of customer service.