Happiness matters. It's how we quantify the essentials necessary for appreciating our quality of life. Everyone has heard and probably pondered this question: Would you rather be rich or happy? There is no simple answer.
A Government is an entity that aims to work for its people and their advancement as a whole; and thus its success, its country's success, can be measured by how happy their citizens are.
The main goal of this notebook is to determine what any given government should first improve upon if they were interested in increasing the average happiness of their citizens. This means providing the features that will give the largest immediate improvement in happiness. I will give examples using the created model on Kenya, a third world country, and the United States, a first world country, to determine which potential (both immediate and long-term) improvements they would benefit most from.
I derived the core of the following data from the World Happiness Report of 2018, which encompasses statistics from most countries spanning 2005-2017. (The initial data was acquired by the Gallup World Poll, which I did not have access to thanks to an annual fee of $30,000.)
In the dataset happiness is measured as a ladder, the question asked of people surveyed was: “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”
Even though this dataset includes additional features (i.e. the perception of an individual's Government and Social Support) it is still fairly limited, so I'll need to merge this data with other various datasets.
Features pertaining to the goal, for reference:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
### Creating a class to suppress output (specifically for imputing with Matrix Factorization) ###
import os, sys
import warnings
class HiddenOutput:
def __enter__(self):
self._original_stdout = sys.stdout
sys.stdout = open(os.devnull, 'w')
def __exit__(self, exc_type, exc_val, exc_tb):
sys.stdout.close()
sys.stdout = self._original_stdout
I have collected 4 different datasets on countries, already having removed all features unnecessary for this project - with the aim to keep the ones which will capture as much of the life ladder rating as possible. Too many features also runs the risk of more missing data.
First thing to do is reformat the datasets to match each other before they are joined.
We want each other dataset to match that of the happiness dataset.
So we are going from this:
Country Feature 2005 2006 2007 2008 2009 0 Afghanistan Age Dependency 99.0 99.5 100.0 100.2 100.1 1 Afghanistan Birth Rate 44.9 43.9 42.8 41.6 40.3 2 Afghanistan Death Rate 10.7 10.4 10.1 9.8 9.5 3 Albania Age Dependency 53.5 52.2 50.9 49.7 48.7 4 Albania Birth Rate 12.3 11.9 11.6 11.5 11.6 5 Albania Death Rate 5.95 6.13 6.32 6.51 6.68
to this:
Age Dependency Birth Rate Death Rate Afghanistan 2005 99.0 44.9 10.7 2006 99.5 43.9 10.4 2007 100.0 42.8 10.1 2008 100.2 41.6 9.8 2009 100.1 40.3 9.5 Albania 2005 53.5 12.3 5.95 2006 52.2 11.9 6.13 2007 50.9 11.6 6.32 2008 49.7 11.5 6.51 2009 48.7 11.6 6.68
# The core happiness dataset
happiness_df = pd.read_csv("datasets/Happiness_2018.csv").rename(columns={'country':'Country',\
'year': 'Year'})
happiness_df = happiness_df.drop(columns=[\
'GINI index (World Bank estimate)',
'GINI index (World Bank estimate), average 2000-15', # Both missing too much data
'Healthy life expectancy at birth', # Redundant
'Positive affect',
'Negative affect', # Too similar to Life Ladder
'Generosity', # Amount donated, mostly attributed to GDP and culture
'Social support']) # Although a big factor, has greatly to do with family
happiness_df = happiness_df.set_index(['Country', 'Year'])
# Country dataset with 6 features
country_df = pd.read_csv("datasets/Country_Data_Clean.csv").rename(\
columns={'Country Name':'Country'})
country_df = country_df.drop(columns=['Country Code', 'Indicator Code'])
country_df = country_df.melt(['Country', 'Indicator Name'], var_name='Year')\
.set_index(['Country', 'Year', 'Indicator Name'])['value']\
.unstack()
# Gender inequality dataset judged by reproductive health, empowerment, and labor market
inequality_df = pd.read_csv('datasets/gender_inequality_index.csv')
inequality_df = inequality_df.drop(columns=['HDI Rank (2015)'])
inequality_df = inequality_df.melt(['Country'], var_name='Year', value_name='HDI Inequality')\
.sort_values(['Country', 'Year']).set_index(['Country', 'Year'])
# Population density per square kilometer
pop_df = pd.read_csv('datasets/Population_density.csv').rename(columns={'Country Name':'Country'})
pop_df = pop_df.drop(columns=['Indicator Code', 'Country Code'])
pop_df = pop_df.melt(['Country', 'Indicator Name'], var_name='Year')\
.set_index(['Country', 'Year', 'Indicator Name'])['value']\
.unstack()
Now that each dataset is correctly formatted, we have to match the year columns to all be int so joining is possible.
inequality_df = inequality_df.reset_index()
inequality_df['Year'] = inequality_df['Year'].astype(int)
inequality_df = inequality_df.set_index(['Country', 'Year'])
country_df = country_df.reset_index()
country_df['Year'] = country_df['Year'].astype(int)
country_df = country_df.set_index(['Country', 'Year'])
pop_df = inequality_df.reset_index()
pop_df['Year'] = pop_df['Year'].astype(int)
pop_df = pop_df.set_index(['Country', 'Year'])
Here we join to get the entire dataset with every feature we want.
df = happiness_df.join(inequality_df).join(country_df).join(pop_df, how='left', lsuffix='_left',\
rsuffix='_right')
df.drop(columns=['HDI Inequality_right'], inplace=True)
df.rename(columns={'HDI Inequality_left':'HDI Inequality'}, inplace=True)
df
Life Ladder | Log GDP per capita | Freedom to make life choices | Perceptions of corruption | Confidence in national government | Democratic Quality | Delivery Quality | gini of household income reported in Gallup, by wp5-year | HDI Inequality | Age dependency ratio (% of working-age population) | Birth rate, crude (per 1,000 people) | Death rate, crude (per 1,000 people) | Health expenditure, public (% of government expenditure) | Life expectancy at birth, total (years) | School enrollment, tertiary (% gross) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | Year | |||||||||||||||
Afghanistan | 2008 | 3.723590 | 7.168690 | 0.718114 | 0.881686 | 0.612072 | -1.929690 | -1.655084 | NaN | NaN | 100.215886 | 41.560 | 9.771 | 6.926090 | 58.225024 | NaN |
2009 | 4.401778 | 7.333790 | 0.678896 | 0.850035 | 0.611545 | -2.044093 | -1.635025 | 0.441906 | NaN | 100.060480 | 40.265 | 9.475 | 12.728971 | 58.603683 | 3.903390 | |
2010 | 4.758381 | 7.386629 | 0.600127 | 0.706766 | 0.299357 | -1.991810 | -1.617176 | 0.327318 | 0.724 | 99.459839 | 38.940 | 9.193 | 14.404153 | 58.970829 | NaN | |
2011 | 3.831719 | 7.415019 | 0.495901 | 0.731109 | 0.307386 | -1.919018 | -1.616221 | 0.336764 | 0.713 | 97.667911 | 37.636 | 8.927 | 10.174108 | 59.327951 | 3.755980 | |
2012 | 3.782938 | 7.517126 | 0.530935 | 0.775620 | 0.435440 | -1.842996 | -1.404078 | 0.344540 | 0.701 | 95.312707 | 36.396 | 8.677 | 11.668976 | 59.679610 | NaN | |
2013 | 3.572100 | 7.503376 | 0.577955 | 0.823204 | 0.482847 | -1.879709 | -1.403036 | 0.304368 | 0.689 | 92.602785 | 35.253 | 8.445 | 10.591285 | 60.028268 | NaN | |
2014 | 3.130896 | 7.484583 | 0.508514 | 0.871242 | 0.409048 | -1.773257 | -1.312503 | 0.413974 | 0.676 | 89.773777 | 34.225 | 8.230 | 11.998628 | 60.374463 | 8.662800 | |
2015 | 3.982855 | 7.466215 | 0.388928 | 0.880638 | 0.260557 | -1.844364 | -1.291594 | 0.596918 | 0.667 | 86.954464 | NaN | NaN | NaN | NaN | NaN | |
2016 | 4.220169 | 7.461401 | 0.522566 | 0.793246 | 0.324990 | -1.917693 | -1.432548 | 0.418629 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2017 | 2.661718 | 7.460144 | 0.427011 | 0.954393 | 0.261179 | NaN | NaN | 0.286599 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Albania | 2007 | 4.634252 | 9.077325 | 0.528605 | 0.874700 | 0.300681 | -0.045108 | -0.420024 | NaN | NaN | 50.862987 | 11.631 | 6.321 | 8.880175 | 76.470293 | 30.653669 |
2009 | 5.485470 | 9.161633 | 0.525223 | 0.863665 | NaN | 0.048114 | -0.264635 | 0.617361 | NaN | 48.637067 | 11.679 | 6.684 | 8.463262 | 76.840366 | 33.400749 | |
2010 | 5.268937 | 9.203026 | 0.568958 | 0.726262 | NaN | -0.033831 | -0.246433 | 0.543528 | 0.273 | 47.885076 | 11.952 | 6.841 | 8.463262 | 77.036951 | 44.540649 | |
2011 | 5.867422 | 9.230898 | 0.487496 | 0.877003 | NaN | -0.110023 | -0.278413 | 0.407266 | 0.279 | 46.720288 | 12.325 | 6.981 | 9.850791 | 77.240585 | 49.670399 | |
2012 | 5.510124 | 9.246649 | 0.601512 | 0.847675 | 0.364894 | -0.060784 | -0.328862 | 0.568153 | 0.281 | 45.835739 | 12.730 | 7.109 | 9.710531 | 77.443976 | 58.565491 | |
2013 | 4.550648 | 9.258439 | 0.631830 | 0.862905 | 0.338095 | 0.070411 | -0.330956 | 0.633796 | 0.272 | 45.247477 | 13.106 | 7.232 | 9.762421 | 77.640463 | 62.547760 | |
2014 | 4.813763 | 9.278097 | 0.734648 | 0.882704 | 0.498786 | 0.314873 | -0.187407 | 0.417219 | 0.267 | 44.912168 | 13.414 | 7.350 | 9.369005 | 77.830463 | 62.706848 | |
2015 | 4.606651 | 9.303031 | 0.703851 | 0.884793 | 0.506978 | 0.251629 | -0.152544 | 0.422627 | 0.267 | 44.806973 | NaN | NaN | NaN | NaN | NaN | |
2016 | 4.511101 | 9.337774 | 0.729819 | 0.901071 | 0.400910 | 0.208456 | -0.139161 | 0.416540 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2017 | 4.639548 | 9.373718 | 0.749611 | 0.876135 | 0.457738 | NaN | NaN | 0.410488 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Algeria | 2010 | 5.463567 | 9.462701 | 0.592696 | 0.618038 | NaN | -1.140853 | -0.740093 | 0.492713 | 0.523 | 48.681853 | 24.643 | 5.108 | 9.646135 | 73.804049 | 29.834560 |
2011 | 5.317194 | 9.471962 | 0.529561 | 0.637982 | NaN | -1.182341 | -0.776610 | 0.426202 | 0.514 | 49.233576 | 24.921 | 5.123 | 9.365661 | 74.070000 | 31.202591 | |
2012 | 5.604596 | 9.485086 | 0.586663 | 0.690116 | NaN | -1.115535 | -0.771172 | 0.421409 | 0.432 | 49.847713 | 24.946 | 5.130 | 9.988192 | 74.324098 | 32.231331 | |
2014 | 6.354898 | 9.509210 | NaN | NaN | NaN | -1.002867 | -0.783428 | 0.475492 | 0.429 | 51.536631 | 24.309 | 5.125 | 9.904623 | 74.808098 | 34.593811 | |
2016 | 5.340854 | 9.541166 | NaN | NaN | NaN | -1.008262 | -0.814304 | 0.604617 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2017 | 5.248912 | 9.540244 | 0.436670 | 0.699774 | NaN | NaN | NaN | 0.527556 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Angola | 2011 | 5.589001 | 8.684613 | 0.583702 | 0.911320 | 0.232387 | -0.747358 | -1.215250 | NaN | NaN | 102.106756 | 47.018 | 14.642 | 5.584801 | 51.059317 | 6.946090 |
2012 | 4.360250 | 8.699287 | 0.456029 | 0.906300 | 0.237091 | -0.732785 | -1.124386 | NaN | NaN | 101.836900 | 46.499 | 14.329 | 5.573031 | 51.464000 | NaN | |
2013 | 3.937107 | 8.729884 | 0.409555 | 0.816375 | 0.547732 | -0.752538 | -1.213750 | 0.588065 | NaN | 101.315235 | 45.985 | 14.021 | 7.423221 | 51.866171 | 9.923570 | |
2014 | 3.794838 | 8.741957 | 0.374542 | 0.834076 | 0.572346 | -0.739363 | -1.168539 | 0.440699 | NaN | 100.637667 | 45.483 | 13.720 | 5.004294 | 52.266878 | NaN | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Yemen | 2011 | 3.746256 | 8.244134 | 0.638211 | 0.753882 | 0.387044 | -1.910233 | -1.124420 | 0.412353 | 0.772 | 80.193948 | 34.017 | 7.280 | 4.297796 | 63.053537 | 9.974600 |
2012 | 4.060601 | 8.241021 | 0.705815 | 0.793233 | 0.598435 | -1.892755 | -1.119488 | 0.415892 | 0.767 | 78.902603 | 33.481 | 7.145 | 3.932508 | 63.327293 | NaN | |
2013 | 4.217679 | 8.261730 | 0.542547 | 0.885197 | 0.387677 | -1.854220 | -1.092611 | 0.429792 | 0.762 | 77.734373 | 32.947 | 7.024 | 3.932508 | 63.583512 | NaN | |
2014 | 3.967958 | 8.233983 | 0.663909 | 0.885429 | 0.344929 | -1.983291 | -1.264091 | 0.447447 | 0.757 | 76.644268 | 32.418 | 6.919 | 3.932508 | 63.818195 | NaN | |
2015 | 2.982674 | 7.878930 | 0.609981 | 0.829098 | 0.263297 | -2.101797 | -1.374387 | 0.445243 | 0.767 | 75.595147 | NaN | NaN | NaN | NaN | NaN | |
2016 | 3.825631 | 7.751505 | 0.532964 | NaN | 0.267581 | -2.222766 | -1.642179 | 0.411021 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2017 | 3.253560 | NaN | 0.595191 | NaN | 0.247787 | NaN | NaN | 0.374522 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Zambia | 2006 | 4.824455 | 7.866006 | 0.720972 | 0.785281 | 0.526590 | 0.077581 | -0.642004 | NaN | NaN | 98.124971 | 43.463 | 13.598 | 14.142366 | 50.966610 | NaN |
2007 | 3.998293 | 7.918941 | 0.682005 | 0.947914 | 0.404140 | 0.078216 | -0.536810 | NaN | NaN | 98.223058 | 43.172 | 12.801 | 9.465782 | 52.477146 | NaN | |
2008 | 4.730263 | 7.966175 | 0.716994 | 0.890299 | 0.557462 | 0.154371 | -0.499575 | NaN | NaN | 98.277253 | 42.810 | 12.060 | 11.217006 | 53.905634 | NaN | |
2009 | 5.260361 | 8.026193 | 0.696183 | 0.916553 | 0.413418 | 0.133997 | -0.567933 | 0.581005 | NaN | 98.260795 | 42.382 | 11.391 | 12.632602 | 55.214171 | NaN | |
2011 | 4.999114 | 8.120028 | 0.662850 | 0.882150 | 0.397613 | 0.169676 | -0.487765 | 0.521399 | 0.556 | 97.854237 | 41.415 | 10.283 | 11.132558 | 57.422195 | NaN | |
2012 | 5.013375 | 8.163204 | 0.787760 | 0.806394 | 0.594114 | 0.264068 | -0.388092 | 0.610944 | 0.550 | 97.385802 | 40.928 | 9.822 | 11.350678 | 58.363317 | NaN | |
2013 | 5.243996 | 8.182191 | 0.769912 | 0.732268 | 0.552761 | 0.164946 | -0.385220 | 0.514960 | 0.544 | 96.791310 | 40.471 | 9.402 | 11.008191 | 59.237366 | NaN | |
2014 | 4.345837 | 8.197678 | 0.811825 | 0.808841 | 0.606339 | 0.023306 | -0.395449 | 0.621956 | 0.541 | 96.122165 | 40.052 | 9.018 | 11.310214 | 60.047049 | NaN | |
2015 | 4.843164 | 8.196217 | 0.758654 | 0.871020 | 0.631103 | 0.040718 | -0.391482 | 0.671201 | 0.526 | 95.402326 | NaN | NaN | NaN | NaN | NaN | |
2016 | 4.347544 | 8.201650 | 0.811575 | 0.770644 | 0.696892 | -0.058471 | -0.460033 | 0.681393 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2017 | 3.932777 | 8.211670 | 0.823169 | 0.739541 | 0.717004 | NaN | NaN | 0.612799 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Zimbabwe | 2006 | 3.826268 | 7.366704 | 0.431110 | 0.904757 | 0.317073 | -1.236102 | -1.570760 | NaN | NaN | 81.702148 | 34.958 | 17.882 | 8.257254 | 42.810707 | NaN |
2007 | 3.280247 | 7.313939 | 0.455957 | 0.946287 | 0.225752 | -1.340245 | -1.653740 | NaN | NaN | 81.272166 | 35.397 | 16.945 | 7.507345 | 44.177756 | NaN | |
2008 | 3.174264 | 7.102516 | 0.343556 | 0.963846 | 0.181594 | -1.381488 | -1.701545 | NaN | NaN | 81.024020 | 35.788 | 15.903 | 6.404965 | 45.804488 | NaN | |
2009 | 4.055914 | 7.197595 | 0.411089 | 0.930818 | 0.285287 | -1.353181 | -1.717821 | 0.545112 | NaN | 80.934968 | 36.094 | 14.809 | 10.578529 | 47.624659 | NaN | |
2010 | 4.681570 | 7.296330 | 0.664718 | 0.828361 | 0.471201 | -1.289599 | -1.693678 | 0.680030 | 0.581 | 80.985702 | 36.267 | 13.711 | 7.471234 | 49.574659 | 5.905600 | |
2011 | 4.845642 | 7.418864 | 0.632978 | 0.829800 | 0.425926 | -1.204545 | -1.621979 | 0.514646 | 0.575 | 80.740494 | 36.264 | 12.645 | 7.594706 | 51.600366 | 5.823760 | |
2012 | 4.955101 | 7.534424 | 0.469531 | 0.858691 | 0.407084 | -1.125315 | -1.555728 | 0.487203 | 0.569 | 80.579870 | 36.077 | 11.626 | 9.691281 | 53.643073 | 5.868670 | |
2013 | 4.690188 | 7.565154 | 0.575884 | 0.830937 | 0.527755 | -1.026085 | -1.526321 | 0.555439 | 0.532 | 80.499816 | 35.715 | 10.675 | 9.593592 | 55.633000 | 5.871750 | |
2014 | 4.184451 | 7.562753 | 0.642034 | 0.820217 | 0.566209 | -0.985267 | -1.484067 | 0.601080 | 0.535 | 80.456439 | 35.189 | 9.819 | 8.486653 | 57.498317 | NaN | |
2015 | 3.703191 | 7.556052 | 0.667193 | 0.810457 | 0.590012 | -0.893078 | -1.357514 | 0.655137 | 0.540 | 80.391033 | NaN | NaN | NaN | NaN | NaN | |
2016 | 3.735400 | 7.538829 | 0.732971 | 0.723612 | 0.699344 | -0.863044 | -1.371214 | 0.596690 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2017 | 3.638300 | 7.538187 | 0.752826 | 0.751208 | 0.682647 | NaN | NaN | 0.581484 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1562 rows × 15 columns
# Num entries and num non-null values
df.info()
<class 'pandas.core.frame.DataFrame'> MultiIndex: 1562 entries, (Afghanistan, 2008) to (Zimbabwe, 2017) Data columns (total 15 columns): Life Ladder 1562 non-null float64 Log GDP per capita 1535 non-null float64 Freedom to make life choices 1533 non-null float64 Perceptions of corruption 1472 non-null float64 Confidence in national government 1401 non-null float64 Democratic Quality 1391 non-null float64 Delivery Quality 1391 non-null float64 gini of household income reported in Gallup, by wp5-year 1205 non-null float64 HDI Inequality 754 non-null float64 Age dependency ratio (% of working-age population) 1229 non-null float64 Birth rate, crude (per 1,000 people) 1098 non-null float64 Death rate, crude (per 1,000 people) 1098 non-null float64 Health expenditure, public (% of government expenditure) 1089 non-null float64 Life expectancy at birth, total (years) 1098 non-null float64 School enrollment, tertiary (% gross) 802 non-null float64 dtypes: float64(15) memory usage: 189.1+ KB
# (1)
df = df.groupby(['Country']).transform(lambda x: x.fillna(x.mean()))
# (2)
from fancyimpute import MatrixFactorization
with warnings.catch_warnings(): # Ignore deprecation warnings
warnings.simplefilter("ignore")
with HiddenOutput():
df.iloc[:,:] = MatrixFactorization().fit_transform(df.iloc[:,:]);
Using TensorFlow backend.
df.isna().sum()
Life Ladder 0 Log GDP per capita 0 Freedom to make life choices 0 Perceptions of corruption 0 Confidence in national government 0 Democratic Quality 0 Delivery Quality 0 gini of household income reported in Gallup, by wp5-year 0 HDI Inequality 0 Age dependency ratio (% of working-age population) 0 Birth rate, crude (per 1,000 people) 0 Death rate, crude (per 1,000 people) 0 Health expenditure, public (% of government expenditure) 0 Life expectancy at birth, total (years) 0 School enrollment, tertiary (% gross) 0 dtype: int64
There aren't anymore null values left so we now have the final dataset
df
Life Ladder | Log GDP per capita | Freedom to make life choices | Perceptions of corruption | Confidence in national government | Democratic Quality | Delivery Quality | gini of household income reported in Gallup, by wp5-year | HDI Inequality | Age dependency ratio (% of working-age population) | Birth rate, crude (per 1,000 people) | Death rate, crude (per 1,000 people) | Health expenditure, public (% of government expenditure) | Life expectancy at birth, total (years) | School enrollment, tertiary (% gross) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | Year | |||||||||||||||
Afghanistan | 2008 | 3.723590 | 7.168690 | 0.718114 | 0.881686 | 0.612072 | -1.929690 | -1.655084 | 0.385668 | 0.695000 | 100.215886 | 41.560000 | 9.771000 | 6.926090 | 58.225024 | 5.440723 |
2009 | 4.401778 | 7.333790 | 0.678896 | 0.850035 | 0.611545 | -2.044093 | -1.635025 | 0.441906 | 0.695000 | 100.060480 | 40.265000 | 9.475000 | 12.728971 | 58.603683 | 3.903390 | |
2010 | 4.758381 | 7.386629 | 0.600127 | 0.706766 | 0.299357 | -1.991810 | -1.617176 | 0.327318 | 0.724000 | 99.459839 | 38.940000 | 9.193000 | 14.404153 | 58.970829 | 5.440723 | |
2011 | 3.831719 | 7.415019 | 0.495901 | 0.731109 | 0.307386 | -1.919018 | -1.616221 | 0.336764 | 0.713000 | 97.667911 | 37.636000 | 8.927000 | 10.174108 | 59.327951 | 3.755980 | |
2012 | 3.782938 | 7.517126 | 0.530935 | 0.775620 | 0.435440 | -1.842996 | -1.404078 | 0.344540 | 0.701000 | 95.312707 | 36.396000 | 8.677000 | 11.668976 | 59.679610 | 5.440723 | |
2013 | 3.572100 | 7.503376 | 0.577955 | 0.823204 | 0.482847 | -1.879709 | -1.403036 | 0.304368 | 0.689000 | 92.602785 | 35.253000 | 8.445000 | 10.591285 | 60.028268 | 5.440723 | |
2014 | 3.130896 | 7.484583 | 0.508514 | 0.871242 | 0.409048 | -1.773257 | -1.312503 | 0.413974 | 0.676000 | 89.773777 | 34.225000 | 8.230000 | 11.998628 | 60.374463 | 8.662800 | |
2015 | 3.982855 | 7.466215 | 0.388928 | 0.880638 | 0.260557 | -1.844364 | -1.291594 | 0.596918 | 0.667000 | 86.954464 | 37.753571 | 8.959714 | 11.213173 | 59.315690 | 5.440723 | |
2016 | 4.220169 | 7.461401 | 0.522566 | 0.793246 | 0.324990 | -1.917693 | -1.432548 | 0.418629 | 0.695000 | 95.255981 | 37.753571 | 8.959714 | 11.213173 | 59.315690 | 5.440723 | |
2017 | 2.661718 | 7.460144 | 0.427011 | 0.954393 | 0.261179 | -1.904737 | -1.485251 | 0.286599 | 0.695000 | 95.255981 | 37.753571 | 8.959714 | 11.213173 | 59.315690 | 5.440723 | |
Albania | 2007 | 4.634252 | 9.077325 | 0.528605 | 0.874700 | 0.300681 | -0.045108 | -0.420024 | 0.492998 | 0.273167 | 50.862987 | 11.631000 | 6.321000 | 8.880175 | 76.470293 | 30.653669 |
2009 | 5.485470 | 9.161633 | 0.525223 | 0.863665 | 0.409726 | 0.048114 | -0.264635 | 0.617361 | 0.273167 | 48.637067 | 11.679000 | 6.684000 | 8.463262 | 76.840366 | 33.400749 | |
2010 | 5.268937 | 9.203026 | 0.568958 | 0.726262 | 0.409726 | -0.033831 | -0.246433 | 0.543528 | 0.273000 | 47.885076 | 11.952000 | 6.841000 | 8.463262 | 77.036951 | 44.540649 | |
2011 | 5.867422 | 9.230898 | 0.487496 | 0.877003 | 0.409726 | -0.110023 | -0.278413 | 0.407266 | 0.279000 | 46.720288 | 12.325000 | 6.981000 | 9.850791 | 77.240585 | 49.670399 | |
2012 | 5.510124 | 9.246649 | 0.601512 | 0.847675 | 0.364894 | -0.060784 | -0.328862 | 0.568153 | 0.281000 | 45.835739 | 12.730000 | 7.109000 | 9.710531 | 77.443976 | 58.565491 | |
2013 | 4.550648 | 9.258439 | 0.631830 | 0.862905 | 0.338095 | 0.070411 | -0.330956 | 0.633796 | 0.272000 | 45.247477 | 13.106000 | 7.232000 | 9.762421 | 77.640463 | 62.547760 | |
2014 | 4.813763 | 9.278097 | 0.734648 | 0.882704 | 0.498786 | 0.314873 | -0.187407 | 0.417219 | 0.267000 | 44.912168 | 13.414000 | 7.350000 | 9.369005 | 77.830463 | 62.706848 | |
2015 | 4.606651 | 9.303031 | 0.703851 | 0.884793 | 0.506978 | 0.251629 | -0.152544 | 0.422627 | 0.267000 | 44.806973 | 12.405286 | 6.931143 | 9.214207 | 77.214728 | 48.869367 | |
2016 | 4.511101 | 9.337774 | 0.729819 | 0.901071 | 0.400910 | 0.208456 | -0.139161 | 0.416540 | 0.273167 | 46.863472 | 12.405286 | 6.931143 | 9.214207 | 77.214728 | 48.869367 | |
2017 | 4.639548 | 9.373718 | 0.749611 | 0.876135 | 0.457738 | 0.071527 | -0.260937 | 0.410488 | 0.273167 | 46.863472 | 12.405286 | 6.931143 | 9.214207 | 77.214728 | 48.869367 | |
Algeria | 2010 | 5.463567 | 9.462701 | 0.592696 | 0.618038 | 0.492667 | -1.140853 | -0.740093 | 0.492713 | 0.523000 | 48.681853 | 24.643000 | 5.108000 | 9.646135 | 73.804049 | 29.834560 |
2011 | 5.317194 | 9.471962 | 0.529561 | 0.637982 | 0.649212 | -1.182341 | -0.776610 | 0.426202 | 0.514000 | 49.233576 | 24.921000 | 5.123000 | 9.365661 | 74.070000 | 31.202591 | |
2012 | 5.604596 | 9.485086 | 0.586663 | 0.690116 | 0.455042 | -1.115535 | -0.771172 | 0.421409 | 0.432000 | 49.847713 | 24.946000 | 5.130000 | 9.988192 | 74.324098 | 32.231331 | |
2014 | 6.354898 | 9.509210 | 0.536398 | 0.661478 | 0.478946 | -1.002867 | -0.783428 | 0.475492 | 0.429000 | 51.536631 | 24.309000 | 5.125000 | 9.904623 | 74.808098 | 34.593811 | |
2016 | 5.340854 | 9.541166 | 0.536398 | 0.661478 | 0.453047 | -1.008262 | -0.814304 | 0.604617 | 0.474500 | 49.824943 | 24.704750 | 5.121500 | 9.726153 | 74.251561 | 31.965573 | |
2017 | 5.248912 | 9.540244 | 0.436670 | 0.699774 | 0.353288 | -1.089971 | -0.777121 | 0.527556 | 0.474500 | 49.824943 | 24.704750 | 5.121500 | 9.726153 | 74.251561 | 31.965573 | |
Angola | 2011 | 5.589001 | 8.684613 | 0.583702 | 0.911320 | 0.232387 | -0.747358 | -1.215250 | 0.514382 | 0.669352 | 102.106756 | 47.018000 | 14.642000 | 5.584801 | 51.059317 | 6.946090 |
2012 | 4.360250 | 8.699287 | 0.456029 | 0.906300 | 0.237091 | -0.732785 | -1.124386 | 0.514382 | 0.651276 | 101.836900 | 46.499000 | 14.329000 | 5.573031 | 51.464000 | 8.434830 | |
2013 | 3.937107 | 8.729884 | 0.409555 | 0.816375 | 0.547732 | -0.752538 | -1.213750 | 0.588065 | 0.673532 | 101.315235 | 45.985000 | 14.021000 | 7.423221 | 51.866171 | 9.923570 | |
2014 | 3.794838 | 8.741957 | 0.374542 | 0.834076 | 0.572346 | -0.739363 | -1.168539 | 0.440699 | 0.681051 | 100.637667 | 45.483000 | 13.720000 | 5.004294 | 52.266878 | 8.434830 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Yemen | 2011 | 3.746256 | 8.244134 | 0.638211 | 0.753882 | 0.387044 | -1.910233 | -1.124420 | 0.412353 | 0.772000 | 80.193948 | 34.017000 | 7.280000 | 4.297796 | 63.053537 | 9.974600 |
2012 | 4.060601 | 8.241021 | 0.705815 | 0.793233 | 0.598435 | -1.892755 | -1.119488 | 0.415892 | 0.767000 | 78.902603 | 33.481000 | 7.145000 | 3.932508 | 63.327293 | 10.532505 | |
2013 | 4.217679 | 8.261730 | 0.542547 | 0.885197 | 0.387677 | -1.854220 | -1.092611 | 0.429792 | 0.762000 | 77.734373 | 32.947000 | 7.024000 | 3.932508 | 63.583512 | 10.532505 | |
2014 | 3.967958 | 8.233983 | 0.663909 | 0.885429 | 0.344929 | -1.983291 | -1.264091 | 0.447447 | 0.757000 | 76.644268 | 32.418000 | 6.919000 | 3.932508 | 63.818195 | 10.532505 | |
2015 | 2.982674 | 7.878930 | 0.609981 | 0.829098 | 0.263297 | -2.101797 | -1.374387 | 0.445243 | 0.767000 | 75.595147 | 34.061143 | 7.322143 | 4.140908 | 62.998774 | 10.532505 | |
2016 | 3.825631 | 7.751505 | 0.532964 | 0.833238 | 0.267581 | -2.222766 | -1.642179 | 0.411021 | 0.767333 | 80.339638 | 34.061143 | 7.322143 | 4.140908 | 62.998774 | 10.532505 | |
2017 | 3.253560 | 8.191046 | 0.595191 | 0.833238 | 0.247787 | -1.888195 | -1.154295 | 0.374522 | 0.767333 | 80.339638 | 34.061143 | 7.322143 | 4.140908 | 62.998774 | 10.532505 | |
Zambia | 2006 | 4.824455 | 7.866006 | 0.720972 | 0.785281 | 0.526590 | 0.077581 | -0.642004 | 0.601957 | 0.543400 | 98.124971 | 43.463000 | 13.598000 | 14.142366 | 50.966610 | 7.134560 |
2007 | 3.998293 | 7.918941 | 0.682005 | 0.947914 | 0.404140 | 0.078216 | -0.536810 | 0.601957 | 0.543400 | 98.223058 | 43.172000 | 12.801000 | 9.465782 | 52.477146 | 18.847403 | |
2008 | 4.730263 | 7.966175 | 0.716994 | 0.890299 | 0.557462 | 0.154371 | -0.499575 | 0.601957 | 0.543400 | 98.277253 | 42.810000 | 12.060000 | 11.217006 | 53.905634 | 8.398973 | |
2009 | 5.260361 | 8.026193 | 0.696183 | 0.916553 | 0.413418 | 0.133997 | -0.567933 | 0.581005 | 0.543400 | 98.260795 | 42.382000 | 11.391000 | 12.632602 | 55.214171 | 34.661095 | |
2011 | 4.999114 | 8.120028 | 0.662850 | 0.882150 | 0.397613 | 0.169676 | -0.487765 | 0.521399 | 0.556000 | 97.854237 | 41.415000 | 10.283000 | 11.132558 | 57.422195 | 13.709220 | |
2012 | 5.013375 | 8.163204 | 0.787760 | 0.806394 | 0.594114 | 0.264068 | -0.388092 | 0.610944 | 0.550000 | 97.385802 | 40.928000 | 9.822000 | 11.350678 | 58.363317 | 7.605665 | |
2013 | 5.243996 | 8.182191 | 0.769912 | 0.732268 | 0.552761 | 0.164946 | -0.385220 | 0.514960 | 0.544000 | 96.791310 | 40.471000 | 9.402000 | 11.008191 | 59.237366 | 8.960832 | |
2014 | 4.345837 | 8.197678 | 0.811825 | 0.808841 | 0.606339 | 0.023306 | -0.395449 | 0.621956 | 0.541000 | 96.122165 | 40.052000 | 9.018000 | 11.310214 | 60.047049 | 8.092875 | |
2015 | 4.843164 | 8.196217 | 0.758654 | 0.871020 | 0.631103 | 0.040718 | -0.391482 | 0.671201 | 0.526000 | 95.402326 | 41.836625 | 11.046875 | 11.532425 | 55.954186 | 18.551970 | |
2016 | 4.347544 | 8.201650 | 0.811575 | 0.770644 | 0.696892 | -0.058471 | -0.460033 | 0.681393 | 0.543400 | 97.382435 | 41.836625 | 11.046875 | 11.532425 | 55.954186 | 16.158530 | |
2017 | 3.932777 | 8.211670 | 0.823169 | 0.739541 | 0.717004 | 0.104841 | -0.475436 | 0.612799 | 0.543400 | 97.382435 | 41.836625 | 11.046875 | 11.532425 | 55.954186 | 17.988343 | |
Zimbabwe | 2006 | 3.826268 | 7.366704 | 0.431110 | 0.904757 | 0.317073 | -1.236102 | -1.570760 | 0.579647 | 0.555333 | 81.702148 | 34.958000 | 17.882000 | 8.257254 | 42.810707 | 5.867445 |
2007 | 3.280247 | 7.313939 | 0.455957 | 0.946287 | 0.225752 | -1.340245 | -1.653740 | 0.579647 | 0.555333 | 81.272166 | 35.397000 | 16.945000 | 7.507345 | 44.177756 | 5.867445 | |
2008 | 3.174264 | 7.102516 | 0.343556 | 0.963846 | 0.181594 | -1.381488 | -1.701545 | 0.579647 | 0.555333 | 81.024020 | 35.788000 | 15.903000 | 6.404965 | 45.804488 | 5.867445 | |
2009 | 4.055914 | 7.197595 | 0.411089 | 0.930818 | 0.285287 | -1.353181 | -1.717821 | 0.545112 | 0.555333 | 80.934968 | 36.094000 | 14.809000 | 10.578529 | 47.624659 | 5.867445 | |
2010 | 4.681570 | 7.296330 | 0.664718 | 0.828361 | 0.471201 | -1.289599 | -1.693678 | 0.680030 | 0.581000 | 80.985702 | 36.267000 | 13.711000 | 7.471234 | 49.574659 | 5.905600 | |
2011 | 4.845642 | 7.418864 | 0.632978 | 0.829800 | 0.425926 | -1.204545 | -1.621979 | 0.514646 | 0.575000 | 80.740494 | 36.264000 | 12.645000 | 7.594706 | 51.600366 | 5.823760 | |
2012 | 4.955101 | 7.534424 | 0.469531 | 0.858691 | 0.407084 | -1.125315 | -1.555728 | 0.487203 | 0.569000 | 80.579870 | 36.077000 | 11.626000 | 9.691281 | 53.643073 | 5.868670 | |
2013 | 4.690188 | 7.565154 | 0.575884 | 0.830937 | 0.527755 | -1.026085 | -1.526321 | 0.555439 | 0.532000 | 80.499816 | 35.715000 | 10.675000 | 9.593592 | 55.633000 | 5.871750 | |
2014 | 4.184451 | 7.562753 | 0.642034 | 0.820217 | 0.566209 | -0.985267 | -1.484067 | 0.601080 | 0.535000 | 80.456439 | 35.189000 | 9.819000 | 8.486653 | 57.498317 | 5.867445 | |
2015 | 3.703191 | 7.556052 | 0.667193 | 0.810457 | 0.590012 | -0.893078 | -1.357514 | 0.655137 | 0.540000 | 80.391033 | 35.749889 | 13.779444 | 8.398395 | 49.818558 | 5.867445 | |
2016 | 3.735400 | 7.538829 | 0.732971 | 0.723612 | 0.699344 | -0.863044 | -1.371214 | 0.596690 | 0.555333 | 80.858665 | 35.749889 | 13.779444 | 8.398395 | 49.818558 | 5.867445 | |
2017 | 3.638300 | 7.538187 | 0.752826 | 0.751208 | 0.682647 | -1.154359 | -1.568579 | 0.581484 | 0.555333 | 80.858665 | 35.749889 | 13.779444 | 8.398395 | 49.818558 | 5.867445 |
1562 rows × 15 columns
Choosing some interesting features, let's first see how and if they correlate with Life Ladder
with HiddenOutput():
sm = pd.plotting.scatter_matrix(df[['Life Ladder', 'Log GDP per capita',
'gini of household income reported in Gallup, by wp5-year',
'HDI Inequality', 'Age dependency ratio (% of working-age population)'\
]], figsize=(10, 10), diagonal='kde')
# Rotate labels so they don't overlap
[s.xaxis.label.set_rotation(45) for s in sm.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]
# Prevent labels from overlapping plots
[s.get_yaxis().set_label_coords(-1.5,0.5) for s in sm.reshape(-1)]
# Hide ticks
[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]
The most noticeable things:
# (1) Checking correlation coefficient R of household income inequality
df['Life Ladder'].corr(df['gini of household income reported in Gallup, by wp5-year'])
-0.2999144917858168
There is in fact a weak downward trend associated to how satisfied a person is with their life as the nation's income inequality increases. Additionally, it looks like the income inequality has a greater affect on happiness past a certain mark. This means having up to a certain amount of income inequality makes no significant difference.
Let's get F-test p-values and use simple linear regression to see some coefficients and get a feel for which single qualities are generally important for a government to provide.
y = np.array(df['Life Ladder'])
X = np.array(df.iloc[:, 1:])
from sklearn.feature_selection import f_regression
from sklearn import linear_model
# F-test p-values
F, p = f_regression(X, y)
# Multilinear regression coefficients
clf = linear_model.LinearRegression()
clf.fit(X, y)
corr = pd.DataFrame(clf.coef_.reshape(1, -1), columns=list(df.columns[1:])).transpose() \
.rename(columns={0 : 'coef'}).sort_values(['coef'], ascending=False)
corr.join(pd.DataFrame(p.reshape(1, -1), columns=list(df.columns[1:])).transpose() \
.rename(columns={0 : 'p-value'}).sort_values(['p-value'], ascending=False))
coef | p-value | |
---|---|---|
Freedom to make life choices | 1.834539 | 1.224220e-108 |
Log GDP per capita | 0.475416 | 6.149166e-318 |
Democratic Quality | 0.076849 | 4.872091e-162 |
Health expenditure, public (% of government expenditure) | 0.043470 | 3.039275e-76 |
Birth rate, crude (per 1,000 people) | 0.008026 | 3.989166e-172 |
School enrollment, tertiary (% gross) | 0.005124 | 2.875886e-183 |
Age dependency ratio (% of working-age population) | 0.002051 | 1.016683e-124 |
Life expectancy at birth, total (years) | -0.002818 | 3.142558e-261 |
Death rate, crude (per 1,000 people) | -0.055781 | 1.700101e-34 |
Delivery Quality | -0.066200 | 8.043331e-240 |
HDI Inequality | -0.142111 | 1.533536e-203 |
Confidence in national government | -0.831529 | 3.602172e-03 |
Perceptions of corruption | -0.861237 | 4.503876e-74 |
gini of household income reported in Gallup, by wp5-year | -1.068271 | 7.887201e-34 |
plt.xkcd()
# The mark of a succesful Government, apparently
fig, ax = plt.subplots(figsize=(7,5))
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
plt.xticks([])
plt.yticks([])
ax.set_ylim([-30, 10])
data = np.ones(100)
data[20:93] -= (np.arange(73) / 2.5)
data[93:100] = -28
plt.annotate(
'THE MARK OF A SUCCESSFUL\n GOVERNMENT, APPARENTLY',
xy=(1, -30), arrowprops=dict(arrowstyle='->'), xytext=(1, -25))
plt.annotate(
'MAKE THEM LOSE ALL HOPE IN YOU',
xy=(70, 1), xytext=(5, -45), fontsize=17)
plt.plot(data)
plt.xlabel('citizen\'s confidence in government')
plt.ylabel('citizen\'s happiness')
plt.title('HOW TO EFFECTIVELY RUN A COUNTRY')
plt.show()
Joking aside, even though an increase in confidence of the national government does negatively affect happiness in multilinear regression, it has no correlation with Life Ladder directly... and it's not very useful, nor does it make sense giving advice to lower people's confidence in their government if you want them to be happier.
print(df['Life Ladder'].corr(df['Confidence in national government']))
df = df.drop(columns=['Confidence in national government']);
-0.07361513467078484
Anyway, back to the multilinear coefficients:
Now let's chart the happiness using the 6 most distinguishing features given by the linear regression coefficients. I will min-max scale each feature and then take their percentage worth of each country's Life Ladder value. This way we can see the make up of a country's happiness based on how each of their features compare to other countries.
plt.rcdefaults()
fig, ax = plt.subplots(figsize=(15,30))
# Use the most significant variables
ranks = df[['Life Ladder', 'Freedom to make life choices', 'Log GDP per capita',
'gini of household income reported in Gallup, by wp5-year',
'Perceptions of corruption', 'HDI Inequality', 'Delivery Quality'\
]].groupby(['Country']).mean().sort_values(by='Life Ladder', ascending=False)
# Scale each feature value (except for y) between 0.1 and 1 by column
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range=(0.1, 1)).fit(ranks.iloc[:, 1:])
ranks.iloc[:, 1:] = sc.transform(ranks.iloc[:, 1:])
# Multiply percentage of each value by Life Ladder, cumulative sum for stacked bar chart
totals = ranks.iloc[:, 1:].sum(axis=1)
ranks.iloc[:, 1:] = ranks.iloc[:, 1:].divide(totals, axis=0).multiply(ranks['Life Ladder'], \
axis=0).cumsum(axis=1)
# Plot each feature
y_pos = np.arange(len(ranks))
p1=plt.barh(y_pos, ranks['Delivery Quality'].get_values(), color='#FF961A')
p2=plt.barh(y_pos, ranks['HDI Inequality'].get_values(), color = '#FF301A')
p3=plt.barh(y_pos, ranks['Perceptions of corruption'].get_values(), color='#33cc33')
p4=plt.barh(y_pos, ranks['gini of household income reported in Gallup, by wp5-year']\
.get_values(), color='#ffbf00')
p5=plt.barh(y_pos, ranks['Log GDP per capita'].get_values(), color='#ff1a8c')
p6=plt.barh(y_pos, ranks['Freedom to make life choices'].get_values(), color='#4AAAAA')
y_indices = np.array(ranks.index)
labels = np.append(np.arange(1, y_indices.shape[0] + 1).reshape(-1, 1), \
y_indices.reshape(-1, 1), axis=1)
labels = [str(row[0]) + ' ' + row[1] for row in labels]
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.invert_yaxis()
ax.set_xlabel('Life Ladder', fontsize=13)
ax.set_title(\
'Country Happiness with each Feature\'s Relative Makeup - Mean of Data from 2005-2017',
fontsize=15)
plt.ylim(len(ranks) , -1)
plt.xlim(0, 10)
plt.legend((p1[0], p2[0], p3[0], p4[0], p5[0], p6[0]), ('Delivery Quality', 'HDI Inequality',
'Perceived Corruption',
'GINI inequality', 'GDP per Capita',
'Freedom of Choices'))
plt.show()
As expected, many of the features that contribute to a greater life ladder shrink in percentage of makeup, and the features that hurt the life ladder increase in percentage as the ranking decreases.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
X = np.array(df.iloc[:, 1:])
y = np.array(df['Life Ladder'])
scaler = StandardScaler().fit(X)
stand_X = scaler.transform(X)
train_X, test_X, train_y, test_y = train_test_split(stand_X, y, train_size=0.7,
test_size=0.3, shuffle=True)
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=123)
param_grid = {'C' : np.logspace(-3, 2, 6), 'gamma': np.logspace(-3, 2, 6)}
svr = SVR(kernel='rbf')
grid = GridSearchCV(svr, param_grid=param_grid, cv=cv, return_train_score=True)
grid.fit(train_X, train_y)
GridSearchCV(cv=ShuffleSplit(n_splits=100, random_state=123, test_size=0.2, train_size=None), error_score='raise-deprecating', estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), fit_params=None, iid='warn', n_jobs=None, param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), 'gamma': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])}, pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring=None, verbose=0)
grid.best_params_
{'C': 10.0, 'gamma': 0.1}
print('Train score:', grid.score(train_X, train_y))
print('Test score:', grid.score(test_X, test_y))
Train score: 0.9426830084552748 Test score: 0.8696420975335767
Check for underfitting/overfitting now that we have the chosen parameters
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
plt.rcdefaults()
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
plt.grid()
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
cv = ShuffleSplit(n_splits=100, test_size=0.2)
plot_learning_curve(SVR(kernel='rbf', C=10, gamma=0.1),
'Learning Curve', stand_X, y, (0.5, 1.01), cv=cv, n_jobs=4)
plt.show()
There is a small gap between the CV and training scores, meaning more data would be useful, but since there is no way of getting more data outside of waiting for years - this is fine. There isn't much overfitting happening anyway.
Here I am going to create a simple function which will determine what any certain government needs to work on foremost with the goal of improving happiness. I'm sure finding the gradient at a point would work with some models, but I decided to use an 'rbf' kernel SVM. So instead to make things simpler I'm, simply put, checking the greatest change in score of each feature, updating that feature, and then checking the next greatest change in score.
def improve_which_features(X, tot_increase=4, step=0.1):
'''
Principal function for this notebook.
Get the best features to improve when sent parameters of a country.
Parameters
----------
X : length 13
Must be scaled before sent
tot_increase: Number total Standard deviations to increase data by
step: Amount each feature is increased by to compare
'''
better_X = np.copy(X)
num_features = X.shape[0]
# Increment each feature by step tot_increase/step times to check
# which one increases the score the most each increment
scores = np.zeros(num_features)
additional_increase = 0
order = np.array([])
for i in np.arange(0, tot_increase, step):
increase = 0
best_feature = -1
for j in range(num_features):
temp_X_plus = np.copy(better_X)
temp_X_plus[j] += (step + additional_increase)
temp_X_minus = np.copy(better_X)
temp_X_minus[j] -= (step + additional_increase)
first = float(grid.predict(temp_X_plus.reshape(1, -1)))
second = float(grid.predict(temp_X_minus.reshape(1, -1)))
new_happ = np.maximum(first, second)
temp_increase = new_happ - float(grid.predict(better_X.reshape(1, -1)))
# Limit the increase to 1 SD. Potential to create unrealistic goals otherwise
if np.greater(temp_increase, increase) and scores[j] < 2:
increase = temp_increase
best_feature = j
if best_feature != -1:
scores[best_feature] += step + additional_increase
better_X[best_feature] += step + additional_increase
additional_increase = 0
order = np.append(order, np.array([best_feature]))
else:
additional_increase += step
return order, scores
Let's use the function on two countries: Kenya and the United States:
Here are the column indices, for reference
pd.DataFrame(np.linspace(0, 12, 13, dtype=int).reshape(1, -1), columns=list(df.columns[1:]))\
.transpose().rename(columns={0 : 'Index'})
Index | |
---|---|
Log GDP per capita | 0 |
Freedom to make life choices | 1 |
Perceptions of corruption | 2 |
Democratic Quality | 3 |
Delivery Quality | 4 |
gini of household income reported in Gallup, by wp5-year | 5 |
HDI Inequality | 6 |
Age dependency ratio (% of working-age population) | 7 |
Birth rate, crude (per 1,000 people) | 8 |
Death rate, crude (per 1,000 people) | 9 |
Health expenditure, public (% of government expenditure) | 10 |
Life expectancy at birth, total (years) | 11 |
School enrollment, tertiary (% gross) | 12 |
# Kenya (2017)
order, scores = improve_which_features(scaler.transform(np.array( \
df.loc[['Country', 'Kenya']])[11, 1:].reshape(1, -1))[0])
print('Order of largest increase in scores:\n', order)
print('\nTotal scores:\nFeature: 0 1 2 3 4 5 6 7 8 9 10 11 12',
'\n Score:', scores)
Order of largest increase in scores: [0. 0. 0. 0. 0. 0. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 0. 4. 4. 0. 4. 4. 0. 7. 4. 7. 0. 4. 7. 0. 5. 5. 5. 5. 5. 5. 5. 5. 5. 5.] Total scores: Feature: 0 1 2 3 4 5 6 7 8 9 10 11 12 Score: [1.1 0. 0. 0. 1.6 1. 0. 0.3 0. 0. 0. 0. 0. ]
For Kenya, it looks like Delivery quality is the first important feature to improve upon. Delivery quality was judged on government effectiveness, regulatory quality, rule of law, control of corruption. This means that a focus on the workings of the government and making sure it runs efficiently and for the people may be essential for Kenya to break into the first world. It would also be useful to improve the GDP a little bit before and during improvement of delivery quality. Afterwards it will be important for them to work on income equality. Stimulating the economy by increasing production and bringing more jobs could potentially solve these last two issues.
# United States (2017)
order, scores = improve_which_features(scaler.transform(np.array( \
df.loc[['Country', 'United States']])[11, 1:].reshape(1, -1))[0])
print('Order of largest increase in scores:\n', order)
print('\nTotal scores:\nFeature: 0 1 2 3 4 5 6 7 8 9 10 11 12',
'\n Score:', scores)
Order of largest increase in scores: [4. 4. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 7. 4. 8. 4. 8. 4. 4. 8. 4. 8. 4. 8. 4. 8. 4. 8. 3. 4. 3.] Total scores: Feature: 0 1 2 3 4 5 6 7 8 9 10 11 12 Score: [0. 0. 0. 0.2 1.1 0. 0. 2. 0.7 0. 0. 0. 0. ]
These results show that first a slight increase in delivery quality could help. More importantly, though, is a further decrease in the age dependency ratio - at least from the list of features we have access to. After the age dependency is improved, an increase in birth rate, which goes hand in hand with age dependency, helps. Finally, continuing improvement in delivery quality is needed.
This solution is far from perfect, more features to cover every possible factor and more data to create a more stable model would be very useful. However, I hope this shows you some of my knowledge! The concepts in this notebook could even be applied to something along the lines of customer service.