import numpy as np
import pandas as pd
import plotly.express as px
import os
PATHS = []
for dirname, _, filenames in os.walk('./input'):
for filename in filenames:
print(os.path.join(dirname, filename))
PATHS.append(os.path.join(dirname, filename))
PATHS.sort()
./input\2015.csv ./input\2016.csv ./input\2017.csv ./input\2018.csv ./input\2019.csv ./input\2020.csv
# 2020
df = pd.read_csv(PATHS[-1])
World Happiness Report is a ranking of 156 countries. Nationally representative samples of respondents are asked to rate their own current lives on that 0 to 10 scale, with the best possible life for them being a 10, and the worst possible experience 0.
They are a few goals I want out of this data set:
While I would like to predict the happiness score, as the columns: GDP per Capita, Family, Life Expectancy, Freedom, Generosity, Trust Government Corruption add up to equal the Happiness score, making it unrealiable to to predict on. With that said I do plan on extending this data set with educational data, and trying to predict off that.
The data was sourced from Kaggle at: https://www.kaggle.com/mathurinache/world-happiness-report, which itself is based on https://worldhappiness.report/ed/2020/ (Cite: Helliwell, John F., Richard Layard, Jeffrey Sachs, and Jan-Emmanuel De Neve, eds. 2020. World Happiness Report 2020. New York: Sustainable Development Solutions Network)
Let's get some general information for the 2020 data. From the data below we can see that we have no null data and the vast majority of the data is in numerical form, which is good as it means we can avoid cleaning the data.
print("The shape is: {}, {}".format(df.shape[0], df.shape[1]))
df.info()
## Rename
df.rename({"Ladder score": "Happiness score"},axis=1, inplace=True)
The shape is: 153, 20 <class 'pandas.core.frame.DataFrame'> RangeIndex: 153 entries, 0 to 152 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 153 non-null object 1 Regional indicator 153 non-null object 2 Ladder score 153 non-null float64 3 Standard error of ladder score 153 non-null float64 4 upperwhisker 153 non-null float64 5 lowerwhisker 153 non-null float64 6 Logged GDP per capita 153 non-null float64 7 Social support 153 non-null float64 8 Healthy life expectancy 153 non-null float64 9 Freedom to make life choices 153 non-null float64 10 Generosity 153 non-null float64 11 Perceptions of corruption 153 non-null float64 12 Ladder score in Dystopia 153 non-null float64 13 Explained by: Log GDP per capita 153 non-null float64 14 Explained by: Social support 153 non-null float64 15 Explained by: Healthy life expectancy 153 non-null float64 16 Explained by: Freedom to make life choices 153 non-null float64 17 Explained by: Generosity 153 non-null float64 18 Explained by: Perceptions of corruption 153 non-null float64 19 Dystopia + residual 153 non-null float64 dtypes: float64(18), object(2) memory usage: 24.0+ KB
Let us now observe the happiest and the unhappiest ranked countries in the dataframe.
We display the the happiest 5 and the unhappiest 5 countries.
fig = px.bar(
data_frame = df.nlargest(5,"Happiness score"),
y="Country name",
x="Happiness score",
orientation='h',
color="Country name",
text="Happiness score")
fig.show()
fig = px.bar(
data_frame = df.nsmallest(5,"Happiness score"),
y="Country name",
x="Happiness score",
orientation='h',
color="Country name",
text="Happiness score")
fig.show()
We can see from the graphs that the happiest countries are all Scandinavian, with the exception of Switzerland, and the unhappiest all are in Sub-Sahara/central Africa, with the exception of Afghanistan.
It is also worth noting that all the happiest countries are in western Europe, and are highly socialised countries (with Switzerland being able to be regarded as both incredible socialist or incredibly capitalist.) Similarly, the unhappiest countries are not surprising, given Sub-Sahara Africa's war, disease and civil arrest, a history which Afghanistan shares.
Given these results and how tightly related to region the data appears to be, let us see how regions as a whole have been ranked on average.
regional = df.groupby("Regional indicator")
fig = px.bar(
data_frame = regional.median("Happiness score"),
#y="",
x="Happiness score",
orientation='h',
#color=df.index,
text="Happiness score",)
fig.update_yaxes(categoryorder = "total ascending")
fig.show()
While our bottom region was as expected, our top region was North America, not Western Europe, which suggests that there is large variance in Western European Happiness. Let's see how wide our range of scores is then per region.
fig = px.box(
data_frame = df,
x="Regional indicator",
y="Happiness score")
fig.show()
As expected, while the happiest countries are in Western Europe there is a large variance in values, unlike North America and ANZ, which is more consistent. This is likely due to the fact that Western Europe covers a much larger number of countries.
We can see from the graph below that the variables most important to happiness score are Log GDP per capita, Social Support and Healthy Life Expectancy, all of which are also very closely correlated. One possible explanation is that the log GDP per capita, leads to more funding for social support and for health care (i.e. improves life expectancy) , which explains why they are so closely related to each other.
Interestingly generosity had almost no correlation with happiness score, nor with much with anything else.
cols = df[['Explained by: Log GDP per capita',
'Explained by: Social support',
'Explained by: Healthy life expectancy',
'Explained by: Freedom to make life choices',
'Explained by: Generosity',
'Explained by: Perceptions of corruption',
'Dystopia + residual',
'Happiness score']].corr()
fig = px.imshow(cols,
title = 'Correlattion Map for 2020')
fig.show()
We'll clean the data a little and then see the general trend over time, restricting our results to the previous 10 countries by default.
## Add all the data to df from 2015 to 2020
df = []
years = ["2015", "2016", "2017", "2018", "2019", "2020"]
df.append(pd.read_csv(PATHS[0]))
df[0]["Year"] = years[0]
df.append(pd.read_csv(PATHS[1]))
df[1]["Year"] = years[1]
df.append(pd.read_csv(PATHS[2]))
df[2]["Year"] = years[2]
df[2].rename({"Happiness.Score": "Happiness Score"},axis=1, inplace=True)
df.append(pd.read_csv(PATHS[3]))
df[3]["Year"] = years[3]
df[3].rename({"Score": "Happiness Score"},axis=1, inplace=True)
df[3].rename({"Country or region": "Country"},axis=1, inplace=True)
df.append(pd.read_csv(PATHS[4]))
df[4]["Year"] = years[4]
df[4].rename({"Score": "Happiness Score"},axis=1, inplace=True)
df[4].rename({"Country or region": "Country"},axis=1, inplace=True)
df.append(pd.read_csv(PATHS[5]))
df[5]["Year"] = years[5]
df[5].rename({"Ladder score": "Happiness Score"},axis=1, inplace=True)
df[5].rename({"Country name": "Country"},axis=1, inplace=True)
result = pd.concat(df)
result = result.pivot(index = "Year", columns = "Country", values = "Happiness Score")
countries_list = ["Finland", "Denmark", "Switzerland", "Iceland", "Norway", "Afghanistan", "South Sudan", "Zimbabwe", "Rwanda", "Central African Republic"]
fig = px.line(result,
#x = "Year",
#y = "Happiness Score",
title = "Happiness Score of Countries over time.",
labels = {"value":"Happiness Score"})
fig.for_each_trace(lambda trace: trace.update(visible="legendonly")
if trace.name not in countries_list else ())
fig.show()
The top countries appear to be increasing very slowly. On the other hand it's concerning to see that the worst countries are rapidly dropping in score.
We also see below that the mean world happiness is increasing, although very slowly and likely below significance.
result = pd.concat(df)
result = result.pivot(index = "Year", columns = "Country", values = "Happiness Score")
result["Mean"] = result.mean(axis=1)
#print(result.head())
#print(result["Mean"])
fig = px.line(result["Mean"],
title = "World Average(Mean) Trend",
labels = {"value":"Happiness Score"})
fig.show()
To conclude we have discovered which countries are the most and least happy, and tracked these down to the most correlated features. We have also identified the spread of this score across regions, and discussed the trends in the last 5 years of data.
In the future I hope to append additional data to these databases and try to predict happiness scores from these. In particular I am interested in the effect education levels and types have on a countries happiness and how it correlates to the features in the data here.