In [1]:

import numpy as np
import pandas as pd
import plotly.express as px

import os

PATHS = []

for dirname, _, filenames in os.walk('./input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
        PATHS.append(os.path.join(dirname, filename))
        

PATHS.sort()

./input\2015.csv
./input\2016.csv
./input\2017.csv
./input\2018.csv
./input\2019.csv
./input\2020.csv

In [2]:

# 2020
df = pd.read_csv(PATHS[-1])

Analysis of World Happiness¶

Introduction¶

World Happiness Report is a ranking of 156 countries. Nationally representative samples of respondents are asked to rate their own current lives on that 0 to 10 scale, with the best possible life for them being a 10, and the worst possible experience 0.

Goals¶

They are a few goals I want out of this data set:

Identify the happiest and unhappiest countries, and contributing features.
Identify any trends in the dataset and predict how countries happiness scores will look in the future.

While I would like to predict the happiness score, as the columns: GDP per Capita, Family, Life Expectancy, Freedom, Generosity, Trust Government Corruption add up to equal the Happiness score, making it unrealiable to to predict on. With that said I do plan on extending this data set with educational data, and trying to predict off that.

Source¶

The data was sourced from Kaggle at: https://www.kaggle.com/mathurinache/world-happiness-report, which itself is based on https://worldhappiness.report/ed/2020/ (Cite: Helliwell, John F., Richard Layard, Jeffrey Sachs, and Jan-Emmanuel De Neve, eds. 2020. World Happiness Report 2020. New York: Sustainable Development Solutions Network)

Basic Infomation¶

Let's get some general information for the 2020 data. From the data below we can see that we have no null data and the vast majority of the data is in numerical form, which is good as it means we can avoid cleaning the data.

In [3]:

print("The shape is: {}, {}".format(df.shape[0], df.shape[1]))
df.info()
## Rename
df.rename({"Ladder score": "Happiness score"},axis=1, inplace=True)

The shape is: 153, 20
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 20 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country name                                153 non-null    object 
 1   Regional indicator                          153 non-null    object 
 2   Ladder score                                153 non-null    float64
 3   Standard error of ladder score              153 non-null    float64
 4   upperwhisker                                153 non-null    float64
 5   lowerwhisker                                153 non-null    float64
 6   Logged GDP per capita                       153 non-null    float64
 7   Social support                              153 non-null    float64
 8   Healthy life expectancy                     153 non-null    float64
 9   Freedom to make life choices                153 non-null    float64
 10  Generosity                                  153 non-null    float64
 11  Perceptions of corruption                   153 non-null    float64
 12  Ladder score in Dystopia                    153 non-null    float64
 13  Explained by: Log GDP per capita            153 non-null    float64
 14  Explained by: Social support                153 non-null    float64
 15  Explained by: Healthy life expectancy       153 non-null    float64
 16  Explained by: Freedom to make life choices  153 non-null    float64
 17  Explained by: Generosity                    153 non-null    float64
 18  Explained by: Perceptions of corruption     153 non-null    float64
 19  Dystopia + residual                         153 non-null    float64
dtypes: float64(18), object(2)
memory usage: 24.0+ KB

Country Happiness¶

Let us now observe the happiest and the unhappiest ranked countries in the dataframe.

We display the the happiest 5 and the unhappiest 5 countries.

In [4]:

fig = px.bar(
    data_frame = df.nlargest(5,"Happiness score"),
    y="Country name",
    x="Happiness score",
    orientation='h',
    color="Country name",
    text="Happiness score")

fig.show()

fig = px.bar(
    data_frame = df.nsmallest(5,"Happiness score"),
    y="Country name",
    x="Happiness score",
    orientation='h',
    color="Country name",
    text="Happiness score")

fig.show()

We can see from the graphs that the happiest countries are all Scandinavian, with the exception of Switzerland, and the unhappiest all are in Sub-Sahara/central Africa, with the exception of Afghanistan.

It is also worth noting that all the happiest countries are in western Europe, and are highly socialised countries (with Switzerland being able to be regarded as both incredible socialist or incredibly capitalist.) Similarly, the unhappiest countries are not surprising, given Sub-Sahara Africa's war, disease and civil arrest, a history which Afghanistan shares.

Given these results and how tightly related to region the data appears to be, let us see how regions as a whole have been ranked on average.

In [5]:

regional = df.groupby("Regional indicator")

fig = px.bar(
    data_frame = regional.median("Happiness score"),
    #y="",
    x="Happiness score",
    orientation='h',
    #color=df.index,
    text="Happiness score",)

fig.update_yaxes(categoryorder = "total ascending")

fig.show()

While our bottom region was as expected, our top region was North America, not Western Europe, which suggests that there is large variance in Western European Happiness. Let's see how wide our range of scores is then per region.

In [6]:

fig = px.box(
    data_frame = df, 
    x="Regional indicator",
    y="Happiness score")
fig.show()

As expected, while the happiest countries are in Western Europe there is a large variance in values, unlike North America and ANZ, which is more consistent. This is likely due to the fact that Western Europe covers a much larger number of countries.

Correlation¶

We can see from the graph below that the variables most important to happiness score are Log GDP per capita, Social Support and Healthy Life Expectancy, all of which are also very closely correlated. One possible explanation is that the log GDP per capita, leads to more funding for social support and for health care (i.e. improves life expectancy) , which explains why they are so closely related to each other.

Interestingly generosity had almost no correlation with happiness score, nor with much with anything else.

In [7]:

cols = df[['Explained by: Log GDP per capita',
          'Explained by: Social support',
          'Explained by: Healthy life expectancy',
          'Explained by: Freedom to make life choices',
          'Explained by: Generosity',
          'Explained by: Perceptions of corruption',
          'Dystopia + residual',
          'Happiness score']].corr()

fig = px.imshow(cols,
                title = 'Correlattion Map for 2020')
fig.show()

Trends¶

We'll clean the data a little and then see the general trend over time, restricting our results to the previous 10 countries by default.

In [8]:

## Add all the data to df from 2015 to 2020 
df = []
years = ["2015", "2016", "2017", "2018", "2019", "2020"]

df.append(pd.read_csv(PATHS[0]))
df[0]["Year"] = years[0]

df.append(pd.read_csv(PATHS[1]))
df[1]["Year"] = years[1]

df.append(pd.read_csv(PATHS[2]))
df[2]["Year"] = years[2]
df[2].rename({"Happiness.Score": "Happiness Score"},axis=1, inplace=True)

df.append(pd.read_csv(PATHS[3]))
df[3]["Year"] = years[3]
df[3].rename({"Score": "Happiness Score"},axis=1, inplace=True)
df[3].rename({"Country or region": "Country"},axis=1, inplace=True)

df.append(pd.read_csv(PATHS[4]))
df[4]["Year"] = years[4]
df[4].rename({"Score": "Happiness Score"},axis=1, inplace=True)
df[4].rename({"Country or region": "Country"},axis=1, inplace=True)

df.append(pd.read_csv(PATHS[5]))
df[5]["Year"] = years[5]
df[5].rename({"Ladder score": "Happiness Score"},axis=1, inplace=True)
df[5].rename({"Country name": "Country"},axis=1, inplace=True)
    
result = pd.concat(df)

result = result.pivot(index = "Year", columns = "Country", values = "Happiness Score")

In [9]:

countries_list = ["Finland", "Denmark", "Switzerland", "Iceland", "Norway", "Afghanistan", "South Sudan", "Zimbabwe", "Rwanda", "Central African Republic"]
fig = px.line(result,
              #x = "Year",
              #y = "Happiness Score",
              title = "Happiness Score of Countries over time.",
              labels = {"value":"Happiness Score"})

fig.for_each_trace(lambda trace: trace.update(visible="legendonly") 
                   if trace.name not in countries_list else ())

fig.show()

The top countries appear to be increasing very slowly. On the other hand it's concerning to see that the worst countries are rapidly dropping in score.

We also see below that the mean world happiness is increasing, although very slowly and likely below significance.

In [10]:

result = pd.concat(df)
result = result.pivot(index = "Year", columns = "Country", values = "Happiness Score")
result["Mean"] = result.mean(axis=1)

#print(result.head())
#print(result["Mean"])

fig = px.line(result["Mean"],
              title = "World Average(Mean) Trend",
              labels = {"value":"Happiness Score"})

fig.show()

Conclusion¶

To conclude we have discovered which countries are the most and least happy, and tracked these down to the most correlated features. We have also identified the spread of this score across regions, and discussed the trends in the last 5 years of data.

In the future I hope to append additional data to these databases and try to predict happiness scores from these. In particular I am interested in the effect education levels and types have on a countries happiness and how it correlates to the features in the data here.

In [ ]: