%matplotlib inline
import pandas as pd
import numpy as np
import re
from requests import get
import json
import os
from collections import defaultdict
from sklearn.preprocessing import MinMaxScaler
#Plotting
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML, IFrame
import seaborn as sns
import plotly.express as px
import folium # conda install -c conda-forge folium
from folium import plugins
import geopandas as gpd
# Our helpers
from neural import prepare_future, predict_future # Wrapper to use RNNs
from helpfunc import *
from plots import *
# Foor retrieving prices of food
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import picos as pic # Library used to implement the convex optimization problem
import plotly.graph_objects as go
import plotly.offline as py
%load_ext autoreload
%autoreload 2
path_dict = {'food_balance_africa': 'data/raw/FoodBalanceSheets_E_Africa_1.csv',
'geoworld_json': 'data/raw/world-countries.json',
'africa_supply_rnn': 'data/processed/africa_cal.pkl',
'ages_calories_demand': 'data/raw/calories_demand.xlsx',
'african_countries_list': "data/raw/african_countries.txt",
'population_age_male': "data/raw/POPULATION_BY_AGE_MALE.xlsx",
'population_age_female': "data/raw/POPULATION_BY_AGE_FEMALE.xlsx",
'food_balance_europe': "data/raw/FoodBalanceSheets_E_Europe_1.csv",
'europe_supply_rnn': 'data/processed/europe_cal.pkl',
'european_countries_list': "data/raw/european_countries.txt",
'african_supply_map': 'visualization/africa_supply_map',
'african_demand_anim': 'visualization/african_cal_diff_animation.html',
'african_estimation_kcal': 'visualization/africa_est_kcal',
'african_kcal_need': "visualization/african_kcal_need",
"european_supply_map":"visualization/european_supply_map",
"african_pop_growth":"visualization/african_pop_growth.html",
'european_pop_growth': "visualization/european_pop_growth.html",
'european_estimation_kcal': 'visualization/europe_est_kcal',
'european_demand_anim': 'visualization/european_cal_diff_animation.html',
'european_kcal_surplus': "visualization/european_kcal_surplus",
'world_kcal_surplus': "visualization/world/world_kcal_surplus",
}
plt.rcParams["figure.figsize"] = (15,8) #set size of plot
To answer this important question, we will need to import data from the FAO Dataset. More specifically, we will focus on the section Food Balance Sheet with respect to African countries only.
food_balance_africa = pd.read_csv(path_dict['food_balance_africa'],encoding='latin-1');
Firstly, we will remove all the columns with title "Y----F" as they contain information about how the data was obtained (Calculated, Regression, Aggregate, FAO Estimation). In this context we will consider that FAO is a highly renowned Agency and hence we can assume these values are truthful without loss of generality. Furthermore we thought that it would be very handy to have numbers as columns representing years instead of "Y----". We proceed on removing the letter Y. The helper functions clean_Fs_and_years
does this cleaning on the dataframe.
food_balance_africa = clean_Fs_and_years(food_balance_africa)
Secondly, we replace all the NAN values with 0 as Item was not available.
food_balance_africa = food_balance_africa.fillna(0);
The third step to complete the cleaning of food_balance_africa consists of adapting names of countries in order to have consistency along our different dataframes. Since some countries changed their name over the years we will rename them. In particular, Swaziland to Eswatini and South Africa to Southern Africa. The function replace_names_of_countries
takes the dataframe and the country names which should be replaced.
food_balance_africa = replace_names_of_countries(food_balance_africa, [('Swaziland', 'Eswatini'), ('South Africa', 'Southern Africa')])
Our Dataframe looks like this:
food_balance_africa.head()
Area Code | Area | Item Code | Item | Element Code | Element | Unit | 1961 | 1962 | 1963 | ... | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Algeria | 2501 | Population | 511 | Total Population - Both sexes | 1000 persons | 11561.0 | 11845.0 | 12136.0 | ... | 33461.0 | 33961.0 | 34507.0 | 35097.0 | 35725.0 | 36383.0 | 37063.0 | 37763.0 | 38482.0 | 39208.0 |
1 | 4 | Algeria | 2511 | Wheat and products | 5511 | Production | 1000 tonnes | 686.0 | 1507.0 | 1590.0 | ... | 2731.0 | 2415.0 | 2688.0 | 2319.0 | 1111.0 | 2953.0 | 2605.0 | 2555.0 | 3432.0 | 3299.0 |
2 | 4 | Algeria | 2511 | Wheat and products | 5611 | Import Quantity | 1000 tonnes | 469.0 | 501.0 | 374.0 | ... | 5123.0 | 5697.0 | 4987.0 | 4885.0 | 6508.0 | 5757.0 | 5109.0 | 7487.0 | 6385.0 | 6343.0 |
3 | 4 | Algeria | 2511 | Wheat and products | 5072 | Stock Variation | 1000 tonnes | 353.0 | -409.0 | -408.0 | ... | -456.0 | -606.0 | -459.0 | -1.0 | 156.0 | -484.0 | 515.0 | -1050.0 | -350.0 | -180.0 |
4 | 4 | Algeria | 2511 | Wheat and products | 5911 | Export Quantity | 1000 tonnes | 46.0 | 12.0 | 33.0 | ... | 7.0 | 13.0 | 7.0 | 14.0 | 24.0 | 37.0 | 39.0 | 5.0 | 5.0 | 2.0 |
5 rows × 60 columns
Analysing our DataFrame food_balance_africa we can see that it's already well structured since it contains many key - value couples such as Item Code - Item and Element Code - Element . More specifically, we will take advantage of this structure to filter out only rows characterized by Grand total as an Item and Food supply (kcal/capita/day) as an Element. The corresponding key-values are (Item Code, 2901) and (Element Code, 664). A reference to the documentation in the FAO Website explains the legend for Element Code and Element Item extensively.
In order to keep our original Dataframe food_balance_africa as a reference we create a new Dataframe food_supply_africa in which we just keep countries and food supplies for every year.
food_supply_africa = obtain_supply(food_balance_africa)
We can now group group by Area and see the supplies derived from each item available in countries for that particular year.
food_supply_africa = food_supply_africa.set_index("Area")
food_supply_africa.head()
1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | ... | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Area | |||||||||||||||||||||
Algeria | 1619.0 | 1569.0 | 1528.0 | 1540.0 | 1591.0 | 1571.0 | 1647.0 | 1706.0 | 1705.0 | 1675.0 | ... | 2987.0 | 2958.0 | 3047.0 | 3041.0 | 3048.0 | 3110.0 | 3142.0 | 3217.0 | 3272.0 | 3296.0 |
Angola | 1798.0 | 1819.0 | 1853.0 | 1862.0 | 1877.0 | 1890.0 | 1921.0 | 1856.0 | 1946.0 | 1965.0 | ... | 2030.0 | 2077.0 | 2119.0 | 2173.0 | 2245.0 | 2303.0 | 2345.0 | 2407.0 | 2384.0 | 2473.0 |
Benin | 1736.0 | 1758.0 | 1703.0 | 1669.0 | 1812.0 | 1804.0 | 1833.0 | 1935.0 | 1871.0 | 1812.0 | ... | 2461.0 | 2435.0 | 2450.0 | 2564.0 | 2521.0 | 2565.0 | 2555.0 | 2598.0 | 2610.0 | 2619.0 |
Botswana | 1976.0 | 1909.0 | 1972.0 | 2001.0 | 2005.0 | 1947.0 | 2010.0 | 2052.0 | 2013.0 | 2035.0 | ... | 2191.0 | 2198.0 | 2150.0 | 2166.0 | 2184.0 | 2199.0 | 2234.0 | 2273.0 | 2342.0 | 2326.0 |
Burkina Faso | 1326.0 | 1308.0 | 1452.0 | 1555.0 | 1571.0 | 1560.0 | 1587.0 | 1602.0 | 1604.0 | 1546.0 | ... | 2505.0 | 2463.0 | 2559.0 | 2546.0 | 2588.0 | 2639.0 | 2681.0 | 2664.0 | 2707.0 | 2720.0 |
5 rows × 53 columns
In order to check for anomalies in our data, we would like to analyze the timeline. We therefore transpose the dataframe and plot the timeline of how food supply in different countries evolved. Legend was suppressed as it is too large.
food_supply_africa = food_supply_africa.transpose();
#converting the year from string to int
food_supply_africa.index = food_supply_africa.index.astype(int)
timeline_supply(food_supply_africa, "African")
This analysis shows that there are two inconsistencies. We therefore check for countries containing values equal to zero.
food_supply_africa.columns.values[(food_supply_africa == 0).any()]
array(['Ethiopia', 'Ethiopia PDR', 'Sudan', 'Sudan (former)'], dtype=object)
We notice that Sudan and Ethiopia appear twice as "Sudan" and "Sudan (former)" and "Ethiopia" and "Ethiopia PDR" respectively. This is due to the fact that South Sudan gained independence in 2011 (reference to https://en.wikipedia.org/wiki/South_Sudan), and the foundation of the Federal Democratic Republic of Ethiopia (reference to https://en.wikipedia.org/wiki/Ethiopia) in 1991. From then on, Ethiopia PDR was listed as Ethiopia. With food supply being consistently constant even after division, the newly introduced country "Sudan" is assumed to further on have accounted for both countries. For this reason, we will consider them to be one single country.
Consequently, the two countries' data is merged into one continuous set each. The function merge_countries
takes care of this, by substituting each key in dictionary (the second argument) with its value(s).
food_supply_africa = merge_countries(food_supply_africa, {'Sudan (former)': ['Sudan'], 'Ethiopia PDR': ['Ethiopia']})
Let's plot the newly generated data:
timeline_supply(food_supply_africa, "African", " - Cleaned dataset")
Next, we want to add more columns representing future years until 2020 to prepare cells for extrapolation to make predictions about possible scenarios. prepare_future
is used for this.
# Adding columns for the new years
food_supply_africa = prepare_future(food_supply_africa, 2014, 2020)
First of all, we want to simulate data until 2020 to match the population data. Furthermore, we also want to be able to make predicitions for individual countries to assess if they might run into food shortages in the near future.
The prediction for the new years are done by using a "Recurrent Neural Network (RNN)" and a window of size 10. Basically what we will do here is using all the past history of each country (windowed in block of 10 years each) to run a neural network and try to predict the future behaviour (up to 2020). During our test we found that the neural networks are able to predict good estimations.
As we don't want precise data, the estimations achieved by using ML are in this case more than acceptable for our purpose.
Credits: We don't know much about RNN, so the network used here is adapted from the Time series forecasting tutorial on Tensorflow, available here
Note: we already ran the networks and saved the results on Colab, running them each time requires more than an hour on Colab. For this reason, we just use the pickle here instead of running the network. This is achieved by the function predict_future
.
food_supply_africa = predict_future(food_supply_africa, path_dict['africa_supply_rnn'])
Plotting the results:
timeline_supply(food_supply_africa, "African", "- Predicted dataset")
When looking at the predicted period (after 2013), a linear approximation can be observed with some countries being forecasted to oscillate. This can be explained by the fact that the neural network method recognised a recurring pattern in the data and extrapolates this behaviour. Among all the tested methods, this one is expected to reproduce the most realistic data, as no odd developments can be observed.
#Geographic coordinates for visualizing
geojson_world = gpd.read_file(path_dict['geoworld_json'])
type(geojson_world)
geopandas.geodataframe.GeoDataFrame
We observe geojson_world
is a GeoDataFrame. Let's see how the data are sorted:
geojson_world.head()
id | name | geometry | |
---|---|---|---|
0 | AFG | Afghanistan | POLYGON ((61.21082 35.65007, 62.23065 35.27066... |
1 | AGO | Angola | MULTIPOLYGON (((16.32653 -5.87747, 16.57318 -6... |
2 | ALB | Albania | POLYGON ((20.59025 41.85540, 20.46317 41.51509... |
3 | ARE | United Arab Emirates | POLYGON ((51.57952 24.24550, 51.75744 24.29407... |
4 | ARG | Argentina | MULTIPOLYGON (((-65.50000 -55.20000, -66.45000... |
We have to make sure that geojson_world
has data for every country under our analysis. In this case, we need to check for every country in Africa and more specifically every country taken into account in food_supply_africa
. Since the name in geojson_world
is not a good index to match and we don't have any comparable id in food_supply_africa
we have to do this job by hand, which is feasible since the size is small enough. As a result, we filter geojson_world
to geojson_africa
. Let's display this new GeoDataFrame.
african_country_codes = ["DZA","AGO","BEN","BWA","BFA","CMR","CAF","TCD","COD","CIV",
"DJI","EGY","SWZ","ETH","GAB","GMB","GHA","GNQ","GNB","KEN","LSO","LBR",
"MDG","MWI","MLI","MRT","MAR","MOZ","NAM","NER","NGA","RWA"
,"SEN","SLE","ZAF","SDN","TGO","TUN","UGA","TZA","ZMB","ZWE"]
african_country_names = ['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso',
'Cameroon', 'Central African Republic', 'Chad', 'Congo',
"Côte d'Ivoire", 'Djibouti', 'Egypt', 'Eswatini','Ethiopia', 'Gabon',
'Gambia', 'Ghana', 'Guinea', 'Guinea-Bissau', 'Kenya', 'Lesotho',
'Liberia', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Morocco',
'Mozambique', 'Namibia', 'Niger', 'Nigeria', 'Rwanda', 'Senegal',
'Sierra Leone', 'Southern Africa', 'Sudan', 'Togo',
'Tunisia', 'Uganda', 'United Republic of Tanzania', 'Zambia',
'Zimbabwe']
african_country_kv = pd.DataFrame({'codes': african_country_codes,
'names': african_country_names
})
geojson_africa = geojson_world[geojson_world.id.isin(african_country_codes)]
geojson_africa.head()
id | name | geometry | |
---|---|---|---|
1 | AGO | Angola | MULTIPOLYGON (((16.32653 -5.87747, 16.57318 -6... |
13 | BEN | Benin | POLYGON ((2.69170 6.25882, 1.86524 6.14216, 1.... |
14 | BFA | Burkina Faso | POLYGON ((-2.82750 9.64246, -3.51190 9.90033, ... |
25 | BWA | Botswana | POLYGON ((25.64916 -18.53603, 25.85039 -18.714... |
26 | CAF | Central African Republic | POLYGON ((15.27946 7.42192, 16.10623 7.49709, ... |
The ordered list african_country_codes contains all the countries available to be plotted as geometry is available. We found out manually that Cabo Verde, Sao Tome and Principe and Mauritius are not in this list. For the sake of simplicity, we will remove these three countries as they don't affect our analysis.
food_supply_africa = food_supply_africa.drop(columns=["Cabo Verde","Mauritius","Sao Tome and Principe"])
Now we can move to plot the Food supply for each country. All of this is done in the plot_map
function. This function plots a world map centered in the zone of interest with interactive values while scrolling over the individual.
As we are particularly interested in the situation in 2020, we'll plot our prediction of supply for this year in the notebook map. Could be interesting also to look at the evolving of the situation over the last 50 years, so we plot a map for each decade from 1970 to 2020.
Note: the map visualized here is just for 2020, if maybe will not load. If this happens, click here
The maps for the previous decade are available here:
# Used to generate, don't run it
legend_name = "Food supply (kcal/person/day)"
for year in range(1970, 2030, 10):
africa_supply_map = plot_map(food_supply_africa.T, path_dict['geoworld_json'], \
african_country_kv, year, "Blues", legend_name, legend_name, path_dict['african_supply_map'] + str(year) + ".html")
africa_supply_map
# Printing 2020 map
IFrame(src='https://manuleo.github.io/mADAm_files/africa_supply_map2020.html', width = 800, height=600)
save_map_data(geojson_africa, african_country_kv, food_supply_africa, "docs/json/africa_supply/africa_supply_{}.geojson", "docs/json/africa_supply/africa_supply_ticks.json")
By analyzing the trend over the decades, we discover that the situation in the African region partially improved over the years, especially if we compare the 2020 with the 1970.
There are some countries, especially in North Africa, that improved their supplies by 1000 kcal. In contrast, the situation in the central-southern region is almost the same as 50 years ago. Another important point to see is that there is a bit of fluctuation of the values.
In this first part, we compute kcal demand for males and females for every age group. Secondly, we will conduct an extensive analysis on African demographics. Finally, we will be able to combine kcal demand with African population data into a unique dataframe that will be the answer of our inital question: What is the kcal demand of a regular person in order to be healthy?
First of all, we load the calories demand datasets scraped from the webpage Calories. This information will be matched with the population datsets to receive the total caloric demand in each country, each year.
male_calory_demand = pd.read_excel(path_dict['ages_calories_demand'], header = None, sheet_name = 0, names = ['age', 'sedentary', 'moderate', 'active'])
female_calory_demand = pd.read_excel(path_dict['ages_calories_demand'], header = None, sheet_name = 1, names = ['age', 'sedentary', 'moderate', 'active'])
In order to better work with the information we have collected, we will make some simplifications on the data. Mainly, we will:
male_calories = male_calory_demand.drop(columns=['sedentary', 'moderate'])
female_calories = female_calory_demand.drop(columns=['sedentary', 'moderate'])
male_calories.rename(columns={'active':'input kcal'}, inplace=True)
female_calories.rename(columns={'active':'input kcal'}, inplace=True)
We have now obtained a caloric demand for simpler calculations in the future and stored in the two previous datasets.
Now, we need a way to match the age groups in this dataframe to the ones in the population database we obtained. As such, let's analyse how ages are represented in our caloric demand dataframes.
male_calories['age'].unique()
array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, '19-20', '21-25', '26-30', '31-35', '36-40', '41-45', '46-50', '51-55', '56-60', '61-65', '66-70', '71-75', '76 and up', nan], dtype=object)
We can see that there are ranges of ages with different sizes (which makes sense, because different age groups have different caloric needs). The function explode_age
returns the dataframe with one row per individual age.
We apply the function to our two dataframes:
male_calories = explode_age(male_calories)
female_calories = explode_age(female_calories)
male_calories['age'].unique()
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101])
Ages are now unique in each dataframe ( male_calories
and female_calories
) and there's a caloric input value for each of them.
The last step to allow the match with the population database is to build the same age groups we have in that set. The compress_ages
function takes care of the differences between datasets by grouping the ages into the same ranges as in the population dataset (and calculating the average needs).
We can lastly apply the function to the dataframes:
male_calories = compress_ages(male_calories)
female_calories = compress_ages(female_calories)
We also use the age group as new index and rename the columns:
male_calories.index.name = 'age_group'
male_calories = male_calories.rename(columns={0: 'input kcal'})
female_calories.index.name = 'age_group'
female_calories = female_calories.rename(columns={0: 'input kcal'})
Let's have a look at the result we have achieved and collected in our matchable dataframe male_calories
and female_calories
. The unit here is kcal/person/day.
male_calories.head()
input kcal | |
---|---|
age_group | |
0-4 | 1333.333333 |
5-9 | 1840.000000 |
10-14 | 2440.000000 |
15-19 | 3120.000000 |
20-24 | 3000.000000 |
fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].barh(male_calories.index, male_calories["input kcal"])
axes[1].barh(female_calories.index, female_calories["input kcal"],color="#ffc0cb")
axes[0].set_xlabel("Amount of kcal needed for group age in males [kcal/persona/day]")
fig.suptitle("Estimation of energy needed to live an active lifestyle for males and females divided in group ages")
axes[1].set_xlabel("Amount of kcal needed for group age in females [kcal/persona/day]")
axes[1].set_xlim(0, 3300)
axes[0].set_xlim(0, 3300)
axes[0].set_ylabel("age groups")
axes[1].set_ylabel("age groups");
As as we can see, in general, males need more calories than females. For both genders, the maximum demand occurs at the age of 15-30.
In our data story, we present a pyramid plot to show the same results. The code used is presented below.
fig = go.Figure()
fig.add_trace(
go.Bar(
name="Male",
orientation="h",
marker_color="#AED6FF",
y=male_calories.index,
x=male_calories['input kcal'],
hovertemplate="Caloric need: %{x:.2f} kcal/day<br>Age group: %{y}"))
fig.add_trace(
go.Bar(
name="Female",
orientation="h",
marker_color="#FFDCFD",
y=female_calories.index,
x=-female_calories['input kcal'],
text=female_calories['input kcal'],
hovertemplate="Caloric need: %{text:.2f} kcal/day<br>Age group: %{y}"))
fig.update_layout(
barmode='overlay',
xaxis=dict(autorange=False, range=[-3500, 3500], ticks="",
tickvals=[-3000,-2000,-1000,0,1000,2000,3000],
ticktext=[3000,2000,1000,0,1000,2000,3000],
title="Caloric need [kcal/day]",
gridcolor="#eee"),
yaxis=dict(title="Age group"),
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
title="Caloric need per age and gender"
)
py.plot(fig, filename="docs/_includes/gender.html", include_plotlyjs=False)
fig.show()
In this second part of our analysis, we load the list of African countries. Secondly, we load the World Population Database (United Nation) and therefore we obtain two dataframes: one for males and the other one for females.
with open (path_dict['african_countries_list'],'r') as af_c:
af_countries = [line.rstrip() for line in af_c] #loading list
We need to check if the FAO Database contains data regarding every country in Africa. We will check the intersection with the list af_countries.
af_to_remove = list(set(af_countries) - set(food_supply_africa.columns.values))
print("List of countries for which no data is available: " + str(af_to_remove))
List of countries for which no data is available: ['Mauritius', 'Mayotte', 'South Sudan', 'Réunion', 'Libya', 'Sao Tome and Principe', 'Eritrea', 'Burundi', 'Comoros', 'Equatorial Guinea', 'Cabo Verde', 'Somalia', 'Seychelles', 'Democratic Republic of the Congo', 'Western Sahara']
As expected, many countries were not present in the FAO Database. The countries to remove are the following: Eritrea, Burundi, Comoros, Democratic Republic of the Congo, Equatorial Guinea, Libya, Seychelles, Western Sahara, South Sudan, and Somalia. Furthermore, Mayotte and Réunion are French territory islands so they will be removed as well.
af_countries = [i for i in af_countries if not i in af_to_remove]
af_to_remove = list(set(af_countries)- set(food_supply_africa.columns.values))
print("List of countries for which no data is available: "+ str(af_to_remove))
List of countries for which no data is available: []
Now we can proceed to load the population data and clean it
#loading datasets
pop_male = pd.read_excel(path_dict['population_age_male'], sheet_name="ESTIMATES")
pop_female = pd.read_excel(path_dict['population_age_female'], sheet_name="ESTIMATES")
The function clean_pop_df
takes care of all the required cleaning. What it does is removing all the unnecessary columns and retaining just the rows of the needed countries. Then it multiplies the value by 1000 (Since the population data is per 1000 people, all entries are multiplied by the same factor (1000) to return to the real value).
pop_male_africa = clean_pop_df(pop_male, af_countries)
pop_female_africa = clean_pop_df(pop_female, af_countries)
World Population Database (United Nation) is now loaded and cleaned. This preprocessing is necessary in order to sort things out for next more complex steps.
Let's have a look at the final version of male population data grouped by age:
pop_male_africa.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
420 | Djibouti | 1950 | 6128 | 4593 | 3828 | 3247 | 2727 | 2266 | 1873 | 1538 | ... | 589 | 422 | 277 | 160 | 74 | 24 | 7 | 0 | 0 | 0 |
421 | Djibouti | 1955 | 5556 | 5527 | 4487 | 3773 | 3168 | 2639 | 2186 | 1800 | ... | 696 | 506 | 337 | 198 | 94 | 33 | 7 | 0 | 0 | 0 |
422 | Djibouti | 1960 | 7201 | 5360 | 5668 | 4640 | 3867 | 3218 | 2672 | 2203 | ... | 868 | 631 | 428 | 256 | 126 | 46 | 11 | 0 | 0 | 0 |
423 | Djibouti | 1965 | 10888 | 7814 | 6324 | 6545 | 5316 | 4389 | 3642 | 3006 | ... | 1191 | 872 | 591 | 358 | 179 | 71 | 16 | 2 | 0 | 0 |
424 | Djibouti | 1970 | 15553 | 11828 | 9094 | 7514 | 7519 | 6072 | 4994 | 4120 | ... | 1646 | 1208 | 826 | 503 | 259 | 102 | 24 | 3 | 0 | 0 |
5 rows × 23 columns
In this context, the population dataframe for males pop_male and for females pop_female contains measurements of population censi, for years from 1950 to 2020 with a frequency of 5 years. Next, data is interpolated in order to obtain values for intermediate years.
For our interpolation method to work, we need to know if the population evolution in these intervals of 5 years is linear. In order to do so, we need to visualize the growth of the population for a group of countries (plotting all of them would be confusing). So we select 3 countries randomly and plot a simple animation of the growth over time.
Note: the code below is used to generate the HTML animation. If it doesn't load click on this link
IFrame(src='https://manuleo.github.io/mADAm_files/african_pop_growth.html', width = 1200, height=600)
# # Selecting countries
# countryrand = []
# n_countries = 3
# for i in range(0, n_countries):
# countryrand.append(random.choice(pop_male_africa.country.drop_duplicates().values))
# fig = plt.figure()
# animator = animation.FuncAnimation(fig, timeline_country_gender, frames=(range(1950, 2025, 5)),\
# fargs = (pop_male_africa, pop_female_africa, "30-34", countryrand), repeat=False)
# #HTML(animator.to_jshtml())
# with open(path_dict['african_pop_growth'], "w") as f:
# print(animator.to_html5_video(), file=f)
The animation shows an almost linear growth for the 3 countries considered (with their respective scale), so we can continue with the interpolation.
Now we can apply our function interpolate_years
, a simple linear interpolation, in order to obtain a frequency of 1 year.
pop_male_africa = interpolate_years(pop_male_africa, 1950, 2020)
pop_female_africa = interpolate_years(pop_female_africa, 1950, 2020)
Let's see how the new dataframes for males and females population look like:
pop_male_africa.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Djibouti | 1950 | 6128.0 | 4593.0 | 3828.0 | 3247.0 | 2727.0 | 2266.0 | 1873.0 | 1538.0 | ... | 589.0 | 422.0 | 277.0 | 160.0 | 74.0 | 24.0 | 7.0 | 0.0 | 0.0 | 0.0 |
1 | Djibouti | 1951 | 6013.6 | 4779.8 | 3959.8 | 3352.2 | 2815.2 | 2340.6 | 1935.6 | 1590.4 | ... | 610.4 | 438.8 | 289.0 | 167.6 | 78.0 | 25.8 | 7.0 | 0.0 | 0.0 | 0.0 |
2 | Djibouti | 1952 | 5899.2 | 4966.6 | 4091.6 | 3457.4 | 2903.4 | 2415.2 | 1998.2 | 1642.8 | ... | 631.8 | 455.6 | 301.0 | 175.2 | 82.0 | 27.6 | 7.0 | 0.0 | 0.0 | 0.0 |
3 | Djibouti | 1953 | 5784.8 | 5153.4 | 4223.4 | 3562.6 | 2991.6 | 2489.8 | 2060.8 | 1695.2 | ... | 653.2 | 472.4 | 313.0 | 182.8 | 86.0 | 29.4 | 7.0 | 0.0 | 0.0 | 0.0 |
4 | Djibouti | 1954 | 5670.4 | 5340.2 | 4355.2 | 3667.8 | 3079.8 | 2564.4 | 2123.4 | 1747.6 | ... | 674.6 | 489.2 | 325.0 | 190.4 | 90.0 | 31.2 | 7.0 | 0.0 | 0.0 | 0.0 |
5 rows × 23 columns
pop_female_africa.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Djibouti | 1950 | 6065.0 | 4583.0 | 3819.0 | 3238.0 | 2736.0 | 2295.0 | 1911.0 | 1574.0 | ... | 655.0 | 487.0 | 334.0 | 202.0 | 100.0 | 34.0 | 11.0 | 2.0 | 0.0 | 0.0 |
1 | Djibouti | 1951 | 5948.6 | 4764.2 | 3949.6 | 3342.8 | 2824.4 | 2369.8 | 1973.8 | 1627.2 | ... | 679.0 | 505.8 | 347.6 | 211.2 | 105.4 | 37.0 | 11.0 | 2.0 | 0.0 | 0.0 |
2 | Djibouti | 1952 | 5832.2 | 4945.4 | 4080.2 | 3447.6 | 2912.8 | 2444.6 | 2036.6 | 1680.4 | ... | 703.0 | 524.6 | 361.2 | 220.4 | 110.8 | 40.0 | 11.0 | 2.0 | 0.0 | 0.0 |
3 | Djibouti | 1953 | 5715.8 | 5126.6 | 4210.8 | 3552.4 | 3001.2 | 2519.4 | 2099.4 | 1733.6 | ... | 727.0 | 543.4 | 374.8 | 229.6 | 116.2 | 43.0 | 11.0 | 2.0 | 0.0 | 0.0 |
4 | Djibouti | 1954 | 5599.4 | 5307.8 | 4341.4 | 3657.2 | 3089.6 | 2594.2 | 2162.2 | 1786.8 | ... | 751.0 | 562.2 | 388.4 | 238.8 | 121.6 | 46.0 | 11.0 | 2.0 | 0.0 | 0.0 |
5 rows × 23 columns
Lastly, we will compute the total population per year. This new dataframe pop_tot will be useful for the next section of our analysis. obtain_total_pop
does just this, combining the population in pop_male_africa and pop_female_africa.
pop_tot_africa = obtain_total_pop(pop_male_africa, pop_female_africa)
For the next analysis we will need to match this data with the food_balance_africa
. We proceed to give to our population data the same shape as the other datasets by using the function reshape_pop_dataframe
.
pop_tot_africa = reshape_pop_dataframe(pop_tot_africa)
pop_tot_africa.head()
1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||||||||||||
Algeria | 8872250.0 | 9052656.0 | 9233062.0 | 9413468.0 | 9593874.0 | 9774280.0 | 10030996.8 | 10287713.6 | 10544430.4 | 10801147.2 | ... | 36727564.8 | 37477678.6 | 38227792.4 | 38977906.2 | 39728020.0 | 40552624.6 | 41377229.2 | 42201833.8 | 43026438.4 | 43851043.0 |
Angola | 4548021.0 | 4647067.0 | 4746113.0 | 4845159.0 | 4944205.0 | 5043251.0 | 5125588.4 | 5207925.8 | 5290263.2 | 5372600.6 | ... | 24261873.6 | 25167500.2 | 26073126.8 | 26978753.4 | 27884380.0 | 28880757.6 | 29877135.2 | 30873512.8 | 31869890.4 | 32866268.0 |
Benin | 2255222.0 | 2264896.0 | 2274570.0 | 2284244.0 | 2293918.0 | 2303592.0 | 2329197.0 | 2354802.0 | 2380407.0 | 2406012.0 | ... | 9474595.6 | 9749937.2 | 10025278.8 | 10300620.4 | 10575962.0 | 10885409.2 | 11194856.4 | 11504303.6 | 11813750.8 | 12123198.0 |
Botswana | 412541.0 | 422703.4 | 432865.8 | 443028.2 | 453190.6 | 463353.0 | 471229.0 | 479105.0 | 486981.0 | 494857.0 | ... | 2013828.0 | 2040550.0 | 2067272.0 | 2093994.0 | 2120716.0 | 2166897.8 | 2213079.6 | 2259261.4 | 2305443.2 | 2351625.0 |
Burkina Faso | 4284455.0 | 4330995.0 | 4377535.0 | 4424075.0 | 4470615.0 | 4517155.0 | 4579581.8 | 4642008.6 | 4704435.4 | 4766862.2 | ... | 16106292.0 | 16607373.0 | 17108454.0 | 17609535.0 | 18110616.0 | 18669148.4 | 19227680.8 | 19786213.2 | 20344745.6 | 20903278.0 |
5 rows × 71 columns
Now we multiply each column of the population data for each matching age_group
in the calories table (which is squeezed to enable multiplication, similar to transposing rows/columns of the dataset). All of this is done in get_calories_need
We obtain two datasets: male_cal_need_africa
and female_cal_need_africa
reporting total calories needed for each country in each year per age group per gender.
The unit here is kcal/day.
male_cal_need_africa = get_calories_need(pop_male_africa, male_calories)
male_cal_need_africa.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Djibouti | 1950 | 8.170667e+06 | 8451120.0 | 9340320.0 | 10130640.0 | 8181000.0 | 6798000.0 | 5619000.0 | 4367920.0 | ... | 1554960.0 | 1097200.0 | 720200.0 | 416000.0 | 180560.0 | 57600.0 | 16800.0 | 0.0 | 0.0 | 0.0 |
1 | Djibouti | 1951 | 8.018133e+06 | 8794832.0 | 9661912.0 | 10458864.0 | 8445600.0 | 7021800.0 | 5806800.0 | 4516736.0 | ... | 1611456.0 | 1140880.0 | 751400.0 | 435760.0 | 190320.0 | 61920.0 | 16800.0 | 0.0 | 0.0 | 0.0 |
2 | Djibouti | 1952 | 7.865600e+06 | 9138544.0 | 9983504.0 | 10787088.0 | 8710200.0 | 7245600.0 | 5994600.0 | 4665552.0 | ... | 1667952.0 | 1184560.0 | 782600.0 | 455520.0 | 200080.0 | 66240.0 | 16800.0 | 0.0 | 0.0 | 0.0 |
3 | Djibouti | 1953 | 7.713067e+06 | 9482256.0 | 10305096.0 | 11115312.0 | 8974800.0 | 7469400.0 | 6182400.0 | 4814368.0 | ... | 1724448.0 | 1228240.0 | 813800.0 | 475280.0 | 209840.0 | 70560.0 | 16800.0 | 0.0 | 0.0 | 0.0 |
4 | Djibouti | 1954 | 7.560533e+06 | 9825968.0 | 10626688.0 | 11443536.0 | 9239400.0 | 7693200.0 | 6370200.0 | 4963184.0 | ... | 1780944.0 | 1271920.0 | 845000.0 | 495040.0 | 219600.0 | 74880.0 | 16800.0 | 0.0 | 0.0 | 0.0 |
5 rows × 23 columns
#total calories female
female_cal_need_africa = get_calories_need(pop_female_africa, female_calories)
female_cal_need_africa.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Djibouti | 1950 | 7.682333e+06 | 7882760.0 | 8249040.0 | 7771200.0 | 6566400.0 | 5508000.0 | 4280640.0 | 3462800.0 | ... | 1441000.0 | 993480.0 | 668000.0 | 404000.0 | 200000.0 | 68000.0 | 22000.0 | 4000.0 | 0.0 | 0.0 |
1 | Djibouti | 1951 | 7.534893e+06 | 8194424.0 | 8531136.0 | 8022720.0 | 6778560.0 | 5687520.0 | 4421312.0 | 3579840.0 | ... | 1493800.0 | 1031832.0 | 695200.0 | 422400.0 | 210800.0 | 74000.0 | 22000.0 | 4000.0 | 0.0 | 0.0 |
2 | Djibouti | 1952 | 7.387453e+06 | 8506088.0 | 8813232.0 | 8274240.0 | 6990720.0 | 5867040.0 | 4561984.0 | 3696880.0 | ... | 1546600.0 | 1070184.0 | 722400.0 | 440800.0 | 221600.0 | 80000.0 | 22000.0 | 4000.0 | 0.0 | 0.0 |
3 | Djibouti | 1953 | 7.240013e+06 | 8817752.0 | 9095328.0 | 8525760.0 | 7202880.0 | 6046560.0 | 4702656.0 | 3813920.0 | ... | 1599400.0 | 1108536.0 | 749600.0 | 459200.0 | 232400.0 | 86000.0 | 22000.0 | 4000.0 | 0.0 | 0.0 |
4 | Djibouti | 1954 | 7.092573e+06 | 9129416.0 | 9377424.0 | 8777280.0 | 7415040.0 | 6226080.0 | 4843328.0 | 3930960.0 | ... | 1652200.0 | 1146888.0 | 776800.0 | 477600.0 | 243200.0 | 92000.0 | 22000.0 | 4000.0 | 0.0 | 0.0 |
5 rows × 23 columns
Once we have the calories needed for both genders, we can aggregate the total caloric need of african countries into total_cal_need_africa
. The function obtain_total_cal_need
does just this, returning a dataframe with the calories needed for all the population of a country in a year, in the unit kcal/year.
total_cal_need_africa = obtain_total_cal_need(male_cal_need_africa, female_cal_need_africa)
Let's take a look at the total calories dataframe total_cal:
total_cal_need_africa.sort_values(by="Calories", ascending=False).head()
country | year | Calories | |
---|---|---|---|
2413 | Nigeria | 2020 | 1.679755e+14 |
2412 | Nigeria | 2019 | 1.637743e+14 |
2411 | Nigeria | 2018 | 1.595730e+14 |
2410 | Nigeria | 2017 | 1.553718e+14 |
2409 | Nigeria | 2016 | 1.511706e+14 |
For the sake of consistency, we will now reshape our dataframe total_cal
into a new one total_cal_final
according to the same schema seen for food_supply_africa
.
total_cal_need_africa = reshape_calories_df(total_cal_need_africa)
Drawing a sample of the final shaped dataframe total calories total_cal_need_africa
:
total_cal_need_africa.head()
1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||||||||||||
Algeria | 7.290425e+12 | 7.412354e+12 | 7.534284e+12 | 7.656213e+12 | 7.778143e+12 | 7.900072e+12 | 8.092341e+12 | 8.284610e+12 | 8.476879e+12 | 8.669149e+12 | ... | 3.150748e+13 | 3.205546e+13 | 3.260345e+13 | 3.315144e+13 | 3.369942e+13 | 3.433776e+13 | 3.497610e+13 | 3.561444e+13 | 3.625278e+13 | 3.689112e+13 |
Angola | 3.721975e+12 | 3.797306e+12 | 3.872637e+12 | 3.947968e+12 | 4.023299e+12 | 4.098630e+12 | 4.159526e+12 | 4.220422e+12 | 4.281318e+12 | 4.342214e+12 | ... | 1.941009e+13 | 2.013902e+13 | 2.086796e+13 | 2.159689e+13 | 2.232582e+13 | 2.315305e+13 | 2.398027e+13 | 2.480750e+13 | 2.563473e+13 | 2.646195e+13 |
Benin | 1.851458e+12 | 1.858390e+12 | 1.865323e+12 | 1.872255e+12 | 1.879187e+12 | 1.886119e+12 | 1.905149e+12 | 1.924180e+12 | 1.943210e+12 | 1.962241e+12 | ... | 7.691320e+12 | 7.921509e+12 | 8.151697e+12 | 8.381886e+12 | 8.612074e+12 | 8.874543e+12 | 9.137011e+12 | 9.399479e+12 | 9.661947e+12 | 9.924415e+12 |
Botswana | 3.377080e+11 | 3.450761e+11 | 3.524442e+11 | 3.598122e+11 | 3.671803e+11 | 3.745484e+11 | 3.799093e+11 | 3.852703e+11 | 3.906313e+11 | 3.959923e+11 | ... | 1.688976e+12 | 1.709664e+12 | 1.730351e+12 | 1.751039e+12 | 1.771726e+12 | 1.812568e+12 | 1.853409e+12 | 1.894250e+12 | 1.935091e+12 | 1.975932e+12 |
Burkina Faso | 3.538834e+12 | 3.574675e+12 | 3.610516e+12 | 3.646357e+12 | 3.682198e+12 | 3.718039e+12 | 3.765608e+12 | 3.813178e+12 | 3.860748e+12 | 3.908317e+12 | ... | 1.296491e+13 | 1.338135e+13 | 1.379779e+13 | 1.421422e+13 | 1.463066e+13 | 1.510462e+13 | 1.557857e+13 | 1.605253e+13 | 1.652648e+13 | 1.700044e+13 |
5 rows × 71 columns
Let's go on with a interactive visualization of the data in order to understand the trend over countries. The following map is based on data for 2020.
However, it's important also to understand what are the possible changing over time, so we plot the same map over the range from 1970 to 2020 (step of 10 years).
If the maps for 2020 doesn't show click here
Link to the other maps:
# Code used just for generating the map
for year in range(1970,2030,10):
legend_name = "Estimation of kcal/year [10^11 kcal/year]"
africa_kcal_est_map = plot_map(total_cal_need_africa.divide(10**11), path_dict['geoworld_json'], \
african_country_kv, year, "Greens", legend_name, legend_name, path_dict['african_estimation_kcal'] + str(year) + ".html")
africa_kcal_est_map
IFrame(src='https://manuleo.github.io/mADAm_files/africa_est_kcal2020.html', width = 800, height=600)
save_map_data(geojson_africa, african_country_kv, total_cal_need_africa.divide(10**11).T, "docs/json/africa_need/africa_need_{}.geojson", "docs/json/africa_need/africa_need_ticks.json")
By looking at the scale change over the years, it's possible to note how the needs of the African population have constantly grown (connected to the increase of population too). This is not reflected in the African food supply, that doesn't increase at the same rate as the population.
Next, an interesting comparison is introduced between the two dataframes we have obtained in the fist two parts of our analysis. More specifically, the analysis will take into account the total population dataframe pop_tot_africa
and the food_supply_africa
. With regard to the FAO Dataframe of food supply, we will need to transform the unit in kcal/year in order to compare results appropriately.
The function obtain_difference
takes into account our dataframes to compute which countris have enough caloric food supply to actually meet their needs.
caloric_difference_africa = obtain_difference(pop_tot_africa, food_supply_africa, total_cal_need_africa)
caloric_difference_africa.head()
1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Algeria | -573.589919 | -620.801820 | -659.153014 | -644.633318 | -591.233516 | -612.982117 | -538.630181 | -481.186136 | -483.657494 | -515.050972 | ... | 866.671209 | 928.653501 | 959.361778 | 998.680650 | 1009.590805 | 1041.856405 | 1058.409442 | 1066.538096 | 1083.352529 | 1086.631007 |
Angola | -413.161895 | -391.881557 | -357.607490 | -348.339486 | -333.077345 | -320.591047 | -290.100518 | -355.605811 | -266.106978 | -247.604067 | ... | 215.152631 | 191.672527 | 280.225775 | 186.029318 | 294.259093 | 203.038770 | 292.560777 | 210.272579 | 277.137533 | 214.003164 |
Benin | -492.836248 | -467.466793 | -519.203353 | -550.041004 | -403.975119 | -409.528255 | -378.181242 | -273.928090 | -335.763279 | -392.681714 | ... | 373.935633 | 384.061159 | 391.289648 | 386.664138 | 392.847960 | 387.777835 | 390.545646 | 388.001273 | 387.055618 | 384.440686 |
Botswana | -208.233159 | -272.356799 | -206.603106 | -174.964398 | -168.433620 | -225.305819 | -161.230054 | -118.202804 | -156.220859 | -133.281287 | ... | -24.779281 | 46.535339 | 32.790121 | 49.163878 | 88.124217 | 68.644236 | 103.269109 | 65.282088 | 99.284178 | 67.992986 |
Burkina Faso | -915.448084 | -930.741495 | -784.109191 | -678.548155 | -660.055530 | -668.265460 | -638.569119 | -620.961863 | -616.439348 | -671.997508 | ... | 458.632137 | 499.472919 | 510.440182 | 517.817483 | 514.063968 | 518.852721 | 521.753299 | 516.341587 | 517.861824 | 517.405554 |
5 rows × 60 columns
Let's start by doing a simple barplot of the deficit per persona/year in each country. As our main point of interest is the present, we will start with a graph showing next year sitution:
caloric_difference_africa_sorted = caloric_difference_africa[2020].sort_values()
p = caloric_difference_africa_sorted.plot(kind='barh', color=(caloric_difference_africa_sorted > 0).map({True: 'g', False: 'red'}),alpha=0.75, rot=0);
p.set_xlabel("deficit/surplus in the african countries [kcal/persona/day]")
p.set_ylabel("African countries")
plt.title('Estimation of net food availability in African countries in 2020' );
This plot already suggests that only by smart redistribution, Africa could sustain its own food demand. However, having the capabilities and know-how to efficiently set up a food aid operation is harder than it seems. European countries on the other hand have a lot more experience in this field, and they are expected to have an even higher amount of excess food, making it easier to provide for this whole operation. Therefore They will be considered to provide the difference in African countries.
For a better visualization, it is convenient to see the evolution of the kcal demand for all the years of interest. By using the draw_demand_bar
, we wil plot the information for every year combined with an animation to move back and forth in time.
Note: the code below is used to generate the HTML animation. If the animation doesn't work inside the notebook, click on this link
# # # Code to generate the animation
# fig = plt.figure()
# animator = animation.FuncAnimation(fig, draw_demand_bar, frames=range(1961, 2021),\
# fargs=(caloric_difference_africa, ),
# repeat=False);
# #HTML(animator.to_jshtml())
# with open(path_dict['african_demand_anim'], "w") as f:
# print(animator.to_html5_video(), file=f)
IFrame(src='https://manuleo.github.io/mADAm_files/african_cal_diff_animation.html', width = 1100, height=600)
As the animation shows, the African situation sensitevely improved over the last 60 years, starting from a share of 75% of starving countries in 1961 to "just" 21% in 2020.
During the years the situation improved differently. The improvement was basic up to the nineties, than dramatically changed from 2000 to now.
For our data story we plotted the same animation in a slider manner, by using the below function for Africa's and Europe's surplus situation.
def get_plotly_progress(dataframe, zone, title, width=800, height=800, mini=-1100, maxi=1500, size=20):
fig = go.Figure()
for year in range(1961, 2021):
x, y, colors = colors_and_order(dataframe[year])
fig.add_trace(
go.Bar(visible=False,
width=0.85,
orientation="h",
x=x,
y=y,
text=y,
hovertemplate="%{x} kcal",
textposition='outside',
textfont=dict(size=100),
marker_color=colors,
name=str(year)))
fig.data[-1].visible=True
steps = []
for i in range(len(fig.data)):
step = dict(
method="restyle",
args=["visible", [False] * len(fig.data)],
label=str(i+1961),
)
step["args"][1][i] = True # Toggle i'th trace to "visible"
steps.append(step)
sliders = [dict(
active=59,
currentvalue={"prefix": "Year: "},
pad={"t": 50},
steps=steps,
)]
fig.update_layout(
title=dict(text=title, y=0.94, x=0.03, font=dict(size=15)),
sliders=sliders,
xaxis=dict(autorange=False, range=[mini, maxi], title="caloric surplus (kcal/person/day)", showline=True),
yaxis=dict(autorange=True, showgrid=False, ticks='', showticklabels=False),
margin=dict(t=30, l=5, b=10, r=5),
autosize=False,
width=width,
height=height,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
py.plot(fig, filename='docs/_includes/cal_diff_{}.html'.format(zone), include_plotlyjs=False)
fig.show()
legend_name = "Estimation of kcal/persona/day deficit"
for year in range(1970,2030,10):
bins = [min(caloric_difference_africa[year]), 0, 200, 450, max(caloric_difference_africa[year])]
african_kcal_need_map = plot_map(caloric_difference_africa, path_dict['geoworld_json'], \
african_country_kv, year, "RdYlGn", legend_name, legend_name, path_dict['african_kcal_need'] + str(year) + ".html", bins)
african_kcal_need_map
IFrame(src='https://manuleo.github.io/mADAm_files/african_kcal_need2020.html', width = 800, height=600)
We found an analysis here about the GDP (Gross Domestic Product) in African countries, which we think can be a good way to evaluate the correctness of our analysis. We show to the left the heat map of Africa in terms of each countries GDP per capita.
Observing the evolution over the last 50 years allow us to better understand and comment the animation above. As we can see from the 1970 map, almost all Africa lived in starving condition with the lowest peak in Guinea-Bissau (at huge deficit of 700 kcal/persona/day needed).
Individual case studies prove the data's accuracy. For example, the Ethiopian famine in the 80's is impressively reproduced and manifests itself as Ethiopia drops to the lowest rank. In general, a very positive development can be observed, as more and more countries manage to reach a surplus regime every single decade, with the highest increase occurring just recently in the last 10 years.
Ever since 2001, more than half of the examined countries show a net surplus. As of 2019, all of the countries in red were determined to be either war-riddled or politically fragile. Exceptions to the rule are Namibia and Eswatini, which both boast a relatively high GDP per capita (ranked 10th and 11th for the African continent, respectively). Thus, the only explanation would be inequality amongst the population or insufficient distribution of available resources.
To answer this important question, we will need to import data from the FAO Dataset. More specifically, we will focus on the section Food Balance Sheet with respect to European countries only.
food_balance_europe = pd.read_csv(path_dict['food_balance_europe'],encoding='latin-1', low_memory=False);
European countries will be analysed following the same strategy we used for African countries in order to be consistent also in the way by which we assess whether countries are in deficit or have a surplus. To start off, we will:
food_balance_europe = clean_Fs_and_years(food_balance_europe)
food_balance_europe = food_balance_europe.fillna(0);
The third step to complete the cleaning of food_balance_europe consists on adapting names of countries in order to have consistency along our different dataframes.
The easiest of these changes that we observe in our dataframe is The former Yugoslav Republic of Macedonia should become North Macedonia.
food_balance_europe = replace_names_of_countries(food_balance_europe, [("The former Yugoslav Republic of Macedonia", "North Macedonia")])
Our Dataframe looks like this:
food_balance_europe.head()
Area Code | Area | Item Code | Item | Element Code | Element | Unit | 1961 | 1962 | 1963 | ... | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Albania | 2501 | Population | 511 | Total Population - Both sexes | 1000 persons | 1669.0 | 1719.0 | 1771.0 | ... | 3216.0 | 3196.0 | 3180.0 | 3166.0 | 3157.0 | 3151.0 | 3150.0 | 3154.0 | 3162.0 | 3173.0 |
1 | 3 | Albania | 2511 | Wheat and products | 5511 | Production | 1000 tonnes | 98.0 | 146.0 | 62.0 | ... | 253.0 | 260.0 | 231.0 | 250.0 | 335.0 | 333.0 | 295.0 | 293.0 | 300.0 | 294.0 |
2 | 3 | Albania | 2511 | Wheat and products | 5611 | Import Quantity | 1000 tonnes | 182.0 | 89.0 | 110.0 | ... | 465.0 | 417.0 | 414.0 | 406.0 | 355.0 | 343.0 | 362.0 | 389.0 | 377.0 | 360.0 |
3 | 3 | Albania | 2511 | Wheat and products | 5072 | Stock Variation | 1000 tonnes | -52.0 | -4.0 | 56.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 3 | Albania | 2511 | Wheat and products | 5911 | Export Quantity | 1000 tonnes | 0.0 | 0.0 | 0.0 | ... | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 4.0 | 4.0 | 4.0 |
5 rows × 60 columns
Given the European countries analysis, and since the structure of this dataset is equivalent to that one, we can again obtain the pairs (Item Code, 2901) and (Element Code, 664) for our Europe analysis.
food_supply_europe = obtain_supply(food_balance_europe)
We can now group by Area and see the supplies derived from each item available in countries for each particular year.
food_supply_europe = food_supply_europe.set_index("Area")
food_supply_europe.head()
1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | ... | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Area | |||||||||||||||||||||
Albania | 2223.0 | 2242.0 | 2156.0 | 2270.0 | 2254.0 | 2254.0 | 2262.0 | 2343.0 | 2404.0 | 2415.0 | ... | 2792.0 | 2874.0 | 2855.0 | 2860.0 | 2947.0 | 2993.0 | 3076.0 | 3132.0 | 3184.0 | 3193.0 |
Austria | 3191.0 | 3193.0 | 3248.0 | 3270.0 | 3220.0 | 3232.0 | 3194.0 | 3221.0 | 3153.0 | 3217.0 | ... | 3606.0 | 3640.0 | 3719.0 | 3737.0 | 3717.0 | 3723.0 | 3724.0 | 3735.0 | 3739.0 | 3768.0 |
Belarus | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 2901.0 | 2987.0 | 3084.0 | 3186.0 | 3200.0 | 3126.0 | 3196.0 | 3400.0 | 3400.0 | 3250.0 |
Belgium | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 3722.0 | 3716.0 | 3716.0 | 3713.0 | 3702.0 | 3697.0 | 3707.0 | 3720.0 | 3715.0 | 3733.0 |
Belgium-Luxembourg | 2923.0 | 2908.0 | 3007.0 | 3061.0 | 2966.0 | 3067.0 | 3088.0 | 3032.0 | 3156.0 | 3068.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 53 columns
In order to check for anomalies in our data, we would like to analyze the timeline. We therefore transpose the dataframe and plot the timeline of how food supply in different countries evolved. Legend was suppressed as it is too large.
food_supply_europe = food_supply_europe.transpose();
#converting the year from string to int
food_supply_europe.index = food_supply_europe.index.astype(int)
timeline_supply(food_supply_europe, "European")
We can observe a similar situation in this graph as we did in the African countries analysis. As such, we can assume that some countries may have changed names or maybe some other situations occured, which caused countries data to stop being recorded/collected.
Let's then see which countries have zeros in the values for food supplies.
food_supply_europe.columns.values[(food_supply_europe == 0).any()]
array(['Belarus', 'Belgium', 'Belgium-Luxembourg', 'Bosnia and Herzegovina', 'Croatia', 'Czechia', 'Czechoslovakia', 'Estonia', 'Latvia', 'Lithuania', 'Luxembourg', 'Montenegro', 'Republic of Moldova', 'Russian Federation', 'Serbia', 'Serbia and Montenegro', 'Slovakia', 'Slovenia', 'North Macedonia', 'Ukraine', 'USSR', 'Yugoslav SFR'], dtype=object)
Respecting the cronology, we observe that there were some countries on which data stopped being collected, as well as others from which data started being collected after the initial collection year of 1950. After some research on these countries, we found out that:
Let's then take care of all these cases..
countries_to_merge = {'USSR': ['Belarus', 'Ukraine', 'Estonia', 'Republic of Moldova', 'Russian Federation', 'Latvia', 'Lithuania'],
'Yugoslav SFR': ['Bosnia and Herzegovina', 'Croatia', 'Serbia and Montenegro', 'North Macedonia', 'Slovenia'],
'Czechoslovakia': ['Czechia', 'Slovakia'],
'Serbia and Montenegro': ['Serbia', 'Montenegro'],
'Belgium-Luxembourg': ['Luxembourg', 'Belgium']}
food_supply_europe = merge_countries(food_supply_europe, countries_to_merge)
timeline_supply(food_supply_europe, "European", " - Cleaned dataset")
We observe a drop in the food supply from most countries that came from the former USSR, which we find to be interesting. While it might not be the case for all of these countries, some of them are vey poor and as such, when separated from the world power that was the USSR, their ability to provide and produce for themselves may have dropped significantly, which is coherent with what we observe. As such, we continue our analysis with these values.
Next, we want to add more columns representing future years until 2020 to prepare cells for extrapolation to make predictions about possible scenarios.
food_supply_europe = prepare_future(food_supply_europe, 2014, 2020)
food_supply_europe = predict_future(food_supply_europe, path_dict['europe_supply_rnn'])
timeline_supply(food_supply_europe, "European", " - Predicted dataset")
Similar to the analysis of African countries, the same conclusions can be drawn about using the neural network method.
Visualizing our results with a interactive heatmap:
european_country_codes = ["ALB","AUT","BLR","BEL","BIH","BGR","HRV","CZE","DNK","EST","FIN","FRA","DEU","GRC","HUN","ISL","IRL",
"ITA","LVA","LTU","LUX","MALTAMISSING","MNE","NLD","NOR","POL","PRT","MDA","ROU","RUS","SRB","SWK",
"SVN","ESP","SWE","CHE","MKD","UKR","GBR"]
european_country_names = food_supply_europe.T.index.values
european_country_kv = pd.DataFrame({'codes': european_country_codes,
'names': european_country_names
})
geojson_europe = geojson_world[geojson_world.id.isin(european_country_codes)]
geojson_europe.head()
id | name | geometry | |
---|---|---|---|
2 | ALB | Albania | POLYGON ((20.59025 41.85540, 20.46317 41.51509... |
9 | AUT | Austria | POLYGON ((16.97967 48.12350, 16.90375 47.71487... |
12 | BEL | Belgium | POLYGON ((3.31497 51.34578, 4.04707 51.26726, ... |
16 | BGR | Bulgaria | POLYGON ((22.65715 44.23492, 22.94483 43.82379... |
18 | BIH | Bosnia and Herzegovina | POLYGON ((19.00549 44.86023, 19.36803 44.86300... |
We notice, after our manual analysis on the countries that Malta is not displayed in the json. Hence, we won't consider it for our analysis
food_supply_europe = food_supply_europe.drop(columns=["Malta"])
# Plotting
legend_name = "Food supply in Europe (kcal/person/day)"
for year in range(1970, 2030, 10):
europe_supply_map = plot_map(food_supply_europe.T, path_dict['geoworld_json'], \
european_country_kv, year, "Blues", legend_name, legend_name, path_dict['european_supply_map'] + str(year) + ".html", bins=9)
europe_supply_map
IFrame(src='https://manuleo.github.io/mADAm_files/european_supply_map2020.html', width = 1000, height=600)
save_map_data(geojson_europe, european_country_kv, food_supply_europe, "docs/json/europe_supply/europe_supply_{}.geojson", "docs/json/europe_supply/europe_supply_ticks.json", bins=6)
Analyzing the data shows a trend scaled up but mostly similar African case. The available food supply in Europe over the last 50 years remained mostly the same, with an available supply between 2500 to 3800 kcal/persona/day.
Interesting to note the drop the ex USSR had in 2000 after the division, drop already emerged during the previous cleaning of the dataset.
Once again, we do a very similar analysis on European countries as we did for the African ones.
In this second part of our analysis, we load the list of European countries. Secondly, we load the World Population Database (United Nation) and therefore we obtain two dataframes: one for males and the other one for females.
with open (path_dict["european_countries_list"],'r') as eu_c:
eu_countries = [line.rstrip() for line in eu_c] #loading list
We need to check if the FAO Database contains data regarding every country in Europe. We will check the intersection with the list eu_countries
.
eu_to_remove = list(set(eu_countries) - set(food_supply_europe.columns.values))
print("List of countries for which no data is available: " + str(eu_to_remove))
List of countries for which no data is available: ['Channel Islands', 'Malta']
Because there are less countries in Europe, and also because most European countries are part of ONU, we expected most countries to be present in both the FAO database and the population database. These Channel Islands are a small set of islands in the English Channel, and because they are so small, we can safely remove them from our scope of interest.
eu_countries = [i for i in eu_countries if not i in eu_to_remove]
eu_to_remove = list(set(eu_countries) - set(food_supply_europe.columns.values))
print("List of countries for which no data is available: " + str(eu_to_remove))
List of countries for which no data is available: []
We now obtain the population for each gender in all european countries.
pop_male_europe = clean_pop_df(pop_male, eu_countries)
pop_female_europe = clean_pop_df(pop_female, eu_countries)
Let's have a look at the final version of male population data grouped by age:
pop_male_europe.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3135 | Belarus | 1950 | 333700 | 282651 | 395174 | 382562 | 361905 | 237761 | 150660 | 206475 | ... | 138445 | 124045 | 99216 | 68726 | 41415 | 19537 | 7648 | 2113 | 365 | 35 |
3136 | Belarus | 1955 | 418684 | 315056 | 264730 | 374177 | 357223 | 337374 | 219617 | 136496 | ... | 128674 | 111772 | 95381 | 70127 | 42393 | 19507 | 8196 | 2262 | 415 | 47 |
3137 | Belarus | 1960 | 491730 | 406036 | 305601 | 255607 | 360905 | 342271 | 323037 | 208441 | ... | 150864 | 108601 | 89962 | 70827 | 45805 | 21818 | 8211 | 2371 | 420 | 50 |
3138 | Belarus | 1965 | 448162 | 482745 | 398975 | 299589 | 248620 | 350831 | 331524 | 311625 | ... | 158297 | 130151 | 89304 | 68499 | 47682 | 24484 | 9262 | 2393 | 448 | 52 |
3139 | Belarus | 1970 | 368381 | 450207 | 485335 | 400660 | 299407 | 247750 | 348188 | 327625 | ... | 147426 | 140703 | 110454 | 70311 | 47732 | 27537 | 10041 | 2554 | 418 | 51 |
5 rows × 23 columns
Similarly to the analysis performed on Africa, we once again interpolate the years with a frequency of 5 years in pop_male_europe, to have a frequency of 1 year. Before doing so, we check again if the evolution in these 5 years intervals occurs in a linear manner.
Note: as usual, if the animation is not visible in the notebook, click here
IFrame(src='https://manuleo.github.io/mADAm_files/european_pop_growth.html', width = 1200, height=600)
# # Selecting countries
# countryrand = []
# n_countries = 2
# for i in range(0, n_countries):
# countryrand.append(random.choice(pop_male_europe.country.drop_duplicates().values))
# fig = plt.figure()
# animator = animation.FuncAnimation(fig, timeline_country_gender, frames=(range(1950, 2025, 5)),\
# fargs = (pop_male_europe, pop_female_europe, "30-34", countryrand), repeat=False)
# #HTML(animator.to_jshtml())
# with open(path_dict['european_pop_growth'], "w") as f:
# print(animator.to_html5_video(), file=f)
Even if the growth is not so linear as a whole as before, the animation clearly shows that it is linear inside each group of 5 year. As our interpolation works by interpolating on each of these groups, we can proceed with it
pop_male_europe = interpolate_years(pop_male_europe, 1950, 2020)
pop_female_europe = interpolate_years(pop_female_europe, 1950, 2020)
Let's see how the new dataframes for males and females population look like:
pop_male_europe.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Belarus | 1950 | 333700.0 | 282651.0 | 395174.0 | 382562.0 | 361905.0 | 237761.0 | 150660.0 | 206475.0 | ... | 138445.0 | 124045.0 | 99216.0 | 68726.0 | 41415.0 | 19537.0 | 7648.0 | 2113.0 | 365.0 | 35.0 |
1 | Belarus | 1951 | 350696.8 | 289132.0 | 369085.2 | 380885.0 | 360968.6 | 257683.6 | 164451.4 | 192479.2 | ... | 136490.8 | 121590.4 | 98449.0 | 69006.2 | 41610.6 | 19531.0 | 7757.6 | 2142.8 | 375.0 | 37.4 |
2 | Belarus | 1952 | 367693.6 | 295613.0 | 342996.4 | 379208.0 | 360032.2 | 277606.2 | 178242.8 | 178483.4 | ... | 134536.6 | 119135.8 | 97682.0 | 69286.4 | 41806.2 | 19525.0 | 7867.2 | 2172.6 | 385.0 | 39.8 |
3 | Belarus | 1953 | 384690.4 | 302094.0 | 316907.6 | 377531.0 | 359095.8 | 297528.8 | 192034.2 | 164487.6 | ... | 132582.4 | 116681.2 | 96915.0 | 69566.6 | 42001.8 | 19519.0 | 7976.8 | 2202.4 | 395.0 | 42.2 |
4 | Belarus | 1954 | 401687.2 | 308575.0 | 290818.8 | 375854.0 | 358159.4 | 317451.4 | 205825.6 | 150491.8 | ... | 130628.2 | 114226.6 | 96148.0 | 69846.8 | 42197.4 | 19513.0 | 8086.4 | 2232.2 | 405.0 | 44.6 |
5 rows × 23 columns
pop_female_europe.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Belarus | 1950 | 325456.0 | 287716.0 | 420006.0 | 403120.0 | 424672.0 | 338168.0 | 223359.0 | 326549.0 | ... | 178569.0 | 183039.0 | 152746.0 | 119379.0 | 85709.0 | 39512.0 | 20284.0 | 7706.0 | 1924.0 | 293.0 |
1 | Belarus | 1951 | 338789.2 | 291637.0 | 389910.6 | 402225.4 | 415571.6 | 350871.6 | 242416.2 | 302527.8 | ... | 179256.4 | 177849.8 | 153480.6 | 119761.2 | 85312.8 | 41588.0 | 19740.8 | 7352.0 | 1810.6 | 277.8 |
2 | Belarus | 1952 | 352122.4 | 295558.0 | 359815.2 | 401330.8 | 406471.2 | 363575.2 | 261473.4 | 278506.6 | ... | 179943.8 | 172660.6 | 154215.2 | 120143.4 | 84916.6 | 43664.0 | 19197.6 | 6998.0 | 1697.2 | 262.6 |
3 | Belarus | 1953 | 365455.6 | 299479.0 | 329719.8 | 400436.2 | 397370.8 | 376278.8 | 280530.6 | 254485.4 | ... | 180631.2 | 167471.4 | 154949.8 | 120525.6 | 84520.4 | 45740.0 | 18654.4 | 6644.0 | 1583.8 | 247.4 |
4 | Belarus | 1954 | 378788.8 | 303400.0 | 299624.4 | 399541.6 | 388270.4 | 388982.4 | 299587.8 | 230464.2 | ... | 181318.6 | 162282.2 | 155684.4 | 120907.8 | 84124.2 | 47816.0 | 18111.2 | 6290.0 | 1470.4 | 232.2 |
5 rows × 23 columns
Lastly, we will compute the total population per year. This new dataframe pop_tot will be useful for the next section of our analysis.
pop_tot_europe = obtain_total_pop(pop_male_europe, pop_female_europe)
For the next analysis we will need to match this data with the food_balance_europe
. We proceed to give to our population data the same shape as the other dataset
pop_tot_europe = reshape_pop_dataframe(pop_tot_europe)
pop_tot_europe.head()
1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||||||||||||
Albania | 1263164.0 | 1294529.4 | 1325894.8 | 1357260.2 | 1388625.6 | 1419991.0 | 1463211.0 | 1506431.0 | 1549651.0 | 1592871.0 | ... | 2936528.0 | 2925027.0 | 2913526.0 | 2902025.0 | 2890524.0 | 2887979.2 | 2885434.4 | 2882889.6 | 2880344.8 | 2877800.0 |
Austria | 6936442.0 | 6939834.4 | 6943226.8 | 6946619.2 | 6950011.6 | 6953404.0 | 6976877.8 | 7000351.6 | 7023825.4 | 7047299.2 | ... | 8463689.4 | 8517433.8 | 8571178.2 | 8624922.6 | 8678667.0 | 8744213.6 | 8809760.2 | 8875306.8 | 8940853.4 | 9006400.0 |
Belarus | 7745004.0 | 7745446.2 | 7745888.4 | 7746330.6 | 7746772.8 | 7747215.0 | 7822748.2 | 7898281.4 | 7973814.6 | 8049347.8 | ... | 9424345.6 | 9428115.2 | 9431884.8 | 9435654.4 | 9439424.0 | 9441403.4 | 9443382.8 | 9445362.2 | 9447341.6 | 9449321.0 |
Belgium | 8637521.0 | 8687471.8 | 8737422.6 | 8787373.4 | 8837324.2 | 8887275.0 | 8943292.2 | 8999309.4 | 9055326.6 | 9111343.8 | ... | 11008574.2 | 11078413.4 | 11148252.6 | 11218091.8 | 11287931.0 | 11348268.0 | 11408605.0 | 11468942.0 | 11529279.0 | 11589616.0 |
Bosnia and Herzegovina | 2661296.0 | 2716296.8 | 2771297.6 | 2826298.4 | 2881299.2 | 2936300.0 | 2994172.8 | 3052045.6 | 3109918.4 | 3167791.2 | ... | 3650254.8 | 3595031.6 | 3539808.4 | 3484585.2 | 3429362.0 | 3399652.6 | 3369943.2 | 3340233.8 | 3310524.4 | 3280815.0 |
5 rows × 71 columns
Now we multiply each column of the population data for each matching age_group
in the calories table (that here we squeeze to allow the multiplication, similar to a transpose rows/columns of the dataset).
We obtain two datasets: male_cal_need_europe
and female_cal_need_europe
reporting total calories needed for each country in each year per age group per gender.
The unit here is kcal/day.
#total calories male
male_cal_need_europe = get_calories_need(pop_male_europe, male_calories)
male_cal_need_europe.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Belarus | 1950 | 4.449333e+08 | 520077840.0 | 964224560.0 | 1.193593e+09 | 1.085715e+09 | 713283000.0 | 451980000.0 | 586389000.0 | ... | 365494800.0 | 322517000.0 | 257961600.0 | 178687600.0 | 101052600.0 | 46888800.0 | 18355200.0 | 5071200.0 | 876000.0 | 84000.0 |
1 | Belarus | 1951 | 4.675957e+08 | 532002880.0 | 900567888.0 | 1.188361e+09 | 1.082906e+09 | 773050800.0 | 493354200.0 | 546640928.0 | ... | 360335712.0 | 316135040.0 | 255967400.0 | 179416120.0 | 101529864.0 | 46874400.0 | 18618240.0 | 5142720.0 | 900000.0 | 89760.0 |
2 | Belarus | 1952 | 4.902581e+08 | 543927920.0 | 836911216.0 | 1.183129e+09 | 1.080097e+09 | 832818600.0 | 534728400.0 | 506892856.0 | ... | 355176624.0 | 309753080.0 | 253973200.0 | 180144640.0 | 102007128.0 | 46860000.0 | 18881280.0 | 5214240.0 | 924000.0 | 95520.0 |
3 | Belarus | 1953 | 5.129205e+08 | 555852960.0 | 773254544.0 | 1.177897e+09 | 1.077287e+09 | 892586400.0 | 576102600.0 | 467144784.0 | ... | 350017536.0 | 303371120.0 | 251979000.0 | 180873160.0 | 102484392.0 | 46845600.0 | 19144320.0 | 5285760.0 | 948000.0 | 101280.0 |
4 | Belarus | 1954 | 5.355829e+08 | 567778000.0 | 709597872.0 | 1.172664e+09 | 1.074478e+09 | 952354200.0 | 617476800.0 | 427396712.0 | ... | 344858448.0 | 296989160.0 | 249984800.0 | 181601680.0 | 102961656.0 | 46831200.0 | 19407360.0 | 5357280.0 | 972000.0 | 107040.0 |
5 rows × 23 columns
#total calories female
female_cal_need_europe = get_calories_need(pop_female_europe, female_calories)
female_cal_need_europe.head()
country | year | 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | ... | 55-59 | 60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | 100+ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Belarus | 1950 | 4.122443e+08 | 494871520.0 | 907212960.0 | 967488000.0 | 1.019213e+09 | 811603200.0 | 500324160.0 | 718407800.0 | ... | 392851800.0 | 373399560.0 | 305492000.0 | 238758000.0 | 171418000.0 | 79024000.0 | 40568000.0 | 15412000.0 | 3848000.0 | 586000.0 |
1 | Belarus | 1951 | 4.291330e+08 | 501615640.0 | 842206896.0 | 965340960.0 | 9.973718e+08 | 842091840.0 | 543012288.0 | 665561160.0 | ... | 394364080.0 | 362813592.0 | 306961200.0 | 239522400.0 | 170625600.0 | 83176000.0 | 39481600.0 | 14704000.0 | 3621200.0 | 555600.0 |
2 | Belarus | 1952 | 4.460217e+08 | 508359760.0 | 777200832.0 | 963193920.0 | 9.755309e+08 | 872580480.0 | 585700416.0 | 612714520.0 | ... | 395876360.0 | 352227624.0 | 308430400.0 | 240286800.0 | 169833200.0 | 87328000.0 | 38395200.0 | 13996000.0 | 3394400.0 | 525200.0 |
3 | Belarus | 1953 | 4.629104e+08 | 515103880.0 | 712194768.0 | 961046880.0 | 9.536899e+08 | 903069120.0 | 628388544.0 | 559867880.0 | ... | 397388640.0 | 341641656.0 | 309899600.0 | 241051200.0 | 169040800.0 | 91480000.0 | 37308800.0 | 13288000.0 | 3167600.0 | 494800.0 |
4 | Belarus | 1954 | 4.797991e+08 | 521848000.0 | 647188704.0 | 958899840.0 | 9.318490e+08 | 933557760.0 | 671076672.0 | 507021240.0 | ... | 398900920.0 | 331055688.0 | 311368800.0 | 241815600.0 | 168248400.0 | 95632000.0 | 36222400.0 | 12580000.0 | 2940800.0 | 464400.0 |
5 rows × 23 columns
Once we have the calories needed for both gender, we can add them together easily to achieve total calories needed for each country in each year, and we collect them in the dataframe total_cal_need_europe
. The unit is kcal/year. All of this is done by the obtain_total_cal_need
function
total_cal_need_europe = obtain_total_cal_need(male_cal_need_europe, female_cal_need_europe)
Let's take a look at total calories dataframe total_cal:
total_cal_need_europe.sort_values(by="Calories", ascending=False).head()
country | year | Calories | |
---|---|---|---|
547 | Russian Federation | 2000 | 1.277543e+14 |
546 | Russian Federation | 1999 | 1.277413e+14 |
545 | Russian Federation | 1998 | 1.277284e+14 |
544 | Russian Federation | 1997 | 1.277154e+14 |
543 | Russian Federation | 1996 | 1.277024e+14 |
For the sake of consistency, we will now reshape our dataframe total_cal_need_europe
according to the same schema seen for food_supply_europe
.
total_cal_need_europe = reshape_calories_df(total_cal_need_europe)
Drawing a sample of the final shaped dataframe total calories total_cal
:
total_cal_need_europe.head()
1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||||||||||||
Albania | 1.041791e+12 | 1.067157e+12 | 1.092523e+12 | 1.117889e+12 | 1.143255e+12 | 1.168621e+12 | 1.202214e+12 | 1.235807e+12 | 1.269400e+12 | 1.302993e+12 | ... | 2.558970e+12 | 2.551603e+12 | 2.544237e+12 | 2.536870e+12 | 2.529503e+12 | 2.526050e+12 | 2.522597e+12 | 2.519144e+12 | 2.515691e+12 | 2.512237e+12 |
Austria | 5.906169e+12 | 5.912127e+12 | 5.918085e+12 | 5.924043e+12 | 5.930001e+12 | 5.935959e+12 | 5.950704e+12 | 5.965449e+12 | 5.980194e+12 | 5.994939e+12 | ... | 7.368272e+12 | 7.414042e+12 | 7.459812e+12 | 7.505582e+12 | 7.551352e+12 | 7.603258e+12 | 7.655163e+12 | 7.707068e+12 | 7.758973e+12 | 7.810878e+12 |
Belarus | 6.581842e+12 | 6.573869e+12 | 6.565896e+12 | 6.557924e+12 | 6.549951e+12 | 6.541978e+12 | 6.585445e+12 | 6.628913e+12 | 6.672380e+12 | 6.715847e+12 | ... | 8.184568e+12 | 8.171641e+12 | 8.158714e+12 | 8.145787e+12 | 8.132860e+12 | 8.125860e+12 | 8.118860e+12 | 8.111861e+12 | 8.104861e+12 | 8.097861e+12 |
Belgium | 7.440138e+12 | 7.475404e+12 | 7.510669e+12 | 7.545935e+12 | 7.581201e+12 | 7.616467e+12 | 7.656335e+12 | 7.696202e+12 | 7.736070e+12 | 7.775938e+12 | ... | 9.508244e+12 | 9.565091e+12 | 9.621938e+12 | 9.678785e+12 | 9.735632e+12 | 9.787514e+12 | 9.839396e+12 | 9.891278e+12 | 9.943160e+12 | 9.995042e+12 |
Bosnia and Herzegovina | 2.207555e+12 | 2.252332e+12 | 2.297108e+12 | 2.341885e+12 | 2.386662e+12 | 2.431439e+12 | 2.479545e+12 | 2.527652e+12 | 2.575759e+12 | 2.623865e+12 | ... | 3.196515e+12 | 3.144872e+12 | 3.093230e+12 | 3.041587e+12 | 2.989945e+12 | 2.963127e+12 | 2.936309e+12 | 2.909492e+12 | 2.882674e+12 | 2.855856e+12 |
5 rows × 71 columns
We can proceed to plot the results inside a map visualization. The most interesting one is as usual the 2020, but also analyzing past years is crucial to understanding how the trend changed.
Note: the link to 2020 in case it won't show up is available here
Links for other years:
for year in range(1970,2030,10):
legend_name = "Estimation of kcal/year [10^11 kcal/year]"
europe_kcal_est_map = plot_map(total_cal_need_europe.divide(10**11), path_dict['geoworld_json'], \
european_country_kv, year, "YlGn", legend_name, legend_name, path_dict['european_estimation_kcal'] + str(year) + ".html", bins=9)
europe_kcal_est_map
IFrame(src='https://manuleo.github.io/mADAm_files/europe_est_kcal2020.html', width = 1000, height=600)
save_map_data(geojson_europe, european_country_kv, total_cal_need_europe.divide(10**11).T, "docs/json/europe_need/europe_need_{}.geojson", "docs/json/europe_need/europe_need_ticks.json", bins=9)
As we can see, the increasing in population is reflected in increasing of needed kcal in Europe.
In this Europe evaluation, we flip the scope of our analysis completely. While in the case of Africa, we want to know which countries need most help (as in, are not producing enough to sustainably survive), for Europe we want to find out which countries are producing more food internally than what they need. The point of this analysis is to find out who could help the African countries in need, by giving away some of their production, while still keeping at least a minimum to be healthy.
As with the African analysis, this analysis will take into account the total population dataframe pop_tot
and the food_supply_europe
. With regards to the FAO Dataframe of food supply, we will need to transform the unit into kcal/year in order to compare results appropriately.
caloric_difference_europe = obtain_difference(pop_tot_europe, food_supply_europe, total_cal_need_europe)
caloric_difference_europe.head()
1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Albania | -13.834758 | 6.437203 | -78.364674 | 36.765861 | 21.834368 | 17.419790 | 21.230166 | 98.248729 | 155.460337 | 162.851283 | ... | 744.528672 | 794.041235 | 800.534160 | 815.804774 | 812.927833 | 818.292383 | 817.404589 | 821.619756 | 821.669430 | 824.708041 |
Austria | 865.618730 | 870.779595 | 928.899367 | 953.978842 | 907.018796 | 919.030973 | 881.043010 | 908.054910 | 840.066677 | 904.078311 | ... | 1349.864434 | 1354.192008 | 1383.515474 | 1328.758610 | 1390.383215 | 1339.989372 | 1390.463523 | 1343.677415 | 1386.151286 | 1353.341990 |
Belarus | 815.470694 | 877.200747 | 923.935723 | 925.675488 | 933.419914 | 921.544550 | 945.798567 | 999.178346 | 992.680403 | 1041.301381 | ... | 1020.686503 | 1025.394254 | 880.098242 | 996.816638 | 857.158851 | 958.721120 | 859.469478 | 966.582382 | 931.484184 | 988.133970 |
Belgium | 588.233931 | 574.260475 | 674.274851 | 729.277274 | 635.267953 | 734.632818 | 754.010615 | 696.401193 | 818.804401 | 729.220090 | ... | 1353.664146 | 1349.523251 | 1368.371593 | 1353.634612 | 1357.850225 | 1354.710149 | 1355.474792 | 1354.426316 | 1354.660523 | 1354.457022 |
Bosnia and Herzegovina | 777.060508 | 918.659699 | 933.346142 | 1007.115164 | 1012.962422 | 1111.095960 | 1136.325436 | 1102.647323 | 1125.058265 | 1124.555066 | ... | 734.831924 | 653.334435 | 759.915028 | 738.663968 | 795.658111 | 749.993127 | 789.841469 | 782.711656 | 800.878745 | 800.331316 |
5 rows × 60 columns
caloric_difference_europe[caloric_difference_europe[2020].values < 0].index
Index([], dtype='object')
When running the exact same analysis on European countries as we did in African ones, it's interesting to observe that in Europe, no country at all is producing less than what they actually need. As such, every single European country's suplly is actually greater than its demand, and should in theory be able to solve the hunger issue in Africa.
We can now proceed to the same visualization we did before, this time with the scope of visualizing the 2020 situation and evaluating if there has been a deficit of kcal over the past years
caloric_difference_europe_sorted = caloric_difference_europe[2020].sort_values()
p = caloric_difference_europe_sorted.plot(kind='barh', color=(caloric_difference_europe_sorted > 0).map({True: 'g', False: 'red'}),alpha=0.75, rot=0);
p.set_xlabel("deficit/surplus in the european countries [kcal/persona/day]")
p.set_ylabel("European countries")
plt.title('Estimation of net food availability in European countries in 2020' );
As already noticed before, European countries boast a very high surplus of food supplied. We can now use the animation we used before to assess the situation over the past years and see if Europe had deficits:
# fig = plt.figure()
# animator = animation.FuncAnimation(fig, draw_demand_bar, frames=range(1961, 2021),\
# fargs=(caloric_difference_europe, ),
# repeat=False);
# #HTML(animator.to_jshtml())
# with open(path_dict['european_demand_anim'], "w") as f:
# print(animator.to_html5_video(), file=f)
IFrame(src='https://manuleo.github.io/mADAm_files/european_cal_diff_animation.html', width = 1200, height=600)
As we can see, Europe has experienced high affluence periods since 1961. Over the past 50 years, only in three years some countries presented a really small deficit in calories:
Finally, we draw a map of the European situation. Year 2020 is as usual the most important one, especially for Europe, as we are interested in their actual surplus for the rest of our analysis and because there has been basically no significant change over the last 50 years.
Note: map for 2020 is available here
Links for other years:
legend_name = "Estimation of kcal/persona/day surplus"
for year in range(1970,2030,10):
europe_kcal_surplus_map = plot_map(caloric_difference_europe, path_dict['geoworld_json'], \
european_country_kv, year, "Greens", legend_name, legend_name, path_dict['european_kcal_surplus'] + str(year) + ".html", bins=8)
europe_kcal_surplus_map
IFrame(src='https://manuleo.github.io/mADAm_files/european_kcal_surplus2020.html', width = 1000, height=600)
As a consequence of the welfare in which Europe is leaving since the post wars period, the available surplus hasn't changed a lot over the years. However, a little increase is notable since 1970 to now.
With the information we gathered so far, we can proceed to a comparison between Africa and Europe, to then move on to an analysis on which European countries could potentially help African countries in need.
We have easily noticed that in Europe, every country has more food than what they need to healthily survive. We will now plot a map that shows this difference more noticeably.
afr_eu_country_kv = european_country_kv.copy()
afr_eu_country_kv = afr_eu_country_kv.append(african_country_kv)
afr_eu_country_kv = afr_eu_country_kv.sort_values(by='names')
geojson_afr_eu = geojson_world[geojson_world.id.isin(afr_eu_country_kv.codes)]
geojson_afr_eu.head()
id | name | geometry | |
---|---|---|---|
1 | AGO | Angola | MULTIPOLYGON (((16.32653 -5.87747, 16.57318 -6... |
2 | ALB | Albania | POLYGON ((20.59025 41.85540, 20.46317 41.51509... |
9 | AUT | Austria | POLYGON ((16.97967 48.12350, 16.90375 47.71487... |
12 | BEL | Belgium | POLYGON ((3.31497 51.34578, 4.04707 51.26726, ... |
13 | BEN | Benin | POLYGON ((2.69170 6.25882, 1.86524 6.14216, 1.... |
caloric_difference_world = pd.concat([caloric_difference_africa, caloric_difference_europe])
caloric_difference_world = caloric_difference_world.sort_index()
Below we plot the surplus/deficit of calories in Africa and Europe in the year of 2020, for the comparison mentioned before.
legend_name = "Estimation of kcal/persona/day surplus"
bins = [min(caloric_difference_world[2020]), 0, 300, 600, 900, max(caloric_difference_world[2020])]
afr_eu_kcal_surplus_map = plot_map(caloric_difference_world, path_dict['geoworld_json'], \
afr_eu_country_kv, year, "RdYlGn", legend_name, legend_name, path_dict['world_kcal_surplus'] + str(2020) + ".html", bins)
afr_eu_kcal_surplus_map
IFrame(src='https://manuleo.github.io/mADAm_files/world/world_kcal_surplus2020.html', width = 1100, height=800)
save_map_data(geojson_afr_eu, afr_eu_country_kv, caloric_difference_world.T, "docs/json/cal_world/cal_world_{}.geojson", "docs/json/cal_world/cal_world_ticks.json",\
bins=[0, 300, 600, 900], year_start = 2020)
This map illustrates the previously stated assumption that Europe is in a much better food situation than African countries in general. This is obviously observed by checking that there are no countries in Europe with a color "lower" than yellow-ish green, while in Africa we observe multiple countries colored in orange and even red! These african countries which don't have enough food to feed some of their citizens are the ones we consider to be in most dire need of help, and thus we will try to come up with opportunities of how to help them in the future.
Another interesting thing to observe is that there seems to be a line dividing the hunger in the world in North and South. By looking at the plots generated by the above cell (located in visualization/world), we notice that this line has been going more and more south over the decades, which means that the hunger situation is slowly being solved as years go by. The main reason we think that such a line exists and it's going south is that the countries in the north of Africa are developing more rapidly than their subsaharan counterparts, and are subsequently able to provide better quality of life for their citizens.
Furthermore, comparing the countries in Europe which have the highest surplus to the African countries with the highest deficit would be of interest as well. We consider the countries in need of help as those that do not have a surplus of at least 300 kcal/person/day (orange and red in the map). This selection makes sense because the values we calculate are averaged over the whole population, and if a country has a low surplus, it may mean that a big slice of the population are still starving.
# The countries with deficit in calories
african_countries_to_help = caloric_difference_africa[caloric_difference_africa[2020].values < 300][2020].sort_values(ascending = False)
number_of_need_of_help = len(african_countries_to_help.index)
# The countries in Europe with highest surplus in calories
caloric_difference_highest_europe = caloric_difference_europe[2020].sort_values(ascending=False).head(number_of_need_of_help).sort_values(ascending=True)
grid = plt.GridSpec(1,2)
# negative african countries
plt.subplot(grid[0, 0]);
p = african_countries_to_help.plot(kind='barh', color=(african_countries_to_help > 0).map({True: 'orange', False: 'red'}),alpha=0.75, rot=0);
p.set_xlabel("deficit in the african countries [kcal/persona/day]")
p.set_ylabel("African countries")
plt.title('Lowest surplus in African Countries in 2020' )
# highest surplus europe countries
plt.subplot(grid[0, 1]);
p1 = caloric_difference_highest_europe.plot(kind='barh', color=(caloric_difference_highest_europe > 900).map({True: 'g', False: 'lightgreen'}),alpha=0.75, rot=0);
p1.set_xlabel("surplus in the european countries [kcal/persona/day]")
p1.set_ylabel("European countries")
plt.title('Highest surplus in European countries in 2020' )
plt.tight_layout()
plt.show()
By looking at the difference between the country most in need Zambia and the country with the highest surplus Belgium, we see that Belgium alone has way more extra food than the lacking countries need individually. As such, we think that an interesting analysis to be made is which European countries could solve African hunger on their own. Below we analyse how much Africa needs in total, and how much Belgium alone "can" provide.
need_in_africa = abs(african_countries_to_help[african_countries_to_help.values < 0].sum())
print("African countries with a deficit need {0:.2f} extra kcal/person/day in total to solve hunger.".format(need_in_africa))
African countries with a deficit need 1306.82 extra kcal/person/day in total to solve hunger.
print("Belgium has {0:.2f} kcal/person/day over their basic needs.".format(caloric_difference_highest_europe['Belgium']))
Belgium has 1354.46 kcal/person/day over their basic needs.
These numbers need to be multiplied by the countries' population to allow for direct comparison! We can say however that each person in Belgium could help one person/day in each of the most starving countries in Africa, while still maintaining some extra kcals (around 300 kcal). We proceed to do the analysis on the total caloric need for the starving countries we picked out (less than 0 kcal/person/day).
# the african countries with negative surplus
deficit_africa = pd.DataFrame(african_countries_to_help[african_countries_to_help.values < 0])
# european countries with highest surplus (per person)
caloric_surplus_europe_high = pd.DataFrame(caloric_difference_europe[2020].sort_values(ascending=False).head(len(deficit_africa.index)).sort_values(ascending=True))
total_need_africa = deficit_africa.merge(pop_tot_africa[2020], left_index=True, right_index=True)
total_need_africa['Total deficit (kcal/year)'] = total_need_africa['2020_x'] * total_need_africa['2020_y']*365
total_need_africa = total_need_africa.drop(columns=['2020_x', '2020_y']).sort_values(by='Total deficit (kcal/year)', ascending=False)
total_need_africa
Total deficit (kcal/year) | |
---|---|
Namibia | -7.753115e+10 |
Central African Republic | -1.973362e+11 |
Congo | -2.194120e+11 |
Senegal | -2.408757e+11 |
Chad | -1.049659e+12 |
Kenya | -1.541405e+12 |
Zambia | -2.260161e+12 |
Madagascar | -2.498640e+12 |
Ethiopia | -5.256981e+12 |
total_extra_europe = caloric_surplus_europe_high.merge(pop_tot_europe[2020], left_index=True, right_index=True)
total_extra_europe['Total surplus (kcal/year)'] = total_extra_europe['2020_x'] * total_extra_europe['2020_y']*365
total_extra_europe = total_extra_europe.drop(columns=['2020_x', '2020_y']).sort_values(by='Total surplus (kcal/year)')
total_extra_europe
Total surplus (kcal/year) | |
---|---|
Luxembourg | 2.603685e+11 |
Norway | 2.203824e+12 |
Ireland | 2.230506e+12 |
Greece | 4.316255e+12 |
Portugal | 4.321679e+12 |
Austria | 4.448890e+12 |
Belgium | 5.729637e+12 |
Italy | 2.754328e+13 |
France | 2.854461e+13 |
grid = plt.GridSpec(1,2);
ax0 = plt.subplot(grid[0, 0]);
total_need_africa.plot(kind='barh', color='r',alpha=0.75, rot=0, ax = ax0 );
plt.xlabel("Deficit in the african countries [kcal/year]")
plt.ylabel("African countries")
plt.title('Highest total deficits in African countries in 2020' )
ax1 = plt.subplot(grid[0, 1]);
total_extra_europe.plot(kind='barh', color='g',alpha=0.75, rot=0, ax = ax1);
plt.xlabel("Surplus in the european countries [kcal/year]")
plt.ylabel("European countries")
plt.title('Highest total surplus in European countries in 2020' )
plt.tight_layout()
plt.show()
We now see that France is actually the one with highest total surplus (of the ones with highest surplus per person). This is due to the fact that France has a much higher population than Belgium does. Following now is the repetition of the previous analysis if France alone could solve the sum of African deficits.
need_in_africa = abs(total_need_africa['Total deficit (kcal/year)'].sum())
print("African countries with a deficit need {0:.2E} extra kcal/year in total to solve hunger.".format(need_in_africa))
African countries with a deficit need 1.33E+13 extra kcal/year in total to solve hunger.
surplus_france = total_extra_europe['Total surplus (kcal/year)']['France']
print("France has {0:.2E} kcal/year over their basic needs.".format(surplus_france))
France has 2.85E+13 kcal/year over their basic needs.
From this analysis, we see that the total surplus in France is over double of the sum of deficits in African countries. This result shows that the European countries with a high surplus per person are indeed very capable of feeding a large percentage of starving people in Africa.
The final interesting analysis we want to look into is the actual total surplus in each European country. From that, we choose the countries that have the highest of total surplus (multiply the surplus per person by the country population) and then we choose the ones with the highest values.
total_europe = pd.DataFrame(caloric_difference_europe[2020]).merge(pop_tot_europe[2020], left_index=True, right_index=True)
total_europe['Total surplus (kcal/year)'] = total_europe['2020_x'] * total_europe['2020_y']*365
total_europe = total_europe.drop(columns=['2020_x', '2020_y']).sort_values(by='Total surplus (kcal/year)', ascending=False).head(len(deficit_africa.index)).sort_values(by='Total surplus (kcal/year)')
total_europe
Total surplus (kcal/year) | |
---|---|
Romania | 6.409110e+12 |
Ukraine | 1.256577e+13 |
Spain | 1.312546e+13 |
Poland | 1.432861e+13 |
United Kingdom | 2.619960e+13 |
Italy | 2.754328e+13 |
France | 2.854461e+13 |
Germany | 3.301847e+13 |
Russian Federation | 5.501934e+13 |
p1 = total_europe.plot(kind='barh', color='g',alpha=0.75, rot=0);
p1.set_xlabel("Surplus in the european countries [kcal/year]")
p1.set_ylabel("European countries")
plt.title('Highest total surplus in European countries in 2020' )
plt.show()
Because the Russian Federation has such a high population, we see that, even though its surplus/person is not very high compared to other European countries, the total surplus in the full country will actually overrrule the other countries' surplus. For the sake of consistency, we print below the actual total surplus of Russian Federation and compare it to the total deficit in Africa.
print("African countries with a deficit need {0:.2E} extra kcal/year in total to solve hunger.".format(need_in_africa))
African countries with a deficit need 1.33E+13 extra kcal/year in total to solve hunger.
total_surplus_russia = total_europe['Total surplus (kcal/year)']['Russian Federation']
print("The Russian Federation has {0:.2E} kcal/year over their basic needs.".format(total_surplus_russia))
The Russian Federation has 5.50E+13 kcal/year over their basic needs.
From this shift in first seeing the total country's surplus, we arrive to the conclusion that the Russian Federation alone has a whopping 5 times bigger surplus than what is needed to solve hunger in the African countries with a deficit.
With this analysis, we're able to see that Europe is very likely able to solve the hunger issue in Africa by giving up a small fraction of the extra calories that are not currently being put to use.
The results we came up with are satisfactory as they accurately reproduce what was expected. They also allowed for interesting analysis. We decided to conduct an analysis on Europe too as it's interesting to see how much spare food would be available for redistribution, as European countries are more likely to be able to provide help. Furthermore, it was decided to restrict further analysis to 2020 as the interest is in the contemporary situation. Nine African countries were found to show an overall deficit in supply, meaning unlike other African countries, they are not capable of solving their issues themselves.
The idea of optimizing the distribution of products from Europe to Africa presupposes the knowledge of the countries that will help and the countries that will be helped. In this context, we have chosen that helper European countries will be Italy, France, Germany, United Kingdom, Spain. This decision is justified by different reasons:
As regards African countries, our plan is to primarily help countries showing a deficit predicted by our initial model. In addition, we will extend the scope to other African countries that shows a weak surplus. As such, we set the threshold of 300kcal/persona/day for the food availability. If a country is below this upper bound it will be helped to reach it. For the sake of completeness, we want to let the reader know that our model is accurate in many ways but still doesn't take into account of different phenomena that can altarate our values such as wealth distribution, civil wars, climate disasters,etc.
Let's have a look at the countries we need to help:
af_real_deficit = np.abs(300 - african_countries_to_help)
af_real_deficit = af_real_deficit.to_frame()
af_real_deficit.tail()
2020 | |
---|---|
Central African Republic | 411.940700 |
Ethiopia | 425.280446 |
Chad | 475.076246 |
Madagascar | 547.213290 |
Zambia | 636.827454 |
Plotting the countries vs the amount of kilocalories they need to recevive in order to reach an desired availability of 300kcal/persona/day:
sns.barplot(y="index", x=2020, data=af_real_deficit.reset_index(), palette="Reds")
plt.xlabel("Amount of kilocalories needed [kcal/persona/day]")
plt.ylabel("African countries")
plt.title("Calculated amount of kcal to reach a threshold of 300kcal/persona/day in every african countries [kcal/persona/day]");
In the plot above we see how the darker red coloured countries like Zambia, Madagascar and Chad are most in need. The situation here is flipped so the reader must pay attention to the context of this plot to fully understand the meaning.
Now we will scale the problem to its real dimensions,i.e. we will multiply these values for the population of each country respecetively and for 365 day/year. The results keep the structure but the unit of values will be kcal/year.
af_real_deficit_tot = af_real_deficit.merge(pop_tot_africa[2020], left_index=True, right_index=True)
af_real_deficit_tot['Total deficit (kcal/year)'] = af_real_deficit_tot['2020_x'] * af_real_deficit_tot['2020_y']*365
af_real_deficit_tot = af_real_deficit_tot.drop(columns=['2020_x', '2020_y']).sort_values(by='Total deficit (kcal/year)', ascending=False)
af_real_deficit_tot.head()
Total deficit (kcal/year) | |
---|---|
Ethiopia | 1.784549e+13 |
Kenya | 7.429362e+12 |
United Republic of Tanzania | 6.083874e+12 |
Madagascar | 5.530806e+12 |
Zambia | 4.273204e+12 |
Since we have set the number of countries that will be "the helper", now we want to understand how much they should contribute individually to the cause. In particular, the scope is to define an index by which we will weight the contribution of each European country. According to this website, "the GDP per capita is a measure of a country's economic output that accounts for its number of people. It is a good measurement of a country's standard of living". This looks like exactly what we are looking for. Starting from the GDP per capita, it will be relatively easy to compute an effective weight for how prosperous a country is. The goal here is to determine a measure for how much countries should give up of their surplus and the idea is to model the optimization problem in order to penalize countries with highest GDP that should indeed contribute the most. We will show initally all the countries but then narrow it down as mentioned before.
We load the dataset from the FAO on Macro Statistics and Key Indicators:
gdp_europe = pd.read_csv("data/raw/Macro-Statistics_Key_Indicators_E_Europe.csv")
gdp_europe.head(2)
Area Code | Area | Item Code | Item | Element Code | Element | Unit | Y1970 | Y1970F | Y1970N | ... | Y2014N | Y2015 | Y2015F | Y2015N | Y2016 | Y2016F | Y2016N | Y2017 | Y2017F | Y2017N | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Albania | 22008 | Gross Domestic Product | 6109 | Value Local Currency | millions | 11327.885280 | X | Data from UNSD AMA | ... | Data from UNSD AMA | 1.434307e+06 | X | Data from UNSD AMA | 1.475251e+06 | X | Data from UNSD AMA | 1.552983e+06 | X | Data from UNSD AMA |
1 | 3 | Albania | 22008 | Gross Domestic Product | 6110 | Value US$ | millions | 2265.577056 | X | Data from UNSD AMA | ... | Data from UNSD AMA | 1.138692e+04 | X | Data from UNSD AMA | 1.188368e+04 | X | Data from UNSD AMA | 1.303936e+04 | X | Data from UNSD AMA |
2 rows × 151 columns
We are interested in the Gross Domestic Product in US Dollars. We now filter down for Element Code and Item Code. In particular Item Code equal to 22008 corresponds to "Gross Domestic Product" and Element Code equal 22008 corresponds to "Value US$"
#filtering element code and item code
gdp_europe = gdp_europe[(gdp_europe["Element Code"]==6110) & (gdp_europe["Item Code"]==22008)]
A short note on the dataset: the most recent observation is for 2017 and in our analysis many times we predicted the data up to 2020 if not available. In this case, we decide not to do it since we are not interested in the exact value of GDP, but rather to an index. Therefore, we accept the error as it will be neglegible.
We clean now a little bit our dataframe in order to delete information we don't need:
#Delete all the columns but Area and Y2017, set the index as Area (name of the country)
gdp_europe = gdp_europe[["Area","Y2017"]].set_index("Area")
#Rename column
gdp_europe = gdp_europe.rename(columns={"Y2017": "GDP [millions US$]"})
#Drop Malta as its not in our scope
gdp_europe = gdp_europe.drop("Malta")
#Filter down to the coutries in our main list european_country_kv
gdp_europe = gdp_europe[gdp_europe.index.isin(european_country_kv.names)]
gdp_europe.head()
GDP [millions US$] | |
---|---|
Area | |
Albania | 13039.355905 |
Austria | 416835.975860 |
Belarus | 54441.189058 |
Belgium | 494763.551890 |
Bosnia and Herzegovina | 18169.075913 |
Let's visualize quickly this result:
p = sns.barplot(y="Area", x="GDP [millions US$]",\
data=gdp_europe.sort_values(by="GDP [millions US$]",ascending=True).reset_index()\
,palette="Greens")
p.set(xscale="log") #we use log axis because we are are more interested in the order of magnitude than the exact value
plt.xlabel("Log of GWP [millions US$]")
plt.ylabel("European countries")
plt.title("GWP for European countries [millions US$]");
As we can see Germany, United Kingdom, France, Italy and Spain are the countries with highest GDP in Europe. This justifies perfectly our design choice of narrowing down the helper countries to these five. Since we are not taking into account of population, Russian Federation shows up in the list of richest countries. Let's analyse now what happens if we divide by population. We expect that Switzerland and Luxembourg will be at the top of the list while Russian Federation will lose positions.
#create dataframe gdp_capita
gdp_capita = pd.DataFrame(index=gdp_europe.index)
#retrieve population data
pop_tot_2020 = pop_tot_europe[[2020]]
#dividing gwp for population
gdp_capita["GDP kUS$/capita"] = gdp_europe.values/pop_tot_2020.values*1000
gdp_capita.head()
GDP kUS$/capita | |
---|---|
Area | |
Albania | 4.531015 |
Austria | 46.282197 |
Belarus | 5.761386 |
Belgium | 42.690245 |
Bosnia and Herzegovina | 5.537976 |
Plotting the results:
p = sns.barplot(y="Area", x="GDP kUS$/capita",\
data=gdp_capita.sort_values(by="GDP kUS$/capita",ascending=True).reset_index()\
,palette="Greens")
plt.xlabel("GWP per capita [kUS$/capita]")
plt.ylabel("European countries")
plt.title("GWP per capita for European countries [kUS$/capita]");
Our expectations are met. Switzerland and Luxembourg are the richest countries if considered the well-being of every person. Frow now on we will consider the GWP per capita as the weight indicating at what extent countries should contribute to the African cause.
In this chapter we will introduce a method to redistribute food among African countries that are in need. The idea behind the implementation is to minimize the amount of food that has to be delivered by our five best european contries. While minimizing, we will have to respect the demand constraint as well as the constraint on food availability of each European country. It is very important to model our problem in a way that richest countries with an higher food availability will have to give up more and viceversa. It is now way more clear where the analysis on GWP will come in handy.
The objective function to be minimized is a quadratic non-negative weighted sum of food [kcal/year]. More spefically, the weights we designed take into account both of GWP of a country and the food availability [kcal/year]. The modelling choice is justified by the fact that a rich country with a large surplus should contribute more than a relatively poor country with less possibilities.
The problem we want to model is a Quadratic Program with Linear Constraints a.k.a QP. The choice of a quadratic objective function is due the fact that we will need to evenely distribute the resources and this means that weights will have to increase quadratically with the amount of food given away. It makes sense to say that the more a country gives away of its surplus, the less the same country should contribute further if other countries didn't countribute at all yet.
In particular:
The first step is then to normalize the GWP in order to scale them in the interval [−1,1]:
scaler = MinMaxScaler()
scaler.fit(gdp_capita.values.reshape(-1,1)) #Min Max of sklearn used
gdp_europe_capita_scaled = pd.DataFrame(index=gdp_capita.index) #create new dataframe
gdp_europe_capita_scaled["GDP/capita scaled"] = np.round_(scaler.transform(gdp_capita.values.reshape(-1,1)),6) #round values in order to have clean numbers
gdp_europe_capita_scaled.head()
GDP/capita scaled | |
---|---|
Area | |
Albania | 0.025796 |
Austria | 0.453856 |
Belarus | 0.038410 |
Belgium | 0.417029 |
Bosnia and Herzegovina | 0.036120 |
fig = px.bar(gdp_europe_capita_scaled.reset_index(), x='Area', y='GDP/capita scaled')
fig.show()
We narrow down to the countries shown below in best_countries
:
best_countries = ["Italy","France","Spain","United Kingdom","Germany"]
Let's see the scaled GDPs of our best countries:
gdp_europe_capita_scaled.loc[best_countries]
GDP/capita scaled | |
---|---|
Area | |
Italy | 0.308962 |
France | 0.384979 |
Spain | 0.267551 |
United Kingdom | 0.376729 |
Germany | 0.431279 |
In the code below, we present how the bubble plot used for helper country selection was generated.
maximum = max(total_cal_need_europe[2020])
sizes = [x/maximum*100 for x in total_cal_need_europe[2020]]
fig = go.Figure()
fig.add_trace(go.Scatter(
x=gdp_capita["GDP 100kUS$/capita"],
y=caloric_difference_europe[2020],
mode="markers",
marker=dict(size=sizes,
color=total_cal_need_europe[2020],
colorbar=dict(title="Total caloric surplus [kcal/year]")
),
text=total_cal_need_europe.index,
name="",
hovertemplate="<b>%{text}</b><br><br>GDP per capita: %{x:.2f}e+3 US$<br>Surplus per capita: %{y:.2f} kcal/person/day<br>Total surplus: %{marker.color:.2e} kcal/year"))
fig.update_layout(
title = dict(text="Viability of European donor nations", font=dict(size=20)),
plot_bgcolor='rgb(256,256,256)',
xaxis=dict(type="log", gridcolor="#eee", title="GDP per capita [thousands US$]"),
yaxis=dict(gridcolor="#eee", title="Caloric surplus [kcal/person/day]"),
)
py.plot(fig, filename='docs/_includes/gdp_surplus.html', include_plotlyjs=False)
IFrame(src='https://manuleo.github.io/mADAm_files/gdp_surplus.html', width = 1000, height=600)
Note that the design choice of selecting 5 best european countries is on a large extent justified. We can observe that the countries represented in the top right hand side of the plot are the ones with the higher Surplus and GDP. Moreover, we see that Italy, UK, Spain, Germany and France are actually the countries that have the most in terms of total surplus (considering the population). Russia wil not be considered because the position on the plot suggest that even tough Total surplus is high, the GDP is relatively not compared to other countries.
Create auxiliary dataframes to store the coefficients:
europe_plus_val
and europe_plus_index
will store respectively the european surplus and the corresponding names of the countriesafrica_deficit_val
and africa_deficit_indx
will store respectively the african defict and the corresponding names of the countriesgdp_europe_val
and gdp_europe_index
will store respectively the gwp normalized of european countries and the corresponding names of the countrieseurope_plus_val = total_europe.loc[best_countries].astype(float).values
europe_plus_index = total_europe.loc[best_countries].index
africa_deficit_val= np.abs(af_real_deficit_tot.values).astype(float)
africa_deficit_index = af_real_deficit_tot.index
gdp_europe_val = gdp_europe_capita_scaled.loc[best_countries].astype(float).values
gdp_europe_index = gdp_europe_capita_scaled.loc[best_countries].index
According to the documentation of the library PICOS we adopted to solve the problem, QUADRATIC PROGRAMS can't be implemented yet. As a consequence, we will model the problem with a matlab function QPSolverEuropeAfrica that is located on the repository. It is possibile to run matlab functions on jupyter notebooks by importing matlab.engine. The function is then called and executed by a temporary matlab kernel created in the jupyter session. In Matlab, it is possibile to use YALMIP which is a very useful translator of the problem to different solvers. We have an academic free license of GUROBI which is a quite famous solver and we will go for it.
import matlab.engine
euc = europe_plus_val; #european countries surplus matrix size 1xm
afc = africa_deficit_val; #african countries deficit matrix size 1xn
gwp = 1-gdp_europe_val; # gwp
#Casting to matlab variables
euc = matlab.double(euc.tolist())
afc = matlab.double(afc.tolist())
gwp = matlab.double(gwp.tolist())
#Start matlab engine
eng = matlab.engine.start_matlab()
#Initializing QPSolverEuropeAfrica that will store the outputs of the matlab function
QPSolverEuropeAfrica = eng.QPSolverEuropeAfrica(euc,afc,gwp,nargout=1)
#Quitting the matlab engine
eng.quit()
#PICKLE
#food_opt_distribution_df.to_pickle('data/processed/food_opt_distribution_df.pkl') #create pickle
food_opt_distribution_df = pd.read_pickle('data/processed/food_opt_distribution_df.pkl')
The optimization was successful. We will store them in the dataframe food_opt_distribution_df
. Let's have a look at them:
food_opt_distribution = np.round(np.asarray(QPSolverEuropeAfrica),4)
food_opt_distribution_df= pd.DataFrame(data=food_opt_distribution,index = europe_plus_index, columns=africa_deficit_index)
food_opt_distribution_df.head()
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Italy | 3.309959e+12 | 1.377777e+12 | 1.128190e+12 | 1.025596e+12 | 7.923125e+11 | 6.504120e+11 | 5.984561e+11 | 5.279925e+11 | 4.332332e+11 | 3.844247e+11 | ... | 1.524224e+11 | 1.343460e+11 | 7.543230e+10 | 6.563087e+10 | 6.366816e+10 | 3.657805e+10 | 3.001846e+10 | 2.303602e+10 | 8.877370e+09 | 8.567958e+09 |
France | 3.719690e+12 | 1.548689e+12 | 1.268253e+12 | 1.152979e+12 | 8.908609e+11 | 7.314214e+11 | 6.730436e+11 | 5.938707e+11 | 4.873991e+11 | 4.325579e+11 | ... | 1.718800e+11 | 1.515692e+11 | 8.537378e+10 | 7.436089e+10 | 7.215559e+10 | 4.171712e+10 | 3.434677e+10 | 2.650129e+10 | 1.059262e+10 | 1.024497e+10 |
Spain | 3.122538e+12 | 1.299598e+12 | 1.064122e+12 | 9.673289e+11 | 7.472343e+11 | 6.133565e+11 | 5.643380e+11 | 4.978583e+11 | 4.084565e+11 | 3.624075e+11 | ... | 1.435221e+11 | 1.264677e+11 | 7.088484e+10 | 6.163756e+10 | 5.978582e+10 | 3.422732e+10 | 2.803860e+10 | 2.145092e+10 | 8.092775e+09 | 7.800856e+09 |
United Kingdom | 3.670387e+12 | 1.528123e+12 | 1.251399e+12 | 1.137651e+12 | 8.790027e+11 | 7.216737e+11 | 6.640686e+11 | 5.859437e+11 | 4.808814e+11 | 4.267661e+11 | ... | 1.695387e+11 | 1.494968e+11 | 8.417754e+10 | 7.331042e+10 | 7.113431e+10 | 4.109874e+10 | 3.382595e+10 | 2.608432e+10 | 1.038623e+10 | 1.004317e+10 |
Germany | 4.022919e+12 | 1.675176e+12 | 1.371909e+12 | 1.247251e+12 | 9.637936e+11 | 7.913740e+11 | 7.282437e+11 | 6.426252e+11 | 5.274857e+11 | 4.681798e+11 | ... | 1.862799e+11 | 1.643157e+11 | 9.273118e+10 | 8.082172e+10 | 7.843688e+10 | 4.552040e+10 | 3.755002e+10 | 2.906583e+10 | 1.186203e+10 | 1.148607e+10 |
5 rows × 25 columns
Let's compute now the total amount that every European country has to give up to save Africa:
#summing over rows of optimal_distribution_df
eu_countries_giveup = food_opt_distribution.sum(axis=1)
#storing results in a new dataframe eu_countries_giveup
eu_countries_giveup = pd.DataFrame({'Total giveup [kcal/year]': eu_countries_giveup}, index=best_countries)
eu_countries_giveup.head()
Total giveup [kcal/year] | |
---|---|
Italy | 1.201073e+13 |
France | 1.351072e+13 |
Spain | 1.132461e+13 |
United Kingdom | 1.333023e+13 |
Germany | 1.462081e+13 |
The following dataframe we obtained has nutritional information about the most varied food products there are. The information we are most interested in for our analysis. per product, are:
#load dataset USDA-Food
usda_foods = pd.read_excel("data/raw/USDA-Food.xlsx", sheet_name=0)
food_properties = pd.DataFrame(usda_foods[['Food Group', 'Food Name', 'Protein (g)', 'Carbohydrates (g)', 'Fat (g)']])
food_properties.head()
Food Group | Food Name | Protein (g) | Carbohydrates (g) | Fat (g) | |
---|---|---|---|---|---|
0 | Dairy and Egg Products | Butter, salted | 0.85 | 0.06 | 81.11 |
1 | Dairy and Egg Products | Butter, whipped, with salt | 0.49 | 2.87 | 78.30 |
2 | Dairy and Egg Products | Butter oil, anhydrous | 0.28 | 0.00 | 99.48 |
3 | Dairy and Egg Products | Cheese, blue | 21.40 | 2.34 | 28.74 |
4 | Dairy and Egg Products | Cheese, brick | 23.24 | 2.79 | 29.68 |
#checking size of the database
food_properties.index.size
6347
There are a lot of possible product inside this database, but many of them could be unsuitable for our purpose. As for the first task, we check the available 'Food Group' and remove the ones we won't need.
We plot them for a clear understanding.
food_properties["Food Group"].value_counts().plot(kind="barh")
plt.title("Distribution of Food Group inside the USDA database")
plt.xlabel("Counts")
plt.ylabel("Food Group");
As expected, not all most viable products are necessarily healthy. Also, other are not suitable for our purpose (like, as example "Baby Food"). Therefore, we will no longer consider the following items:
not_suit = ["Sweets","Snacks","Beverages", "Baked Products","Spices and Herbs", "Nut and Seed Products", \
"Dairy and Egg Products", "Breakfast Cereals", "Baby Foods", "Soups, Sauces, and Gravies", \
"Spices and Herbs", "Nut and Seed Products", "Fats and Oils"]
food_properties = food_properties[~food_properties["Food Group"].isin(not_suit)].reset_index(drop=True)
food_properties.head()
Food Group | Food Name | Protein (g) | Carbohydrates (g) | Fat (g) | |
---|---|---|---|---|---|
0 | Poultry Products | Chicken, broiler, rotisserie, BBQ, breast meat... | 28.04 | 0.00 | 3.57 |
1 | Poultry Products | Chicken, broilers or fryers, meat and skin and... | 18.33 | 0.13 | 14.83 |
2 | Poultry Products | Chicken, broilers or fryers, meat and skin and... | 22.84 | 9.03 | 17.53 |
3 | Poultry Products | Chicken, broilers or fryers, meat and skin and... | 28.57 | 3.27 | 15.27 |
4 | Poultry Products | Chicken, broilers or fryers, meat and skin and... | 26.78 | 0.06 | 13.27 |
New check on dimension:
food_properties.index.size
4240
By a first look to the head of the dataset we also understand that the USDA-Food
contains many processed products. As we are interested in raw products for our analysis, we filter only this one:
food_properties = food_properties[food_properties["Food Name"].str.contains("raw")]
food_properties.index.size
1283
We have a set of 1283 products from which we have to choose the possible diet.
We keep this dataset as it is for now and move to the next part.
An important feature in our next choice of products will be the cost of producing the products inside the European countries. For this reason, we introduce now two new datasets and we use them to filter more the USDA-Food
ones.
The new datasets are:
We use the two of them because just the one from FAO is not complete enough for our purpose.
Note: FAO data are from 2014 to 2018, while EU_prices will go from 2014 to 2019. We assume these prices to be ok even if they are from a different period because there are no available data online. We simply decide to pick the most recent price for each available product.
# loading the new datasets
eu_prices = pd.read_csv("data/raw/europe_food_prices.csv", usecols=["Category", "Product desc", "Unit", "Country", "Period", "MP Market Price"])\
.rename(columns={"Product desc": "Product", "MP Market Price":"Price"})
eu_prices.head(1)
Category | Product | Unit | Country | Period | Price | |
---|---|---|---|---|---|---|
0 | Animal Products | Chicken | €/100 kg carcass weight | AT | 201910 | 268.11 |
In order to have reasonable prices, we consider only the last five years, from 2014 to october 2019:
eu_prices = eu_prices[eu_prices.Period.between(201400, 201910)]
Check the availability inside this dataset:
eu_prices.Product.unique()
array(['Chicken', 'Raw Milk', 'SMP', 'WMP', 'Whey Powder', 'Butter', 'Butter Oil', 'Emmental', 'Cheddar', 'Edam', 'Gouda', 'Apricots', 'Avocados', 'Cherries', 'Lemons', 'Clementines', 'Strawberries', 'Kiwis', 'Mandarins', 'Melons', 'Nectarines', 'Oranges', 'Watermelons', 'Peaches', 'Pears', 'Apples Braeburn', 'Apples Boskoop', 'Apples Cox', 'Apples Elstar', 'Apples Fuji', 'Apples Gala', 'Apples Golden', 'Apples Granny', 'Apples Idared', 'Apples Jonagold', 'Apples Red', 'Apples Shampion', 'Plums', 'Grapes', 'Satsumas', 'Garlic', 'Asparagus', 'Eggplants', 'Cabbages', 'Carrots', 'Mushrooms', 'Cauliflowers', 'Cucumbers', 'Courgettes', 'Beans', 'Lettuces', 'Leeks', 'Onions', 'Peppers', 'Tomatoes Cherry', 'Tomatoes Trusses', 'Tomatoes Round', 'Feed Oats', 'Milling Oats', 'Feed Wheat', 'Bread Wheat', 'Durum Wheat', 'Feed Maize', 'Malting Barley', 'Feed Barley', 'Feed Rye', 'Bread Rye', 'Lampante Olive Oil', 'Extra Virgin Olive Oil', 'Virgin Olive Oil', 'White Sugar'], dtype=object)
There is a good number of representatives from all the categories we have in the USDA-Food
, but we miss an important source of carbohydrates: rice. Also, there is no meat present in the database.
For this reason, we go to scrape these information from FAO:
fao_prices = pd.read_csv("data/raw/fao_cereal_meat_prices_201418.csv", usecols=["Area", "Item", "Value", "Year"])
fao_prices
Area | Item | Year | Value | |
---|---|---|---|---|
0 | Albania | Barley | 2014 | 431.4 |
1 | Albania | Barley | 2015 | 341.4 |
2 | Albania | Barley | 2016 | 319.7 |
3 | Albania | Barley | 2017 | 303.6 |
4 | Albania | Barley | 2018 | 302.2 |
... | ... | ... | ... | ... |
1976 | United Kingdom | Wheat | 2014 | 256.7 |
1977 | United Kingdom | Wheat | 2015 | 189.4 |
1978 | United Kingdom | Wheat | 2016 | 162.0 |
1979 | United Kingdom | Wheat | 2017 | 158.3 |
1980 | United Kingdom | Wheat | 2018 | 218.8 |
1981 rows × 4 columns
Note: FAO data are in USD/ton
At this point we need to restrict our product analysis. In the Which European countries can help Africa? section we defined the countries with the highest surplus in Europe, that are able to help Africa only by giving away a small fraction of their surplus.
From the previous classification we remove now the Russian Federation, because it has a huge surplus and would be able to solve the African problem (and we want a fair share between the richest European countries), together with Poland, Ukraine and Romania, because they are smaller nations.
At this point we have build the set of top 5 countries that will help Africa: France, Italy, United Kingdom, Germany and Spain.
# defining countries
best_countries = ["France", "Italy", "United Kingdom", "Germany", "Spain"]
best_countries_code = ["FR", "IT", "UK", "DE", "ES"] # codes for the eu_prices
# filtering our datasets
eu_prices = eu_prices[eu_prices.Country.isin(best_countries_code)]
fao_prices = fao_prices[fao_prices.Area.isin(best_countries)]
Considering only the most recent prices in the FAO cereal dataset
fao_prices = fao_prices.sort_values("Year", ascending=False).groupby(["Area", "Item"])\
.first().reset_index().drop(columns="Year")
Let's take a look to the products in the FAO cereal dataset now
fao_prices.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO cereal and meat dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
Two fundamental problems emerge here: we don't have the price for rice and meat (cattle) in all the countries of interest, we need to impute them
We start with the rice.
In order to do so, what we do now is try to find a correlation between the product we have in all the countries (Wheat and Oats) and the rice.
rice_df = fao_prices[fao_prices.Item=="Rice, paddy"]
rice_df = rice_df.merge(fao_prices, on="Area")
rice_df = rice_df[rice_df.Item_y.str.contains("Wheat|Oats")]
rice_df["Coeff"] = rice_df.Value_x/rice_df.Value_y
rice_df.sort_values("Item_y")
Area | Item_x | Value_x | Item_y | Value_y | Coeff | |
---|---|---|---|---|---|---|
11 | France | Rice, paddy | 372.0 | Oats | 145.7 | 2.553191 |
19 | Italy | Rice, paddy | 403.7 | Oats | 272.3 | 1.482556 |
37 | Spain | Rice, paddy | 344.6 | Oats | 171.3 | 2.011675 |
15 | France | Rice, paddy | 372.0 | Wheat | 195.0 | 1.907692 |
21 | Italy | Rice, paddy | 403.7 | Wheat | 233.9 | 1.725951 |
42 | Spain | Rice, paddy | 344.6 | Wheat | 217.3 | 1.585826 |
mean_coeff_oats = rice_df[rice_df["Item_y"] == "Oats"].Coeff.mean()
mean_coeff_wheat = rice_df[rice_df["Item_y"] == "Wheat"].Coeff.mean()
print("The mean coefficient for oats is", mean_coeff_oats)
print("The mean coefficient for wheat is", mean_coeff_wheat)
The mean coefficient for oats is 2.0158076390008994 The mean coefficient for wheat is 1.7398232052849225
As the wheat seems to be the product with less fluctuation, we take this one for our imputation.
wheat_germany = fao_prices[(fao_prices.Area == "Germany") & (fao_prices.Item=="Wheat")].Value.values[0]
row_germ = pd.DataFrame(np.array([["Germany", "Rice, paddy", '%.1f'%(mean_coeff_wheat*wheat_germany)]]), columns=["Area", "Item", "Value"])
wheat_uk = fao_prices[(fao_prices.Area == "United Kingdom") & (fao_prices.Item=="Wheat")].Value.values[0]
row_uk = pd.DataFrame(np.array([["United Kingdom", "Rice, paddy", '%.1f'%(mean_coeff_wheat*wheat_uk)]]), columns=["Area", "Item", "Value"])
fao_prices = fao_prices.append(row_germ)
fao_prices = fao_prices.append(row_uk).reset_index(drop=True)
fao_prices[fao_prices.Item=="Barley"]
Area | Item | Value | |
---|---|---|---|
0 | France | Barley | 195.1 |
16 | Germany | Barley | 195.4 |
39 | Spain | Barley | 203.7 |
60 | United Kingdom | Barley | 197.4 |
For the meat case, we first look to the countries for which we have data (considering the most general case: cattle, that we consider to be beef)
fao_prices[fao_prices.Item=="Meat, cattle"]
Area | Item | Value | |
---|---|---|---|
2 | France | Meat, cattle | 4549.5 |
18 | Germany | Meat, cattle | 4033.1 |
43 | Spain | Meat, cattle | 2774.8 |
61 | United Kingdom | Meat, cattle | 5328.2 |
We need to impute the price for Italy:
We need to multiply this price by 1.10 (actual exchange EUR/USD) and 1000 (to consider tons)
meat_italy = pd.DataFrame(np.array([["Italy", "Meat, cattle", '%.1f'%(3.42*1.10*1000)]]), columns=["Area", "Item", "Value"])
fao_prices = fao_prices.append(meat_italy).reset_index(drop=True)
For pork, we already have all the data so no need to impute.
Reprint the plot to see the final product in the FAO dataset:
fao_prices.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO cereal and meat dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
We need to report all in a single unit in eu_prices
eu_prices.Unit.unique()
array(['€/100 kg carcass weight', '€/100 kg', '€/100 kg net weight', '€/t', '€/100 Kg'], dtype=object)
As we are working with large amount of data and FAO as data in USD/tons, we take everything as EUR/ton (multiply 10) and multiply by 1.10 (EUR/USD change)
eu_prices.loc[eu_prices['Unit'].str.contains('100'), 'Price'] = eu_prices.loc[eu_prices['Unit'].str.contains('100'), 'Price']\
.apply(lambda x: x*1.10*10)
eu_prices = eu_prices.drop(columns="Unit")
We now assess, as we did for the fao_prices
, if the prices we have are correctly represented among all our countries.
First step we do to this end is to keep only the most recent data for each product inside each country.
eu_prices = eu_prices.sort_values("Period", ascending=False).groupby(["Country", "Category", "Product"])\
.first().reset_index().drop(columns="Period")
plt.figure(figsize=(20, 15))
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
eu_prices.Product = eu_prices.Product.apply(lambda x: "Apples" if "Apples" in x else x)
eu_prices.Product = eu_prices.Product.apply(lambda x: "Tomatoes" if "Tomatoes" in x else x)
eu_prices.Product = eu_prices.Product.apply(lambda x: "Bread" if "Bread" in x else x)
# dropping and readd them
apples_eu = eu_prices[eu_prices.Product=="Apples"].groupby(["Country", "Category", "Product"]).agg("mean").reset_index()
eu_prices = eu_prices[~(eu_prices["Product"]=="Apples")].reset_index(drop=True)
tomatoes_eu = eu_prices[eu_prices.Product=="Tomatoes"].groupby(["Country", "Category", "Product"]).agg("mean").reset_index()
eu_prices = eu_prices[~(eu_prices["Product"]=="Tomatoes")].reset_index(drop=True)
bread_eu = eu_prices[eu_prices.Product=="Bread"].groupby(["Country", "Category", "Product"]).agg("mean").reset_index()
eu_prices = eu_prices[~(eu_prices["Product"]=="Bread")].reset_index(drop=True)
eu_prices = eu_prices.append(apples_eu)
eu_prices = eu_prices.append(tomatoes_eu)
eu_prices = eu_prices.append(bread_eu)
plt.figure(figsize=(20, 15))
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
Merging didn't solve the situation.
As a consequence, we decide to drop all the products with not at least 4 counts and put our imputing efforts just on them.
counts_eu = eu_prices.Product.value_counts()
possible_products = counts_eu[counts_eu >= len(best_countries) - 1].index
eu_prices = eu_prices[eu_prices.Product.isin(possible_products)]
# replot to zoom on our interested products
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
Printing the counts to understand which countries to impute. A country with all products should count 13 products
eu_prices.Country.value_counts()
ES 13 DE 12 IT 12 FR 11 UK 10 Name: Country, dtype: int64
Only Spain has all the needed products. We need to impute all the others. We therefore proceed to print all the missing points in the other countries:
# set difference between the products
need_it = set(possible_products) - set(eu_prices[eu_prices.Country=="IT"].Product.values)
need_de = set(possible_products) - set(eu_prices[eu_prices.Country=="DE"].Product.values)
need_fr = set(possible_products) - set(eu_prices[eu_prices.Country=="FR"].Product.values)
need_uk = set(possible_products) - set(eu_prices[eu_prices.Country=="UK"].Product.values)
# printing
print("Missing product in Italy:", need_it)
print("Missing product in Germany:", need_de)
print("Missing product in France:", need_fr)
print("Missing product in United Kingdom:", need_uk)
Missing product in Italy: {'Malting Barley'} Missing product in Germany: {'Onions'} Missing product in France: {'Butter', 'Lettuces'} Missing product in United Kingdom: {'Apples', 'Cherries', 'Feed Maize'}
barley_italy = pd.DataFrame(np.array([["IT", "Vegetal Products", "Malting Barley", '%.2f'%(189*1.10)]]), columns=["Country", "Category", "Product", "Price"])
eu_prices = eu_prices.append(barley_italy).reset_index(drop=True)
We need to impute the Onions. We can easily expect that onions won't be part of our diet, so we can drop them for all of the countries.
eu_prices = eu_prices[~(eu_prices.Product=="Onions")]
eu_prices = eu_prices[~(eu_prices.Product=="Butter")]
lettuce_france = pd.DataFrame(np.array([["FR", "Vegetable Products", "Lettuces", '%.2f'%(1269.3)]]), columns=["Country", "Category", "Product", "Price"])
eu_prices = eu_prices.append(lettuce_france).reset_index(drop=True)
eu_prices[(eu_prices.Product=="Feed Maize")]
Country | Category | Product | Price | |
---|---|---|---|---|
7 | DE | Vegetal Products | Feed Maize | 161.88 |
16 | ES | Vegetal Products | Feed Maize | 170.38 |
24 | FR | Vegetal Products | Feed Maize | 153.65 |
33 | IT | Vegetal Products | Feed Maize | 149.64 |
apples_uk = pd.DataFrame(np.array([["UK", "Fruit Products", "Apples", '%.2f'%(1343.4)]]), columns=["Country", "Category", "Product", "Price"])
cherries_uk = pd.DataFrame(np.array([["UK", "Fruit Products", "Cherries", '%.2f'%(4574.9)]]), columns=["Country", "Category", "Product", "Price"])
maize_uk = pd.DataFrame(np.array([["UK", "Vegetal Products", "Feed Maize", '%.2f'%(195)]]), columns=["Country", "Category", "Product", "Price"])
eu_prices = eu_prices.append(apples_uk)
eu_prices = eu_prices.append(cherries_uk)
eu_prices = eu_prices.append(maize_uk).reset_index(drop=True)
Check if everything went right:
# replot to zoom on our interested products
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
As we can see, the Vegetables category is not well represented by our actual dataset. For this reason, we consider FAO data for vegetables in order to fill the gaps:
fao_vegetables_prc = pd.read_csv("data/raw/fao_vegetables_prices_201418.csv", usecols=["Area", "Item", "Value", "Year"])
fao_vegetables_prc = fao_vegetables_prc[fao_vegetables_prc.Area.isin(best_countries)]
fao_vegetables_prc = fao_vegetables_prc.sort_values("Year", ascending=False).groupby(["Area", "Item"])\
.first().reset_index().drop(columns="Year")
# plotting
fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetables dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
As we did before, we take only the products present in at least 4 countries for the next analysis:
counts_fao_veg = fao_vegetables_prc.Item.value_counts()
possible_products_veg = counts_fao_veg[counts_fao_veg >= len(best_countries) - 1].index
fao_vegetables_prc = fao_vegetables_prc[fao_vegetables_prc.Item.isin(possible_products_veg)]
# replot to zoom on our interested products
fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetables dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
We can also drop Carrots and Lettuce before already present in the precedent dataset:
fao_vegetables_prc = fao_vegetables_prc[~(fao_vegetables_prc.Item.isin(["Carrots and turnips", "Lettuce and chicory"]))]
possible_products_veg = possible_products_veg[~(possible_products_veg.isin(["Carrots and turnips", "Lettuce and chicory"]))]
# replot to zoom on our interested products
fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetables dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
Let's see how these products are represented:
fao_vegetables_prc.Area.value_counts()
Spain 8 United Kingdom 8 France 8 Germany 8 Italy 2 Name: Area, dtype: int64
The only country in which products are missing is Italy. We need to impute manually the prices of these products:
# set difference between the products
need_it = set(possible_products_veg) - set(fao_vegetables_prc[fao_vegetables_prc.Area=="Italy"].Item.values)
# printing
print("Missing product in Italy:", need_it)
Missing product in Italy: {'Cabbages and other brassicas', 'Cauliflowers and broccoli', 'Mushrooms and truffles', 'Cucumbers and gherkins', 'Beans, green', 'Peas, green'}
The price for the cucumbers and the cauliflowers can be found in the eu_prices
dataset
eu_prices2 = pd.read_csv("data/raw/europe_food_prices.csv", usecols=["Category", "Product desc", "Unit", "Country", "Period", "MP Market Price"])\
.rename(columns={"Product desc": "Product", "MP Market Price":"Price"})
eu_prices2 = eu_prices2[eu_prices2.Country.isin(best_countries_code)]
eu_prices2 = eu_prices2[eu_prices2.Period.between(201400, 201910)]
eu_prices2 = eu_prices2.sort_values("Period", ascending=False).groupby(["Country", "Category", "Product"])\
.first().reset_index().drop(columns="Period")
eu_prices2[eu_prices2.Product=="Cucumbers"]
Country | Category | Product | Unit | Price | |
---|---|---|---|---|---|
63 | ES | Vegetable Products | Cucumbers | €/100 kg net weight | 49.81 |
101 | FR | Vegetable Products | Cucumbers | €/100 kg net weight | 114.55 |
144 | IT | Vegetable Products | Cucumbers | €/100 kg net weight | 68.00 |
eu_prices2[eu_prices2.Product=="Cauliflowers"]
Country | Category | Product | Unit | Price | |
---|---|---|---|---|---|
61 | ES | Vegetable Products | Cauliflowers | €/100 kg net weight | 25.20 |
99 | FR | Vegetable Products | Cauliflowers | €/100 kg net weight | 40.68 |
142 | IT | Vegetable Products | Cauliflowers | €/100 kg net weight | 62.71 |
#prices are in Eur/100 kg, we need usd/tonnes
cucumbers = pd.DataFrame(np.array([["Italy", "Cucumbers and gherkins", 68.00*1.10*10]]), columns=["Area", "Item", "Value"])
cauliflowers = pd.DataFrame(np.array([["Italy", "Cauliflowers and broccoli", 62.71*1.10*10]]), columns=["Area", "Item", "Value"])
fao_vegetables_prc = fao_vegetables_prc.append(cucumbers)
fao_vegetables_prc = fao_vegetables_prc.append(cauliflowers).reset_index(drop=True)
For the peas and the beans of 2017 in Italy can be found here (page in Italian).
The price is 120.33€/100kg for the peas and 175.56€/100kg for the beans.
#prices are in Eur/100 kg, we need usd/tonnes
peas = pd.DataFrame(np.array([["Italy", "Peas, green", 120.33*1.10*10]]), columns=["Area", "Item", "Value"])
beans = pd.DataFrame(np.array([["Italy", "Beans, green", 175.56*1.10*10]]), columns=["Area", "Item", "Value"])
fao_vegetables_prc = fao_vegetables_prc.append(peas)
fao_vegetables_prc = fao_vegetables_prc.append(beans).reset_index(drop=True)
Unfortunately, we didn't find any reliable price for Cabbages and mushrooms, so we drop them:
fao_vegetables_prc = fao_vegetables_prc[~(fao_vegetables_prc.Item.isin(["Cabbages and other brassicas", "Mushrooms and truffles"]))]
Final check on these products
fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetable dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");
We also now rename the codes in eu_prices
to proper names of nations before merging.
dict_country = dict(zip(best_countries_code, best_countries))
eu_prices = eu_prices.replace({"Country":dict_country})
Final step is combining the three dataframe together. We modify the needed products from the FAO datasets and adapt it to eu_prices
counts_fao = fao_prices.Item.value_counts()
fao_products = counts_fao[counts_fao == len(best_countries)].index
fao_prices = fao_prices[fao_prices.Item.isin(fao_products)]
# dropping the no longer needed category column from eu
eu_prices.drop(columns="Category", inplace=True)
fao_prices = fao_prices.replace({"Item":{"Rice, paddy": "Rice", "Meat, cattle": "Meat"}})
fao_prices = fao_prices[["Area", "Item", "Value"]]
fao_prices = fao_prices.rename(columns={"Item": "Product", "Area":"Country", "Value":"Price"})
fao_vegetables_prc = fao_vegetables_prc.replace({"Item":{"Cauliflowers and broccoli": "Cauliflowers", "Peas, green": "Peas", "Beans, green": "Beans", "Cucumbers and gherkins": "Cucumbers"}})
fao_vegetables_prc = fao_vegetables_prc[["Area", "Item", "Value"]]
fao_vegetables_prc = fao_vegetables_prc.rename(columns={"Item": "Product", "Area":"Country", "Value":"Price"})
prices = eu_prices.append(fao_prices).append(fao_vegetables_prc).sort_values(by="Product").reset_index(drop=True)
prices.head()
Country | Product | Price | |
---|---|---|---|
0 | United Kingdom | Apples | 1343.40 |
1 | Germany | Apples | 597.443 |
2 | Spain | Apples | 898.04 |
3 | France | Apples | 1029.79 |
4 | Italy | Apples | 713.743 |
print("The final products present in the dataset are: {}".format(prices.Product.unique()))
The final products present in the dataset are: ['Apples' 'Beans' 'Bread' 'Carrots' 'Cauliflowers' 'Cherries' 'Chicken' 'Cucumbers' 'Feed Barley' 'Feed Maize' 'Lettuces' 'Malting Barley' 'Meat' 'Meat, pig' 'Oats' 'Peas' 'Potatoes' 'Raw Milk' 'Rice' 'Strawberries' 'Tomatoes' 'Wheat']
We have finally build our final prices dataset!
Now that we know the set of products between we can choose, it's time to filter the USDA_food
accordingly
# defining the set of foods between we can choose
foods = prices.Product.unique()
food_properties[food_properties["Food Name"].str.contains('|'.join(foods), case=False)]
Food Group | Food Name | Protein (g) | Carbohydrates (g) | Fat (g) | |
---|---|---|---|---|---|
1 | Poultry Products | Chicken, broilers or fryers, meat and skin and... | 18.33 | 0.13 | 14.83 |
6 | Poultry Products | Chicken, broilers or fryers, meat and skin, raw | 18.60 | 0.00 | 15.06 |
11 | Poultry Products | Chicken, broilers or fryers, meat only, raw | 21.39 | 0.00 | 3.08 |
15 | Poultry Products | Chicken, broilers or fryers, skin only, raw | 13.33 | 0.00 | 32.35 |
20 | Poultry Products | Chicken, broilers or fryers, giblets, raw | 17.88 | 1.80 | 4.47 |
... | ... | ... | ... | ... | ... |
4052 | Beef Products | Beef, ground, 95% lean meat / 5% fat, raw | 21.41 | 0.00 | 5.00 |
4057 | Beef Products | Beef, ground, 90% lean meat / 10% fat, raw | 20.00 | 0.00 | 10.00 |
4062 | Beef Products | Beef, ground, 85% lean meat / 15% fat, raw | 18.59 | 0.00 | 15.00 |
4067 | Beef Products | Beef, ground, 80% lean meat / 20% fat, raw | 17.17 | 0.00 | 20.00 |
4072 | Beef Products | Beef, ground, 75% lean meat / 25% fat, raw | 15.76 | 0.00 | 25.00 |
269 rows × 5 columns
269 rows, but there are multiple matches. Let's look to the elements that don't have a match:
# find the set of total foods and set to lower
total_foods = food_properties["Food Name"].unique()
total_foods = [t.lower() for t in total_foods]
# consider also foods to lower fo check the match
foods = [f.lower()for f in foods]
# find set of not matching
not_matching = [f for f in foods if not any(f in t for t in total_foods)]
len(not_matching)
10
not_matching
['cauliflowers', 'cucumbers', 'feed barley', 'feed maize', 'lettuces', 'malting barley', 'meat, pig', 'oats', 'raw milk', 'wheat']
We can try to divide the multiple words, also remove duplicates after that
foods = [f.replace(",","").replace("(", "").replace(")", "").split() for f in foods]
foods = [l for sublist in foods for l in sublist]
foods = list(set(foods))
Second step is removing all generated stopwords and add singulars
stop_words = stopwords.words('english')
stop_words += ["feed", "raw"]
foods = [f for f in foods if f not in stop_words]
lem = WordNetLemmatizer()
singular = [lem.lemmatize(f) for f in foods]
foods += singular
foods = list(set(foods))
foods = [f for f in foods if f not in stop_words]
# printing the set of food
foods
['bean', 'carrots', 'lettuces', 'cauliflower', 'strawberries', 'oats', 'barley', 'pig', 'lettuce', 'milk', 'rice', 'cucumbers', 'maize', 'tomato', 'beans', 'cherries', 'meat', 'tomatoes', 'cucumber', 'cherry', 'potato', 'carrot', 'cauliflowers', 'malting', 'chicken', 'apples', 'apple', 'peas', 'strawberry', 'pea', 'wheat', 'potatoes', 'oat', 'bread']
It's possible to note that "meat" is too generic for our purpose, as we considered cow or beef in analyzing the prices. For this reason, we remove meat and add "cow" and "beef"
foods = [f for f in foods if f!="meat"]
foods += ["beef", "cow"]
New matching check:
food_properties[food_properties["Food Name"].str.contains('|'.join(foods), case=False)]
Food Group | Food Name | Protein (g) | Carbohydrates (g) | Fat (g) | |
---|---|---|---|---|---|
1 | Poultry Products | Chicken, broilers or fryers, meat and skin and... | 18.33 | 0.13 | 14.83 |
6 | Poultry Products | Chicken, broilers or fryers, meat and skin, raw | 18.60 | 0.00 | 15.06 |
11 | Poultry Products | Chicken, broilers or fryers, meat only, raw | 21.39 | 0.00 | 3.08 |
15 | Poultry Products | Chicken, broilers or fryers, skin only, raw | 13.33 | 0.00 | 32.35 |
20 | Poultry Products | Chicken, broilers or fryers, giblets, raw | 17.88 | 1.80 | 4.47 |
... | ... | ... | ... | ... | ... |
4142 | Beef Products | Beef, round, tip round, roast, separable lean ... | 21.38 | 0.00 | 3.35 |
4145 | Beef Products | Beef, flank, steak, separable lean only, trimm... | 21.57 | 0.00 | 5.47 |
4146 | Beef Products | Beef, flank, steak, separable lean only, trimm... | 21.43 | 0.00 | 5.00 |
4147 | Beef Products | Beef, brisket, flat half, separable lean and f... | 18.12 | 0.12 | 22.15 |
4148 | Beef Products | Beef, brisket, flat half, separable lean and f... | 17.77 | 0.00 | 22.21 |
602 rows × 5 columns
We have a good set of products now, so we can stop here for the next considerations
poss_diet = food_properties[food_properties["Food Name"].str.contains('|'.join(foods), case=False)]
Check of our residue groups:
poss_diet["Food Group"].unique()
array(['Poultry Products', 'Fruits and Fruit Juices', 'Vegetables and Vegetable Products', 'Beef Products', 'Finfish and Shellfish Products', 'Legumes and Legume Products', 'Lamb, Veal, and Game Products', 'Cereal Grains and Pasta'], dtype=object)
Fish products seems a bit strange for what we analyzed so far, check on it:
poss_diet[poss_diet["Food Group"] == "Finfish and Shellfish Products"]
Food Group | Food Name | Protein (g) | Carbohydrates (g) | Fat (g) | |
---|---|---|---|---|---|
2312 | Finfish and Shellfish Products | Fish, anchovy, european, raw | 20.35 | 0.0 | 4.84 |
2364 | Finfish and Shellfish Products | Fish, milkfish, raw | 20.53 | 0.0 | 6.73 |
2437 | Finfish and Shellfish Products | Fish, turbot, european, raw | 16.05 | 0.0 | 2.95 |
Finfish and Shellfish Products is not really represented by our prices database, so we drop the category
poss_diet = poss_diet[~poss_diet["Food Group"].isin(["Finfish and Shellfish Products"])].reset_index(drop=True)
Let's now adjust the units. The grams of proteins in 100g serve [grams of proteins/100g] times the kcalalories in 1 gram of proteins [kcal/g of proteins] will result in the kilocalories obtained from proteins for 100g serve [kcal/100g]. Same argument che be extended to carbohydrates and fat.
According to the National Agriculture Library:
According to what has been said above, we will multiply the values and obtain a new dataframe diet_kcal
in which values have units of [kcal/100g]
#multiply values with vector of kcal/g, merging to keep Food
kcal_g = np.array([4,4,9])
diet_kcal = poss_diet[["Food Group", "Food Name"]].merge(poss_diet[poss_diet.columns[2:]].multiply(kcal_g),left_index=True, right_index=True)
diet_kcal.rename(columns={'Protein (g)':'Protein (kcal/100g)',
'Carbohydrates (g)':'Carbohydrates (kcal/100g)',
'Fat (g)':'Fat (kcal/100g)'},
inplace=True)
diet_kcal.head()
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | |
---|---|---|---|---|---|
0 | Poultry Products | Chicken, broilers or fryers, meat and skin and... | 73.32 | 0.52 | 133.47 |
1 | Poultry Products | Chicken, broilers or fryers, meat and skin, raw | 74.40 | 0.00 | 135.54 |
2 | Poultry Products | Chicken, broilers or fryers, meat only, raw | 85.56 | 0.00 | 27.72 |
3 | Poultry Products | Chicken, broilers or fryers, skin only, raw | 53.32 | 0.00 | 291.15 |
4 | Poultry Products | Chicken, broilers or fryers, giblets, raw | 71.52 | 7.20 | 40.23 |
The next point we are going to do now is trying to give a rank to each product taking into account the necessity that each person, in their diet, should have:
(Data adapted from here)
In order to do so, we apply a greedy ranking based on how well a particular food is is representing these shares.
def rank_food(food):
prot = food['Protein (kcal/100g)']
carb = food['Carbohydrates (kcal/100g)']
fat = food['Fat (kcal/100g)']
if (prot == 0 and carb == 0 and fat == 0):
return -1
tot = prot + carb + fat
err_prot = abs(tot*0.55/4 - prot) / 100
err_carb = abs(tot*0.25/4 - carb) / 100
err_fat = abs(tot*0.20/9 - fat) / 100
avg_err = (err_prot + err_carb + err_fat)/3
return avg_err
diet_kcal['rank'] = diet_kcal.apply(rank_food, axis=1)
In the next step we take the best 3 out of each group (because there may be inconsistencies in the results due to the filtering process).
From these, we then choose our representative product for each group. We will use this to compute the final diet in the next and final step.
diet_kcal.groupby(["Food Group"]).apply(lambda x: x.sort_values(['rank'])\
.reset_index(drop=True).groupby("Food Group").head(3)).reset_index(drop=True)
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | rank | |
---|---|---|---|---|---|---|
0 | Beef Products | Beef, New Zealand, imported, variety meats and... | 59.44 | 0.00 | 17.82 | 0.232495 |
1 | Beef Products | Beef, variety meats and by-products, tripe, raw | 48.28 | 0.00 | 33.21 | 0.245225 |
2 | Beef Products | Beef, New Zealand, imported, variety meats and... | 62.72 | 0.00 | 23.76 | 0.260241 |
3 | Cereal Grains and Pasta | Wild rice, raw | 58.92 | 299.60 | 9.72 | 0.954696 |
4 | Cereal Grains and Pasta | Barley, pearled, raw | 39.64 | 310.88 | 10.44 | 1.002436 |
5 | Cereal Grains and Pasta | Oat bran, raw | 69.20 | 264.88 | 63.27 | 1.030167 |
6 | Fruits and Fruit Juices | Rose-apples, raw | 2.40 | 22.80 | 2.70 | 0.081908 |
7 | Fruits and Fruit Juices | Pitanga, (surinam-cherry), raw | 3.20 | 29.96 | 3.60 | 0.107667 |
8 | Fruits and Fruit Juices | Strawberries, raw | 2.68 | 30.72 | 2.70 | 0.108818 |
9 | Lamb, Veal, and Game Products | Lamb, New Zealand, imported, sweetbread, raw | 44.00 | 0.00 | 28.44 | 0.217991 |
10 | Lamb, Veal, and Game Products | Goat, raw | 82.40 | 0.00 | 20.79 | 0.310525 |
11 | Lamb, Veal, and Game Products | Game meat, beefalo, composite of cuts, raw | 93.20 | 0.00 | 43.20 | 0.410463 |
12 | Legumes and Legume Products | SILK Banana-Strawberry soy yogurt | 9.40 | 68.24 | 10.62 | 0.247061 |
13 | Legumes and Legume Products | SILK Strawberry soy yogurt | 9.40 | 72.96 | 10.62 | 0.263624 |
14 | Legumes and Legume Products | Beans, kidney, all types, mature seeds, raw | 94.32 | 240.04 | 7.47 | 0.887067 |
15 | Poultry Products | Chicken, gizzard, all classes, raw | 70.64 | 0.00 | 18.54 | 0.268366 |
16 | Poultry Products | Chicken, broilers or fryers, giblets, raw | 71.52 | 7.20 | 40.23 | 0.309951 |
17 | Poultry Products | Chicken, roasting, light meat, meat only, raw | 88.80 | 0.00 | 14.67 | 0.311368 |
18 | Vegetables and Vegetable Products | Cucumber, peeled, raw | 2.36 | 8.64 | 1.44 | 0.032252 |
19 | Vegetables and Vegetable Products | Lettuce, butterhead (includes boston and bibb ... | 5.40 | 8.92 | 1.98 | 0.042259 |
20 | Vegetables and Vegetable Products | Lettuce, red leaf, raw | 5.32 | 9.04 | 1.98 | 0.042363 |
Let's further analyze this results.
We can start from the category that is less represented (2 results instead of 3): Lamb, Veal, and Game Products
diet_kcal[diet_kcal["Food Group"] == "Lamb, Veal, and Game Products"]
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | rank | |
---|---|---|---|---|---|---|
298 | Lamb, Veal, and Game Products | Game meat, beefalo, composite of cuts, raw | 93.2 | 0.0 | 43.20 | 0.410463 |
299 | Lamb, Veal, and Game Products | Goat, raw | 82.4 | 0.0 | 20.79 | 0.310525 |
300 | Lamb, Veal, and Game Products | Lamb, New Zealand, imported, sweetbread, raw | 44.0 | 0.0 | 28.44 | 0.217991 |
As we can see, this category doesn't really have a representation. The residue results are here just as consequence of the filtering process (using str.contains
caused this).
For this reason, we drop it, than we increase the range of possible products from 3 to 5 and reprint.
diet_kcal = diet_kcal[~diet_kcal["Food Group"].isin(["Lamb, Veal, and Game Products"])].reset_index(drop=True)
diet_kcal.groupby(["Food Group"]).apply(lambda x: x.sort_values(['rank'])\
.reset_index(drop=True).groupby("Food Group").head(5)).reset_index(drop=True)
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | rank | |
---|---|---|---|---|---|---|
0 | Beef Products | Beef, New Zealand, imported, variety meats and... | 59.44 | 0.00 | 17.82 | 0.232495 |
1 | Beef Products | Beef, variety meats and by-products, tripe, raw | 48.28 | 0.00 | 33.21 | 0.245225 |
2 | Beef Products | Beef, New Zealand, imported, variety meats and... | 62.72 | 0.00 | 23.76 | 0.260241 |
3 | Beef Products | Beef, variety meats and by-products, lungs, raw | 64.80 | 0.00 | 22.50 | 0.262708 |
4 | Beef Products | Beef, variety meats and by-products, kidneys, raw | 69.60 | 1.16 | 27.81 | 0.288889 |
5 | Cereal Grains and Pasta | Wild rice, raw | 58.92 | 299.60 | 9.72 | 0.954696 |
6 | Cereal Grains and Pasta | Barley, pearled, raw | 39.64 | 310.88 | 10.44 | 1.002436 |
7 | Cereal Grains and Pasta | Oat bran, raw | 69.20 | 264.88 | 63.27 | 1.030167 |
8 | Cereal Grains and Pasta | Rice, brown, medium-grain, raw | 30.00 | 304.68 | 24.12 | 1.059122 |
9 | Cereal Grains and Pasta | Rice, white, medium-grain, raw, enriched | 26.44 | 317.36 | 5.22 | 1.065442 |
10 | Fruits and Fruit Juices | Rose-apples, raw | 2.40 | 22.80 | 2.70 | 0.081908 |
11 | Fruits and Fruit Juices | Pitanga, (surinam-cherry), raw | 3.20 | 29.96 | 3.60 | 0.107667 |
12 | Fruits and Fruit Juices | Strawberries, raw | 2.68 | 30.72 | 2.70 | 0.108818 |
13 | Fruits and Fruit Juices | Acerola, (west indian cherry), raw | 1.60 | 30.76 | 2.70 | 0.112368 |
14 | Fruits and Fruit Juices | Strawberries, frozen, unsweetened | 1.72 | 36.52 | 0.99 | 0.126202 |
15 | Legumes and Legume Products | SILK Banana-Strawberry soy yogurt | 9.40 | 68.24 | 10.62 | 0.247061 |
16 | Legumes and Legume Products | SILK Strawberry soy yogurt | 9.40 | 72.96 | 10.62 | 0.263624 |
17 | Legumes and Legume Products | Beans, kidney, all types, mature seeds, raw | 94.32 | 240.04 | 7.47 | 0.887067 |
18 | Legumes and Legume Products | Lima beans, thin seeded (baby), mature seeds, raw | 82.48 | 251.32 | 8.37 | 0.887107 |
19 | Legumes and Legume Products | Beans, white, mature seeds, raw | 93.44 | 241.08 | 7.65 | 0.887107 |
20 | Poultry Products | Chicken, gizzard, all classes, raw | 70.64 | 0.00 | 18.54 | 0.268366 |
21 | Poultry Products | Chicken, broilers or fryers, giblets, raw | 71.52 | 7.20 | 40.23 | 0.309951 |
22 | Poultry Products | Chicken, roasting, light meat, meat only, raw | 88.80 | 0.00 | 14.67 | 0.311368 |
23 | Poultry Products | Chicken, roasting, meat only, raw | 81.32 | 0.00 | 24.30 | 0.317838 |
24 | Poultry Products | Chicken, dark meat, drumstick, meat only, with... | 76.76 | 0.00 | 29.34 | 0.319282 |
25 | Vegetables and Vegetable Products | Cucumber, peeled, raw | 2.36 | 8.64 | 1.44 | 0.032252 |
26 | Vegetables and Vegetable Products | Lettuce, butterhead (includes boston and bibb ... | 5.40 | 8.92 | 1.98 | 0.042259 |
27 | Vegetables and Vegetable Products | Lettuce, red leaf, raw | 5.32 | 9.04 | 1.98 | 0.042363 |
28 | Vegetables and Vegetable Products | Lettuce, iceberg (includes crisphead types), raw | 3.60 | 11.88 | 1.26 | 0.043400 |
29 | Vegetables and Vegetable Products | Cucumber, with peel, raw | 2.60 | 14.52 | 0.99 | 0.046952 |
Now the shape of our final diet is way more clear:
We can now build the final dataframe with nutrient for each product.
prod_diet_final = pd.DataFrame(columns=["Product", "Protein (kcal/100g)", "Carbohydrates (kcal/100g)","Fat (kcal/100g)"])
# taking an average over the first 5 beef product
best_beef = diet_kcal[diet_kcal["Food Group"]=="Beef Products"].sort_values(by="rank").head(10)
beef_repr = [best_beef["Protein (kcal/100g)"].median(),\
best_beef["Carbohydrates (kcal/100g)"].median(),\
best_beef["Fat (kcal/100g)"].median()]
beef_list = ["Beef Meat"] + beef_repr
beef = pd.DataFrame(np.array([beef_list]), columns=["Product", "Protein (kcal/100g)", "Carbohydrates (kcal/100g)","Fat (kcal/100g)"])
prod_diet_final = prod_diet_final.append(beef)
# taking the top 3 in cereals (rice, bearley, oat)
best_cereal = diet_kcal[diet_kcal["Food Group"]=="Cereal Grains and Pasta"].sort_values(by="rank").head(3)
best_cereal = best_cereal.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
best_cereal = best_cereal.replace({"Product":{"Wild rice, raw": "Rice", "Barley, pearled, raw": "Barley", "Oat bran, raw": "Oats"}})
prod_diet_final = prod_diet_final.append(best_cereal)
By further analysis, we discovered that it's possible to add also cherries to the possible diet. So we keep those too.
# taking apples, cherries and strawberries
fruits = diet_kcal[(diet_kcal["Food Name"]=="Apples, raw, with skin") | (diet_kcal["Food Name"]=="Strawberries, raw") | (diet_kcal["Food Name"]=="Cherries, sour, red, raw")]
fruits = fruits.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
fruits = fruits.replace({"Product":{"Apples, raw, with skin": "Apples", "Cherries, sour, red, raw": "Cherries", "Strawberries, raw": "Strawberries"}})
prod_diet_final = prod_diet_final.append(fruits)
# just take Chicken, roasting, light meat, meat only, raw
poultry = diet_kcal[diet_kcal["Food Name"]=="Chicken, roasting, light meat, meat only, raw"]
poultry = poultry.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
poultry = poultry.replace({"Product":{"Chicken, roasting, light meat, meat only, raw": "Chicken"}})
prod_diet_final = prod_diet_final.append(poultry)
check_leg = diet_kcal[diet_kcal["Food Group"]=="Legumes and Legume Products"]
check_leg[check_leg["Food Name"].str.contains("bean", case=False)].head(5)
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | rank | |
---|---|---|---|---|---|---|
261 | Legumes and Legume Products | Beans, adzuki, mature seeds, raw | 79.48 | 251.60 | 4.77 | 0.888678 |
262 | Legumes and Legume Products | Beans, black, mature seeds, raw | 86.40 | 249.44 | 12.78 | 0.903830 |
263 | Legumes and Legume Products | Beans, black turtle, mature seeds, raw | 85.00 | 253.00 | 8.10 | 0.897296 |
264 | Legumes and Legume Products | Beans, cranberry (roman), mature seeds, raw | 92.12 | 240.20 | 11.07 | 0.890270 |
265 | Legumes and Legume Products | Beans, french, mature seeds, raw | 75.24 | 256.44 | 18.18 | 0.907044 |
check_leg[check_leg["Food Name"].str.contains("pea", case=False)].head(5)
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | rank | |
---|---|---|---|---|---|---|
278 | Legumes and Legume Products | Chickpeas (garbanzo beans, bengal gram), matur... | 81.88 | 251.80 | 54.36 | 1.006030 |
279 | Legumes and Legume Products | Cowpeas, catjang, mature seeds, raw | 95.40 | 238.56 | 18.63 | 0.914122 |
280 | Legumes and Legume Products | Cowpeas, common (blackeyes, crowder, southern)... | 94.08 | 240.12 | 11.34 | 0.895844 |
287 | Legumes and Legume Products | Peas, green, split, mature seeds, raw | 95.28 | 254.96 | 10.44 | 0.935096 |
288 | Legumes and Legume Products | Peanuts, all types, raw | 103.20 | 64.52 | 443.16 | 1.583763 |
As we can see, we have both the product we mentioned before. As only 2 types of legumes were chosen, because their nutriments provision is usually high, we decide to keep both of them for our next analysis.
# Taking one for beans and one for peas
beans = diet_kcal[diet_kcal["Food Name"]=="Beans, adzuki, mature seeds, raw"]
beans = beans.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
beans = beans.replace({"Product":{"Beans, adzuki, mature seeds, raw": "Beans"}})
peas = diet_kcal[diet_kcal["Food Name"]=="Peas, green, split, mature seeds, raw"]
peas = peas.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
peas = peas.replace({"Product":{"Peas, green, split, mature seeds, raw": "Peas"}})
prod_diet_final = prod_diet_final.append(beans)
prod_diet_final = prod_diet_final.append(peas)
diet_kcal[diet_kcal["Food Group"]=="Vegetables and Vegetable Products"].sort_values(by="rank").head(10)
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | rank | |
---|---|---|---|---|---|---|
104 | Vegetables and Vegetable Products | Cucumber, peeled, raw | 2.36 | 8.64 | 1.44 | 0.032252 |
107 | Vegetables and Vegetable Products | Lettuce, butterhead (includes boston and bibb ... | 5.40 | 8.92 | 1.98 | 0.042259 |
111 | Vegetables and Vegetable Products | Lettuce, red leaf, raw | 5.32 | 9.04 | 1.98 | 0.042363 |
109 | Vegetables and Vegetable Products | Lettuce, iceberg (includes crisphead types), raw | 3.60 | 11.88 | 1.26 | 0.043400 |
103 | Vegetables and Vegetable Products | Cucumber, with peel, raw | 2.60 | 14.52 | 0.99 | 0.046952 |
133 | Vegetables and Vegetable Products | Tomatoes, yellow, raw | 3.92 | 11.92 | 2.34 | 0.047133 |
110 | Vegetables and Vegetable Products | Lettuce, green leaf, raw | 5.44 | 11.48 | 1.35 | 0.047367 |
132 | Vegetables and Vegetable Products | Tomatoes, orange, raw | 4.64 | 12.72 | 1.71 | 0.049441 |
90 | Vegetables and Vegetable Products | Balsam-pear (bitter gourd), pods, raw | 4.00 | 14.80 | 1.53 | 0.052707 |
108 | Vegetables and Vegetable Products | Lettuce, cos or romaine, raw | 4.92 | 13.16 | 2.70 | 0.053874 |
As we can see, the list of most viable 10 vegetables includes Lettuce, cucumber and tomatoes (different kinds for the latter).
We decide to take the three (we search for the red one in the case of tomatoes, as it is the most common).
# take Lettuce, green leaf, raw and Cucumber
lettuce = diet_kcal[diet_kcal["Food Name"]=="Lettuce, green leaf, raw"]
lettuce = lettuce.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
lettuce = lettuce.replace({"Product":{"Lettuce, green leaf, raw": "Lettuces"}})
cucumber = diet_kcal[diet_kcal["Food Name"]=="Cucumber, peeled, raw"]
cucumber = cucumber.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
cucumber = cucumber.replace({"Product":{"Cucumber, peeled, raw": "Cucumbers"}})
prod_diet_final = prod_diet_final.append(lettuce)
prod_diet_final = prod_diet_final.append(cucumber)
#average over tomatoes
tomatoes = diet_kcal[diet_kcal["Food Name"].str.contains("tomatoe", case=False)]
tomatoes
Food Group | Food Name | Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | rank | |
---|---|---|---|---|---|---|
125 | Vegetables and Vegetable Products | Tomatoes, green, raw | 4.80 | 20.40 | 1.80 | 0.070000 |
126 | Vegetables and Vegetable Products | Tomatoes, red, ripe, raw, year round average | 3.52 | 15.56 | 1.80 | 0.054133 |
132 | Vegetables and Vegetable Products | Tomatoes, orange, raw | 4.64 | 12.72 | 1.71 | 0.049441 |
133 | Vegetables and Vegetable Products | Tomatoes, yellow, raw | 3.92 | 11.92 | 2.34 | 0.047133 |
Found the name for the red tomatoe. Add it to the dataset:
tomatoes = diet_kcal[diet_kcal["Food Name"]=="Tomatoes, red, ripe, raw, year round average"]
tomatoes = tomatoes.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
tomatoes = tomatoes.replace({"Product":{"Tomatoes, red, ripe, raw, year round average": "Tomatoes"}})
prod_diet_final = prod_diet_final.append(tomatoes).reset_index(drop=True).set_index("Product")
We can finally look at our final diet dataframe prod_diet_final
. This dataframe will be used in the next section to build an optimal diet based on the production prices
prod_diet_final
Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | |
---|---|---|---|
Product | |||
Beef Meat | 70.24 | 0.0 | 28.98 |
Rice | 58.92 | 299.6 | 9.72 |
Barley | 39.64 | 310.88 | 10.44 |
Oats | 69.2 | 264.88 | 63.27 |
Apples | 1.04 | 55.24 | 1.53 |
Cherries | 4 | 48.72 | 2.7 |
Strawberries | 2.68 | 30.72 | 2.7 |
Chicken | 88.8 | 0 | 14.67 |
Beans | 79.48 | 251.6 | 4.77 |
Peas | 95.28 | 254.96 | 10.44 |
Lettuces | 5.44 | 11.48 | 1.35 |
Cucumbers | 2.36 | 8.64 | 1.44 |
Tomatoes | 3.52 | 15.56 | 1.8 |
# making sure all the values are float
prod_diet_final["Protein (kcal/100g)"] = prod_diet_final["Protein (kcal/100g)"].astype(float)
prod_diet_final["Carbohydrates (kcal/100g)"] = prod_diet_final["Carbohydrates (kcal/100g)"].astype(float)
prod_diet_final["Fat (kcal/100g)"] = prod_diet_final["Fat (kcal/100g)"].astype(float)
Saving the result into a pickle:
prod_diet_final.to_pickle("data/processed/prod_diet_final.pkl")
prod_diet_final = pd.read_pickle("data/processed/prod_diet_final.pkl")
Plotting the results:
#Creating an exploded version of prod_diet_final
plot_diet_exploxed = prod_diet_final.copy()
#Resetting index
plot_diet_exploxed.reset_index()
#Unstack result, converting to dataframe and reset index
plot_diet_exploxed = plot_diet_exploxed.unstack().to_frame().reset_index(level=['Product',0])
#Casting values column to float
#plot_diet_exploxed[0]= plot_diet_exploxed[0].astype(float)
#renaming columns
plot_diet_exploxed = plot_diet_exploxed.rename(columns={"level_0":"Macronutrient",0:"Value"})
#showing results
plot_diet_exploxed.head()
Macronutrient | Product | Value | |
---|---|---|---|
0 | Protein (kcal/100g) | Beef Meat | 70.24 |
1 | Protein (kcal/100g) | Rice | 58.92 |
2 | Protein (kcal/100g) | Barley | 39.64 |
3 | Protein (kcal/100g) | Oats | 69.20 |
4 | Protein (kcal/100g) | Apples | 1.04 |
#Initialize bar plot
bp = sns.barplot(x='Product', y="Value", hue='Macronutrient', data=plot_diet_exploxed)
#Set log axis in order to show more values since with have both big numbers and very small numbers
bp.set(yscale="log")
plt.ylabel('Log of caloric density [kcal/100g]')
plt.xlabel('Food items')
plt.title('Caloric density of different food items with respect to their share of macronutrients');
prod_diet_final.plot(kind='bar',stacked=True)
plt.ylabel('Caloric density [kcal/100g]')
plt.xlabel('Food items')
plt.title('Stacked barplot of caloric density of different food items with respect to their share of macronutrients');
The code which generates the 3D plot used for the macronutrients represation is as follows.
t = np.linspace(0, 20, 13)
fig = go.Figure(data=[go.Scatter3d(
name="",
text=prod_diet_final.index.values,
x=prod_diet_final["Protein (kcal/100g)"],
y=prod_diet_final["Carbohydrates (kcal/100g)"],
z=prod_diet_final["Fat (kcal/100g)"],
hovertemplate="Item: %{text}<br />Protein: %{x:.2f} kcal/100g<br />Carbohydrates: %{y:.2f} kcal/100g<br />Fat: %{z:.2f} kcal/100g",
mode="markers",
marker=dict(
size=8,
color=t, # set color to an array/list of desired values
colorscale='Viridis', # choose a colorscale
opacity=0.8
)
)])
camera = dict(
eye=dict(x=-1.4, y=-1.4, z=0.05),
center=dict(x=0,y=0,z=-0.2)
)
name = 'Shares of macronutrients in each food item"'
# tight layout
fig.update_layout(
margin=dict(l=0, r=0, b=0, t=0),
title=name,
scene = dict(
camera=camera,
xaxis = dict(title='Protein [kcal/100g]'),
yaxis = dict(title='Carbohydrates [kcal/100g]'),
zaxis = dict(title='Fat [kcal/100g]'),
dragmode = 'turntable'),
paper_bgcolor = "rgba(0,0,0,0)"
)
py.plot(fig, filename='docs/_includes/3d_macros.html', include_plotlyjs=False)
IFrame(src="https://manuleo.github.io/mADAm_files/3d_macros.html", width=800, height=600)
As we can see from the 3D spatial representation of the shares of macronutrients, chicken and meat beef occupy the region in which proteins and fat density is high. On the other hand, we also observe that oats has very high carbs caloric density, while it doesn't really have a stong protein and fat footprint. In addition, we also observe that the most of vegetables and fruits have quite low caloric density. The products listed will be the basis of the perfect diet that we will compute.
We need now to prepare the prices matrix for the next section. In order to do so, we filter the product we have in the prices
dataframe accordingly with what we have delined in the last part.
Theoretically, we should have 65 elements in the final dataframe (5 countries, 13 products)
# a conversion in float for all the prices just to be sure
prices.Price = prices.Price.apply(lambda x: float(x))
prices_final = prices[prices.Product.isin(prod_diet_final.index.values)]
prices_final.index.size
55
We miss 2 products. We must investigate over those:
print("Set of missing product in prices:", set(prod_diet_final.index.values) - set(prices_final.Product.unique()))
Set of missing product in prices: {'Barley', 'Beef Meat'}
Probably these products have another name in the previous dataframe (as example, "Beef Meat" was set as "Meat" by us).
We look into these problems and manually solve them
# check for Meat
prices[prices.Product.str.contains("Meat")]
Country | Product | Price | |
---|---|---|---|
60 | France | Meat | 4549.5 |
61 | Germany | Meat | 4033.1 |
62 | Italy | Meat | 3762.0 |
63 | United Kingdom | Meat | 5328.2 |
64 | Spain | Meat | 2774.8 |
65 | Spain | Meat, pig | 1350.8 |
66 | Germany | Meat, pig | 1685.6 |
67 | Italy | Meat, pig | 2844.8 |
68 | United Kingdom | Meat, pig | 1983.0 |
69 | France | Meat, pig | 1469.0 |
As expected, we have "Meat" instead of "Beef Meat". Fixing it:
prices = prices.replace({"Product":{"Meat": "Beef Meat"}})
# check for Barley
prices[prices.Product.str.contains("Barley")]
Country | Product | Price | |
---|---|---|---|
40 | Italy | Feed Barley | 172.92 |
41 | Germany | Feed Barley | 156.54 |
42 | United Kingdom | Feed Barley | 155.11 |
43 | Spain | Feed Barley | 175.67 |
44 | France | Feed Barley | 171.22 |
55 | France | Malting Barley | 195.27 |
56 | Germany | Malting Barley | 209.30 |
57 | Spain | Malting Barley | 180.00 |
58 | Italy | Malting Barley | 207.90 |
59 | United Kingdom | Malting Barley | 177.60 |
For this case, we have "Feed Barley" and "Malting Barley" from which choose. As we don't know from which kind of barley is derived the one we are considering in the USDA_food
we decide to take an average between the two prices
barleys = prices[prices.Product.str.contains("Barley")]
barleys = barleys.groupby(["Country"]).mean().reset_index()
barleys["Product"] = "Barley"
barleys = barleys[["Country", "Product", "Price"]]
prices = prices.append(barleys)
After the final fix, we can now print the prices_final
dataframe for the prices for our interesting product for all the countries
prices_final = prices[prices.Product.isin(prod_diet_final.index.values)]
prices_final.head()
Country | Product | Price | |
---|---|---|---|
0 | United Kingdom | Apples | 1343.400000 |
1 | Germany | Apples | 597.443000 |
2 | Spain | Apples | 898.040000 |
3 | France | Apples | 1029.792500 |
4 | Italy | Apples | 713.742857 |
We need to elaborate on this dataframe. For our subsequent section, we need a matrix with countries as index and product as columns. We therefore work on the last dataframe a bit and produce a final one named final_prices
# reshaping with pivot
final_prices = pd.pivot_table(prices_final, index="Country", columns="Product", aggfunc="first")
#cleaning
final_prices.index = final_prices.index.to_flat_index()
final_prices.columns = final_prices.columns.get_level_values(1)
final_prices.columns.name = None
# final print
final_prices
Apples | Barley | Beans | Beef Meat | Cherries | Chicken | Cucumbers | Lettuces | Oats | Peas | Rice | Strawberries | Tomatoes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||||
France | 1029.792500 | 183.245 | 262.90 | 4549.5 | 4450.60 | 2530.00 | 951.7 | 1269.30 | 145.7 | 356.80 | 372.0 | 3438.60 | 779.0 |
Germany | 597.443000 | 182.920 | 2347.80 | 4033.1 | 3648.37 | 3174.38 | 753.1 | 473.22 | 182.7 | 2848.30 | 345.9 | 3404.50 | 1945.3 |
Italy | 713.742857 | 190.410 | 1931.16 | 3762.0 | 2134.00 | 2244.77 | 748.0 | 1159.29 | 272.3 | 1323.63 | 403.7 | 1952.83 | 1003.4 |
Spain | 898.040000 | 177.835 | 1921.40 | 2774.8 | 1815.00 | 1671.23 | 644.8 | 906.62 | 171.3 | 506.40 | 344.6 | 1259.06 | 372.4 |
United Kingdom | 1343.400000 | 166.355 | 2907.00 | 5328.2 | 4574.90 | 1845.69 | 1218.0 | 893.75 | 193.4 | 2861.60 | 380.7 | 2949.54 | 1320.7 |
Final footnote: the values here reported are USD/tonnes
Loading the result into a pickle
final_prices.to_pickle('data/processed/final_prices.pkl')
final_prices = pd.read_pickle('data/processed/final_prices.pkl')
Final footnote: the values here reported are USD/tonnes
The composition of the diet is one of the key steps of our analysis. We decide to compute the amount of product by minimizing the acutal costs of shipments that every European country has to meet. In order to model the problem we need to define an objective function which in this case will be the non-negative weighted sum of products' cost.
The problem we want to model is a Linear Program. The library we will use to solve is PICOS and the solver will be GUROBI.
In particular:
The strategy is to take advantage of the results we obtained from the previous optimization problem (food_opt_distribution_df
) and use them a starting point for the computation of the ideal diet for every country.
final_prices
).A last note is that the problem will be solved 5 times, one for each of best_countries
. We will have 5 outputs to analyse. Let's do it.
italy_giveup_val = food_opt_distribution_df.loc["Italy"].values
italy_giveup_index = food_opt_distribution_df.loc["Italy"].index
italy_prices_val = final_prices.loc["Italy"].values.reshape(-1,1)
italy_prices_index = final_prices.loc["Italy"].index
final_prices
Apples | Barley | Beans | Beef Meat | Cherries | Chicken | Cucumbers | Lettuces | Oats | Peas | Rice | Strawberries | Tomatoes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||||
France | 1029.792500 | 183.245 | 262.90 | 4549.5 | 4450.60 | 2530.00 | 951.7 | 1269.30 | 145.7 | 356.80 | 372.0 | 3438.60 | 779.0 |
Germany | 597.443000 | 182.920 | 2347.80 | 4033.1 | 3648.37 | 3174.38 | 753.1 | 473.22 | 182.7 | 2848.30 | 345.9 | 3404.50 | 1945.3 |
Italy | 713.742857 | 190.410 | 1931.16 | 3762.0 | 2134.00 | 2244.77 | 748.0 | 1159.29 | 272.3 | 1323.63 | 403.7 | 1952.83 | 1003.4 |
Spain | 898.040000 | 177.835 | 1921.40 | 2774.8 | 1815.00 | 1671.23 | 644.8 | 906.62 | 171.3 | 506.40 | 344.6 | 1259.06 | 372.4 |
United Kingdom | 1343.400000 | 166.355 | 2907.00 | 5328.2 | 4574.90 | 1845.69 | 1218.0 | 893.75 | 193.4 | 2861.60 | 380.7 | 2949.54 | 1320.7 |
prod_diet_final = prod_diet_final.sort_values(by="Product")
prod_diet_index = prod_diet_final.index
prod_diet_val = prod_diet_final.to_numpy()/10**-6
prod_diet_final
Protein (kcal/100g) | Carbohydrates (kcal/100g) | Fat (kcal/100g) | |
---|---|---|---|
Product | |||
Apples | 1.04 | 55.24 | 1.53 |
Barley | 39.64 | 310.88 | 10.44 |
Beans | 79.48 | 251.60 | 4.77 |
Beef Meat | 70.24 | 0.00 | 28.98 |
Cherries | 4.00 | 48.72 | 2.70 |
Chicken | 88.80 | 0.00 | 14.67 |
Cucumbers | 2.36 | 8.64 | 1.44 |
Lettuces | 5.44 | 11.48 | 1.35 |
Oats | 69.20 | 264.88 | 63.27 |
Peas | 95.28 | 254.96 | 10.44 |
Rice | 58.92 | 299.60 | 9.72 |
Strawberries | 2.68 | 30.72 | 2.70 |
Tomatoes | 3.52 | 15.56 | 1.80 |
First off, we initialize the problem prob
. Secondly, we add the decision variable Y∈Rmxn:
prob = pic.Problem() #initalize convex problem
Y = prob.add_variable('Y', (italy_prices_val.size,italy_giveup_val.size)) #definition of decision matrix of variables nxm
The go on setting the objective function obj
and the nature of the problem minimization
:
obj = pic.sum(italy_prices_val.T * Y) #define obj function
prob.set_objective("min", obj) #set objective function
With regards to the constrains:
#Initialize constraints
constraints = []
#Define non-negativity constraint
constraints.append(prob.add_constraint(Y>=0))
#Define constraints proteins,carbs and fats
#Define shares of proteins,carbs and fat as an absolute variable (not subject to optimization)
shares = np.array([0.55,0.25,0.2])
for i in range(0,italy_giveup_val.size):
for j in range(0,shares.size):
constraints.append(prob.add_constraint(Y[:,i].T*prod_diet_val[:,j].reshape(-1,1)==shares[j]*italy_giveup_val[i]))
#Define constraints to provide an upper bound (every product has to be sent at most to cover the 35,5% of the final demand)
for i in range(0,italy_giveup_val.size):
for j in range(0,italy_prices_val.size):
constraints.append(prob.add_constraint(pic.sum(Y[j,i]*prod_diet_val[j,:].reshape(1,-1))<=0.355*italy_giveup_val[i]))
constraints.append(prob.add_constraint(pic.sum(Y[j,i]*prod_diet_val[j,:].reshape(1,-1))>=0.0001*italy_giveup_val[i]))
Let's have a look at the problem:
#print(prob)
#Solving problem with gurobi solver (License available for free academic use)
solution = prob.solve(verbose=0, solver = 'gurobi')
Y_opt_ITdiet = Y.value
print('The solution of the problem is:')
print(Y_opt_ITdiet);
Using license file /Users/riccardovasapollo/gurobi.lic Academic license - for non-commercial use only Reset all parameters The solution of the problem is: [ 5.73e+00 2.38e+00 1.95e+00 1.77e+00 1.37e+00 1.13e+00 1.04e+00 ... ] [ 8.86e+01 3.69e+01 3.02e+01 2.75e+01 2.12e+01 1.74e+01 1.60e+01 ... ] [ 9.86e-01 4.10e-01 3.36e-01 3.05e-01 2.36e-01 1.94e-01 1.78e-01 ... ] [ 1.18e+04 4.93e+03 4.04e+03 3.67e+03 2.83e+03 2.33e+03 2.14e+03 ... ] [ 5.97e+00 2.49e+00 2.04e+00 1.85e+00 1.43e+00 1.17e+00 1.08e+00 ... ] [ 8.74e+03 3.64e+03 2.98e+03 2.71e+03 2.09e+03 1.72e+03 1.58e+03 ... ] [ 1.68e+03 7.01e+02 5.74e+02 5.22e+02 4.03e+02 3.31e+02 3.05e+02 ... ] [ 1.81e+01 7.54e+00 6.18e+00 5.61e+00 4.34e+00 3.56e+00 3.28e+00 ... ] [ 2.96e+03 1.23e+03 1.01e+03 9.16e+02 7.08e+02 5.81e+02 5.35e+02 ... ] [ 9.18e-01 3.82e-01 3.13e-01 2.84e-01 2.20e-01 1.80e-01 1.66e-01 ... ] [ 8.99e-01 3.74e-01 3.06e-01 2.79e-01 2.15e-01 1.77e-01 1.63e-01 ... ] [ 9.17e+00 3.82e+00 3.13e+00 2.84e+00 2.19e+00 1.80e+00 1.66e+00 ... ] [ 1.59e+01 6.60e+00 5.40e+00 4.91e+00 3.79e+00 3.12e+00 2.87e+00 ... ]
Converting the solution into a more agile dataframe:
result_ITdiet = np.array(Y_opt_ITdiet)
result_ITdiet_df= pd.DataFrame(data=result_ITdiet, index = prod_diet_index)
result_ITdiet_df.columns = italy_giveup_index
result_ITdiet_df
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product | |||||||||||||||||||||
Apples | 5.725581 | 2.383284 | 1.951548 | 1.774081 | 1.370546 | 1.125086 | 1.035212 | 0.913324 | 0.749409 | 0.664980 | ... | 0.263661 | 0.232392 | 0.130483 | 0.113529 | 0.110133 | 0.063273 | 0.051926 | 0.039848 | 0.015356 | 0.014821 |
Barley | 88.599557 | 36.879736 | 30.198907 | 27.452727 | 21.208282 | 17.409951 | 16.019216 | 14.133077 | 11.596601 | 10.290117 | ... | 4.079979 | 3.596116 | 2.019140 | 1.756779 | 1.704242 | 0.979106 | 0.803521 | 0.616618 | 0.237626 | 0.229343 |
Beans | 0.985547 | 0.410236 | 0.335921 | 0.305373 | 0.235913 | 0.193661 | 0.178191 | 0.157211 | 0.128996 | 0.114463 | ... | 0.045384 | 0.040002 | 0.022460 | 0.019542 | 0.018957 | 0.010891 | 0.008938 | 0.006859 | 0.002643 | 0.002551 |
Beef Meat | 11842.726527 | 4929.557696 | 4036.559600 | 3669.489481 | 2834.821002 | 2327.114171 | 2141.220573 | 1889.108393 | 1550.068388 | 1375.436180 | ... | 545.353426 | 480.677541 | 269.889804 | 234.821184 | 227.798787 | 130.872874 | 107.403290 | 82.420742 | 31.762410 | 30.655362 |
Cherries | 5.972498 | 2.486064 | 2.035709 | 1.850589 | 1.429651 | 1.173605 | 1.079856 | 0.952711 | 0.781727 | 0.693657 | ... | 0.275031 | 0.242414 | 0.136110 | 0.118425 | 0.114883 | 0.066002 | 0.054165 | 0.041566 | 0.016018 | 0.015460 |
Chicken | 8739.780757 | 3637.950550 | 2978.929374 | 2708.036319 | 2092.061654 | 1717.380504 | 1580.193406 | 1394.137840 | 1143.930651 | 1015.054314 | ... | 402.463856 | 354.733879 | 199.175224 | 173.295031 | 168.112592 | 96.582507 | 79.262255 | 60.825454 | 23.440253 | 22.623265 |
Cucumbers | 1684.407936 | 701.138043 | 574.125647 | 521.916739 | 403.200647 | 330.988778 | 304.548865 | 268.690589 | 220.468444 | 195.630256 | ... | 77.566398 | 68.367454 | 38.386813 | 33.398953 | 32.400147 | 18.614236 | 15.276124 | 11.722820 | 4.517613 | 4.360156 |
Lettuces | 18.116906 | 7.541197 | 6.175096 | 5.613555 | 4.336686 | 3.560000 | 3.275622 | 2.889942 | 2.371282 | 2.104131 | ... | 0.834277 | 0.735337 | 0.412875 | 0.359228 | 0.348485 | 0.200208 | 0.164305 | 0.126087 | 0.048590 | 0.046896 |
Oats | 2957.179630 | 1230.931709 | 1007.946253 | 916.287269 | 707.866968 | 581.090394 | 534.671965 | 471.718472 | 387.058728 | 343.452316 | ... | 136.177091 | 120.027245 | 67.392642 | 58.635857 | 56.882334 | 32.679518 | 26.819062 | 20.580813 | 7.931210 | 7.654775 |
Peas | 0.917700 | 0.381994 | 0.312795 | 0.284351 | 0.219672 | 0.180329 | 0.165924 | 0.146388 | 0.120116 | 0.106583 | ... | 0.042260 | 0.037248 | 0.020914 | 0.018196 | 0.017652 | 0.010141 | 0.008323 | 0.006387 | 0.002461 | 0.002376 |
Rice | 0.898859 | 0.374152 | 0.306374 | 0.278513 | 0.215162 | 0.176627 | 0.162518 | 0.143383 | 0.117650 | 0.104395 | ... | 0.041392 | 0.036483 | 0.020485 | 0.017823 | 0.017290 | 0.009933 | 0.008152 | 0.006256 | 0.002411 | 0.002327 |
Strawberries | 9.168861 | 3.816556 | 3.125180 | 2.840987 | 2.194771 | 1.801695 | 1.657773 | 1.462583 | 1.200092 | 1.064888 | ... | 0.422223 | 0.372150 | 0.208954 | 0.181803 | 0.176366 | 0.101324 | 0.083154 | 0.063812 | 0.024591 | 0.023734 |
Tomatoes | 15.852292 | 6.598547 | 5.403209 | 4.911860 | 3.794600 | 3.115000 | 2.866169 | 2.528700 | 2.074872 | 1.841115 | ... | 0.729993 | 0.643419 | 0.361266 | 0.314324 | 0.304924 | 0.175182 | 0.143767 | 0.110326 | 0.042516 | 0.041034 |
13 rows × 25 columns
Let's now have a look at the results yielded by the optmization considering that Italy is the donor country. The products that are being most used are:
We derive from this values that the optimization is actually choosing the products that costs the least while contributing the most to the kilocalories demand. Moreover, the constraint of shares of macronutrients is respected so our diets are close to what in reality one would need biologically.
In conclusion, the results we obtained are quite surprising. We have obtained a tangible number of tons of food items that every European country should send in order to meet the demand in Africa.
If the reader want to have an overview on:
result_ITdiet_df
and obtain the distribution of for every food items over the 25 African countries in respect to the donor country (In this example Italy).The problem shown above was an example on how we will solve the problem for just one country. For the sake of simplicity, a function LPSolverDietEurope
will be created and run on the 5 countries. The function will output 5 dataframes with the respective solutions. For more details, have a look at LPSolverDietEurope.py in the repository.
import LPSolverDietEurope as lp
diet_dict_eu_countries = {name: lp.LPSolverDietEurope(name) for name in best_countries}
for countries in best_countries:
diet_dict_eu_countries[countries].to_pickle("data/processed/" + countries + "_opt_diet.pkl")
Reset all parameters Reset all parameters Reset all parameters Reset all parameters Reset all parameters
#Loading pickle of solutions
import pandas as pd
Final_Diet_It = pd.read_pickle("data/processed/Italy_opt_diet.pkl")
Final_Diet_Ge = pd.read_pickle("data/processed/Germany_opt_diet.pkl")
Final_Diet_Fr = pd.read_pickle("data/processed/France_opt_diet.pkl")
Final_Diet_Sp = pd.read_pickle("data/processed/Spain_opt_diet.pkl")
Final_Diet_UK = pd.read_pickle("data/processed/United Kingdom_opt_diet.pkl")
We have our five final dataframes containing the tons of products for every country that can be considered as OUR FINAL DIETS.
Final_Diet_It.head()
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product | |||||||||||||||||||||
Apples | 5.725581 | 2.383284 | 1.951548 | 1.774081 | 1.370546 | 1.125086 | 1.035212 | 0.913324 | 0.749409 | 0.664980 | ... | 0.263661 | 0.232392 | 0.130483 | 0.113529 | 0.110133 | 0.063273 | 0.051926 | 0.039848 | 0.015356 | 0.014821 |
Barley | 88.599557 | 36.879736 | 30.198907 | 27.452727 | 21.208282 | 17.409951 | 16.019216 | 14.133077 | 11.596601 | 10.290117 | ... | 4.079979 | 3.596116 | 2.019140 | 1.756779 | 1.704242 | 0.979106 | 0.803521 | 0.616618 | 0.237626 | 0.229343 |
Beans | 0.985547 | 0.410236 | 0.335921 | 0.305373 | 0.235913 | 0.193661 | 0.178191 | 0.157211 | 0.128996 | 0.114463 | ... | 0.045384 | 0.040002 | 0.022460 | 0.019542 | 0.018957 | 0.010891 | 0.008938 | 0.006859 | 0.002643 | 0.002551 |
Beef Meat | 11842.726527 | 4929.557696 | 4036.559600 | 3669.489481 | 2834.821002 | 2327.114171 | 2141.220573 | 1889.108393 | 1550.068388 | 1375.436180 | ... | 545.353426 | 480.677541 | 269.889804 | 234.821184 | 227.798787 | 130.872874 | 107.403290 | 82.420742 | 31.762410 | 30.655362 |
Cherries | 5.972498 | 2.486064 | 2.035709 | 1.850589 | 1.429651 | 1.173605 | 1.079856 | 0.952711 | 0.781727 | 0.693657 | ... | 0.275031 | 0.242414 | 0.136110 | 0.118425 | 0.114883 | 0.066002 | 0.054165 | 0.041566 | 0.016018 | 0.015460 |
5 rows × 25 columns
Final_Diet_Ge.head()
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product | |||||||||||||||||||||
Apples | 6.958864 | 2.897727 | 2.373135 | 2.157500 | 1.667174 | 1.368922 | 1.259719 | 1.111616 | 0.912447 | 0.809860 | ... | 0.322228 | 0.284234 | 0.160407 | 0.139806 | 0.135680 | 0.078741 | 0.064954 | 0.050278 | 0.020519 | 0.019869 |
Barley | 107.683788 | 44.840387 | 36.722686 | 33.385876 | 25.798414 | 21.183160 | 19.493315 | 17.201517 | 14.119511 | 12.532036 | ... | 4.986260 | 4.398331 | 2.482189 | 2.163401 | 2.099565 | 1.218470 | 1.005123 | 0.778022 | 0.317518 | 0.307454 |
Beans | 1.197832 | 0.498787 | 0.408489 | 0.371371 | 0.286971 | 0.235633 | 0.216836 | 0.191343 | 0.157060 | 0.139401 | ... | 0.055465 | 0.048925 | 0.027611 | 0.024065 | 0.023355 | 0.013554 | 0.011181 | 0.008654 | 0.003532 | 0.003420 |
Beef Meat | 14393.634574 | 5993.624089 | 4908.565479 | 4462.548186 | 3448.364366 | 2831.463085 | 2605.588641 | 2299.253730 | 1887.295094 | 1675.104069 | ... | 666.492244 | 587.906269 | 331.783607 | 289.172645 | 280.639914 | 162.867774 | 134.350497 | 103.994858 | 42.441245 | 41.096099 |
Cherries | 7.258967 | 3.022692 | 2.475477 | 2.250543 | 1.739072 | 1.427957 | 1.314045 | 1.159555 | 0.951797 | 0.844785 | ... | 0.336124 | 0.296492 | 0.167324 | 0.145835 | 0.141532 | 0.082137 | 0.067755 | 0.052446 | 0.021404 | 0.020725 |
5 rows × 25 columns
Final_Diet_Fr.head()
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product | |||||||||||||||||||||
Apples | 6.434336 | 2.678929 | 2.193830 | 1.994428 | 1.541015 | 1.265216 | 1.164234 | 1.027280 | 0.843105 | 0.748241 | ... | 0.297319 | 0.262185 | 0.147680 | 0.128630 | 0.124815 | 0.072162 | 0.059413 | 0.045842 | 0.018323 | 0.017722 |
Barley | 65.543362 | 27.288907 | 22.347446 | 20.316240 | 15.697550 | 12.888122 | 11.859468 | 10.464390 | 8.588291 | 7.621952 | ... | 3.028637 | 2.670749 | 1.504342 | 1.310287 | 1.271429 | 0.735083 | 0.605212 | 0.466970 | 0.186649 | 0.180523 |
Beans | 1.107545 | 0.461125 | 0.377625 | 0.343302 | 0.265256 | 0.217782 | 0.200400 | 0.176826 | 0.145124 | 0.128795 | ... | 0.051178 | 0.045130 | 0.025420 | 0.022141 | 0.021484 | 0.012421 | 0.010227 | 0.007891 | 0.003154 | 0.003050 |
Beef Meat | 13308.706472 | 5541.065427 | 4537.692180 | 4125.251953 | 3187.417901 | 2616.958171 | 2408.088003 | 2124.814591 | 1743.869019 | 1547.652154 | ... | 614.970627 | 542.300760 | 305.459520 | 266.056395 | 258.166025 | 149.260006 | 122.889566 | 94.819159 | 37.899424 | 36.655543 |
Cherries | 6.711818 | 2.794458 | 2.288439 | 2.080438 | 1.607472 | 1.319779 | 1.214442 | 1.071582 | 0.879464 | 0.780509 | ... | 0.310141 | 0.273492 | 0.154049 | 0.134177 | 0.130198 | 0.075274 | 0.061975 | 0.047819 | 0.019113 | 0.018486 |
5 rows × 25 columns
Final_Diet_Sp.head()
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product | |||||||||||||||||||||
Apples | 5.401381 | 2.248050 | 1.840723 | 1.673290 | 1.292569 | 1.060987 | 0.976194 | 0.861198 | 0.706550 | 0.626894 | ... | 0.248265 | 0.218764 | 0.122617 | 0.106621 | 0.103418 | 0.059207 | 0.048501 | 0.037106 | 0.013999 | 0.013494 |
Barley | 55.021166 | 22.899758 | 18.750524 | 17.044967 | 13.166756 | 10.807742 | 9.944005 | 8.772588 | 7.197270 | 6.385857 | ... | 2.528954 | 2.228443 | 1.249037 | 1.086094 | 1.053465 | 0.603108 | 0.494058 | 0.377979 | 0.142600 | 0.137456 |
Beans | 0.929742 | 0.386958 | 0.316844 | 0.288024 | 0.222490 | 0.182628 | 0.168033 | 0.148238 | 0.121619 | 0.107908 | ... | 0.042734 | 0.037656 | 0.021106 | 0.018353 | 0.017801 | 0.010191 | 0.008349 | 0.006387 | 0.002410 | 0.002323 |
Beef Meat | 11172.154280 | 4649.840286 | 3807.330271 | 3461.013461 | 2673.535239 | 2194.532996 | 2019.149400 | 1781.291069 | 1461.419614 | 1296.660714 | ... | 513.508951 | 452.489693 | 253.619414 | 220.533495 | 213.908128 | 122.462185 | 100.319518 | 76.749427 | 28.955202 | 27.910743 |
Cherries | 5.634317 | 2.344997 | 1.920105 | 1.745451 | 1.348312 | 1.106742 | 1.018293 | 0.898337 | 0.737020 | 0.653929 | ... | 0.258972 | 0.228199 | 0.127905 | 0.111219 | 0.107878 | 0.061760 | 0.050593 | 0.038706 | 0.014603 | 0.014076 |
5 rows × 25 columns
Final_Diet_UK.head()
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product | |||||||||||||||||||||
Apples | 6.349053 | 2.643354 | 2.164676 | 1.967914 | 1.520503 | 1.248354 | 1.148709 | 1.013568 | 0.831831 | 0.738222 | ... | 0.293269 | 0.258600 | 0.145611 | 0.126813 | 0.123048 | 0.071093 | 0.058512 | 0.045121 | 0.017966 | 0.017373 |
Barley | 64.674623 | 26.926528 | 22.050475 | 20.046156 | 15.488601 | 12.716361 | 11.701323 | 10.324710 | 8.473444 | 7.519897 | ... | 2.987382 | 2.634231 | 1.483263 | 1.291777 | 1.253433 | 0.724187 | 0.596035 | 0.459623 | 0.183012 | 0.176967 |
Beans | 1.092865 | 0.455002 | 0.372607 | 0.338738 | 0.261725 | 0.214880 | 0.197728 | 0.174466 | 0.143183 | 0.127070 | ... | 0.050480 | 0.044513 | 0.025064 | 0.021828 | 0.021180 | 0.012237 | 0.010072 | 0.007767 | 0.003093 | 0.002990 |
Beef Meat | 13132.307414 | 5467.483665 | 4477.391686 | 4070.410773 | 3144.990472 | 2582.081699 | 2375.976265 | 2096.452436 | 1720.549294 | 1526.929677 | ... | 606.593699 | 534.885735 | 301.179472 | 262.297911 | 254.511983 | 147.047511 | 121.026126 | 93.327277 | 37.160966 | 35.933551 |
Cherries | 6.622857 | 2.757350 | 2.258029 | 2.052781 | 1.586075 | 1.302190 | 1.198247 | 1.057278 | 0.867704 | 0.770058 | ... | 0.305916 | 0.269752 | 0.151890 | 0.132282 | 0.128355 | 0.074159 | 0.061036 | 0.047067 | 0.018741 | 0.018122 |
5 rows × 25 columns
The code for the chord plot is present in this notebook. (Chord.ipynb from our repository). We didn't put it here for scalability reasons.
The below cells contain the code for the sunburst plot generation.
def build_hierarchical_dataframe(df, levels, value_column, color_columns=None, prod=None):
"""
Build a hierarchy of levels for Sunburst or Treemap charts.
Levels are given starting from the bottom to the top of the hierarchy,
ie the last level corresponds to the root.
"""
df_all_trees = pd.DataFrame(columns=['id', 'label', 'parent', 'value', 'log_val','color'])
for i, level in enumerate(levels):
df_tree = pd.DataFrame(columns=['id', 'parent', 'value','log_val', 'color'])
dfg = df.groupby(levels[i:]).sum(numerical_only=True)
dfg = dfg.reset_index()
df_tree['id'] = dfg[level].copy()
if(i==0):
df_tree['label'] = dfg['Product'].copy()
else:
df_tree['label'] = dfg[level].copy()
if i < len(levels) - 1:
df_tree['parent'] = dfg[levels[i+1]].copy()
else:
df_tree['parent'] = 'Total'
df_tree['log_val'] = np.log(dfg[value_column]+2)
df_tree['value'] = dfg[value_column]
df_tree['color'] = dfg[color_columns[0]]
df_all_trees = df_all_trees.append(df_tree, ignore_index=True)
total = pd.Series(dict(id='Total', parent='',
value=df_tree['value'].sum(),
color=df_tree['color'].sum(),
label="Total",
# log_val=np.log1p(df[value_column].sum())))
log_val=sum(df_all_trees[df_all_trees.id.isin(prod)]['log_val'])))
df_all_trees = df_all_trees.append(total, ignore_index=True)
return df_all_trees
def make_sunburst(dataset, path, EU=False):
df_val = dataset.copy()
prod = dataset.index.values
df_val = df_val.apply(pd.to_numeric)
df_val = df_val.stack().to_frame().reset_index()
df_val = df_val.rename(columns = {"level_1": "af_countries", 0:"val"})
levels = ['id', 'af_countries'] # levels used for the hierarchical chart
color_columns = ['val']
value_column = 'val'
id_n = df_val.index.values
df_val['id'] = df_val.Product + id_n.astype(str)
df_val = df_val[['Product', 'id', 'af_countries', 'val']]
df_all_trees = build_hierarchical_dataframe(df_val, levels, value_column, color_columns, prod)
df_all_trees = df_all_trees.replace({'United Republic of Tanzania':'Tanzania', 'Central African Republic':'Centr. Afr. Rep.'})
if (EU):
color_scale = "YlGnBu"
else:
color_scale = "RdBu"
fig =go.Figure(go.Sunburst(
ids=df_all_trees.id,
labels=df_all_trees.label,
parents=df_all_trees.parent,
maxdepth=2,
values=df_all_trees.log_val,
customdata = df_all_trees.value,
hovertemplate='<b>%{label}</b><br><br>Need: %{customdata:.3f} tons',
name='',
marker = dict(colorscale=color_scale, line=dict(color="#777", width=0.5)),
# branchvalues='total'
))
fig.update_layout(margin = dict(t=0, l=0, r=0, b=20), autosize = False, height=500, width=500, paper_bgcolor = "rgba(0,0,0,0)", plot_bgcolor="rgba(0,0,0,0)")
py.plot(fig, filename=path, include_plotlyjs=False)
The final important step of our analysis is to compute a more general diet that takes into account of all the countries contributing to the cause. The first optmization problem gave us the kilocalories that every European contry has to give up to save every African country. Considering this, the second optimization problem could have been carried out considering just the total calories given jointly by all Eruope but increasing the granularity enabled us to determine appropriate individual diets. In this final part we will have a look at the final global diet that comprises the effort of all European countries. The result is simply obtained by summing up the 5 dataframes generated.
FINAL_Diet_df = Final_Diet_It + Final_Diet_UK + Final_Diet_Ge + Final_Diet_Sp + Final_Diet_Fr
FINAL_Diet_df.head()
Ethiopia | Kenya | United Republic of Tanzania | Madagascar | Zambia | Sudan | Uganda | Chad | Mozambique | Senegal | ... | Congo | Central African Republic | Liberia | Namibia | Togo | Botswana | Guinea-Bissau | Eswatini | Lesotho | Djibouti | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product | |||||||||||||||||||||
Apples | 30.869216 | 12.851344 | 10.523913 | 9.567214 | 7.391807 | 6.068565 | 5.584069 | 4.926986 | 4.043342 | 3.588196 | ... | 1.424742 | 1.256176 | 0.706798 | 0.615398 | 0.597095 | 0.344476 | 0.283307 | 0.218195 | 0.086163 | 0.083278 |
Barley | 381.522496 | 158.835317 | 130.070038 | 118.245966 | 91.359603 | 75.005337 | 69.017327 | 60.896283 | 49.975117 | 44.349860 | ... | 17.611212 | 15.527871 | 8.737970 | 7.608339 | 7.382133 | 4.259953 | 3.503950 | 2.699212 | 1.067404 | 1.031744 |
Beans | 5.313531 | 2.212107 | 1.811485 | 1.646808 | 1.272355 | 1.044585 | 0.961188 | 0.848084 | 0.695982 | 0.617638 | ... | 0.245241 | 0.216226 | 0.121661 | 0.105929 | 0.102778 | 0.059295 | 0.048766 | 0.037558 | 0.014831 | 0.014335 |
Beef Meat | 63849.529268 | 26581.571162 | 21767.539216 | 19788.713854 | 15289.128980 | 12552.150121 | 11550.022882 | 10190.920219 | 8363.201409 | 7421.782794 | ... | 2946.918947 | 2598.259997 | 1461.931816 | 1272.881630 | 1235.024837 | 712.510350 | 585.988997 | 451.311463 | 178.219247 | 172.251297 |
Cherries | 32.200458 | 13.405561 | 10.977759 | 9.979802 | 7.710581 | 6.330273 | 5.824883 | 5.139463 | 4.217712 | 3.742938 | ... | 1.486184 | 1.310349 | 0.737278 | 0.641937 | 0.622845 | 0.359332 | 0.295525 | 0.227604 | 0.089879 | 0.086869 |
5 rows × 25 columns
Let's have a look at them and then visualize the data with an interactive Chord plot and a SunBurst plot.
IFrame(src="https://manuleo.github.io/mADAm_files/chord.html", width=800, height=600)
This chord plot provides a general overview about how food could be redistributed with minimal expenditures.Germany and France were found to contribute the most, with each providing slightly more than 100,000 tons. This is mainly due to their higher GDP. On the other end, Ethiopia claim the highest share of food aid of all examined countries.
IFrame(src="https://manuleo.github.io/mADAm_files/sunburst_EU.html", width=800, height=600)
The SunBurst plot shows basically the final result of our whole analysis. It is used to wrap up the global final diet that the analysis propose to every African country. It is important to understand that when we talk about a diet we mean a list of food items in which we indicate the amount of food [tons] that has to be shipped.
More specifically, we can analyse Ethiopia case in which the diet is composed by:
The other products are being considered while in a neglegible extent as they have either high prices low caloric density. The toal extent to which barley, cucumber, lattuces, strawberries, cherries, apples,beans,peas and rice contribute jointly for ~150 tons.
Notice that the diet will be proportional with respect to the food deficit of each African country. This is important because we can apply the previous analysis to all the other African countries and take it as a guide to understand the results in a different context.