import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
Every year in France, just over 810,000 students in the fourth year of collège ("classe de troisième"), aged around 15, take the brevet des collèges, a diploma that completes the first cycle of secondary education. Since schooling is compulsory up to the age of 16 in France, all teenagers take the brevet exams, making it a universal measure of the level of French students of that age. In 2019, 86.5% of the 813,200 candidates obtained the national brevet diploma. Here is the evolution of the success rate in recent years.
tx = pd.read_csv("data/college/tx_succes.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.plot(tx.annee, tx.taux, '-o', color='orange')
plt.xlabel("Year")
plt.ylabel("Success rate (%)")
plt.title("Evolution of the national success rate in the 'brevet des collèges'", fontsize=14)
plt.ylim(50,100)
plt.show()
# Source : https://www.data.gouv.fr/fr/datasets/le-diplome-national-du-brevet-00000000/#_
However, this rate is not the same for everyone. There are two streams in the examination, the general stream, in which 90% of the students are enrolled, and the vocational stream, so called "fillière professionnelle". For 2019, the pass rate is 87.8% in the general stream compared to 77.2% for the vocational stream. There is a similarly large gap between girls and boys, with a pass rate of 90% compared to 83% for boys. Similarly, there are major inequalities among graduating students. Here is the distribution of the types of honors, the "mentions", during the last session of the secondary school certificate.
mentions = pd.read_csv("data/college/tx_mentions.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.pie(mentions.taux,
labels=mentions.mention,
colors=['darkgreen', 'green', 'lightgreen', 'lightyellow', 'red'],
autopct='%1.1f%%')
plt.title("Breakdown of students in the 2019 'brevet' according to their mention")
plt.axis('equal')
plt.show()
Education is a major issue and is a central element of the policy of a country like France. France spends more than 150 billion euros a year on education. Around 20% of this sum, 30 billion euros, is allocated to training at the first cycle of secondary education, in other words, at the collège.
In 2017, the French national education system identified 7,200 public and private collèges. 79% of secondary school students were enrolled in a public school.
First of all, predicting the brevet pass rate per collège is important because it makes it possible to apprehend spatial inequalities in the territory and to answer questions such as whether there is a divide between urban and rural colleges. However, predicting the success rate will above all, make it possible to identify the factors determining the success of pupils in classe de troisième in the brevet des collèges and, more generally, the determining factors in educational success. This will thus enable those in charge of education policy to (re)allocate human and financial resources where these essential factors of success are weaker than elsewhere, thus enabling the national education system to achieve its pedagogical objectives in secondary education. For example, small classes could be set up outside the REP/REP+ zones (a kind of priority education network) in a targeted manner.
This subject, through the evaluation of the brevet success rate, aims to identify the criteria for success in schools with high success rates so that investment can be made in collèges where students are on average less successful. The social impact of this system of allocating teaching and educational resources could help to move closer to the republican principle of equal opportunities nationwide.
If this resource allocation program works well, we can expect to see the lowest-performing collèges catch up. In other words, collèges that are currently, or repeatedly year after year, below the national average should be expected to catch up with the national average.
To capture some of this "catch-up" effect, one can first look at the evolution of the variance of the distribution of patent success rates. Furthermore, it is also possible to look at the evolution between two dates of the mean for the group located below the national average with respect to the national average: $$ \frac{\bar{y}^{1} - \bar{y}_{low}^{1}}{\bar{y}^{0} - \bar{y}_{low}^{0}}$$ where $\bar{y}^0$ is the national average success rate at time 0, and $\bar{y}_{low}^{0}$ is the average success rate among all schools that are in the first quartile of the distribution (the 25% of schools that have the lowest success rates), and similarly at time 1.
Thus, we expect this metric to be as small as possible, meaning that the below of the distribution is now more concentraded around the national average. On the contrary, a value close to 1 would mean that nothing has changed between the two dates, because even if the average of the last quartile would have increased, it is possible that this is only due to a general upward trend.
Another possible indicator would be to do the same thing but taking a specific group of collèges. For example, at date 0 we identify a group of collèges that are in difficulty (i.e., they have a very low success rate compared to other collèges), and it is within this group that we calculate the low average $\bar{y}_{low}^{0}$. At date 1, the low average $\bar{y}_{low}^{1}$ is calculated not on a new group of collèges but on the same as the previous data. Thus, the metric is more accurate in the sense that it can be used to track the specific evolution of a targeted group of collèges.
However, one must be aware that such a simplistic metric as this one does not make it possible to assess the effectiveness of a public policy as a whole, because on the one hand it takes only two dates, while on the other hand it is necessary to look over several years at the effects of a public policy. Moreover, in order to identify the real effects of a policy, it is necessary to evaluate a more advanced econometric model.
In order to achieve our goal, it seems crucial to achieve the most accurate patent examination pass rate possible. For this, we can use the "root mean square error" defined as follows: $$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$
In fact we will use the normalized RMSE, which is the RMSE divided by the standard deviation of the target variable $\sigma_y$: $$RMSE = \frac{\sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}}{\sigma_y}$$ The standard deviation of the target variable can be seen as a RMSE for a model that always predicts the average value. Thus, deviding the classic RMSE by $\sigma_y$ gives us a ratio which allow us to easily compare the performance of our model.
import geopandas as gpd
import folium
pd.set_option('display.max_columns', 500)
pd.set_option("display.max_rows", 500)
import warnings
warnings.filterwarnings("ignore")
import sys
sys.path.insert(0, './tools')
First of all, data related to the collèges and their exam pass rates are loaded.
data_college = pd.read_csv('./data/college/data_college.csv', index_col=0)
data_college.head()
columns_to_drop = ['Nb 3èmes insertion',
'Nb dispositifs relais',
'Nb dispositifs initiation aux métiers en alternance',
'Nb 2nd cycle général ou technologique',
'Nb CAP ou BAC professionnel',
'Nb 3émes générales et insertion rentrée précédente',
'Inscription à DCOL',
'Inscription à DCOL Rs N-1',
'Situation relative à un QPV (quartier prioritaire de la Ville)']
data_college_filtered = data_college.drop(columns_to_drop, axis=1)
data_college_filtered.to_csv('./data/college/data_college_filtered.csv')
data_college = pd.read_csv('./data/college/data_college_filtered.csv', index_col=0)
print('shape of the college table:', data_college.shape)
data_college.head()
shape of the college table: (4186, 44)
Appartenance EP | Name | Coordonnée X | Coordonnée Y | Etablissement sensible | CATAEU2010 | Situation relative à une zone rurale ou autre | Commune code | City_name | Commune et arrondissement code | Commune et arrondissement nom | Département code | Département nom | Académie code | Académie nom | Région code | Région nom | Région 2016 code | Région 2016 nom | Nb élèves | Nb 6èmes | Nb 5èmes | Nb 4èmes générales | Nb 3èmes générales | Nb 6ème SEGPA | Nb 5ème SEGPA | Nb 4ème SEGPA | Nb 3ème SEGPA | Nb SEGPA | Nb 3èmes générales retardataires | Nb divisions | Nb 6èmes provenant d'une école EP | Nb 5èmes 4èmes et 3èmes générales Latin ou Grec | Nb 5èmes 4èmes et 3èmes générales | Nb élèves pratiquant langue rare | Nb 6èmes bilangues | Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales | Nb 6èmes 5èmes 4èmes et 3èmes générales | Nb 3émes générales et insertion rentrée précédente passés en 2nde GT | Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel | Longitude | Latitude | Position | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEP | jules simon | 267972.0 | 6744464.8 | NON | 111.0 | urbain | 56260 | vannes | 56260 | VANNES | 56 | MORBIHAN | 14.0 | RENNES | 53 | BRETAGNE | 53 | BRETAGNE | 774.0 | 175.0 | 185.0 | 198.0 | 216.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 37.0 | 30.0 | 43.0 | 32.0 | 0.0 | 774.0 | 147.0 | 38.0 | 214.0 | -2.760887 | 47.658668 | -2.760887 | 47.658668 | 47.6586681136,-2.76088651324 | 95.6 |
1 | HEP | raoul blanchard | 942287.5 | 6538643.7 | NON | 111.0 | urbain | 74010 | annecy | 74010 | ANNECY | 74 | HAUTE SAVOIE | 8.0 | GRENOBLE | 82 | RHONE-ALPES | 84 | AUVERGNE-ET-RHONE-ALPES | 829.0 | 211.0 | 200.0 | 192.0 | 213.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 51.0 | 31.0 | 144.0 | 0.0 | 0.0 | 816.0 | 150.0 | 41.0 | 211.0 | 6.125997 | 45.904308 | 6.125997 | 45.904308 | 45.9043076805,6.12599674762 | 88.9 |
2 | HEP | georges duhamel | 637692.1 | 6878029.4 | NON | 111.0 | urbain | 95306 | herblay | 95306 | HERBLAY | 95 | VAL-D'OISE | 25.0 | VERSAILLES | 11 | ILE-DE-FRANCE | 11 | ILE-DE-FRANCE | 372.0 | 102.0 | 85.0 | 84.0 | 85.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 | 14.0 | 0.0 | 12.0 | 0.0 | 356.0 | 57.0 | 20.0 | 84.0 | 2.148468 | 48.999161 | 2.148468 | 48.999161 | 48.9991613692,2.14846829582 | 76.3 |
3 | HEP | edouard herriot | 585352.4 | 6816567.4 | NON | 111.0 | urbain | 28218 | luce | 28218 | LUCE | 28 | EURE-ET-LOIR | 18.0 | ORLEANS-TOURS | 24 | CENTRE-VAL-DE-LOIRE | 24 | CENTRE-VAL-DE-LOIRE | 517.0 | 124.0 | 131.0 | 133.0 | 116.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 16.0 | 20.0 | 0.0 | 23.0 | 0.0 | 504.0 | 76.0 | 29.0 | 119.0 | 1.449799 | 48.439256 | 1.449799 | 48.439256 | 48.4392557302,1.44979875281 | 89.2 |
4 | HEP | henri dheurle | 372116.5 | 6400997.6 | NON | 111.0 | urbain | 33529 | la teste de buch | 33529 | LA TESTE-DE-BUCH | 33 | GIRONDE | 4.0 | BORDEAUX | 72 | AQUITAINE | 75 | NOUVELLE-AQUITAINE | 777.0 | 194.0 | 194.0 | 174.0 | 201.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24.0 | 27.0 | 1.0 | 29.0 | 0.0 | 763.0 | 111.0 | 40.0 | 189.0 | -1.135637 | 44.630780 | -1.135637 | 44.630780 | 44.6307801744,-1.13563658848 | 87.4 |
Separate the database into train and split
from sklearn.model_selection import train_test_split
y = data_college['target'].values
X = data_college.drop('target', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train['target'] = y_train
X_train.to_csv('./data/data_college_filtered_TRAIN.csv')
X_test['target'] = y_test
X_test.to_csv('./data/data_college_filtered_TEST.csv')
Let's have a look at the main variables that we have:
Important note: in this database there are no missing data, so we don't have to worry about this problem afterwards.
# number of unique values
data_college.nunique()
Appartenance EP 3 Name 2355 Coordonnée X 4183 Coordonnée Y 4182 Etablissement sensible 2 CATAEU2010 8 Situation relative à une zone rurale ou autre 3 Commune code 2771 City_name 2748 Commune et arrondissement code 2812 Commune et arrondissement nom 2789 Département code 101 Département nom 101 Académie code 31 Académie nom 31 Région code 27 Région nom 27 Région 2016 code 18 Région 2016 nom 18 Nb élèves 733 Nb 6èmes 236 Nb 5èmes 223 Nb 4èmes générales 220 Nb 3èmes générales 218 Nb 6ème SEGPA 34 Nb 5ème SEGPA 36 Nb 4ème SEGPA 35 Nb 3ème SEGPA 36 Nb SEGPA 118 Nb 3èmes générales retardataires 79 Nb divisions 46 Nb 6èmes provenant d'une école EP 267 Nb 5èmes 4èmes et 3èmes générales Latin ou Grec 151 Nb 5èmes 4èmes et 3èmes générales 98 Nb élèves pratiquant langue rare 704 Nb 6èmes bilangues 180 Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales 93 Nb 6èmes 5èmes 4èmes et 3èmes générales 211 Nb 3émes générales et insertion rentrée précédente passés en 2nde GT 4184 Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel 4184 Longitude 4184 Latitude 4184 Position 4184 target 359 dtype: int64
# Python type of each feature
data_college.dtypes
Appartenance EP object Name object Coordonnée X float64 Coordonnée Y float64 Etablissement sensible object CATAEU2010 float64 Situation relative à une zone rurale ou autre object Commune code object City_name object Commune et arrondissement code object Commune et arrondissement nom object Département code object Département nom object Académie code float64 Académie nom object Région code int64 Région nom object Région 2016 code int64 Région 2016 nom object Nb élèves float64 Nb 6èmes float64 Nb 5èmes float64 Nb 4èmes générales float64 Nb 3èmes générales float64 Nb 6ème SEGPA float64 Nb 5ème SEGPA float64 Nb 4ème SEGPA float64 Nb 3ème SEGPA float64 Nb SEGPA float64 Nb 3èmes générales retardataires float64 Nb divisions float64 Nb 6èmes provenant d'une école EP float64 Nb 5èmes 4èmes et 3èmes générales Latin ou Grec float64 Nb 5èmes 4èmes et 3èmes générales float64 Nb élèves pratiquant langue rare float64 Nb 6èmes bilangues float64 Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales float64 Nb 6èmes 5èmes 4èmes et 3èmes générales float64 Nb 3émes générales et insertion rentrée précédente passés en 2nde GT float64 Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel float64 Longitude float64 Latitude float64 Position object target float64 dtype: object
In order to add socio-economic context to our analysis, it was decided to use a second database, this time with city-level information.
cities_data = pd.read_csv("./data/donnees_geographiques/cities_data.csv", index_col=0)
keep_col = ['insee_code',
'LIBGEO',
'REG',
'DEP',
'population',
'SUPERF',
'med_std_living',
'poverty_rate',
'unemployment_rate']
cities_data = cities_data[keep_col]
cities_data_filtered.to_csv('./data/donnees_geographiques/cities_data_filtered.csv')
cities_data = pd.read_csv("./data/donnees_geographiques/cities_data_filtered.csv", index_col=0)
print('Shape of the cities table', cities_data.shape)
cities_data.head()
shape of the cities table (36729, 9)
insee_code | LIBGEO | REG | DEP | population | SUPERF | med_std_living | poverty_rate | unemployment_rate | |
---|---|---|---|---|---|---|---|---|---|
0 | 01001 | L'Abergement-Clémenciat | 84 | 01 | 767.0 | 15.95 | 22228.000000 | NaN | 0.087766 |
1 | 01002 | L'Abergement-de-Varey | 84 | 01 | 241.0 | 9.15 | 22883.333333 | NaN | 0.081301 |
2 | 01004 | Ambérieu-en-Bugey | 84 | 01 | 14127.0 | 24.60 | 19735.200000 | 17.227132 | 0.158234 |
3 | 01005 | Ambérieux-en-Dombes | 84 | 01 | 1619.0 | 15.92 | 23182.666667 | NaN | 0.078759 |
4 | 01006 | Ambléon | 84 | 01 | 109.0 | 5.88 | NaN | NaN | 0.137931 |
Let us briefly present the main variables of this second database.
# proportion of NaN values
cities_data.isna().sum() / cities_data.shape[0]
insee_code 0.000000 LIBGEO 0.000000 REG 0.000000 DEP 0.000000 population 0.035068 SUPERF 0.035068 med_std_living 0.121974 poverty_rate 0.880476 unemployment_rate 0.035258 dtype: float64
The variables in this database are quite well filled out, except for the variable coding for the poverty rate, which has more than 88% missing data. Subsequently, we will fill the missing data by taking the average value at the department level, so that there are no more missing data.
# Python type of each feature
cities_data.dtypes
insee_code object LIBGEO object REG int64 DEP object population float64 SUPERF float64 med_std_living float64 poverty_rate float64 unemployment_rate float64 dtype: object
As mentioned earlier, the objective here is to arrive at an accurate estimate of a college's pass rate on the Brevet exam. Therefore, the corresponding variable is named 'target'.
First, let's look at the simple distribution of these success rates.
plt.figure(figsize=(10,7))
plt.hist(data_college.target, bins=40)
plt.axvline(x=data_college.target.mean(), color="orange", label='mean')
plt.axvline(x=data_college.target.quantile(.5), color="red", label='median')
plt.xlabel("Success rate (%)")
plt.ylabel("Count")
plt.title("Distribution of the revet uccess rates", fontsize=14)
plt.legend(loc='best')
plt.show()
data_college.target.describe()
count 4186.000000 mean 87.429025 std 7.611328 min 41.300000 25% 82.925000 50% 88.600000 75% 93.100000 max 100.000000 Name: target, dtype: float64
print("The variance of the success rates in 2019 is %.2f" %data_college.target.std())
The variance of the success rates in 2019 is 7.61
In order to integrate the socio-economic data available in our second database, it is possible to join the two databases at the city level, thanks to their unique code.
# merge on the city code
data_college = pd.merge(data_college, cities_data,
left_on='Commune et arrondissement code', right_on='insee_code', how='left')
# fill na by taking the average value at the departement level
city_col_with_na = []
for col in cities_data.columns:
if cities_data[col].isna().sum() > 0:
city_col_with_na.append(col)
#print(city_col_with_na)
for col in city_col_with_na:
data_college[col] = data_college[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
data_college.head()
Appartenance EP | Name | Coordonnée X | Coordonnée Y | Etablissement sensible | CATAEU2010 | Situation relative à une zone rurale ou autre | Commune code | City_name | Commune et arrondissement code | Commune et arrondissement nom | Département code | Département nom | Académie code | Académie nom | Région code | Région nom | Région 2016 code | Région 2016 nom | Nb élèves | Nb 6èmes | Nb 5èmes | Nb 4èmes générales | Nb 3èmes générales | Nb 6ème SEGPA | Nb 5ème SEGPA | Nb 4ème SEGPA | Nb 3ème SEGPA | Nb SEGPA | Nb 3èmes générales retardataires | Nb divisions | Nb 6èmes provenant d'une école EP | Nb 5èmes 4èmes et 3èmes générales Latin ou Grec | Nb 5èmes 4èmes et 3èmes générales | Nb élèves pratiquant langue rare | Nb 6èmes bilangues | Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales | Nb 6èmes 5èmes 4èmes et 3èmes générales | Nb 3émes générales et insertion rentrée précédente passés en 2nde GT | Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel | Longitude | Latitude | Position | target | insee_code | LIBGEO | REG | DEP | population | SUPERF | med_std_living | poverty_rate | unemployment_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEP | jules simon | 267972.0 | 6744464.8 | NON | 111.0 | urbain | 56260 | vannes | 56260 | VANNES | 56 | MORBIHAN | 14.0 | RENNES | 53 | BRETAGNE | 53 | BRETAGNE | 774.0 | 175.0 | 185.0 | 198.0 | 216.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 37.0 | 30.0 | 43.0 | 32.0 | 0.0 | 774.0 | 147.0 | 38.0 | 214.0 | -2.760887 | 47.658668 | -2.760887 | 47.658668 | 47.6586681136,-2.76088651324 | 95.6 | 56260 | Vannes | 53.0 | 56 | 53200.0 | 32.30 | 21046.666667 | 16.432284 | 0.172886 |
1 | HEP | raoul blanchard | 942287.5 | 6538643.7 | NON | 111.0 | urbain | 74010 | annecy | 74010 | ANNECY | 74 | HAUTE SAVOIE | 8.0 | GRENOBLE | 82 | RHONE-ALPES | 84 | AUVERGNE-ET-RHONE-ALPES | 829.0 | 211.0 | 200.0 | 192.0 | 213.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 51.0 | 31.0 | 144.0 | 0.0 | 0.0 | 816.0 | 150.0 | 41.0 | 211.0 | 6.125997 | 45.904308 | 6.125997 | 45.904308 | 45.9043076805,6.12599674762 | 88.9 | 74010 | Annecy | 84.0 | 74 | 125694.0 | 66.93 | 22531.250000 | 12.108427 | 0.107961 |
2 | HEP | georges duhamel | 637692.1 | 6878029.4 | NON | 111.0 | urbain | 95306 | herblay | 95306 | HERBLAY | 95 | VAL-D'OISE | 25.0 | VERSAILLES | 11 | ILE-DE-FRANCE | 11 | ILE-DE-FRANCE | 372.0 | 102.0 | 85.0 | 84.0 | 85.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 | 14.0 | 0.0 | 12.0 | 0.0 | 356.0 | 57.0 | 20.0 | 84.0 | 2.148468 | 48.999161 | 2.148468 | 48.999161 | 48.9991613692,2.14846829582 | 76.3 | 95306 | Herblay | 11.0 | 95 | 28341.0 | 12.74 | 25128.857143 | 9.647151 | 0.096764 |
3 | HEP | edouard herriot | 585352.4 | 6816567.4 | NON | 111.0 | urbain | 28218 | luce | 28218 | LUCE | 28 | EURE-ET-LOIR | 18.0 | ORLEANS-TOURS | 24 | CENTRE-VAL-DE-LOIRE | 24 | CENTRE-VAL-DE-LOIRE | 517.0 | 124.0 | 131.0 | 133.0 | 116.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 16.0 | 20.0 | 0.0 | 23.0 | 0.0 | 504.0 | 76.0 | 29.0 | 119.0 | 1.449799 | 48.439256 | 1.449799 | 48.439256 | 48.4392557302,1.44979875281 | 89.2 | 28218 | Lucé | 24.0 | 28 | 16107.0 | 6.06 | 17900.333333 | 21.074758 | 0.200058 |
4 | HEP | henri dheurle | 372116.5 | 6400997.6 | NON | 111.0 | urbain | 33529 | la teste de buch | 33529 | LA TESTE-DE-BUCH | 33 | GIRONDE | 4.0 | BORDEAUX | 72 | AQUITAINE | 75 | NOUVELLE-AQUITAINE | 777.0 | 194.0 | 194.0 | 174.0 | 201.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24.0 | 27.0 | 1.0 | 29.0 | 0.0 | 763.0 | 111.0 | 40.0 | 189.0 | -1.135637 | 44.630780 | -1.135637 | 44.630780 | 44.6307801744,-1.13563658848 | 87.4 | 33529 | La Teste-de-Buch | 75.0 | 33 | 26110.0 | 180.20 | 21434.666667 | 10.666966 | 0.152706 |
plt.figure(figsize=(10,7))
plt.scatter(data_college['unemployment_rate'], data_college['target'])
plt.xlabel("Average unemployment rate in the departement")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the unemployment rate in the departement", fontsize=14)
plt.ylim(50,100)
plt.show()
At first glance, it would appear that the unemployment rate in the city has a negative effect on the exam pass rate. This supports the idea that socio-economic variables in the environment in which the school is located may be crucial explanatory factors.
Let's look at the distribution of the success rate by department.
# Create a dynamic map
def plot_per_department(column, data_college,
cmap='OrRd_r',
path_geo_data = './data/donnees_geographiques/fichiers_geopandas/'):
'''
Functions which returns an interactive map of France colored by department according
to the value of the feature "column" (for example the target).
Parameters:
column (string): the name of the column to plot per department
data_college: the data frame with the data per college
path_geo_data: path to a geojson file of france
'''
dep_df = data_college.groupby('Département code').agg({column:'mean'}).reset_index()
dep_df.columns = ['code', column] # we rename the department code because it's the key of the geojson file
m = folium.Map(location=[46.45,1], zoom_start=6) #map centered on France
# Add the chloropleth
m.choropleth(
geo_data=path_geo_data+'departements.geojson.txt', # geoJSON file or url to geojson
name='choropleth',
data=dep_df, # Pandas dataframe
columns=['code',column], # key and value of interest from the dataframe
key_on='feature.properties.code', # key to link the json file and the dataframe
fill_color=cmap, # colormap
fill_opacity=0.7,
line_opacity=0.2,
legend_name=column
)
display(m)
return None
plot_per_department(column='target', data_college=data_college)
Now we can look in detail, at the city level.
Thanks to the function below, you can choose a department and display an interactive map of the collèges. You can click on the little little circles (each of them corresponds to a collège) and see some information about it:
# Create a dynamic map for the cities in each department
def plot_cities_in_dep(column, dep_code, dep_name,
data_college,
url_parent='https://france-geojson.gregoiredavid.fr/repo/departements/',
cmap='OrRd_r'):
'''
column (string): name of the column to plot
dep_code(string): the code of the department to plot
dep_name (string): the name of the department
data_college : the data frame with data per college
url_parent : the url of France GEOJSON
'''
cities_df = data_college[data_college['Département code']==dep_code]
cities_df_group = cities_df.groupby('Commune et arrondissement code').agg({column:'mean'}).reset_index()
cities_df_group.columns = ['code', column]
#url to geojson of the department
url = url_parent+dep_code+'-'+dep_name+'/communes-'+dep_code+'-'+dep_name+'.geojson'
coords = gpd.read_file(url).loc[0].geometry.centroid.coords[0] #coordinates where the maps is centered
m = folium.Map(location=[coords[1],coords[0]], zoom_start=10) #map centered on France
# Add the chloropleth
m.choropleth(
geo_data=url,# url to geojson of the department
name='choropleth',
data=cities_df_group, # Pandas dataframe
columns=['code',column], # key and value of interest from the dataframe
key_on='feature.properties.code', # key to link the json file and the dataframe
fill_color=cmap, # colormap
fill_opacity=0.7,
line_opacity=0.2,
legend_name=column
)
#add the colleges marker
for ix, row in cities_df.iterrows():
# Create a popup tab with the college name and its success_rate
popup_df = pd.DataFrame(data=[['College', row['Name']],
['Sucess rate', str(row['target'])+'%'],
['City', row['Commune et arrondissement nom']],
['Department', row['Département nom']],
['Appartenance EP', row['Appartenance EP']]])
popup_html = popup_df.to_html(classes='table table-striped table-hover table-condensed table-responsive', index=False, header=False)
# Create a marker on the map
folium.CircleMarker(location = [row['Latitude'],row['Longitude']], radius=2, popup=folium.Popup(popup_html), color='red', alpha=0.5, fill_color='#0000FF').add_to(m)
display(m)
return None
plot_cities_in_dep(column='target',
dep_code='75',
dep_name='paris',
data_college=data_college)
Some of the following features' possible link with the target are worth investigating:
print('Mean median living wage: %.5f' %cities_data.med_std_living.mean())
Mean median living wage: 20633.28419
plot_per_department(column='med_std_living', data_college=data_college, cmap='Blues_r')
plot_cities_in_dep(column='med_std_living', dep_code='75', dep_name='paris',
data_college=data_college, cmap='Blues_r')
plot_cities_in_dep(column='med_std_living', dep_code='93', dep_name='seine-saint-denis',
data_college=data_college, cmap='Blues_r')
print('Average unemployment rate: %.2f%%' %(cities_data.unemployment_rate.mean()*100))
Average unemployment rate: 11.05%
plot_per_department(column='unemployment_rate', data_college=data_college, cmap='Purples')
plot_cities_in_dep(column='unemployment_rate', dep_code='75', dep_name='paris',
data_college=data_college, cmap='Purples')
print('Average poverty rate: %.02f%%' %cities_data.poverty_rate.mean())
Average poverty rate: 13.85%
plot_per_department('poverty_rate', data_college, cmap='Greens')
plot_cities_in_dep('poverty_rate', dep_code='62',
dep_name='pas-de-calais', data_college=data_college, cmap="Greens")
fig, ax = plt.subplots(1,3, figsize=(20,5))
sns.scatterplot(x='med_std_living', y='target', data=data_college, ax=ax[0])
sns.scatterplot(x='unemployment_rate', y='target', color='purple',
data=data_college.dropna(subset=['unemployment_rate']), ax=ax[1])
sns.scatterplot(x='poverty_rate', y='target', color='green',
data=data_college.dropna(subset=['poverty_rate']), ax=ax[2])
plt.show()
The priority education policy aims to reduce the gaps in achievement between pupils enrolled in priority education and those who are not. Two types of networks have been identified: the REP+, which concern neighbourhoods or isolated sectors with the greatest concentration of social difficulties that have a strong impact on educational success, and the more socially mixed REPs that are more socially mixed but encounter more significant social difficulties than those of collèges and schools outside priority education. Not every collèges are in the priority education network, and in this case they are labelled as "HEP" (outside the priority education network, Hors éducation prioritaire).
ax = sns.boxplot(x="Appartenance EP", y="target", data=data_college)
ax.axes.set_title("Success rate according to the type of education network")
ax.set_xlabel("Type of education network")
ax.set_ylabel("Success rate");
It can easily be observed that the success rate is significantly lower when the collège is part of a priority education network (this is all the more the case for the REP+ network). However, no causal link can be drawn from this, since collèges are placed in a priority education system precisely when they have a certain amount of ground to make up in relation to other collèges.
Allongside the priority education networks, there exists another label for collège in difficulties, the établissements sensibles. The so-called "sensitive" schools are secondary schools in which a climate of insecurity prevails that seriously compromises pupils' schooling. They are not necessarily in priority education.
The development of violence in schools has led the Ministers of National Education and the Interior to strengthen their collaboration. The latter has led, since 1992, to the classification of certain public secondary schools as "sensitive" schools, without saying, however, that violence is present only in these schools.
Like REPs, sensitive establishments benefit from special measures. They are the subject of exceptional efforts in terms of innovative and adapted pedagogy, by strengthening the timetable potential (class splitting, tutoring, tutoring, directed studies, etc.) and by the allocation of additional jobs, by strengthening the presence of adults (increase in the number of senior educational advisers, boarding school teachers, day school supervisors, etc.) and by appointing two head teachers per class.
sns.catplot(x="Appartenance EP", y="target", hue="Etablissement sensible", kind="box", data=data_college)
plt.title('Success rate according to the type of education network');
Another determinant of academic success is the number of students per class. It is easy to understand that a smaller class size makes it much easier for the teacher to spend more time with each student, thus ensuring that all students progress through the classroom without some being left behind. This is true at all levels of education but mainly during the first years of schooling.
Unfortunately, the variable "number of students" per class is not present in our database. However, it is possible to create a variable "average number of pupils per class" from the variables "total number of pupils in the school" and "number of classes".
data_college['average_class_size'] = data_college['Nb élèves'] / data_college['Nb divisions']
plt.scatter(data_college['average_class_size'], data_college['target'])
plt.xlabel("Average class size")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the average class size", fontsize=14)
plt.ylim(50,100)
plt.show()
In this case, the effect of class size on the exam pass rate cannot be determined directly because the rest of the variables must be controlled for. However, it is interesting to keep such a variable, given its importance in the literature.
In the same way, it is possible to create other new variables, such as the share of pupils who are in a general stream or the share of pupils who are in a European or international section.
# percentage of pupils in the general stream
data_college['percent_general_stream'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / data_college['Nb élèves']
# percentage of pupils in an european or international section
data_college['percent_euro_int_section'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / data_college['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = data_college['Nb 5èmes'] + data_college['Nb 4èmes générales'] + data_college['Nb 3èmes générales']
data_college['percent_latin_greek'] = data_college['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
data_college['percent_segpa'] = data_college['Nb SEGPA'] / data_college['Nb élèves']
quant_features = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb élèves pratiquant langue rare',
'Nb 3ème SEGPA',
'average_class_size', 'percent_general_stream', 'percent_euro_int_section']
data_college[quant_features].describe()
Nb élèves | Nb 3èmes générales | Nb 3èmes générales retardataires | Nb 5èmes 4èmes et 3èmes générales Latin ou Grec | Nb élèves pratiquant langue rare | Nb 3ème SEGPA | average_class_size | percent_general_stream | percent_euro_int_section | |
---|---|---|---|---|---|---|---|---|---|
count | 4186.000000 | 4186.000000 | 4186.000000 | 4186.000000 | 4186.000000 | 4186.000000 | 4186.000000 | 4186.000000 | 4186.000000 |
mean | 516.225036 | 119.627568 | 18.273053 | 17.682991 | 494.123268 | 4.247731 | 24.452214 | 0.238136 | 0.056751 |
std | 163.357700 | 40.049045 | 11.141567 | 25.031394 | 156.442065 | 7.434012 | 2.502211 | 0.033351 | 0.022781 |
min | 120.000000 | 0.000000 | 0.000000 | 0.000000 | 111.000000 | 0.000000 | 12.866667 | 0.101480 | 0.000000 |
25% | 397.000000 | 91.000000 | 11.000000 | 0.000000 | 378.000000 | 0.000000 | 22.736656 | 0.216216 | 0.040541 |
50% | 504.000000 | 116.000000 | 16.000000 | 13.000000 | 482.000000 | 0.000000 | 24.666667 | 0.237145 | 0.054127 |
75% | 623.750000 | 145.000000 | 24.000000 | 25.000000 | 599.000000 | 9.000000 | 26.380952 | 0.257856 | 0.069565 |
max | 1711.000000 | 419.000000 | 211.000000 | 211.000000 | 1698.000000 | 35.000000 | 32.095238 | 0.591667 | 0.175000 |
data_college[quant_features].hist(figsize=(16, 20), bins = 50, xlabelsize=8, ylabelsize=8)
plt.show()
We can observe that most of our quantitative features have a Gaussian distribution shape.
Finally, we can have a look on the correlation between our different quantitative features and our target.
features_for_corr = ['target', 'average_class_size',
'percent_general_stream', 'percent_euro_int_section',
'med_std_living', 'poverty_rate', 'unemployment_rate']
sns.heatmap(data_college[features_for_corr].corr(), cmap='YlGn')
plt.title('Anlaysis of correlations via heatmap');
The work flow is composed of two essential elements that make up the submission: the feature extractor and the regressor. The first allows both the preparation of initial data and the creation of new variables. The second, on the other hand, allows a supervised learning model to be trained so that the success rate on the exam can be correctly predicted. This model is trained on a part of the base obtained from the feature extractor output, then is evaluated on the remaining part.
We will use a Random Forest Regressor model in order to predict the different success rates.
data_college = pd.read_csv('./data/college/data_college_filtered.csv', index_col=0)
y_array = data_college['target'].values
X_df = data_college.drop('target', axis=1)
cities_data = pd.read_csv("./data/donnees_geographiques/cities_data_filtered.csv", index_col=0)
keep_col_cities = ['population', 'SUPERF', 'med_std_living', 'poverty_rate', 'unemployment_rate']
def process_students(X):
"""Create new features linked to the pupils"""
# average class size
X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
# percentage of pupils in the general stream
X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
# percentage of pupils in an european or international section
X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']
return np.c_[X['average_class_size'].values,
X['percent_general_stream'].values,
X['percent_euro_int_section'].values,
X['percent_latin_greek'].values,
X['percent_segpa'].values]
def merge_naive(X):
# merge the two databases at the city level
df = pd.merge(X, cities_data,
left_on='Commune et arrondissement code', right_on='insee_code', how='left')
# fill na by taking the average value at the departement level
for col in keep_col_cities:
if cities_data[col].isna().sum() > 0:
df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
return df[keep_col_cities]
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
# Transformers
students_col = ['Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA']
students_transformer = FunctionTransformer(process_students, validate=False)
num_cols = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
"Nb 6èmes provenant d'une école EP"]
numeric_transformer = Pipeline(steps=[('scale', StandardScaler())])
cat_cols = ['Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
'Situation relative à une zone rurale ou autre']
categorical_transformer = Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='ignore'))])
merge_col = ['Commune et arrondissement code', 'Département code']
merge_transformer = FunctionTransformer(merge_naive, validate=False)
drop_cols = ['Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
'Commune et arrondissement code', 'Commune et arrondissement nom',
'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
'Longitude', 'Latitude', 'Position']
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_cols),
('cat', categorical_transformer, cat_cols),
('students', make_pipeline(students_transformer, SimpleImputer(strategy='mean'), StandardScaler()), students_col),
('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='mean')), merge_col),
('drop cols', 'drop', drop_cols),
], remainder='drop') # remainder='drop' or 'passthrough'
# check it works
preprocessor.fit_transform(X_df)
array([[ 1.57816718e+00, 2.40664777e+00, 1.68101893e+00, ..., 2.10466667e+04, 1.64322843e+01, 1.72886328e-01], [ 1.91489186e+00, 2.33173067e+00, 2.93772485e+00, ..., 2.25312500e+04, 1.21084267e+01, 1.07960805e-01], [-8.82984188e-01, -8.64732351e-01, 1.55018891e-01, ..., 2.51288571e+04, 9.64715116e+00, 9.67644240e-02], ..., [-6.87226673e-02, 8.42176087e-02, -9.22157608e-01, ..., 2.26595000e+04, 1.08124686e+01, 1.17145873e-01], [ 4.51670034e-01, 5.83664956e-01, -2.04039942e-01, ..., 1.90816667e+04, 1.95091294e+01, 1.80443702e-01], [-1.00542953e+00, -1.31423496e+00, -1.01192232e+00, ..., 2.13980000e+04, 9.49384985e+00, 1.34922608e-01]])
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=5, max_depth=50, max_features=10)
from sklearn.metrics import make_scorer, mean_squared_error
def normalized_rmse(y_true, y_pred):
"""Normalized RMSE"""
if isinstance(y_true, pd.Series):
y_true = y_true.values
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
return rmse / np.std(y_true)
custom_loss = make_scorer(normalized_rmse, greater_is_better=False)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
clf = Pipeline(steps=[
('preprocessing', preprocessor),
('classifier', regressor)])
cv = ShuffleSplit(n_splits=5, test_size=0.25)
scores_Xdf = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=custom_loss)
print("mean: %.2e (+/- %.2e)" % (scores_Xdf.mean(), scores_Xdf.std()))
mean: 9.41e-01 (+/- 1.83e-02)
In order to run a whole submission, we need to compile into two differents python files the feature extractor and the regressor. Basically, we create a subfolder inside the folder 'submission', we name it 'starting_kit' for example, and we place into this subfolder the two files feature_extrator.py and regressor.py. These two files simply contain the codes presented above, in a more structured way, as presented later.
Note: the metric used to evaluate our model is not defined in either of those two files but is defined in a more general file, problem.py, placed at the root of the project.
# feature_extractor.py
import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
class FeatureExtractor(object):
def __init__(self):
self.path = os.path.dirname(__file__)
# read the database with the city informations
self.cities_data = pd.read_csv(os.path.join(self.path, 'cities_data_filtered.csv'), index_col=0)
self.keep_col_cities = ['population', 'SUPERF', 'med_std_living', 'poverty_rate', 'unemployment_rate']
# Transformers
self.students_col = ['Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA']
self.students_transformer = FunctionTransformer(self.process_students, validate=False)
self.num_cols = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
"Nb 6èmes provenant d'une école EP"]
self.numeric_transformer = Pipeline(steps=[('scale', StandardScaler())])
self.cat_cols = ['Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
'Situation relative à une zone rurale ou autre']
self.categorical_transformer = Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='ignore'))])
self.merge_col = merge_col = ['Commune et arrondissement code', 'Département code']
self.merge_transformer = FunctionTransformer(self.merge_naive, validate=False)
self.drop_cols = ['Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
'Commune et arrondissement code', 'Commune et arrondissement nom',
'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
'Longitude', 'Latitude', 'Position']
pass
def fit(self, X_df, y_array):
X_encoded = X_df
self.preprocessor = ColumnTransformer(
transformers=[
('num', self.numeric_transformer, self.num_cols),
('cat', self.categorical_transformer, self.cat_cols),
('students', make_pipeline(self.students_transformer, SimpleImputer(strategy='mean'), StandardScaler()), self.students_col),
('merge', make_pipeline(self.merge_transformer, SimpleImputer(strategy='mean')), self.merge_col),
('drop cols', 'drop', self.drop_cols),
], remainder='passthrough') # remainder='drop' or 'passthrough'
self.preprocessor.fit(X_encoded, y_array)
pass
def transform(self, X_df):
X_encoded = X_df
X_array = self.preprocessor.transform(X_encoded)
return X_array
@staticmethod
def process_students(X):
"""Create new features linked to the pupils"""
# average class size
X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
# percentage of pupils in the general stream
X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
# percentage of pupils in an european or international section
X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']
return np.c_[X['average_class_size'].values,
X['percent_general_stream'].values,
X['percent_euro_int_section'].values,
X['percent_latin_greek'].values,
X['percent_segpa'].values]
def merge_naive(self, X):
# merge the two databases at the city level
df = pd.merge(X, self.cities_data,
left_on='Commune et arrondissement code', right_on='insee_code', how='left')
# fill na by taking the average value at the departement level
for col in self.keep_col_cities:
if self.cities_data[col].isna().sum() > 0:
df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
return df[self.keep_col_cities]
# regressor.py
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator
class Regressor(BaseEstimator):
def __init__(self):
self.reg = RandomForestRegressor(
n_estimators=5, max_depth=50, max_features=10)
def fit(self, X, y):
self.reg.fit(X, y)
def predict(self, X):
return self.reg.predict(X)
!ramp_test_submission
Testing Prediction of succes rates at the Brevet exam Reading train and test files from ./data ... Reading cv ... Training submissions/starting_kit ... CV fold 0 score normalized rmse time train 0.43705 1.028025 valid 0.95440 0.448320 test 0.94854 0.411667 CV fold 1 score normalized rmse time train 0.45110 0.943786 valid 0.90906 0.412820 test 0.94620 0.418762 CV fold 2 score normalized rmse time train 0.41144 0.971130 valid 0.95395 0.476197 test 0.94920 0.471500 CV fold 3 score normalized rmse time train 0.42130 1.039829 valid 0.92765 0.470854 test 0.95002 0.415969 CV fold 4 score normalized rmse time train 0.42068 0.950585 valid 0.94567 0.410397 test 0.94812 0.407459 CV fold 5 score normalized rmse time train 0.42068 1.008203 valid 0.94154 0.415711 test 0.94138 0.417803 CV fold 6 score normalized rmse time train 0.43042 0.934333 valid 0.98223 0.416615 test 0.93851 0.413901 CV fold 7 score normalized rmse time train 0.43065 0.994172 valid 0.91149 0.413359 test 0.93771 0.406948 ---------------------------- Mean CV scores ---------------------------- score normalized rmse time train 0.42792 ± 0.011481 1.0 ± 0.04 valid 0.94075 ± 0.022744 0.4 ± 0.03 test 0.94496 ± 0.004677 0.4 ± 0.02 ---------------------------- Bagged scores ---------------------------- score normalized rmse valid 0.90990 test 0.86526