Data Camp Project

Prediction of the success rate of the French "brevet des collèges"

Authors : Allain Cédric - Gerbeaux Alexis - Miura Clotilde - Muliukov Artem

Table of content¶

Introduction
Data exploration
Workflow
Submission

Import¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

Introduction¶

Every year in France, just over 810,000 students in the fourth year of collège ("classe de troisième"), aged around 15, take the brevet des collèges, a diploma that completes the first cycle of secondary education. Since schooling is compulsory up to the age of 16 in France, all teenagers take the brevet exams, making it a universal measure of the level of French students of that age. In 2019, 86.5% of the 813,200 candidates obtained the national brevet diploma. Here is the evolution of the success rate in recent years.

In [2]:

tx = pd.read_csv("data/college/tx_succes.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.plot(tx.annee, tx.taux, '-o', color='orange')
plt.xlabel("Year")
plt.ylabel("Success rate (%)")
plt.title("Evolution of the national success rate in the 'brevet des collèges'", fontsize=14)
plt.ylim(50,100)
plt.show()
# Source : https://www.data.gouv.fr/fr/datasets/le-diplome-national-du-brevet-00000000/#_

However, this rate is not the same for everyone. There are two streams in the examination, the general stream, in which 90% of the students are enrolled, and the vocational stream, so called "fillière professionnelle". For 2019, the pass rate is 87.8% in the general stream compared to 77.2% for the vocational stream. There is a similarly large gap between girls and boys, with a pass rate of 90% compared to 83% for boys. Similarly, there are major inequalities among graduating students. Here is the distribution of the types of honors, the "mentions", during the last session of the secondary school certificate.

In [3]:

mentions = pd.read_csv("data/college/tx_mentions.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.pie(mentions.taux,
        labels=mentions.mention,
        colors=['darkgreen', 'green', 'lightgreen', 'lightyellow', 'red'],
        autopct='%1.1f%%')
plt.title("Breakdown of students in the 2019 'brevet' according to their mention")
plt.axis('equal')
plt.show()

Education is a major issue and is a central element of the policy of a country like France. France spends more than 150 billion euros a year on education. Around 20% of this sum, 30 billion euros, is allocated to training at the first cycle of secondary education, in other words, at the collège.

In 2017, the French national education system identified 7,200 public and private collèges. 79% of secondary school students were enrolled in a public school.

First of all, predicting the brevet pass rate per collège is important because it makes it possible to apprehend spatial inequalities in the territory and to answer questions such as whether there is a divide between urban and rural colleges. However, predicting the success rate will above all, make it possible to identify the factors determining the success of pupils in classe de troisième in the brevet des collèges and, more generally, the determining factors in educational success. This will thus enable those in charge of education policy to (re)allocate human and financial resources where these essential factors of success are weaker than elsewhere, thus enabling the national education system to achieve its pedagogical objectives in secondary education. For example, small classes could be set up outside the REP/REP+ zones (a kind of priority education network) in a targeted manner.

This subject, through the evaluation of the brevet success rate, aims to identify the criteria for success in schools with high success rates so that investment can be made in collèges where students are on average less successful. The social impact of this system of allocating teaching and educational resources could help to move closer to the republican principle of equal opportunities nationwide.

Performance indicators for public policy action¶

If this resource allocation program works well, we can expect to see the lowest-performing collèges catch up. In other words, collèges that are currently, or repeatedly year after year, below the national average should be expected to catch up with the national average.

To capture some of this "catch-up" effect, one can first look at the evolution of the variance of the distribution of patent success rates. Furthermore, it is also possible to look at the evolution between two dates of the mean for the group located below the national average with respect to the national average: $$ \frac{\bar{y}^{1} - \bar{y}_{low}^{1}}{\bar{y}^{0} - \bar{y}_{low}^{0}}$$ where $\bar{y}^0$ is the national average success rate at time 0, and $\bar{y}_{low}^{0}$ is the average success rate among all schools that are in the first quartile of the distribution (the 25% of schools that have the lowest success rates), and similarly at time 1.

Thus, we expect this metric to be as small as possible, meaning that the below of the distribution is now more concentraded around the national average. On the contrary, a value close to 1 would mean that nothing has changed between the two dates, because even if the average of the last quartile would have increased, it is possible that this is only due to a general upward trend.

Another possible indicator would be to do the same thing but taking a specific group of collèges. For example, at date 0 we identify a group of collèges that are in difficulty (i.e., they have a very low success rate compared to other collèges), and it is within this group that we calculate the low average $\bar{y}_{low}^{0}$. At date 1, the low average $\bar{y}_{low}^{1}$ is calculated not on a new group of collèges but on the same as the previous data. Thus, the metric is more accurate in the sense that it can be used to track the specific evolution of a targeted group of collèges.

However, one must be aware that such a simplistic metric as this one does not make it possible to assess the effectiveness of a public policy as a whole, because on the one hand it takes only two dates, while on the other hand it is necessary to look over several years at the effects of a public policy. Moreover, in order to identify the real effects of a policy, it is necessary to evaluate a more advanced econometric model.

Definition of the problem: Predict the success rate per collège¶

Our target : The success rate per collège in 2017.

In order to achieve our goal, it seems crucial to achieve the most accurate patent examination pass rate possible. For this, we can use the "root mean square error" defined as follows: $$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$

In fact we will use the normalized RMSE, which is the RMSE divided by the standard deviation of the target variable $\sigma_y$: $$RMSE = \frac{\sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}}{\sigma_y}$$ The standard deviation of the target variable can be seen as a RMSE for a model that always predicts the average value. Thus, deviding the classic RMSE by $\sigma_y$ gives us a ratio which allow us to easily compare the performance of our model.

Data exploration¶

Import¶

In [4]:

import geopandas as gpd
import folium

pd.set_option('display.max_columns', 500)
pd.set_option("display.max_rows", 500)

import warnings
warnings.filterwarnings("ignore")

import sys
sys.path.insert(0, './tools')

Load data¶

First of all, data related to the collèges and their exam pass rates are loaded.

In [ ]:

data_college = pd.read_csv('./data/college/data_college.csv', index_col=0)
data_college.head()

In [ ]:

columns_to_drop = ['Nb 3èmes insertion',
                   'Nb dispositifs relais',
                   'Nb dispositifs initiation aux métiers en alternance',
                   'Nb 2nd cycle général ou technologique',
                   'Nb CAP ou BAC professionnel',
                   'Nb 3émes générales et insertion rentrée précédente',
                   'Inscription à DCOL',
                   'Inscription à DCOL Rs N-1',
                   'Situation relative à un QPV (quartier prioritaire de la Ville)']
data_college_filtered = data_college.drop(columns_to_drop, axis=1)
data_college_filtered.to_csv('./data/college/data_college_filtered.csv')

In [5]:

data_college = pd.read_csv('./data/college/data_college_filtered.csv', index_col=0)
print('shape of the college table:', data_college.shape)
data_college.head()

shape of the college table: (4186, 44)

Out[5]:

	Appartenance EP	Name	Coordonnée X	Coordonnée Y	Etablissement sensible	CATAEU2010	Situation relative à une zone rurale ou autre	Commune code	City_name	Commune et arrondissement code	Commune et arrondissement nom	Département code	Département nom	Académie code	Académie nom	Région code	Région nom	Région 2016 code	Région 2016 nom	Nb élèves	Nb 6èmes	Nb 5èmes	Nb 4èmes générales	Nb 3èmes générales	Nb 3èmes générales retardataires	Nb divisions	Nb 6èmes provenant d'une école EP	Nb 5èmes 4èmes et 3èmes générales Latin ou Grec	Nb élèves pratiquant langue rare	Nb 6èmes bilangues	Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales	Nb 6èmes 5èmes 4èmes et 3èmes générales	Nb 3émes générales et insertion rentrée précédente passés en 2nde GT	Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel	Longitude	Latitude	Position	target
0	HEP	jules simon	267972.0	6744464.8	NON	111.0	urbain	56260	vannes	56260	VANNES	56	MORBIHAN	14.0	RENNES	53	BRETAGNE	53	BRETAGNE	774.0	175.0	185.0	198.0	216.0	37.0	30.0	43.0	32.0	774.0	147.0	38.0	214.0	-2.760887	47.658668	-2.760887	47.658668	47.6586681136,-2.76088651324	95.6
1	HEP	raoul blanchard	942287.5	6538643.7	NON	111.0	urbain	74010	annecy	74010	ANNECY	74	HAUTE SAVOIE	8.0	GRENOBLE	82	RHONE-ALPES	84	AUVERGNE-ET-RHONE-ALPES	829.0	211.0	200.0	192.0	213.0	51.0	31.0	144.0	0.0	816.0	150.0	41.0	211.0	6.125997	45.904308	6.125997	45.904308	45.9043076805,6.12599674762	88.9
2	HEP	georges duhamel	637692.1	6878029.4	NON	111.0	urbain	95306	herblay	95306	HERBLAY	95	VAL-D'OISE	25.0	VERSAILLES	11	ILE-DE-FRANCE	11	ILE-DE-FRANCE	372.0	102.0	85.0	84.0	85.0	20.0	14.0	0.0	12.0	356.0	57.0	20.0	84.0	2.148468	48.999161	2.148468	48.999161	48.9991613692,2.14846829582	76.3
3	HEP	edouard herriot	585352.4	6816567.4	NON	111.0	urbain	28218	luce	28218	LUCE	28	EURE-ET-LOIR	18.0	ORLEANS-TOURS	24	CENTRE-VAL-DE-LOIRE	24	CENTRE-VAL-DE-LOIRE	517.0	124.0	131.0	133.0	116.0	16.0	20.0	0.0	23.0	504.0	76.0	29.0	119.0	1.449799	48.439256	1.449799	48.439256	48.4392557302,1.44979875281	89.2
4	HEP	henri dheurle	372116.5	6400997.6	NON	111.0	urbain	33529	la teste de buch	33529	LA TESTE-DE-BUCH	33	GIRONDE	4.0	BORDEAUX	72	AQUITAINE	75	NOUVELLE-AQUITAINE	777.0	194.0	194.0	174.0	201.0	24.0	27.0	1.0	29.0	763.0	111.0	40.0	189.0	-1.135637	44.630780	-1.135637	44.630780	44.6307801744,-1.13563658848	87.4

Separate the database into train and split

In [6]:

from sklearn.model_selection import train_test_split

y = data_college['target'].values
X = data_college.drop('target', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train['target'] = y_train
X_train.to_csv('./data/data_college_filtered_TRAIN.csv')

X_test['target'] = y_test
X_test.to_csv('./data/data_college_filtered_TEST.csv')

Only public collèges (middle schools) which presented at least 50 students to the exam
4.186 collèges after the merge of the table with the success rates per collège and a table with a lot of explainatory variables per collège found on data.gouv
44 features per collège
geographic data per city to enrich te data. Data of 2015 and 2010. We will work with 2015. The table comes from INSEE and can be found at this link
The scrapping code and the merge code can be found in the notebook 'data_camp_get_data'.

Let's have a look at the main variables that we have:

Appartenance EP: categorical, the level of priority education network
Name: name of the collège
Etablissement sensible: boolean, True if the collège is "sensitive"
Situation relative à une zone rurale ou autre: categorical, situation relating to a rural or other area
Département code: code for the department the collège is in
Académie code: code for the academy the collège is in
Région code: code for the region the collège is in
Nb 6èmes (resp. 5èmes, 4èmes générales, 3èmes générales): number of pupils in each level ("générales": general stream)

Important note: in this database there are no missing data, so we don't have to worry about this problem afterwards.

In [7]:

# number of unique values
data_college.nunique()

Out[7]:

Appartenance EP                                                                        3
Name                                                                                2355
Coordonnée X                                                                        4183
Coordonnée Y                                                                        4182
Etablissement sensible                                                                 2
CATAEU2010                                                                             8
Situation relative à une zone rurale ou autre                                          3
Commune code                                                                        2771
City_name                                                                           2748
Commune et arrondissement code                                                      2812
Commune et arrondissement nom                                                       2789
Département code                                                                     101
Département nom                                                                      101
Académie code                                                                         31
Académie nom                                                                          31
Région code                                                                           27
Région nom                                                                            27
Région 2016 code                                                                      18
Région 2016 nom                                                                       18
Nb élèves                                                                            733
Nb 6èmes                                                                             236
Nb 5èmes                                                                             223
Nb 4èmes générales                                                                   220
Nb 3èmes générales                                                                   218
Nb 6ème SEGPA                                                                         34
Nb 5ème SEGPA                                                                         36
Nb 4ème SEGPA                                                                         35
Nb 3ème SEGPA                                                                         36
Nb SEGPA                                                                             118
Nb 3èmes générales retardataires                                                      79
Nb divisions                                                                          46
Nb 6èmes provenant d'une école EP                                                    267
Nb 5èmes 4èmes et 3èmes générales Latin ou Grec                                      151
Nb 5èmes 4èmes et 3èmes générales                                                     98
Nb élèves pratiquant langue rare                                                     704
Nb 6èmes bilangues                                                                   180
Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales       93
Nb 6èmes 5èmes 4èmes et 3èmes générales                                              211
Nb 3émes générales et insertion rentrée précédente passés en 2nde GT                4184
Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel    4184
Longitude                                                                           4184
Latitude                                                                            4184
Position                                                                            4184
target                                                                               359
dtype: int64

In [8]:

# Python type of each feature
data_college.dtypes

Out[8]:

Appartenance EP                                                                      object
Name                                                                                 object
Coordonnée X                                                                        float64
Coordonnée Y                                                                        float64
Etablissement sensible                                                               object
CATAEU2010                                                                          float64
Situation relative à une zone rurale ou autre                                        object
Commune code                                                                         object
City_name                                                                            object
Commune et arrondissement code                                                       object
Commune et arrondissement nom                                                        object
Département code                                                                     object
Département nom                                                                      object
Académie code                                                                       float64
Académie nom                                                                         object
Région code                                                                           int64
Région nom                                                                           object
Région 2016 code                                                                      int64
Région 2016 nom                                                                      object
Nb élèves                                                                           float64
Nb 6èmes                                                                            float64
Nb 5èmes                                                                            float64
Nb 4èmes générales                                                                  float64
Nb 3èmes générales                                                                  float64
Nb 6ème SEGPA                                                                       float64
Nb 5ème SEGPA                                                                       float64
Nb 4ème SEGPA                                                                       float64
Nb 3ème SEGPA                                                                       float64
Nb SEGPA                                                                            float64
Nb 3èmes générales retardataires                                                    float64
Nb divisions                                                                        float64
Nb 6èmes provenant d'une école EP                                                   float64
Nb 5èmes 4èmes et 3èmes générales Latin ou Grec                                     float64
Nb 5èmes 4èmes et 3èmes générales                                                   float64
Nb élèves pratiquant langue rare                                                    float64
Nb 6èmes bilangues                                                                  float64
Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales     float64
Nb 6èmes 5èmes 4èmes et 3èmes générales                                             float64
Nb 3émes générales et insertion rentrée précédente passés en 2nde GT                float64
Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel    float64
Longitude                                                                           float64
Latitude                                                                            float64
Position                                                                             object
target                                                                              float64
dtype: object

In order to add socio-economic context to our analysis, it was decided to use a second database, this time with city-level information.

In [ ]:

cities_data = pd.read_csv("./data/donnees_geographiques/cities_data.csv", index_col=0)
keep_col = ['insee_code',
            'LIBGEO',
            'REG',
            'DEP',
            'population',
            'SUPERF',
            'med_std_living',
            'poverty_rate',
            'unemployment_rate']
cities_data = cities_data[keep_col]
cities_data_filtered.to_csv('./data/donnees_geographiques/cities_data_filtered.csv')

In [9]:

cities_data = pd.read_csv("./data/donnees_geographiques/cities_data_filtered.csv", index_col=0)
print('Shape of the cities table', cities_data.shape)
cities_data.head()

shape of the cities table (36729, 9)

Out[9]:

	insee_code	LIBGEO	REG	DEP	population	SUPERF	med_std_living	poverty_rate	unemployment_rate
0	01001	L'Abergement-Clémenciat	84	01	767.0	15.95	22228.000000	NaN	0.087766
1	01002	L'Abergement-de-Varey	84	01	241.0	9.15	22883.333333	NaN	0.081301
2	01004	Ambérieu-en-Bugey	84	01	14127.0	24.60	19735.200000	17.227132	0.158234
3	01005	Ambérieux-en-Dombes	84	01	1619.0	15.92	23182.666667	NaN	0.078759
4	01006	Ambléon	84	01	109.0	5.88	NaN	NaN	0.137931

Let us briefly present the main variables of this second database.

insee_code: the Insee code of the city
LIBGEO: name of the city
REG: code for the région
DEP: code for the département
population: number of inhabitants in the city
poverty_rate: poverty rate in the city, in percentage
unemployment_rate: unemployment rate in the city, in percentage

In [10]:

# proportion of NaN values
cities_data.isna().sum() / cities_data.shape[0]

Out[10]:

insee_code           0.000000
LIBGEO               0.000000
REG                  0.000000
DEP                  0.000000
population           0.035068
SUPERF               0.035068
med_std_living       0.121974
poverty_rate         0.880476
unemployment_rate    0.035258
dtype: float64

The variables in this database are quite well filled out, except for the variable coding for the poverty rate, which has more than 88% missing data. Subsequently, we will fill the missing data by taking the average value at the department level, so that there are no more missing data.

In [11]:

# Python type of each feature
cities_data.dtypes

Out[11]:

insee_code            object
LIBGEO                object
REG                    int64
DEP                   object
population           float64
SUPERF               float64
med_std_living       float64
poverty_rate         float64
unemployment_rate    float64
dtype: object

Target visualization¶

As mentioned earlier, the objective here is to arrive at an accurate estimate of a college's pass rate on the Brevet exam. Therefore, the corresponding variable is named 'target'.

First, let's look at the simple distribution of these success rates.

In [12]:

plt.figure(figsize=(10,7))
plt.hist(data_college.target, bins=40)
plt.axvline(x=data_college.target.mean(), color="orange", label='mean')
plt.axvline(x=data_college.target.quantile(.5), color="red", label='median')
plt.xlabel("Success rate (%)")
plt.ylabel("Count")
plt.title("Distribution of the revet uccess rates", fontsize=14)
plt.legend(loc='best')
plt.show()

In [13]:

data_college.target.describe()

Out[13]:

count    4186.000000
mean       87.429025
std         7.611328
min        41.300000
25%        82.925000
50%        88.600000
75%        93.100000
max       100.000000
Name: target, dtype: float64

In [14]:

print("The variance of the success rates in 2019 is %.2f" %data_college.target.std())

The variance of the success rates in 2019 is 7.61

Naive merge with the city database¶

In order to integrate the socio-economic data available in our second database, it is possible to join the two databases at the city level, thanks to their unique code.

In [15]:

# merge on the city code
data_college = pd.merge(data_college, cities_data,
                        left_on='Commune et arrondissement code', right_on='insee_code', how='left')

# fill na by taking the average value at the departement level
city_col_with_na = []
for col in cities_data.columns:
    if cities_data[col].isna().sum() > 0:
        city_col_with_na.append(col)

#print(city_col_with_na)
for col in city_col_with_na:
    data_college[col] = data_college[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
    
data_college.head()

Out[15]:

	Appartenance EP	Name	Coordonnée X	Coordonnée Y	Etablissement sensible	CATAEU2010	Situation relative à une zone rurale ou autre	Commune code	City_name	Commune et arrondissement code	Commune et arrondissement nom	Département code	Département nom	Académie code	Académie nom	Région code	Région nom	Région 2016 code	Région 2016 nom	Nb élèves	Nb 6èmes	Nb 5èmes	Nb 4èmes générales	Nb 3èmes générales	Nb 3èmes générales retardataires	Nb divisions	Nb 6èmes provenant d'une école EP	Nb 5èmes 4èmes et 3èmes générales Latin ou Grec	Nb élèves pratiquant langue rare	Nb 6èmes bilangues	Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales	Nb 6èmes 5èmes 4èmes et 3èmes générales	Nb 3émes générales et insertion rentrée précédente passés en 2nde GT	Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel	Longitude	Latitude	Position	target	insee_code	LIBGEO	REG	DEP	population	SUPERF	med_std_living	poverty_rate	unemployment_rate
0	HEP	jules simon	267972.0	6744464.8	NON	111.0	urbain	56260	vannes	56260	VANNES	56	MORBIHAN	14.0	RENNES	53	BRETAGNE	53	BRETAGNE	774.0	175.0	185.0	198.0	216.0	37.0	30.0	43.0	32.0	774.0	147.0	38.0	214.0	-2.760887	47.658668	-2.760887	47.658668	47.6586681136,-2.76088651324	95.6	56260	Vannes	53.0	56	53200.0	32.30	21046.666667	16.432284	0.172886
1	HEP	raoul blanchard	942287.5	6538643.7	NON	111.0	urbain	74010	annecy	74010	ANNECY	74	HAUTE SAVOIE	8.0	GRENOBLE	82	RHONE-ALPES	84	AUVERGNE-ET-RHONE-ALPES	829.0	211.0	200.0	192.0	213.0	51.0	31.0	144.0	0.0	816.0	150.0	41.0	211.0	6.125997	45.904308	6.125997	45.904308	45.9043076805,6.12599674762	88.9	74010	Annecy	84.0	74	125694.0	66.93	22531.250000	12.108427	0.107961
2	HEP	georges duhamel	637692.1	6878029.4	NON	111.0	urbain	95306	herblay	95306	HERBLAY	95	VAL-D'OISE	25.0	VERSAILLES	11	ILE-DE-FRANCE	11	ILE-DE-FRANCE	372.0	102.0	85.0	84.0	85.0	20.0	14.0	0.0	12.0	356.0	57.0	20.0	84.0	2.148468	48.999161	2.148468	48.999161	48.9991613692,2.14846829582	76.3	95306	Herblay	11.0	95	28341.0	12.74	25128.857143	9.647151	0.096764
3	HEP	edouard herriot	585352.4	6816567.4	NON	111.0	urbain	28218	luce	28218	LUCE	28	EURE-ET-LOIR	18.0	ORLEANS-TOURS	24	CENTRE-VAL-DE-LOIRE	24	CENTRE-VAL-DE-LOIRE	517.0	124.0	131.0	133.0	116.0	16.0	20.0	0.0	23.0	504.0	76.0	29.0	119.0	1.449799	48.439256	1.449799	48.439256	48.4392557302,1.44979875281	89.2	28218	Lucé	24.0	28	16107.0	6.06	17900.333333	21.074758	0.200058
4	HEP	henri dheurle	372116.5	6400997.6	NON	111.0	urbain	33529	la teste de buch	33529	LA TESTE-DE-BUCH	33	GIRONDE	4.0	BORDEAUX	72	AQUITAINE	75	NOUVELLE-AQUITAINE	777.0	194.0	194.0	174.0	201.0	24.0	27.0	1.0	29.0	763.0	111.0	40.0	189.0	-1.135637	44.630780	-1.135637	44.630780	44.6307801744,-1.13563658848	87.4	33529	La Teste-de-Buch	75.0	33	26110.0	180.20	21434.666667	10.666966	0.152706

In [16]:

plt.figure(figsize=(10,7))
plt.scatter(data_college['unemployment_rate'], data_college['target'])
plt.xlabel("Average unemployment rate in the departement")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the unemployment rate in the departement", fontsize=14)
plt.ylim(50,100)
plt.show()

At first glance, it would appear that the unemployment rate in the city has a negative effect on the exam pass rate. This supports the idea that socio-economic variables in the environment in which the school is located may be crucial explanatory factors.

Geographic visualization¶

Important note: In order to correctly see the different dynamic maps, please run this notebook making sure you have an internet connection (because it's dynamic).
An other solution to visualize the maps correctly, is to look at the compiled notebook here.

Let's look at the distribution of the success rate by department.

In [17]:

# Create a dynamic map
def plot_per_department(column, data_college,
                        cmap='OrRd_r',
                        path_geo_data = './data/donnees_geographiques/fichiers_geopandas/'):
    '''
    Functions which returns an interactive map of France colored by department according
    to the value of the feature "column" (for example the target).
    
    Parameters:
        column (string): the name of the column to plot per department
        data_college: the data frame with the data per college
        path_geo_data: path to a geojson file of france
    '''
    
    dep_df = data_college.groupby('Département code').agg({column:'mean'}).reset_index()
    dep_df.columns = ['code', column] # we rename the department code because it's the key of the geojson file
    
    
    m = folium.Map(location=[46.45,1], zoom_start=6) #map centered on France

    # Add the chloropleth
    m.choropleth(
       geo_data=path_geo_data+'departements.geojson.txt', # geoJSON file or url to geojson
       name='choropleth',
       data=dep_df, # Pandas dataframe
       columns=['code',column], # key and value of interest from the dataframe
       key_on='feature.properties.code', # key to link the json file and the dataframe
       fill_color=cmap, # colormap
       fill_opacity=0.7,
       line_opacity=0.2,
       legend_name=column
    )
    
    display(m)
    
    return None

In [18]:

plot_per_department(column='target', data_college=data_college)

Now we can look in detail, at the city level.

Thanks to the function below, you can choose a department and display an interactive map of the collèges. You can click on the little little circles (each of them corresponds to a collège) and see some information about it:

its name
its success rate
its city and department
Its 'Appartenance EP':
- 'HEP' = it's not a priority education college
- 'REP' : it is
- 'REPPLUS' : the highest level of priority

In [19]:

# Create a dynamic map for the cities in each department
def plot_cities_in_dep(column, dep_code, dep_name,
                       data_college,
                       url_parent='https://france-geojson.gregoiredavid.fr/repo/departements/',
                       cmap='OrRd_r'):
    '''
    column (string): name of the column to plot
    dep_code(string): the code of the department to plot 
    dep_name (string): the name of the department
    data_college : the data frame with data per college
    url_parent : the url of France GEOJSON
    '''

    cities_df = data_college[data_college['Département code']==dep_code]
    cities_df_group = cities_df.groupby('Commune et arrondissement code').agg({column:'mean'}).reset_index()
    cities_df_group.columns = ['code', column]
    
    #url to geojson of the department
    url = url_parent+dep_code+'-'+dep_name+'/communes-'+dep_code+'-'+dep_name+'.geojson'
    coords = gpd.read_file(url).loc[0].geometry.centroid.coords[0] #coordinates where the maps is centered
    
    m = folium.Map(location=[coords[1],coords[0]], zoom_start=10) #map centered on France
    
    # Add the chloropleth
    m.choropleth(
       geo_data=url,# url to geojson of the department
       name='choropleth',
       data=cities_df_group, # Pandas dataframe
       columns=['code',column], # key and value of interest from the dataframe
       key_on='feature.properties.code', # key to link the json file and the dataframe
       fill_color=cmap, # colormap
       fill_opacity=0.7,
       line_opacity=0.2,
       legend_name=column
    )
    
    #add the colleges marker 
    for ix, row in cities_df.iterrows():
        # Create a popup tab with the college name and its success_rate
        popup_df = pd.DataFrame(data=[['College', row['Name']], 
                                      ['Sucess rate', str(row['target'])+'%'], 
                                      ['City', row['Commune et arrondissement nom']],
                                      ['Department', row['Département nom']],
                                      ['Appartenance EP', row['Appartenance EP']]])
        popup_html = popup_df.to_html(classes='table table-striped table-hover table-condensed table-responsive', index=False, header=False)
        # Create a marker on the map
        folium.CircleMarker(location = [row['Latitude'],row['Longitude']], radius=2, popup=folium.Popup(popup_html), color='red', alpha=0.5, fill_color='#0000FF').add_to(m)

    display(m)
    
    return None

In [20]:

plot_cities_in_dep(column='target',
                   dep_code='75',
                   dep_name='paris',
                   data_college=data_college)

Without any surprise, some of the richest arrondissements like the $5^{th}$, the $7^{th}$or the $8^{th}$ have the highest average success rates (more than 94%)
Surprisingly the $10^{th}$ arrondissement has an average success rate (76-80%) lower than the national average (87.4 %). This is because the collège La Grange aux Belles which is 'REP' has a very bad success rate of 67% and there are only 4 collèges in the arrondissement. One must be careful of the interpretation of those rates in some cities.

Macro socio-economic features¶

Some of the following features' possible link with the target are worth investigating:

the median standard of living
the poverty rate
the unemployment rate

The median standard of living.¶

The national average is approximatively of 20 000 euros. In average, french cities have half of their households wich live with less than 20 000 euros per year

In [21]:

print('Mean median living wage: %.5f' %cities_data.med_std_living.mean())

Mean median living wage: 20633.28419

In [22]:

plot_per_department(column='med_std_living', data_college=data_college, cmap='Blues_r')

In [23]:

plot_cities_in_dep(column='med_std_living', dep_code='75', dep_name='paris',
                   data_college=data_college, cmap='Blues_r')

In [24]:

plot_cities_in_dep(column='med_std_living', dep_code='93', dep_name='seine-saint-denis',
                   data_college=data_college, cmap='Blues_r')

As expected by clicking on somme collèges in the darked blue zones which correspond to cities where the median standard of living is very low, the sucess rates of collèges are not good.
Example, the collège Jean Lurcat in Saint Denis has a success rate of 72.7%

The unemployment rate¶

National average (per city) in 2015 : 11%

In [25]:

print('Average unemployment rate: %.2f%%' %(cities_data.unemployment_rate.mean()*100))

Average unemployment rate: 11.05%

In [26]:

plot_per_department(column='unemployment_rate', data_college=data_college, cmap='Purples')

In [27]:

plot_cities_in_dep(column='unemployment_rate', dep_code='75', dep_name='paris',
                   data_college=data_college, cmap='Purples')

In the $19^{th}$ arrondissement, the unemployment rate is higher and the collèges have a success rate under the national average (87.3%)

The poverty rate¶

Definition : percentage of people who live with less than 60% of the median standard of living.
National average per city : 13.9%

In [28]:

print('Average poverty rate: %.02f%%' %cities_data.poverty_rate.mean())

Average poverty rate: 13.85%

In [29]:

plot_per_department('poverty_rate', data_college, cmap='Greens')

In [30]:

plot_cities_in_dep('poverty_rate', dep_code='62', 
                   dep_name='pas-de-calais', data_college=data_college, cmap="Greens")

For the department Pas-de-Calais where there are a lot of cities with a high poverty rate, it's not clear wether the poverty rate of the city has a clear impact on the success rate of a collège.

In [31]:

fig, ax = plt.subplots(1,3, figsize=(20,5))
sns.scatterplot(x='med_std_living', y='target', data=data_college, ax=ax[0])
sns.scatterplot(x='unemployment_rate', y='target', color='purple',
                data=data_college.dropna(subset=['unemployment_rate']), ax=ax[1])
sns.scatterplot(x='poverty_rate', y='target', color='green',
                data=data_college.dropna(subset=['poverty_rate']), ax=ax[2])
plt.show()

The median standard of living and the poverty rate seem to be discriminant
The influence of the unemployment rate is less clear

Priority Education Network and "sensitive" schools¶

The priority education policy aims to reduce the gaps in achievement between pupils enrolled in priority education and those who are not. Two types of networks have been identified: the REP+, which concern neighbourhoods or isolated sectors with the greatest concentration of social difficulties that have a strong impact on educational success, and the more socially mixed REPs that are more socially mixed but encounter more significant social difficulties than those of collèges and schools outside priority education. Not every collèges are in the priority education network, and in this case they are labelled as "HEP" (outside the priority education network, Hors éducation prioritaire).

In [32]:

ax = sns.boxplot(x="Appartenance EP", y="target", data=data_college)
ax.axes.set_title("Success rate according to the type of education network")
ax.set_xlabel("Type of education network")
ax.set_ylabel("Success rate");

It can easily be observed that the success rate is significantly lower when the collège is part of a priority education network (this is all the more the case for the REP+ network). However, no causal link can be drawn from this, since collèges are placed in a priority education system precisely when they have a certain amount of ground to make up in relation to other collèges.

Allongside the priority education networks, there exists another label for collège in difficulties, the établissements sensibles. The so-called "sensitive" schools are secondary schools in which a climate of insecurity prevails that seriously compromises pupils' schooling. They are not necessarily in priority education.

The development of violence in schools has led the Ministers of National Education and the Interior to strengthen their collaboration. The latter has led, since 1992, to the classification of certain public secondary schools as "sensitive" schools, without saying, however, that violence is present only in these schools.

Like REPs, sensitive establishments benefit from special measures. They are the subject of exceptional efforts in terms of innovative and adapted pedagogy, by strengthening the timetable potential (class splitting, tutoring, tutoring, directed studies, etc.) and by the allocation of additional jobs, by strengthening the presence of adults (increase in the number of senior educational advisers, boarding school teachers, day school supervisors, etc.) and by appointing two head teachers per class.

In [33]:

sns.catplot(x="Appartenance EP", y="target", hue="Etablissement sensible", kind="box", data=data_college)
plt.title('Success rate according to the type of education network');

Analysis of class size¶

Another determinant of academic success is the number of students per class. It is easy to understand that a smaller class size makes it much easier for the teacher to spend more time with each student, thus ensuring that all students progress through the classroom without some being left behind. This is true at all levels of education but mainly during the first years of schooling.

Unfortunately, the variable "number of students" per class is not present in our database. However, it is possible to create a variable "average number of pupils per class" from the variables "total number of pupils in the school" and "number of classes".

In [34]:

data_college['average_class_size'] = data_college['Nb élèves'] / data_college['Nb divisions']

In [35]:

plt.scatter(data_college['average_class_size'], data_college['target'])
plt.xlabel("Average class size")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the average class size", fontsize=14)
plt.ylim(50,100)
plt.show()

In this case, the effect of class size on the exam pass rate cannot be determined directly because the rest of the variables must be controlled for. However, it is interesting to keep such a variable, given its importance in the literature.

In the same way, it is possible to create other new variables, such as the share of pupils who are in a general stream or the share of pupils who are in a European or international section.

In [36]:

# percentage of pupils in the general stream
data_college['percent_general_stream'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / data_college['Nb élèves']
# percentage of pupils in an european or international section
data_college['percent_euro_int_section'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / data_college['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = data_college['Nb 5èmes'] + data_college['Nb 4èmes générales'] + data_college['Nb 3èmes générales']
data_college['percent_latin_greek'] = data_college['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
data_college['percent_segpa'] = data_college['Nb SEGPA'] / data_college['Nb élèves']

Analysis of the quantitative features¶

In [37]:

quant_features = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
                  'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb élèves pratiquant langue rare',
                  'Nb 3ème SEGPA',
                  'average_class_size', 'percent_general_stream', 'percent_euro_int_section']
data_college[quant_features].describe()

Out[37]:

	Nb élèves	Nb 3èmes générales	Nb 3èmes générales retardataires	Nb 5èmes 4èmes et 3èmes générales Latin ou Grec	Nb élèves pratiquant langue rare	Nb 3ème SEGPA	average_class_size	percent_general_stream	percent_euro_int_section
count	4186.000000	4186.000000	4186.000000	4186.000000	4186.000000	4186.000000	4186.000000	4186.000000	4186.000000
mean	516.225036	119.627568	18.273053	17.682991	494.123268	4.247731	24.452214	0.238136	0.056751
std	163.357700	40.049045	11.141567	25.031394	156.442065	7.434012	2.502211	0.033351	0.022781
min	120.000000	0.000000	0.000000	0.000000	111.000000	0.000000	12.866667	0.101480	0.000000
25%	397.000000	91.000000	11.000000	0.000000	378.000000	0.000000	22.736656	0.216216	0.040541
50%	504.000000	116.000000	16.000000	13.000000	482.000000	0.000000	24.666667	0.237145	0.054127
75%	623.750000	145.000000	24.000000	25.000000	599.000000	9.000000	26.380952	0.257856	0.069565
max	1711.000000	419.000000	211.000000	211.000000	1698.000000	35.000000	32.095238	0.591667	0.175000

In [38]:

data_college[quant_features].hist(figsize=(16, 20), bins = 50, xlabelsize=8, ylabelsize=8)
plt.show()

We can observe that most of our quantitative features have a Gaussian distribution shape.

Correlation study¶

Finally, we can have a look on the correlation between our different quantitative features and our target.

In [39]:

features_for_corr = ['target', 'average_class_size',
                     'percent_general_stream', 'percent_euro_int_section',
                     'med_std_living', 'poverty_rate', 'unemployment_rate']
sns.heatmap(data_college[features_for_corr].corr(), cmap='YlGn')
plt.title('Anlaysis of correlations via heatmap');

Workflow¶

The work flow is composed of two essential elements that make up the submission: the feature extractor and the regressor. The first allows both the preparation of initial data and the creation of new variables. The second, on the other hand, allows a supervised learning model to be trained so that the success rate on the exam can be correctly predicted. This model is trained on a part of the base obtained from the feature extractor output, then is evaluated on the remaining part.

We will use a Random Forest Regressor model in order to predict the different success rates.

In [40]:

data_college = pd.read_csv('./data/college/data_college_filtered.csv', index_col=0)
y_array = data_college['target'].values
X_df = data_college.drop('target', axis=1)

cities_data = pd.read_csv("./data/donnees_geographiques/cities_data_filtered.csv", index_col=0)
keep_col_cities = ['population', 'SUPERF', 'med_std_living', 'poverty_rate', 'unemployment_rate']

In [41]:

def process_students(X):
    """Create new features linked to the pupils"""
    # average class size
    X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
    # percentage of pupils in the general stream
    X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
    # percentage of pupils in an european or international section
    X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
    # percentage of pupils doing Latin or Greek
    sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
    X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
    # percentage of pupils that are in a SEGPA class
    X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']

    return np.c_[X['average_class_size'].values,
                 X['percent_general_stream'].values,
                 X['percent_euro_int_section'].values,
                 X['percent_latin_greek'].values,
                 X['percent_segpa'].values]
        
        
def merge_naive(X):
    # merge the two databases at the city level
    df = pd.merge(X, cities_data,
                  left_on='Commune et arrondissement code', right_on='insee_code', how='left')

    # fill na by taking the average value at the departement level
    for col in keep_col_cities:
        if cities_data[col].isna().sum() > 0:
            df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))

    return df[keep_col_cities]

In [42]:

from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer


# Transformers
students_col = ['Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
                'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
                'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
                'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA']
students_transformer = FunctionTransformer(process_students, validate=False)

num_cols = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
            "Nb 6èmes provenant d'une école EP"]
numeric_transformer = Pipeline(steps=[('scale', StandardScaler())])

cat_cols = ['Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
            'Situation relative à une zone rurale ou autre']
categorical_transformer = Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='ignore'))])

merge_col = ['Commune et arrondissement code', 'Département code']
merge_transformer = FunctionTransformer(merge_naive, validate=False)

drop_cols = ['Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
             'Commune et arrondissement code', 'Commune et arrondissement nom',
             'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
             'Longitude', 'Latitude', 'Position']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols),
        ('students', make_pipeline(students_transformer, SimpleImputer(strategy='mean'), StandardScaler()), students_col),
        ('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='mean')), merge_col),
        ('drop cols', 'drop', drop_cols),
        ], remainder='drop') # remainder='drop' or 'passthrough'

In [43]:

# check it works
preprocessor.fit_transform(X_df)

Out[43]:

array([[ 1.57816718e+00,  2.40664777e+00,  1.68101893e+00, ...,
         2.10466667e+04,  1.64322843e+01,  1.72886328e-01],
       [ 1.91489186e+00,  2.33173067e+00,  2.93772485e+00, ...,
         2.25312500e+04,  1.21084267e+01,  1.07960805e-01],
       [-8.82984188e-01, -8.64732351e-01,  1.55018891e-01, ...,
         2.51288571e+04,  9.64715116e+00,  9.67644240e-02],
       ...,
       [-6.87226673e-02,  8.42176087e-02, -9.22157608e-01, ...,
         2.26595000e+04,  1.08124686e+01,  1.17145873e-01],
       [ 4.51670034e-01,  5.83664956e-01, -2.04039942e-01, ...,
         1.90816667e+04,  1.95091294e+01,  1.80443702e-01],
       [-1.00542953e+00, -1.31423496e+00, -1.01192232e+00, ...,
         2.13980000e+04,  9.49384985e+00,  1.34922608e-01]])

In [44]:

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=5, max_depth=50, max_features=10)

In [45]:

from sklearn.metrics import make_scorer, mean_squared_error

def normalized_rmse(y_true, y_pred):
    """Normalized RMSE"""
    if isinstance(y_true, pd.Series):
        y_true = y_true.values

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return rmse / np.std(y_true) 
    
custom_loss = make_scorer(normalized_rmse, greater_is_better=False)

In [46]:

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

clf = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', regressor)])

cv = ShuffleSplit(n_splits=5, test_size=0.25)

scores_Xdf = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=custom_loss)

print("mean: %.2e (+/- %.2e)" % (scores_Xdf.mean(), scores_Xdf.std()))

mean: 9.41e-01 (+/- 1.83e-02)

Submission¶

In order to run a whole submission, we need to compile into two differents python files the feature extractor and the regressor. Basically, we create a subfolder inside the folder 'submission', we name it 'starting_kit' for example, and we place into this subfolder the two files feature_extrator.py and regressor.py. These two files simply contain the codes presented above, in a more structured way, as presented later.

Note: the metric used to evaluate our model is not defined in either of those two files but is defined in a more general file, problem.py, placed at the root of the project.

In [47]:

# feature_extractor.py

import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline 

class FeatureExtractor(object):
    def __init__(self):
        
        self.path = os.path.dirname(__file__)
        
        # read the database with the city informations
        self.cities_data = pd.read_csv(os.path.join(self.path, 'cities_data_filtered.csv'), index_col=0)
        self.keep_col_cities = ['population', 'SUPERF', 'med_std_living', 'poverty_rate', 'unemployment_rate']
        
        # Transformers
        
        self.students_col = ['Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
                             'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
                             'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
                             'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA']
        self.students_transformer = FunctionTransformer(self.process_students, validate=False)
        
        self.num_cols = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
                         "Nb 6èmes provenant d'une école EP"]
        self.numeric_transformer = Pipeline(steps=[('scale', StandardScaler())])
        
        self.cat_cols = ['Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
                         'Situation relative à une zone rurale ou autre']
        self.categorical_transformer = Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='ignore'))])
        
        self.merge_col = merge_col = ['Commune et arrondissement code', 'Département code']
        self.merge_transformer = FunctionTransformer(self.merge_naive, validate=False)
        
        self.drop_cols = ['Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
                          'Commune et arrondissement code', 'Commune et arrondissement nom',
                          'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
                          'Longitude', 'Latitude', 'Position']
        pass

    def fit(self, X_df, y_array):
        X_encoded = X_df
        
        self.preprocessor = ColumnTransformer(
            transformers=[
                ('num', self.numeric_transformer, self.num_cols),
                ('cat', self.categorical_transformer, self.cat_cols),
                ('students', make_pipeline(self.students_transformer, SimpleImputer(strategy='mean'), StandardScaler()), self.students_col),
                ('merge', make_pipeline(self.merge_transformer, SimpleImputer(strategy='mean')), self.merge_col),
                ('drop cols', 'drop', self.drop_cols),
                ], remainder='passthrough') # remainder='drop' or 'passthrough'

        self.preprocessor.fit(X_encoded, y_array)
        pass

    def transform(self, X_df):
        X_encoded = X_df
        X_array = self.preprocessor.transform(X_encoded)
        return X_array
    
    @staticmethod
    def process_students(X):
        """Create new features linked to the pupils"""
        # average class size
        X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
        # percentage of pupils in the general stream
        X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
        # percentage of pupils in an european or international section
        X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
        # percentage of pupils doing Latin or Greek
        sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
        X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
        # percentage of pupils that are in a SEGPA class
        X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']
        
        return np.c_[X['average_class_size'].values,
                     X['percent_general_stream'].values,
                     X['percent_euro_int_section'].values,
                     X['percent_latin_greek'].values,
                     X['percent_segpa'].values]
        
        
    def merge_naive(self, X):
        # merge the two databases at the city level
        df = pd.merge(X, self.cities_data,
                      left_on='Commune et arrondissement code', right_on='insee_code', how='left')

        # fill na by taking the average value at the departement level
        for col in self.keep_col_cities:
            if self.cities_data[col].isna().sum() > 0:
                df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
    
        return df[self.keep_col_cities]
        

In [48]:

# regressor.py

from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator


class Regressor(BaseEstimator):
    def __init__(self):
        self.reg = RandomForestRegressor(
            n_estimators=5, max_depth=50, max_features=10)

    def fit(self, X, y):
        self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)

In [49]:

!ramp_test_submission

Testing Prediction of succes rates at the Brevet exam
Reading train and test files from ./data ...
Reading cv ...
Training submissions/starting_kit ...
CV fold 0
	score  normalized rmse      time
	train          0.43705  1.028025
	valid          0.95440  0.448320
	test           0.94854  0.411667
CV fold 1
	score  normalized rmse      time
	train          0.45110  0.943786
	valid          0.90906  0.412820
	test           0.94620  0.418762
CV fold 2
	score  normalized rmse      time
	train          0.41144  0.971130
	valid          0.95395  0.476197
	test           0.94920  0.471500
CV fold 3
	score  normalized rmse      time
	train          0.42130  1.039829
	valid          0.92765  0.470854
	test           0.95002  0.415969
CV fold 4
	score  normalized rmse      time
	train          0.42068  0.950585
	valid          0.94567  0.410397
	test           0.94812  0.407459
CV fold 5
	score  normalized rmse      time
	train          0.42068  1.008203
	valid          0.94154  0.415711
	test           0.94138  0.417803
CV fold 6
	score  normalized rmse      time
	train          0.43042  0.934333
	valid          0.98223  0.416615
	test           0.93851  0.413901
CV fold 7
	score  normalized rmse      time
	train          0.43065  0.994172
	valid          0.91149  0.413359
	test           0.93771  0.406948
----------------------------
Mean CV scores
----------------------------
	score     normalized rmse        time
	train  0.42792 ± 0.011481  1.0 ± 0.04
	valid  0.94075 ± 0.022744  0.4 ± 0.03
	test   0.94496 ± 0.004677  0.4 ± 0.02
----------------------------
Bagged scores
----------------------------
	score  normalized rmse
	valid          0.90990
	test           0.86526

Data Camp Project

Prediction of the success rate of the French "brevet des collèges"

Table of content¶

Import¶

Introduction¶

Social impact¶

Performance indicators for public policy action¶

Definition of the problem: Predict the success rate per collège¶

Data exploration¶

Import¶

Load data¶

Target visualization¶

Naive merge with the city database¶

Geographic visualization¶

Macro socio-economic features¶

The median standard of living.¶

The unemployment rate¶

The poverty rate¶

Priority Education Network and "sensitive" schools¶

Analysis of class size¶

Analysis of the quantitative features¶

Correlation study¶

Workflow¶

Submission¶