Premiers pas en pandas


Auteur: Joseph Salmon

[email protected] , adapté en francais du travail de Joris Van den Bossche: https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/01-pandas_introduction.ipynb

Introduction et présentation

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_rows = 8

Cas 1: Survie sur le Titanic

In [3]:
from download import download

url = "http://josephsalmon.eu/enseignement/datasets/titanic.csv"
path_target = "./titanic.csv"
download(url, path_target, replace=False)

# df: data frame
df_titanic_raw = pd.read_csv("titanic.csv")
file_sizes: 100%|██████████████████████████| 61.2k/61.2k [00:00<00:00, 1.38MB/s]
Downloading data from http://josephsalmon.eu/enseignement/datasets/titanic.csv (60 kB)

Successfully downloaded file to ./titanic.csv

In [122]:
df_titanic_raw.tail(n=3)
df_titanic_raw.describe()
Out[122]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 183.000000 183.000000 183.000000 183.000000 183.000000 183.000000 183.000000
mean 455.366120 0.672131 1.191257 35.674426 0.464481 0.475410 78.682469
std 247.052476 0.470725 0.515187 15.643866 0.644159 0.754617 76.347843
min 2.000000 0.000000 1.000000 0.920000 0.000000 0.000000 0.000000
25% 263.500000 0.000000 1.000000 24.000000 0.000000 0.000000 29.700000
50% 457.000000 1.000000 1.000000 36.000000 0.000000 0.000000 57.000000
75% 676.000000 1.000000 1.000000 47.500000 1.000000 1.000000 90.000000
max 890.000000 1.000000 3.000000 80.000000 3.000000 4.000000 512.329200

Valeurs manquantes:

Pour faciliter la suite on ne garde que les observations qui sont complètes, on enlève donc ici les valeurs manquantes

In [8]:
df_titanic = df_titanic_raw.dropna()
df_titanic.tail(3)
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

Description succinte des variables:

  • Survival - Survie (0 = Non; 1 = Oui).
  • Pclass - Passenger Class / Classe du passager (1 = 1ere; 2 = 2nde; 3 = 3ème)
  • Name - Nom
  • Sex - Sexe
  • Age - Age
  • Sibsp - Nombre de frères / soeurs / maris / épouses à bord
  • Parch - Nombre de parents ascendants / enfants à bord
  • Ticket - Numéro du ticket
  • Fare - Prix du ticket (British pound)
  • Cabin - Cabine
  • Embarked - Port d'embarquation (C = Cherbourg; Q = Queenstown; S = Southampton)

Descriptif rapide:

  • count - effectif
  • mean - moyenne
  • std (standard deviation - écart-type)
In [11]:
df_titanic.describe()
Out[11]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 183.000000 183.000000 183.000000 183.000000 183.000000 183.000000 183.000000
mean 455.366120 0.672131 1.191257 35.674426 0.464481 0.475410 78.682469
std 247.052476 0.470725 0.515187 15.643866 0.644159 0.754617 76.347843
min 2.000000 0.000000 1.000000 0.920000 0.000000 0.000000 0.000000
25% 263.500000 0.000000 1.000000 24.000000 0.000000 0.000000 29.700000
50% 457.000000 1.000000 1.000000 36.000000 0.000000 0.000000 57.000000
75% 676.000000 1.000000 1.000000 47.500000 1.000000 1.000000 90.000000
max 890.000000 1.000000 3.000000 80.000000 3.000000 4.000000 512.329200

Compréhension visualisation de la base de données:

Quelle est la répartition par âge des passagers?

In [28]:
plt.figure(figsize=(3,3))
plt.hist(df_titanic['Age'], density=False,bins=50)
plt.xlabel('Age')
plt.ylabel('Proportion')
plt.title("Histogramme de l'âge des passagers")
Out[28]:
Text(0.5,1,"Histogramme de l'âge des passagers")
In [26]:
plt.figure()
ax = sns.kdeplot(df_titanic['Age'], shade=True, cut=0, bw=3)
plt.xlabel('Proportion')
plt.ylabel('Age')
ax.legend().set_visible(False)
plt.title("Estimation de la densité de l'âge des passagers")
Out[26]:
Text(0.5,1,"Estimation de la densité de l'âge des passagers")
In [29]:
ax = sns.kdeplot

Comment le taux de survie des passagers diffère-t-il entre les sexes?

In [31]:
df_titanic_raw.groupby('Sex')[['Survived']].aggregate(lambda x: x.mean())
Out[31]:
Survived
Sex
female 0.742038
male 0.188908

Ou en quoi diffère-t-il entre les différentes classes?

In [36]:
df_titanic.columns
Out[36]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [51]:
plt.figure()
df_titanic.groupby('Pclass')['Survived'].aggregate(lambda x: x.mean()).plot(kind='bar')
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbe903c3550>
In [53]:
sns.catplot(x=df_titanic_raw.columns[2], y="Age",
            hue="Sex", data=df_titanic_raw, kind="violin", legend=False)
plt.title("Taux de survie par classe")
plt.legend?
plt.tight_layout()
/home/jo/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Cas 2: qualité de l'air

In [54]:
url = "http://josephsalmon.eu/enseignement/datasets/20080421_20160927-PA13_auto.csv"
path_target = "./20080421_20160927-PA13_auto.csv"
download(url, path_target, replace=False)
Replace is False and data exists, so doing nothing. Use replace==True to re-download the data.
Out[54]:
'./20080421_20160927-PA13_auto.csv'
In [81]:
polution_df = pd.read_csv('20080421_20160927-PA13_auto.csv', sep=';',
                          comment='#', na_values="n/d",
                          converters={'heure': str})
# check issues with 24:00:
# https://www.tutorialspoint.com/python/time_strptime.htm


# Pré-traitement:
polution_df['heure'] = polution_df['heure'].replace('24', '0')
time_improved = pd.to_datetime(polution_df['date'] +
                               ' ' + polution_df['heure'] + ':00',
                               format='%d/%m/%Y %H:%M')

polution_df['DateTime'] = time_improved
del polution_df['heure']
del polution_df['date']

polution_ts = polution_df.set_index(['DateTime'])
polution_ts = polution_ts.sort_index()
polution_ts.head()
Out[81]:
NO2 O3
DateTime
2008-04-21 00:00:00 28.0 36.0
2008-04-21 01:00:00 13.0 74.0
2008-04-21 02:00:00 11.0 73.0
2008-04-21 03:00:00 13.0 64.0
2008-04-21 04:00:00 23.0 46.0
In [56]:
## Pollution sur Paris au cours des années, Source: Airparif
polution_ts.describe()
Out[56]:
NO2 O3
count 71008.000000 71452.000000
mean 34.453414 39.610046
std 20.380702 28.837333
min 1.000000 0.000000
25% 19.000000 16.000000
50% 30.000000 38.000000
75% 46.000000 58.000000
max 167.000000 211.000000
In [61]:
fig, axes = plt.subplots(2, 1, figsize=(6, 4), sharex=True)

axes[0].plot(polution_ts['O3'].resample('D').mean())
axes[0].set_title("Polution à l'ozone: moyenne journalière sur Paris")
axes[0].set_ylabel("Concentration (µg/m³)")

axes[1].plot(polution_ts['NO2'].resample('D').mean())
axes[1].set_title("Polution à l'azote: moyenne journalière sur Paris")
axes[1].set_ylabel("Concentration (µg/m³)")

plt.show()

La pollution atmosphérique montre-t-elle une tendance à la baisse au fil des ans?

In [64]:
ax = polution_ts['2008':].resample('A').mean().plot(figsize=(4,4))  # échantillone par année (A pour Annual)
plt.ylim(0,50)
plt.title("Evolution de la pollution: \n moyenne annuelle sur Paris")
plt.ylabel("Concentration (µg/m³)")
plt.xlabel("Années")
Out[64]:
Text(0.5,0,'Années')
In [65]:
# Chargement des couleurs
sns.set_palette("GnBu_d", n_colors=7)
polution_ts['weekday'] = polution_ts.index.weekday  # Monday=0, Sunday=6

# polution_ts['weekend'] = polution_ts['weekday'].isin([5, 6])

days = ['Lundi', 'Mardi', 'Mercredi',
        'Jeudi', 'Vendredi', 'Samedi', 'Dimanche']

polution_week_no2 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
    'NO2'].mean().unstack(level=0)
polution_week_03 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
    'O3'].mean().unstack(level=0)
In [67]:
 
Out[67]:
Int64Index([4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
            ...
            9, 9, 9, 9, 9, 9, 9, 9, 9, 9],
           dtype='int64', name='DateTime', length=73920)
In [76]:
fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)

polution_week_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Heure de la journée")
axes[0].set_title(
    "Profil journalier de la pollution au NO2: effet du weekend?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 60)

polution_week_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Heure de la journée")
axes[1].set_title("Profil journalier de la pollution au O3: effet du weekend?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 70)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=days, loc='lower left', bbox_to_anchor=(1, 0.1))

plt.tight_layout()
In [17]:
# XXX TODO quid des saisons?
In [82]:
import calendar
polution_ts['month'] = polution_ts.index.month  # Janvier=0, .... Decembre=12
polution_ts['month'] = polution_ts['month'].apply(lambda x: calendar.month_abbr[x])
polution_ts.head()
Out[82]:
NO2 O3 month
DateTime
2008-04-21 00:00:00 28.0 36.0 Apr
2008-04-21 01:00:00 13.0 74.0 Apr
2008-04-21 02:00:00 11.0 73.0 Apr
2008-04-21 03:00:00 13.0 64.0 Apr
2008-04-21 04:00:00 23.0 46.0 Apr
In [107]:
days = []

polution_month_no2 = polution_ts.groupby(['month', polution_ts.index.hour])[
    'NO2'].mean().unstack(level=0)
polution_month_03 = polution_ts.groupby(['month', polution_ts.index.hour])[
    'O3'].mean().unstack(level=0)
In [113]:
sns.set_palette("GnBu_d", n_colors=12)

fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)

polution_month_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Heure de la journée")
axes[0].set_title(
    "Profil journalier de la pollution au NO2: effet du weekend?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 90)

polution_month_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Heure de la journée")
axes[1].set_title("Profil journalier de la pollution au O3: effet du weekend?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 90)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=calendar.month_name[1:], loc='lower left', bbox_to_anchor=(1, 0.1))

plt.tight_layout()