HLMA 408: Premiers pas en Python/Pandas


Auteur: Joseph Salmon

[email protected]

Introduction et présentation

Import des packages usuels:

In [1]:
import platform
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import seaborn as sns
from scipy.stats import norm
from download import download

Verifier les versions des packages utilisées:

In [2]:
print('The python version is', platform.python_version())
print('numpy {}.'.format(np.__version__))
The python version is 3.6.8
numpy 1.15.2.

Commande "magique" pour un affichage plus avancé en Jupyter:

In [3]:
%matplotlib notebook

Préparation pour l'affichage graphique et sauvegarder les images:

In [4]:
# saving tools for the course:
sns.set_context("paper", font_scale=1)
sns.set_style("ticks")
sns.set_palette("colorblind")

plt.rcParams.update({'figure.max_open_warning': 50})  # avoid warning when too many plots are opened

# colors 
brown = (0.64, 0.16, 0.16)
purple = (148. / 255, 0, 211. / 255)


dirname = "../prebuiltimages/"
imageformat = ".pdf"

Grossesses et cigarettes, impact sur la santé du nouveau né

Description des données:

"This dataset is found from http://www.stat.berkeley.edu/users/statlabs/labs.html. It accompanies the excellent text Stat Labs: Mathematical Statistics through Applications Springer-Verlag (2001) by Deborah Nolan and Terry Speed."

Plus de détails: https://www.stat.berkeley.edu/users/statlabs/papers/sample.pdf (notamment sur les biais de collectes des données...)

Téléchargement et import pour sauvegarder les données

In [5]:
# to use the default values of utils for instance
saving = False
path_target = "./utils.py"
url_shared_files = "http://josephsalmon.eu/enseignement/Montpellier/HLMA408/sharedcode/utils.py"
download(url_shared_files, path_target, replace=False)
from utils import my_saving_display
Replace is False and data exists, so doing nothing. Use replace==True to re-download the data.

Téléchargement et import des données

In [6]:
url = "http://josephsalmon.eu/enseignement/datasets/babies23.data"
# url = "http://www.stat.berkeley.edu/users/statlabs/data/babies23.data" # backup url, without header.
path_target = "./babies23.data"
download(url, path_target, replace=False)
Replace is False and data exists, so doing nothing. Use replace==True to re-download the data.
Out[6]:
'./babies23.data'
In [7]:
?pd.read_csv  # uncomment for help

Option de pandas et de pré-traitement:

In [8]:
# Preoprocessing: only run once or big trouble (think about it!)
is_preprocessing_done = 0  # init at 0,if greater don't redo it
pd.options.display.max_rows = 8  # set not to display to many lines in pandas
pd.set_option('precision', 0)  # set to display number at precision 0 in pandas

Lecture de la base de données et constructions d'un dataframe:

In [9]:
df_babies = pd.read_csv("babies23.data", skiprows=38, sep='\s+') # \s+ : for hanlding spaces
df_babies.head(n=10)  # df stands for Data Frame
Out[9]:
id pluralty outcome date gestation sex wt parity race age ... drace dage ded dht dwt marital inc smoke time number
0 15 5 1 1411 284 1 120 1 8 27 ... 8 31 5 65 110 1 1 0 0 0
1 20 5 1 1499 282 1 113 2 0 33 ... 0 38 5 70 148 1 4 0 0 0
2 58 5 1 1576 279 1 128 1 0 28 ... 5 32 1 99 999 1 2 1 1 1
3 61 5 1 1504 999 1 123 2 0 36 ... 3 43 4 68 197 1 8 3 5 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6 102 5 1 1449 244 1 138 4 7 33 ... 7 37 4 99 999 1 98 0 0 0
7 129 5 1 1562 245 1 132 2 7 23 ... 7 23 4 71 192 1 2 0 0 0
8 142 5 1 1408 289 1 120 3 0 25 ... 3 26 1 70 180 0 2 0 0 0
9 148 5 1 1568 299 1 143 3 0 30 ... 0 34 5 99 999 1 2 1 1 4

10 rows × 23 columns

Statistiques descriptives élémentaires:

In [10]:
df_babies.describe()
Out[10]:
id pluralty outcome date gestation sex wt parity race age ... drace dage ded dht dwt marital inc smoke time number
count 1236 1236 1236 1236 1236 1236 1236 1236 1236 1236 ... 1236 1236 1236 1236 1236 1e+03 1236 1e+03 1236 1236
mean 6001 5 1 1536 287 1 120 2 3 27 ... 4 31 3 82 505 1e+00 13 9e-01 2 3
std 2257 0 0 107 75 0 18 2 4 6 ... 7 9 2 14 407 3e-01 28 1e+00 9 9
min 15 5 1 1350 148 1 55 0 0 15 ... 0 18 0 60 110 0e+00 0 0e+00 0 0
25% 5286 5 1 1444 272 1 109 0 0 23 ... 0 25 2 70 165 1e+00 2 0e+00 0 0
50% 6730 5 1 1540 280 1 120 1 3 26 ... 3 29 4 73 190 1e+00 4 1e+00 1 1
75% 7583 5 1 1627 288 1 131 3 7 31 ... 7 35 5 99 999 1e+00 7 1e+00 1 3
max 9263 5 1 1714 999 1 176 13 99 99 ... 99 99 9 99 999 5e+00 98 9e+00 99 98

8 rows × 23 columns

Données nan (ou Not A Number / données manquantes):

In [11]:
(df_babies.isnull().any()) # no missing values?
Out[11]:
id          False
pluralty    False
outcome     False
date        False
            ...  
inc         False
smoke       False
time        False
number      False
Length: 23, dtype: bool
In [12]:
plt.figure(figsize=(5,5))
ax = sns.kdeplot(df_babies['gestation'], shade=True)
plt.xlabel('Durée de la grossesse (en jours)')
plt.ylabel('Proportion')
ax.legend().set_visible(False)
plt.title("Estimation de la densité de la durée de la grossesse")
/home/jo/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[12]:
Text(0.5,1,'Estimation de la densité de la durée de la grossesse')

Commentaires sur la base de données:

Bizarre Bizarre des grossesses qui durent 1000 jours ....

In [13]:
percentage_na = np.sum(df_babies['gestation']
                       == 999) / df_babies['gestation'].count() * 100
print("Il y a {0:.2f}% des grossesses de la base de donnée qui durent 999 jours...".format(
    percentage_na))  # {0:.2f} use to display only 2 digits after the sign "." 
Il y a 1.05% des grossesses de la base de donnée qui durent 999 jours...

Pré-traitement:

Lire l'entête du fichier pour comprendre tout ça... les unités, les données manquantes etc.

In [14]:
if is_preprocessing_done<1:
    print("You have to do the pre-processing only once, to avoid unit issues")
    # Remark: use inplace option to avoid useless copies for nans
    df_babies['gestation'].replace(999, np.nan, inplace=True) 
    df_babies['smoke'].replace(9, np.nan, inplace=True)  # handle missing values

    df_babies['ht'].replace(99, np.nan, inplace=True)  # handle missing values
    df_babies['dht'].replace(99, np.nan, inplace=True)  # handle missing values
    df_babies['ht'] = df_babies['ht'] * 2.54  # inches -> cm
    df_babies['dht'] = df_babies['dht'] * 2.54  # inches -> cm
    
    df_babies['wt'] * 28.3495 / 1000  # onces -> kg
    df_babies['dwt'].replace(999, np.nan, inplace=True)  # handle missing values
    df_babies['wt.1'].replace(999, np.nan, inplace=True)  # handle missing values

    df_babies['wt'] = df_babies['wt'] * 0.453592  # onces -> kg
    df_babies['wt.1'] = df_babies['wt.1'] * 0.453592  # onces -> kg
    df_babies['dwt'] = df_babies['dwt'] * 0.453592  # onces -> kg

    df_babies['number'].replace(9, np.nan, inplace=True)
    df_babies['number'].replace(98, np.nan, inplace=True)
    df_babies['number'].replace(99, np.nan, inplace=True)

    df_babies['number'].replace(1, 2.5, inplace=True)
    df_babies['number'].replace(3, 12, inplace=True)
    df_babies['number'].replace(4, 17, inplace=True)
    df_babies['number'].replace(5, 24.5, inplace=True)
    df_babies['number'].replace(6, 34.5, inplace=True)
    df_babies['number'].replace(7, 50, inplace=True)
    df_babies['number'].replace(8, 70, inplace=True)
    df_babies['number'].replace(2, 7, inplace=True)

    df_babies.dropna(inplace=True)
    print("This is done only because {} < 1".format(is_preprocessing_done))
    is_preprocessing_done +=1
You have to do the pre-processing only once, to avoid unit issues
This is done only because 0 < 1
In [15]:
plt.figure(figsize=(5, 5))
ax = sns.kdeplot(df_babies['gestation'], shade=True)
plt.xlabel('Durée de la grossesse (en jours)')
plt.ylabel('Proportion')
plt.axvline(40 * 7, c='k')  # bar à 40 semaines
ax.legend().set_visible(False)
plt.title("Estimation de la densité de la durée de la grossesse")
print("Maintenant il y a {} des valeurs manquantes identifiées.".format(

    sum(df_babies["gestation"].isnull())))
Maintenant il y a 0 des valeurs manquantes identifiées.
/home/jo/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [16]:
np.quantile(df_babies['gestation'],np.linspace(0.05,0.95,num=19)) 
Out[16]:
array([250. , 262.4, 267. , 270. , 272. , 274. , 275. , 277. , 278. ,
       280. , 282. , 283. , 284. , 286. , 288. , 290. , 292. , 295. ,
       302. ])
In [17]:
pd.set_option('precision',3)  # display 3 digits now in pandas
In [18]:
df_babies['dwt']  # mothers pre-pregnancy weight in kg
Out[18]:
0       49.895
1       67.132
5       58.967
7       87.090
         ...  
1232    77.111
1233    81.647
1234    74.843
1235    78.018
Name: dwt, Length: 695, dtype: float64
In [19]:
df_babies['wt'] # babies weight in kg
Out[19]:
0       54.431
1       51.256
5       61.689
7       59.874
         ...  
1232    58.060
1233    58.967
1234    56.699
1235    53.070
Name: wt, Length: 695, dtype: float64
In [20]:
df_babies['wt.1']  # father weight in kg
Out[20]:
0       45.359
1       61.235
5       42.184
7       63.503
         ...  
1232    54.431
1233    68.039
1234    49.895
1235    58.513
Name: wt.1, Length: 695, dtype: float64
In [21]:
# Exercice: afficher les poids des parents et visualiser les différences?
In [22]:
# Réponse:
# plt.figure(figsize=(3,3))
# plt.title("Poids des parents (densité)")
# sns.kdeplot(df_babies['wt.1'], label="Père")
# sns.kdeplot(df_babies['wt'], label="Mère")
# plt.xlabel("Poids (kg)")
# plt.tight_layout()
In [23]:
fig_hist_height = plt.figure(figsize=(6,4))
plt.hist(df_babies['ht'], normed=True, bins=19)
plt.xlabel('Taille de la mère (en cm)')
plt.ylabel('Proportion')
plt.title("Histogramme de la taille des mères")
my_saving_display(fig_hist_height, dirname, "hist_height_mother", imageformat,saving=saving)
/home/jo/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6571: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [24]:
fig_hist_height = plt.figure(figsize=(6,4))
ax = sns.kdeplot(df_babies['ht'], shade=True)
plt.xlabel('Taille de la mère (en cm)')
plt.ylabel('Proportion')
plt.title("Densité de la taille des mères")
ax.get_legend().remove()
my_saving_display(fig_hist_height, dirname,
                  "kde_height_mother", imageformat, saving=saving)
/home/jo/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [25]:
n_samples = df_babies['ht'].count()  # put 10 to visualize better
sample = df_babies['ht'].iloc[0:n_samples]
y = np.zeros(n_samples,) + 0.05 * np.random.randn(n_samples,)
bandwidth = 1.5
x = np.linspace(sample.min(), sample.max(), num=200)
z = np.zeros(200)

fig, ax = plt.subplots(3, 1, figsize=(8, 6))
for i in range(n_samples):
    current_density = norm.pdf(x, sample.iloc[i], bandwidth)
    z += current_density / n_samples
    ax[1].plot(x, current_density, '-',
               color=sns.color_palette()[0], lw=1, alpha=0.05)
ax[1].set_title(
    "Gaussiennes centrées sur les observations (écart-type {0})".format(bandwidth))
ax[1].set_ylim(0, .3)

ax[0].scatter(sample, y, c='black', s=100,
              marker='o', edgecolors=brown, lw='1')
ax[0].set_title("Taille des mères: observations (avec frémissement/jittering)")

ax[2].plot(x, z, '-', color=sns.color_palette()[0], lw=1)
ax[2].set_title("Moyenne des gaussiennes précédentes: KDE")
# ax[2].set_ylim(0,.06)

plt.tight_layout()
my_saving_display(fig, dirname,
                  "kde_for_dummies", imageformat, saving=saving)

Remarque:

Dans la partie ci-dessus le coefficient alpha permet d'afficher de manière plus opaque les gaussiennes qui sont plus fréquentes que les autres

Histogramme et densité gaussienne : le théorème de la limite centrale en action sur données réelles (pour la taille des mères)

In [26]:
fig_hist_height, axs = plt.subplots(1, 1, figsize=(6,4))

mean = np.mean(df_babies['ht'])
sd = np.std(df_babies['ht'])

print("Moyenne={0} (cm) et écart-type={1} (cm)".format(mean,sd))

plt.hist((df_babies['ht']-mean)/sd, density=True, bins=19)
x = np.linspace(((df_babies['ht']-mean)/sd).min(), ((df_babies['ht']-mean)/sd).max(), 100)
plt.xlabel('Taille de la mère standardisée')
plt.ylabel('Densité')
plt.title("Histogramme de la taille des mères")

plt.plot(x, norm.pdf(x, 0, 1), 'k-', lw=5)
my_saving_display(fig_hist_height, dirname,
                  "hist_n_gauss_height_mother", imageformat, saving=saving)
Moyenne=162.71715107913673 (cm) et écart-type=6.4288819931782255 (cm)
In [27]:
# Exercice: affiché la taille des parents (densité):
In [ ]:
 
In [28]:
# Réponse:
plt.figure(figsize=(3,3))
plt.title("Taille des parents (densité)")
sns.kdeplot(df_babies['dht'], label="Père")
sns.kdeplot(df_babies['ht'], label="Mère")
plt.xlabel("Taille (cm)")
plt.tight_layout()
/home/jo/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [29]:
fig_hist_cigs = plt.figure(figsize=(6, 4))

hist_manual = df_babies.loc[df_babies["number"] > 0, 'number']
hist_manual.hist(bins=[0, 1, 5, 10, 15, 20, 30, 40, 60, 71], density=False)
plt.xlabel('Nombre de cigarettes fumées par une mère fumeuse')
plt.ylabel('Proportion')
plt.title("Histogramme de la consommation de cigarettes (mères fumeuses)")
my_saving_display(fig_hist_cigs, dirname,
                  "hist_cigs_mother", imageformat, saving=saving)
In [30]:
# hist_manual
hist_manual.value_counts() / hist_manual.count() * 100
Out[30]:
24.5    31.081
2.5     23.243
7.0     23.243
12.0    11.892
34.5     5.135
17.0     3.243
50.0     2.162
Name: number, dtype: float64
In [31]:
# check sum =100%
np.sum(hist_manual.value_counts() / hist_manual.count() * 100)
Out[31]:
99.99999999999999
In [32]:
# histogram of height by type of smoker: 0 never, 1 now, 2 during pregnancy, 3 within 1yr
df_babies.hist(column='ht', by='smoke',density=True, bins=20, grid=False, layout=(4,1), sharex=True)
Out[32]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f6734d3f2b0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f6733d12940>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f6733cbf2e8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f6733ce5a20>],
      dtype=object)
In [33]:
plt.subplots?
In [34]:
nrow = 4
ncol = 1
fig, axs = plt.subplots(nrow, ncol,figsize=(5,5),sharex=True)
for i, group in enumerate(df_babies.groupby("smoke")):
    if i is 0:
        axs[i].set_title("Densité de la taille des mères")    
    sns.kdeplot(group[1]["ht"],ax=axs[i],label=group[0])
    if i is 3:
        axs[i].set_xlabel("Densité de la taille des mères")    
plt.tight_layout()