HLMA 408: Échantillonage aléatoire¶

Auteur: Joseph Salmon joseph.salmon@umontpellier.fr

Sommaire¶

Introduction et présentation

Introduction et présentation¶

Import des packages usuels:¶

In [1]:

import platform
import sys
from download import download
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact  # widget manipulation
from scipy.special import comb, binom
from scipy.stats import norm

Commande "magique" pour un affichage plus avancé en Jupyter:¶

In [2]:

%matplotlib inline

Préparation pour l'affichage graphique et sauvegarder les images:¶

In [3]:

# saving tools for the course:
sns.set_context("paper", font_scale=1)
sns.set_style("ticks")
sns.set_palette("colorblind")

# colors 
brown = (0.64, 0.16, 0.16)
purple = (148. / 255, 0, 211. / 255)


dirname = "../prebuiltimages/"
imageformat = ".pdf"

Grossesses et cigarettes, impact sur la santé du nouveau né¶

Description des données:

"This dataset is found from http://www.stat.berkeley.edu/users/statlabs/labs.html. It accompanies the excellent text Stat Labs: Mathematical Statistics through Applications Springer-Verlag (2001) by Deborah Nolan and Terry Speed."

Plus de détails: https://www.stat.berkeley.edu/users/statlabs/papers/sample.pdf (notamment sur les biais de collectes des données...)

Téléchargement et import pour sauvegarder les données¶

In [4]:

# to use the default values of utils for instance
saving = True
path_target = "./utils.py"
url_shared_files = "http://josephsalmon.eu/enseignement/Montpellier/HLMA408/sharedcode/utils.py"
download(url_shared_files, path_target, replace=False)
from utils import my_saving_display

Replace is False and data exists, so doing nothing. Use replace==True to re-download the data.

Téléchargement et import des données¶

In [5]:

url = "http://josephsalmon.eu/enseignement/datasets/babies23.data"
# url = "http://www.stat.berkeley.edu/users/statlabs/data/babies23.data" # backup url, without header.
path_target = "./babies23.data"
download(url, path_target, replace=False)

Replace is False and data exists, so doing nothing. Use replace==True to re-download the data.

Out[5]:

'./babies23.data'

Option de pandas et de pré-traitement:¶

In [6]:

# Preoprocessing: only run once or big trouble (think about it!)
is_preprocessing_done = 0  # init at 0,if greater don't redo it
pd.options.display.max_rows = 8  # set not to display to many lines in pandas
pd.set_option('precision', 0)  # set to display number at precision 0 in pandas

Lecture de la base de données et constructions d'un dataframe:¶

In [7]:

df_babies = pd.read_csv("babies23.data", skiprows=38, sep='\s+') # \s+ : for hanlding spaces
df_babies.head(n=10)  # df stands for Data Frame

Out[7]:

	id	pluralty	outcome	date	gestation	sex	wt	parity	race	age	...	drace	dage	ded	dht	dwt	marital	inc	smoke	time	number
0	15	5	1	1411	284	1	120	1	8	27	...	8	31	5	65	110	1	1	0	0	0
1	20	5	1	1499	282	1	113	2	0	33	...	0	38	5	70	148	1	4	0	0	0
2	58	5	1	1576	279	1	128	1	0	28	...	5	32	1	99	999	1	2	1	1	1
3	61	5	1	1504	999	1	123	2	0	36	...	3	43	4	68	197	1	8	3	5	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6	102	5	1	1449	244	1	138	4	7	33	...	7	37	4	99	999	1	98	0	0	0
7	129	5	1	1562	245	1	132	2	7	23	...	7	23	4	71	192	1	2	0	0	0
8	142	5	1	1408	289	1	120	3	0	25	...	3	26	1	70	180	0	2	0	0	0
9	148	5	1	1568	299	1	143	3	0	30	...	0	34	5	99	999	1	2	1	1	4

10 rows × 23 columns

Pré-traitement:¶

Lire l'entête du fichier pour comprendre tout ça... les unités, les données manquantes etc.

In [8]:

if is_preprocessing_done<1:
    print("You have to do the pre-processing only once, to avoid unit issues")
    # Remark: use inplace option to avoid useless copies for nans
    df_babies['smoke'].replace(9, np.nan, inplace=True)  # handle missing values
    df_babies['smoke'].replace(0, False, inplace=True)
    df_babies['smoke'].replace(1, True, inplace=True)
    df_babies['smoke'].replace(2, True, inplace=True)
    df_babies['smoke'].replace(3, True, inplace=True)

    df_babies.dropna(inplace=True)
    print("This is done only because {} < 1".format(is_preprocessing_done))
    is_preprocessing_done +=1

You have to do the pre-processing only once, to avoid unit issues
This is done only because 0 < 1

Analyse du sexe de l'enfant¶

In [9]:

df_babies.head(20)

Out[9]:

	id	pluralty	outcome	date	gestation	sex	wt	parity	race	age	...	drace	dage	ded	dht	dwt	marital	inc	smoke	time	number
0	15	5	1	1411	284	1	120	1	8	27	...	8	31	5	65	110	1	1	False	0	0
1	20	5	1	1499	282	1	113	2	0	33	...	0	38	5	70	148	1	4	False	0	0
2	58	5	1	1576	279	1	128	1	0	28	...	5	32	1	99	999	1	2	True	1	1
3	61	5	1	1504	999	1	123	2	0	36	...	3	43	4	68	197	1	8	True	5	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16	207	5	1	1481	255	1	92	3	4	25	...	7	28	1	99	999	1	1	True	1	5
17	217	5	1	1605	261	1	115	3	3	33	...	3	33	2	70	140	1	4	True	1	5
18	239	5	1	1431	261	1	144	2	0	33	...	5	35	4	99	999	1	7	False	0	0
19	240	5	1	1647	288	1	119	3	0	43	...	0	45	2	73	195	1	5	True	1	6

20 rows × 23 columns

In [10]:

plt.figure(figsize=(6,4))
df_babies.groupby("smoke").wt.plot(kind='kde')
plt.legend()
plt.xlabel('Poids du bébé (en kg)')
plt.ylabel('Proportion')
plt.title("Densité du poids du bébé selon le statut tabagique")

Out[10]:

Text(0.5,1,'Densité du poids du bébé selon le statut tabagique')

In [11]:

smoke_by_type = df_babies.smoke.value_counts()
n_total = df_babies.smoke.count()
print("Il y a {} bébés dont on connait le statut tabagique de la mère".format(n_total))

Il y a 1226 bébés dont on connait le statut tabagique de la mère

In [12]:

smoke_by_type

Out[12]:

True     682
False    544
Name: smoke, dtype: int64

In [13]:

smoke_ratio= smoke_by_type[True] / df_babies.smoke.count()

In [14]:

print("La proportion de mères fumeuses est de {0:.2f} % dans la popultation totale.".format(smoke_ratio * 100))

La proportion de mères fumeuses est de 55.63 % dans la popultation totale.

Tirage aléatoire de n_samples échantillons (parmi les 1226)¶

In [15]:

n_samples = 91

# change the random_state value to get another random sample
# df_extract = df_babies.sample(n=n_samples, random_state=45)
df_extract = df_babies.sample(n=n_samples)
smoke_by_type_extract = df_extract.smoke.value_counts()
smoke_ratio_extract = smoke_by_type_extract[True] / df_extract.smoke.count()
print("La proportion de mères fumeuses est de {0:.2f} % dans la popultation extraite.".format(smoke_ratio_extract * 100))

La proportion de mères fumeuses est de 52.75 % dans la popultation extraite.

In [16]:

comb(n_total, n_samples, exact=True)

Out[16]:

27163361289825550523268206145916032768087536721887922928079532500193847739178087753157271974542740511341704937942015497847278125901041363840

In [17]:

n_digits = sys.getsizeof(comb(n_total, n_samples, exact=True))

In [18]:

print(r"Il y a environ 10^{2} façon de choisir {0} nombres parmi {1}".format(n_samples, n_total, n_digits))

Il y a environ 10^88 façon de choisir 91 nombres parmi 1226

Intervalle de confiance avec le TCL¶

In [19]:

n_samples = 1000
X = np.random.random(n_samples)
meanX = np.mean(X)
sdX = np.std(X)

In [20]:

fig = plt.figure(figsize=(10,4))
ax = sns.stripplot(x=X,alpha=0.3,jitter = 0.1, size=10, marker=".")
plt.title(r"Échantillon (loi uniforme, $n ={0}$)".format(n_samples)) 
plt.axvline(meanX, linestyle='--', color='k', label="$\\bar{x}_n$")
plt.legend()
plt.show()

In [21]:

Z = (X - meanX) / sdX

In [22]:

fig = plt.figure(figsize=(10,4))
ax = sns.stripplot(x=Z,alpha=0.3,jitter = 0.1, size=10, marker=".")
plt.title(r"Échantillon standardiseé (loi uniforme, $n ={0}$)".format(n_samples)) 
plt.axvline(0, linestyle='--', color='k', label="$\\bar{x}_n$")
plt.legend()
plt.show()

In [23]:

alpha = 0.00001
c =  norm.ppf(1- alpha/ 2)
IC = np.array([meanX - c * sdX / np.sqrt(n_samples), meanX + c * sdX / np.sqrt(n_samples)])

In [ ]:

In [24]:

IC, meanX, c

Out[24]:

(array([0.46672845, 0.54562877]), 0.5061786072452197, 4.417173413467605)

In [25]:

from matplotlib.ticker import StrMethodFormatter
fig = plt.figure(figsize=(10, 4))
p = plt.axvspan(IC[0], IC[1], facecolor=sns.color_palette()[
                5], alpha=0.5, label="Intervalle à {}%".format(1-alpha))
ax = sns.stripplot(x=X, alpha=0.3, jitter=0.1, size=10, marker=".")
plt.title(r"Échantillon (loi uniforme, $n ={0}$)".format(n_samples))
plt.axvline(meanX, linestyle='--', color='k', label="$\\bar{x}_n$")
bottom, top = plt.ylim()
xtixcks_val = np.arange(0, 1.1, 0.1)
plt.xticks(xtixcks_val, xtixcks_val)
plt.gca().xaxis.set_major_formatter(StrMethodFormatter('{x:,.2f}')) # 2 decimal places
plt.legend()
plt.show()
my_saving_display(fig, dirname,
                  "jitter_IC_TCL", imageformat, saving=saving)

In [26]:

def IC_by_TCL(alpha=0.05, n_samples=1000):

    X = np.random.random(n_samples)
    meanX = np.mean(X)
    sdX = np.std(X)
    Z = (X - meanX) / sdX
    c = norm.ppf(1 - alpha / 2)
    IC = np.array([meanX - c * sdX / np.sqrt(n_samples),
                   meanX + c * sdX / np.sqrt(n_samples)])
    fig = plt.figure(figsize=(10, 4))
    p = plt.axvspan(IC[0], IC[1], facecolor=sns.color_palette()[
                    5], alpha=0.5, label="Intervalle à {}%".format(1 - alpha))
    ax = sns.stripplot(x=X, alpha=0.3, jitter=0.1, size=10, marker=".")
    plt.title(r"Échantillon (loi uniforme, $n ={0}$)".format(n_samples))
    plt.axvline(meanX, linestyle='--', color='k', label="$\\bar{x}_n$")
    bottom, top = plt.ylim()
    xtixcks_val = np.arange(0, 1.1, 0.1)
    plt.xticks(xtixcks_val, xtixcks_val)
    plt.gca().xaxis.set_major_formatter(
        StrMethodFormatter('{x:,.2f}'))  # 2 decimal places
    plt.legend()
    plt.show()

Visualiseur de l'impact de $\alpha$ et de $n$ sur les IC du TCL¶

In [27]:

interact(IC_by_TCL, alpha=(0.001, .999, 0.001),n_samples=(100,2000,100))  # change the first and second value to check more quantiles 

Out[27]:

<function __main__.IC_by_TCL>

In [28]:

# np.shape(np.where(np.logical_and(X>=IC[0], X<=IC[1])))[1]

On répète la création de la statisque sur 500 tirages d'échantillons de taille 1000:¶

In [29]:

is_IC_has_param = 0 # before starting no IC contains the true parameter 
n_repetition = 500
for i in range(n_repetition):
    n_samples = 1000
    X = np.random.random(n_samples)
    meanX = np.mean(X)
    sdX = np.std(X)
    Z = (X - meanX) / sdX
    alpha = 0.04
    c =  norm.ppf(1- alpha/ 2)
    IC = np.array([meanX - c * sdX / np.sqrt(n_samples), meanX + c * sdX / np.sqrt(n_samples)])
    if (IC[0]<0.5) and (IC[1]>0.5):
        is_IC_has_param += 1
#     else:
#         print(IC)
"Proportion de fois que l'échantiollon contient le vrai paramètre (=espérance): {}".format(is_IC_has_param/n_repetition)

Out[29]:

"Proportion de fois que l'échantiollon contient le vrai paramètre (=espérance): 0.96"

In [ ]: