Author: Pratik Sharma ¶

Project 1 - Applied Statistics¶

Data Description: The data at hand contains medical costs of people characterized by certain attributes.

Domain: Healthcare

Context: Leveraging customer information is paramount for most businesses. In the case of an insurance company, attributes of customers like the ones mentioned below can be crucial in making business decisions. Hence, knowing to explore and generate value out of such data can be an invaluable skill to have.

Attribute Information

age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body weights that are relatively high or low relative to height, objective index of body weight (kg/m^2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest
charges: Individual medical costs billed by health insurance.

Learning Outcomes

Exploratory Data Analysis
Practicing statistics using Python
Hypothesis testing

In [1]:

# Importing packages
import pandas as pd, numpy as np, scipy.stats as stats, matplotlib.pyplot as plt, seaborn as sns
import matplotlib.style as style; style.use('fivethirtyeight')
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.formula.api import ols
import statsmodels.api as sm
pd.options.display.max_rows = 4000
from scipy.stats import chi2

In [2]:

# Reading the data as a dataframe
insurance = pd.read_csv('insurance.csv')

In [3]:

# first five rows of insurance dataframe
insurance.head()

Out[3]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

In [4]:

# Printing out shape of the data
insurance.shape

Out[4]:

(1338, 7)

The dataset is of 1338 rows and 7 columns

In [5]:

# Data type of each attribute
insurance.dtypes

Out[5]:

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

Numeric attributes: age, bmi, children, charges

Object attributes: sex, smoker, region

In [6]:

# Checking the presence of missing values
insurance.isnull().sum()

Out[6]:

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

The dataset has no null values

In [7]:

# Five point summary of numerical attributes
insurance.describe()

Out[7]:

	age	bmi	children	charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

Measure of CT (3Ms) and Distributions¶

Measure of central tendency describes the entire dataset with a single value or metric which represents the middle or center of distribution. It is also known as measure of center or central location.

Determining 3Ms and checking the distribution of age, bmi and charges columns

In [8]:

# Distribution of 'Age' alongwith measure of CT (3Ms)
age_mean = insurance['age'].mean()
age_median = insurance['age'].median()
age_mode = insurance['age'].mode()

fig, ax_hist = plt.subplots(figsize = (12.8, 6))
ax_hist = sns.distplot(insurance['age'])

ax_hist.axvline(age_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(age_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(age_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('3Ms and Distribution of Age')

plt.legend(); plt.show()

In [9]:

# Distribution of 'BMI' alongwith measure of CT (3Ms)
bmi_mean = insurance['bmi'].mean()
bmi_median = insurance['bmi'].median()
bmi_mode = insurance['bmi'].mode()

fig, ax_hist = plt.subplots(figsize = (12.8, 6))
ax_hist = sns.distplot(insurance['bmi'])

ax_hist.axvline(bmi_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(bmi_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(bmi_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('3Ms and Distribution of BMI')

plt.legend(); plt.show()

In [10]:

# Distribution of 'Charges' alongwith measure of CT (3Ms)
charges_mean = insurance['charges'].mean()
charges_median = insurance['charges'].median()
charges_mode = insurance['charges'].mode()

fig, ax_hist = plt.subplots(figsize = (12.8, 6))
ax_hist = sns.distplot(insurance['charges'])

ax_hist.axvline(charges_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(charges_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(charges_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('3Ms and Distribution of Charges')

plt.legend(); plt.show()

Measure of skewness¶

Skewness is a measure of extent to which a distribution differs from a normal distribution.

The rule of thumb I would use here:

If the skewness is between -0.5 and 0.5, the data are fairly symmetrical
If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed
If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed

Determining skewness of age, bmi and charges columns and ploting the results after fitting a norm distribution

In [11]:

# Measure of Skewness and kurtosis for 'age', 'bmi' and 'charges' columns
print("Skewness of 'Age': {}\n".format(insurance['age'].skew().round(3)))

print("Skewness of 'BMI': {}\n".format(insurance['bmi'].skew().round(3)))

print("Skewness of 'Charges': {}\n".format(insurance['charges'].skew().round(3)))

Skewness of 'Age': 0.056

Skewness of 'BMI': 0.284

Skewness of 'Charges': 1.516

In [12]:

# Fitting a normal distribution and ploting the result
h = np.asarray(insurance['age'])
h = sorted(h)
 
#use the scipy stats module to fit a normal distirbution with same mean and standard deviation
fit = stats.norm.pdf(h, np.mean(h), np.std(h)) 
 
#plot both series on the histogram
fig, ax_hist = plt.subplots(figsize = (12.8, 6))
plt.plot(h, fit, '-', linewidth = 2, color = 'goldenrod', label = 'Normal distribution')
plt.hist(h, density = True, color = 'lightsteelblue', label = 'Actual distribution')

ax_hist.axvline(age_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(age_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(age_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('Fitting a Normal Distribution and Checking Skewness of Age')

plt.legend(); plt.show()

We can see in the above graph that age is positively (right, Mode < Median < Mean) skewed with skewness score 0.056 (fairly symmetrical).

In [13]:

# Fitting a normal distribution and ploting the result
h = np.asarray(insurance['bmi'])
h = sorted(h)
 
#use the scipy stats module to fit a normal distirbution with same mean and standard deviation
fit = stats.norm.pdf(h, np.mean(h), np.std(h)) 
 
#plot both series on the histogram
fig, ax_hist = plt.subplots(figsize = (12.8, 6))
plt.plot(h, fit, '-', linewidth = 2, color = 'goldenrod', label = 'Normal distribution')
plt.hist(h, density = True, color = 'lightsteelblue', label = 'Actual distribution')

ax_hist.axvline(bmi_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(bmi_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(bmi_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('Fitting a Normal Distribution and Checking Skewness of BMI')

plt.legend(); plt.show()

We can see in the above graph that bmi is positively (right, median < mean) skewed with skewness score 0.284 (fairly symmetrical).

In [14]:

# Fitting a normal distribution and ploting the result
h = np.asarray(insurance['charges'])
h = sorted(h)
 
#use the scipy stats module to fit a normal distirbution with same mean and standard deviation
fit = stats.norm.pdf(h, np.mean(h), np.std(h)) 
 
#plot both series on the histogram
fig, ax_hist = plt.subplots(figsize = (12.8, 6))
plt.plot(h, fit, '-', linewidth = 2, color = 'goldenrod', label = 'Normal distribution')
plt.hist(h, density = True, color = 'lightsteelblue', label = 'Actual distribution')

ax_hist.axvline(charges_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(charges_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(charges_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('Fitting a Normal Distribution and Checking Skewness of Charges')

plt.legend(); plt.show()

We can see in the above graph that charges is positively (right, mode < median < mean) skewed with skewness score 1.516 (highly skewed).

Outliers¶

In statistics, an outlier is an observation point that is distant from other observations.

Ways to detect outliers

*Box Plot*: In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.
*Scatter Plot*: A scatter plot , is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
*Z-score*: The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.
*IQR score*: The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.

I would be using Boxplot (to visualize), IQR (to display the columns) for outliers

In [15]:

# Outliers in Age
Q3 = insurance['age'].quantile(0.75)
Q1 = insurance['age'].quantile(0.25)
IQR = Q3 - Q1 
display(insurance.loc[(insurance['age'] < (Q1 - 1.5 * IQR)) | (insurance['age'] > (Q3 + 1.5 * IQR))])

plt.figure(figsize = (12.8 , 6))
sns.boxplot(insurance['age'], palette = 'copper').set_title('Outlier in Age')

	age	sex	bmi	children	smoker	region	charges

Out[15]:

Text(0.5, 1.0, 'Outlier in Age')

In [16]:

# Outliers in BMI
Q3 = insurance['bmi'].quantile(0.75)
Q1 = insurance['bmi'].quantile(0.25)
IQR = Q3 - Q1 
display(insurance.loc[(insurance['bmi'] < (Q1 - 1.5 * IQR)) | (insurance['bmi'] > (Q3 + 1.5 * IQR))])

plt.figure(figsize = (12.8 , 6))
sns.boxplot(insurance['bmi'], palette = 'copper').set_title('Outlier in BMI')

	age	sex	bmi	children	smoker	region	charges
116	58	male	49.06	0	no	southeast	11381.32540
286	46	female	48.07	2	no	northeast	9432.92530
401	47	male	47.52	1	no	southeast	8083.91980
543	54	female	47.41	0	yes	southeast	63770.42801
847	23	male	50.38	1	no	southeast	2438.05520
860	37	female	47.60	2	yes	southwest	46113.51100
1047	22	male	52.58	1	yes	southeast	44501.39820
1088	52	male	47.74	1	no	southeast	9748.91060
1317	18	male	53.13	0	no	southeast	1163.46270

Out[16]:

Text(0.5, 1.0, 'Outlier in BMI')

In [17]:

# Outliers in Charges
Q3 = insurance['charges'].quantile(0.75)
Q1 = insurance['charges'].quantile(0.25)
IQR = Q3 - Q1 
display(insurance.loc[(insurance['charges'] < (Q1 - 1.5 * IQR)) | (insurance['charges'] > (Q3 + 1.5 * IQR))].head(5))

plt.figure(figsize = (12.8 , 6))
sns.boxplot(insurance['charges'], palette = 'copper').set_title('Outlier in Charges Using Boxplot')

	age	sex	bmi	children	smoker	region	charges
14	27	male	42.13	0	yes	southeast	39611.7577
19	30	male	35.30	0	yes	southwest	36837.4670
23	34	female	31.92	1	yes	northeast	37701.8768
29	31	male	36.30	2	yes	southwest	38711.0000
30	22	male	35.60	0	yes	southwest	35585.5760

Out[17]:

Text(0.5, 1.0, 'Outlier in Charges Using Boxplot')

age column has no outliers while the columns such as bmi and charges have the outliers

Distribution of categorical columns (including children)¶

In [18]:

# Convert children to categorical
insurance['children'] = pd.Categorical(insurance['children'])

# Replace non-smoker with 0 and smoker with 1
insurance['smoker'] = insurance['smoker'].replace({'no': 0, 'yes': 1})

# Replace male with 1 and female with 0
insurance['sex'] = insurance['sex'].replace({'female': 0, 'male': 1})

In [19]:

# Distribution of sex - Male - 1, Female - 0
plt.figure(figsize = (12.8 , 6))

insurance['sex'].value_counts().plot.bar(color = sns.color_palette('deep', 2))

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x20c43b01860>

In [20]:

# Distribution of children
plt.figure(figsize = (12.8 , 6))

insurance['children'].value_counts().plot.bar(color = sns.color_palette('deep', 6))

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x20c43a79748>

In [21]:

# Distribution of region
plt.figure(figsize = (12.8 , 6))

insurance['region'].value_counts().plot.bar(color = sns.color_palette('deep', 4))

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x20c439e0588>

In [22]:

# Count of smokers vs sex, Male - 1, Female - 0
fig = plt.figure(figsize = (12.8, 6))

sns.countplot(x = 'smoker', hue = 'sex', palette = 'hls', data = insurance).set_title('Count of Smokers by Sex')

Out[22]:

Text(0.5, 1.0, 'Count of Smokers by Sex')

In [23]:

# Distribution of charges for smokers category
fig = plt.figure(figsize=(12.8, 6))

ax = fig.add_subplot(121)
sns.distplot(insurance[(insurance['smoker'] == 1)]['charges'], color = 'c', 
             ax = ax).set_title('Distribution of charges for smokers')

ax= fig.add_subplot(122)
sns.distplot(insurance[(insurance['smoker'] == 0)]['charges'], color = 'b', 
             ax = ax).set_title('Distribution of charges for non-smokers')

Out[23]:

Text(0.5, 1.0, 'Distribution of charges for non-smokers')

In [24]:

# Count of smokers vs age (<18 and >18)
# Sex: Male - 1, Female - 0
fig = plt.figure(figsize = (12.8, 6))

ax = fig.subplots(1, 2)
sns.countplot(x = 'smoker', hue = 'sex', palette = 'Dark2', 
                  data = insurance[(insurance['age'] <= 18)], ax = ax[0]).set_title('Count of Smoker for Age <=18 by Sex')

sns.countplot(x = 'smoker', hue = 'sex', palette = 'Dark2', 
                  data = insurance[(insurance['age'] > 18)], ax = ax[1]).set_title('Count of Smoker for Age >18 by Sex')

Out[24]:

Text(0.5, 1.0, 'Count of Smoker for Age >18 by Sex')

In [25]:

# Distribution of charges for male and female smoker and non-smokers

fig = plt.figure(figsize = (12.8, 6))

ax = sns.boxplot(x = 'smoker', y = 'charges', hue = 'sex', palette = 'Accent', data = insurance)
ax = sns.stripplot(x = 'smoker', y = 'charges', hue = 'sex', palette = 'Accent', data = insurance, 
              jitter = True, dodge = True, linewidth = 0.5)
ax.set_title('Distribution of Charges for Male and Female by Smoker and Non-smokers')

Out[25]:

Text(0.5, 1.0, 'Distribution of Charges for Male and Female by Smoker and Non-smokers')

In [26]:

# Boxplot for medical charges by number of children
fig = plt.figure(figsize = (12.8, 6))

ax = sns.boxplot(x = 'children', y = 'charges', palette = 'afmhot', data = insurance)
ax = sns.stripplot(x = 'children', y = 'charges', palette = 'afmhot', data = insurance, 
              jitter = True, dodge = True, linewidth = 0.5)
ax.set_title('Distribution of Charges by Number of Children')

Out[26]:

Text(0.5, 1.0, 'Distribution of Charges by Number of Children')

In [27]:

# Boxplot for medical charges by region
fig = plt.figure(figsize = (12.8, 6))

ax = sns.boxplot(x = 'region', y = 'charges', palette = 'afmhot', data = insurance)
ax = sns.stripplot(x = 'region', y = 'charges', palette = 'afmhot', data = insurance, 
              jitter = True, dodge = True, linewidth = 0.5)
ax.set_title('Distribution of Charges by Region')

Out[27]:

Text(0.5, 1.0, 'Distribution of Charges by Region')

In [28]:

# Distribution of charges for patients with BMI greater than 30
fig = plt.figure(figsize = (12.8, 6))

sns.distplot(insurance[(insurance['bmi'] >= 30)]['charges'], color = 'b').set_title(
    'Distribution of charges for patients with BMI greater than 30')

Out[28]:

Text(0.5, 1.0, 'Distribution of charges for patients with BMI greater than 30')

In [29]:

# Distribution of charges for patients with BMI less than 30
fig = plt.figure(figsize = (12.8, 6))

sns.distplot(insurance[(insurance['bmi'] < 30)]['charges'], color = 'm').set_title(
    'Distribution of charges for patients with BMI less than 30')

Out[29]:

Text(0.5, 1.0, 'Distribution of charges for patients with BMI less than 30')

There are almost equal number of male and female records in the dataset. People with zero children are the most in the dataset. Most of the people are from southwest region. Smokers spend most on the treatments and the count of male smokers is higher in the dataset against the female smokers. People with 5 children, on average, has less medical expenditures compared to the other groups.

Pair plot that includes all the columns of the data frame¶

In [30]:

# Replace 0 with no and 1 with yes
insurance['smoker'] = insurance['smoker'].replace({0: 'no', 1: 'yes'})

# Replace 1 with male and 0 with female
insurance['sex'] = insurance['sex'].replace({0: 'female', 1: 'male'})

In [31]:

sns.pairplot(insurance, hue = 'smoker')

Out[31]:

<seaborn.axisgrid.PairGrid at 0x20c43fe2b00>

In [32]:

sns.pairplot(insurance, hue = 'sex', palette = 'husl')

Out[32]:

<seaborn.axisgrid.PairGrid at 0x20c46fa0f60>

In [33]:

sns.pairplot(insurance, hue = 'region', palette = 'Set1')

Out[33]:

<seaborn.axisgrid.PairGrid at 0x20c47708dd8>

Statistical evidence: Do charges of people who smoke differs significantly from the people who don't¶

T-test: A t-test is a type of inferential statistic which is used to determine if there is a significant difference between the means of two groups which may be related in certain features.

H0 = Charges for smokers and non-smokers don't differ significantly

H1 = Charges for smokers and non-smokers differs significantly

In [34]:

smokers = np.array(insurance[insurance['smoker'] == 'yes']['charges'])
non_smokers = np.array(insurance[insurance['smoker'] == 'no']['charges'])

print('Mean of Charges for Smokers: {}'.format(smokers.mean().round(2)))
print('Mean of Charges for Non-smokers: {}\n'.format(non_smokers.mean().round(2)))

#performing an independent T-test
t, p_value = stats.ttest_ind(smokers, non_smokers, axis = 0)
if p_value <0.05:
    
    print(f'With a p-value of {round(p_value, 4)} the difference is significant. aka |We reject the null|')
    print('Charges differs significantly.')

else:
    
    print(f'With a p-value of {round(p_value, 4)} the difference is not significant. aka |We fail to reject the null|')
    print("Charges don't differ significantly.")

Mean of Charges for Smokers: 32050.23
Mean of Charges for Non-smokers: 8434.27

With a p-value of 0.0 the difference is significant. aka |We reject the null|
Charges differs significantly.

Statistical evidence: Does bmi of males differ significantly from that of females?¶

H0 = BMI of male and female don't differ significantly

H1 = BMI of male and female differs significantly

In [35]:

male_bmi = np.array(insurance[insurance['sex'] == 'male']['bmi'])
female_bmi = np.array(insurance[insurance['sex'] == 'female']['bmi'])

print('Mean of Charges for Smokers: {}'.format(male_bmi.mean().round(2)))
print('Mean of Charges for Non-smokers: {}\n'.format(female_bmi.mean().round(2)))

#performing an independent T-test
t, p_value = stats.ttest_ind(male_bmi, female_bmi, axis = 0)
if p_value <0.05:
    
    print(f'With a p-value of {round(p_value, 4)} the difference is significant. aka |We reject the null|')
    print('BMI of male and female differs significantly.')

else:
    
    print(f'With a p-value of {round(p_value, 4)} the difference is not significant. aka |We fail to reject the null|')
    print("BMI of male and female don't differ significantly.")

Mean of Charges for Smokers: 30.94
Mean of Charges for Non-smokers: 30.38

With a p-value of 0.09 the difference is not significant. aka |We fail to reject the null|
BMI of male and female don't differ significantly.

In [36]:

# Visualizing
fig = plt.figure(figsize=(12.8, 6))

ax = fig.add_subplot(121)
sns.distplot(insurance[(insurance['sex'] == 'male')]['bmi'], color = 'orange', 
             ax = ax).set_title('Distribution of BMI of Male')

ax= fig.add_subplot(122)
sns.distplot(insurance[(insurance['sex'] == 'female')]['bmi'], color = 'green', 
             ax = ax).set_title('Distribution of BMI of Female')

Out[36]:

Text(0.5, 1.0, 'Distribution of BMI of Female')

Statistical evidence: Is the proportion of smokers significantly different in different genders?¶

H0: Proportion of smokers in male and female are equal

H1: Proportion of smokers in male and female are not equal

In [37]:

female_smokers = insurance[insurance['sex'] == 'female'].smoker.value_counts()[1]  # number of female smokers
male_smokers = insurance[insurance['sex'] == 'male'].smoker.value_counts()[1] # number of male smokers
n_females = insurance['sex'].value_counts()['female'] # number of females in the data
n_males = insurance['sex'].value_counts()['male'] #number of males in the data

In [38]:

print([female_smokers, male_smokers] , [n_females, n_males])
print(f'Proportion of smokers in females, males = {round(115/662,2)}%, {round(159/676,2)}% respectively')

[115, 159] [662, 676]
Proportion of smokers in females, males = 0.17%, 0.24% respectively

Proportions are different, but are they statistically significant?

In [39]:

stat, p_value = proportions_ztest([female_smokers, male_smokers] , [n_females, n_males])

if p_value < 0.05:
    print(f'With a p-value of {round(p_value, 4)} the difference is significant. aka |We reject the null|')
    print('Proportion of smokers in male and female are not equal.')

else:
    print(f'With a p-value of {round(p_value, 4)} the difference is not significant. aka |We fail to reject the null|')
    print('Proportion of smokers in male and female are equal.')

With a p-value of 0.0053 the difference is significant. aka |We reject the null|
Proportion of smokers in male and female are not equal.

Broader category¶

Chi-square test is used to compare categorical variables.

H0: Smoker and Gender are independent

H1: Smoker and Gender are not independent

In [40]:

contingency_table = pd.crosstab(insurance['sex'], insurance['smoker'])
print('Contigency Table:')
display(np.array(contingency_table))

#Observed Values
Observed_Values = contingency_table.values
print('Observed Values:')
display(Observed_Values)

b = stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print('Expected Values:')
display(Expected_Values)

Contigency Table:

array([[547, 115],
       [517, 159]], dtype=int64)

Observed Values:

array([[547, 115],
       [517, 159]], dtype=int64)

Expected Values:

array([[526.43348281, 135.56651719],
       [537.56651719, 138.43348281]])

In [41]:

no_of_rows = len(contingency_table.iloc[0:2, 0])

no_of_columns = len(contingency_table.iloc[0, 0:2])

ddof = (no_of_rows - 1)*(no_of_columns - 1) #degree of freedom

alpha = 0.05 #alpha value

chi_square = sum([(o-e)**2./e for o,e in zip(Observed_Values, Expected_Values)]) #chi square

chi_square_statistic = chi_square[0] + chi_square[1] #chi square statistic

critical_value = chi2.ppf(q = 1-alpha, df = ddof) #critical value

p_value = 1-chi2.cdf(x = chi_square_statistic,df = ddof) #p-value

In [42]:

print('Significance level: ', alpha)
print('Degree of Freedom: ', ddof)
print('chi-square statistic: ', chi_square_statistic)
print('critical_value: ', critical_value)
print('p-value: ', p_value)

Significance level:  0.05
Degree of Freedom:  1
chi-square statistic:  7.765921028604451
critical_value:  3.841458820694124
p-value:  0.005324114164320548

In [43]:

if chi_square_statistic >= critical_value:
    print('Reject H0, there is a relationship between 2 categorical variables based on critical values.')
    print('Proportion of Smokers and Gender are independent.')
    
else:
    print('Fail to reject H0, there is no relationship between 2 categorical variables based on critical values.')
    print('Proportion of Smokers and Gender are not independent.')
    
if p_value <= alpha:
    print('Reject H0, there is a relationship between 2 categorical variables based on p-values.')
    print('Proportion of Smokers and Gender are independent.')

else:
    print('Fail to reject H0, there is no relationship between 2 categorical variables based on p-values.')
    print('Proportion of Smokers and Gender are not independent.')

Reject H0, there is a relationship between 2 categorical variables based on critical values.
Proportion of Smokers and Gender are independent.
Reject H0, there is a relationship between 2 categorical variables based on p-values.
Proportion of Smokers and Gender are independent.

Statistical evidence: Is the distribution of bmi across women with no children, one child and two children, the same?¶

ANOVA, also known as analysis of variance, is used to compare multiple (three or more) samples with a single test. There are 2 major flavors of ANOVA

One-way ANOVA: It is used to compare the difference between the three or more samples/groups of a single independent variable.
MANOVA: MANOVA allows us to test the effect of one or more independent variable on two or more dependent variables. In addition, MANOVA can also detect the difference in co-relation between dependent variables given the groups of independent variables.

The hypothesis being tested in ANOVA is

H0: All pairs of samples are same i.e. all sample means are equal
H1: At least one pair of samples is significantly different

In [44]:

anova = insurance[['bmi', 'sex', 'children']].copy()
anova = anova[anova['sex'] == 'female']
anova.drop('sex', axis = 1, inplace = True)

anova = anova.loc[(anova['children'] == 0) | (anova['children'] == 1) | (anova['children'] == 2)]
anova['children'] = anova['children'].replace({0: 'No Child', 1: '1 Child', 2: '2 Child'})
anova = anova.reset_index(drop = True)

groups = anova.groupby('children').groups

no_child = anova['bmi'][groups['No Child']]
one_child = anova['bmi'][groups['1 Child']]
two_child = anova['bmi'][groups['2 Child']]

# Perform the ANOVA
stats.f_oneway(no_child, one_child, two_child)

Out[44]:

F_onewayResult(statistic=0.3344720147757968, pvalue=0.7158579926754841)

In [45]:

# Another way
model = ols('bmi ~ children', data = anova).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    bmi   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.3345
Date:                Wed, 14 Aug 2019   Prob (F-statistic):              0.716
Time:                        02:18:13   Log-Likelihood:                -1821.7
No. Observations:                 566   AIC:                             3649.
Df Residuals:                     563   BIC:                             3662.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               30.0527      0.482     62.305      0.000      29.105      31.000
children[T.2 Child]      0.5971      0.736      0.811      0.417      -0.848       2.043
children[T.No Child]     0.3089      0.600      0.515      0.607      -0.869       1.487
==============================================================================
Omnibus:                        6.792   Durbin-Watson:                   1.891
Prob(Omnibus):                  0.034   Jarque-Bera (JB):                6.671
Skew:                           0.234   Prob(JB):                       0.0356
Kurtosis:                       2.747   Cond. No.                         4.24
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [46]:

# ANOVA table
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)

                sum_sq     df         F    PR(>F)
children     24.590123    2.0  0.334472  0.715858
Residual  20695.661583  563.0       NaN       NaN

Higher F statistic implies a relationship between the variables. Generally, we take the cutoff for p-value as 0.05 (which is 95% significance level). We fail to reject our null hypothesis and conclude that the distribution of bmi across women with no children, one child and two children is same i.e. means are ~equal.

In [47]:

# Visualizing
fig, ax_hist = plt.subplots(figsize = (12.8, 6))

sns.distplot(insurance[(insurance['sex'] == 'female') & (insurance['children'] == 0)]['bmi'], 
             color = 'orange', hist = False, label = 'No Child')
ax_hist.axvline(insurance[(insurance['sex'] == 'female') & (insurance['children'] == 0)]['bmi'].mean(), 
                color = 'orange', linestyle = '--', label = 'No Child Mean')

sns.distplot(insurance[(insurance['sex'] == 'female') & (insurance['children'] == 1)]['bmi'], 
             color = 'green', hist = False, label = '1 Child')
ax_hist.axvline(insurance[(insurance['sex'] == 'female') & (insurance['children'] == 1)]['bmi'].mean(), 
                color = 'green', linestyle = '--', label = '1 Child Mean')

sns.distplot(insurance[(insurance['sex'] == 'female') & (insurance['children'] == 2)]['bmi'], 
             color = 'blue', hist = False, label = '2 Child')
ax_hist.axvline(insurance[(insurance['sex'] == 'female') & (insurance['children'] == 2)]['bmi'].mean(), 
                color = 'blue', linestyle = '--', label = '2 Child Mean')

plt.suptitle('Distribution of BMI of Female with 0, 1, 2 Children'); plt.legend(); plt.show()

Author: Pratik Sharma¶