Phi_K advanced tutorial

This notebook guides you through the more advanced functionality of the phik package. This notebook will not cover all the underlaying theory, but will just attempt to give an overview of all the options that are available. For a theoretical description the user is referred to our paper.

The package offers functionality on three related topics:

  1. Phik correlation matrix
  2. Significance matrix
  3. Outlier significance matrix
In [1]:
# import standard packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools

import phik

from phik import resources
from phik.binning import bin_data
from phik.decorators import *
from phik.report import plot_correlation_matrix

%matplotlib inline
In [2]:
# if one changes something in the phik-package one can automatically reload the package or module
%load_ext autoreload
%autoreload 2

Load data

A simulated dataset is part of the phik-package. The dataset concerns car insurance data. Load the dataset here:

In [3]:
data = pd.read_csv( resources.fixture('fake_insurance_data.csv.gz') )
In [4]:
data.head()
Out[4]:
car_color driver_age area mileage car_size
0 black 26.377219 suburbs 156806.288398 XXL
1 black 58.976840 suburbs 74400.323559 XL
2 multicolor 55.744988 downtown 267856.748015 XXL
3 metalic 57.629139 downtown 259028.249060 XXL
4 green 21.490637 downtown 110712.216080 XL

Specify bin types

The phik-package offers a way to calculate correlations between variables of mixed types. Variable types can be inferred automatically altought we recommend to variable types to be specified by the user.

Because interval type variables need to be binned in order to calculate phik and the signifiance, a list of interval variables is created.

In [5]:
data_types = {'severity': 'interval',
             'driver_age':'interval',
             'satisfaction':'ordinal',
             'mileage':'interval',
             'car_size':'ordinal',
             'car_use':'ordinal',
             'car_color':'categorical',
             'area':'categorical'}

interval_cols = [col for col, v in data_types.items() if v=='interval' and col in data.columns]
interval_cols
# interval_cols is used below
Out[5]:
['driver_age', 'mileage']

Phik correlation matrix

Now let's start calculating the corrlation phik between pairs of variables.

Note that the original dataset is used as input, the binning of interval variables is done automatically.

In [6]:
phik_overview = data.phik_matrix(interval_cols=interval_cols)
phik_overview
Out[6]:
var2 area car_color car_size driver_age mileage
var1
area 1.000000 0.590456 0.000000 0.105506 0.000000
car_color 0.590456 1.000000 0.000000 0.389671 0.000000
car_size 0.000000 0.000000 1.000000 0.000000 0.768589
driver_age 0.105506 0.389671 0.000000 1.000000 0.000000
mileage 0.000000 0.000000 0.768589 0.000000 1.000000

Specify binning per interval variable

Binning can be set per interval variable individually. One can set the number of bins, or specify a list of bin edges. Note that the measured phik correlation is dependent on the chosen binning. The default binning is uniform between the min and max values of the interval variable.

In [7]:
bins = {'mileage':5, 'driver_age':[18,25,35,45,55,65,125]}
phik_overview = data.phik_matrix(interval_cols=interval_cols, bins=bins)
phik_overview
Out[7]:
var2 area car_color car_size driver_age mileage
var1
area 1.000000 0.590456 0.000000 0.071189 0.000000
car_color 0.590456 1.000000 0.000000 0.388350 0.000000
car_size 0.000000 0.000000 1.000000 0.000000 0.665845
driver_age 0.071189 0.388350 0.000000 1.000000 0.000000
mileage 0.000000 0.000000 0.665845 0.000000 1.000000

Do not apply noise correction

For low statistics samples often a correlation larger than zero is measured when no correlation is actually present in the true underlying distribution. This is not only the case for phik, but also for the pearson correlation and Cramer's phi (see figure 4 in XX ). In the phik calculation a noise correction is applied by default, to take into account erroneous correlation values as a result of low statistics. To switch off this noise cancellation (not recommended), do:

In [8]:
phik_overview = data.phik_matrix(interval_cols=interval_cols, noise_correction=False)
phik_overview
Out[8]:
var2 area car_color car_size driver_age mileage
var1
area 1.000000 0.594173 0.067452 0.190390 0.149679
car_color 0.594173 1.000000 0.096630 0.407860 0.136265
car_size 0.067452 0.096630 1.000000 0.121585 0.770836
driver_age 0.190390 0.407860 0.121585 1.000000 0.199606
mileage 0.149679 0.136265 0.770836 0.199606 1.000000

Statistical significance of the correlation

When assessing correlations it is good practise to evaluate both the correlation and the significance of the correlation: a large correlation may be statistically insignificant, and vice versa a small correlation may be very significant. For instance, scipy.stats.pearsonr returns both the pearson correlation and the p-value. Similarly, the phik package offeres functionality the calculate a significance matrix. Significance is defined as:

$$Z = \Phi^{-1}(1-p)\ ;\quad \Phi(z)=\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z} e^{-t^{2}/2}\,{\rm d}t $$

Several corrections to the 'standard' p-value calculation are taken into account, making the method more rebust for low statistics and sparse data cases. The user is referred to our paper for more details.

Due to the corrections, the significance calculation can take a few seconds.

In [9]:
significance_overview = data.significance_matrix(interval_cols=interval_cols)
significance_overview
Out[9]:
var2 area car_color car_size driver_age mileage
var1
area 72.422354 37.609814 -0.321721 1.844649 -0.703350
car_color 37.609814 85.463792 -0.618694 19.813735 -0.617533
car_size -0.321721 -0.618694 69.045839 -0.534410 49.209947
driver_age 1.844649 19.813735 -0.534410 84.322135 -0.721688
mileage -0.703350 -0.617533 49.209947 -0.721688 91.224339

Specify binning per interval variable

Binning can be set per interval variable individually. One can set the number of bins, or specify a list of bin edges. Note that the measure phik correlation is dependent on the chosen binning.

In [10]:
bins = {'mileage':5, 'driver_age':[18,25,35,45,55,65,125]}
significance_overview = data.significance_matrix(interval_cols=interval_cols, bins=bins)
significance_overview
Out[10]:
var2 area car_color car_size driver_age mileage
var1
area 72.386151 37.605468 -0.311415 2.345344 -0.320132
car_color 37.605468 85.449436 -0.539453 20.518680 -0.327428
car_size -0.311415 -0.539453 69.053634 -0.562903 47.013748
driver_age 2.345344 20.518680 -0.562903 83.331359 -0.586750
mileage -0.320132 -0.327428 47.013748 -0.586750 77.765892

Specify significance method

The recommended method to calculate the significance of the correlation is a hybrid approach, which uses the G-test statistic. The number of degrees of freedom and an analytical, emperical description of the $\chi^2$ distribution are sed, based on Monte Carlo simulations. This method works well for both high as low statistics samples.

Other approaches to calculate the significance are implemented:

  • asymptotic: fast, but over-estimates the number of degrees of freedom for low statistics samples, leading to erronous values of the significance
  • MC: Many simulated samples are needed to accurately measure significaces larger than 3, making this method computationaly expensive.
In [11]:
significance_overview = data.significance_matrix(interval_cols=interval_cols, significance_method='asymptotic')
significance_overview
Out[11]:
var2 area car_color car_size driver_age mileage
var1
area 72.440209 37.661844 -0.123678 1.742050 -0.465002
car_color 37.661844 85.526574 -0.333340 19.681564 -0.385023
car_size -0.123678 -0.333340 69.107448 -0.793434 49.332305
driver_age 1.742050 19.681564 -0.793434 84.014654 -0.947153
mileage -0.465002 -0.385023 49.332305 -0.947153 91.301129

Simulation method

The chi2 of a contingency table is measured using a comparison of the expected frequencies with the true frequencies in a contingency table. The expected frequencies can be simulated in a variety of ways. The following methods are implemented:

  • multinominal: Only the total number of records is fixed. (default)
  • row_product_multinominal: The row totals fixed in the sampling.
  • col_product_multinominal: The column totals fixed in the sampling.
  • hypergeometric: Both the row or column totals are fixed in the sampling. (Note that this type of sampling is only available when row and column totals are integers, which is usually the case.)
In [12]:
# --- Warning, can be slow
#     turned off here by default for unit testing purposes

#significance_overview = data.significance_matrix(interval_cols=interval_cols, simulation_method='hypergeometric')
#significance_overview

Expected frequencies

In [13]:
from phik.simulation import sim_2d_data_patefield, sim_2d_product_multinominal, sim_2d_data
In [14]:
inputdata = data[['driver_age', 'area']].hist2d(interval_cols=['driver_age'])
inputdata
Out[14]:
area country_side downtown hills suburbs unpaved_roads
driver_age
1 11.0 86.0 123.0 147.0 21.0
2 9.0 77.0 137.0 125.0 31.0
3 7.0 102.0 131.0 130.0 18.0
4 17.0 83.0 130.0 95.0 14.0
5 13.0 68.0 120.0 72.0 8.0
6 7.0 30.0 51.0 47.0 9.0
7 1.0 11.0 23.0 14.0 7.0
8 0.0 4.0 7.0 8.0 2.0
9 0.0 0.0 1.0 1.0 0.0
10 0.0 1.0 1.0 0.0 0.0

Multinominal

In [15]:
simdata = sim_2d_data(inputdata.values)
print('data total:', inputdata.sum().sum())
print('sim  total:', simdata.sum().sum())
print('data row totals:', inputdata.sum(axis=0).values)
print('sim  row totals:', simdata.sum(axis=0))
print('data column totals:', inputdata.sum(axis=1).values)
print('sim  column totals:', simdata.sum(axis=1))
data total: 2000.0
sim  total: 2000.0
data row totals: [ 65. 462. 724. 639. 110.]
sim  row totals: [ 71. 492. 728. 607. 102.]
data column totals: [388. 379. 388. 339. 281. 144.  56.  21.   2.   2.]
sim  column totals: [423. 380. 382. 332. 278. 126.  58.  16.   2.   3.]

product multinominal

In [16]:
simdata = sim_2d_product_multinominal(inputdata.values, 'rows')
print('data total:', inputdata.sum().sum())
print('sim  total:', simdata.sum().sum())
print('data row totals:', inputdata.sum(axis=0).astype(int).values)
print('sim  row totals:', simdata.sum(axis=0).astype(int))
print('data column totals:', inputdata.sum(axis=1).astype(int).values)
print('sim  column totals:', simdata.sum(axis=1).astype(int))
data total: 2000.0
sim  total: 2000.0
data row totals: [ 65 462 724 639 110]
sim  row totals: [ 65 462 724 639 110]
data column totals: [388 379 388 339 281 144  56  21   2   2]
sim  column totals: [389 371 393 382 248 143  43  24   2   5]

hypergeometric

In [17]:
simdata = sim_2d_data_patefield(inputdata.values)
print('data total:', inputdata.sum().sum())
print('sim  total:', simdata.sum().sum())
print('data row totals:', inputdata.sum(axis=0).astype(int).values)
print('sim  row totals:', simdata.sum(axis=0))
print('data column totals:', inputdata.sum(axis=1).astype(int).values)
print('sim  column totals:', simdata.sum(axis=1))
data total: 2000.0
sim  total: 2000
data row totals: [ 65 462 724 639 110]
sim  row totals: [ 65 462 724 639 110]
data column totals: [388 379 388 339 281 144  56  21   2   2]
sim  column totals: [388 379 388 339 281 144  56  21   2   2]

Outlier significance

The normal pearson correlation between two interval variables is easy to interpret. However, the phik correlation between two variables of mixed type is not always easy to interpret, especially when it concerns categorical veriables. Therefore, functionality is providided to detect "outliers": excesses and deficictsover the exptected frequencies in the contigency table of two varaibles.

Example 1: mileage versus car_size

For the categorical variable pair mileage - car_size we measured:

$$\phi_k = 0.77 \, ,\quad\quad \mathrm{significance} = 46.3$$

Let's use the outlier signifiance functionality to gain a better understanding of this sigificance correlation between mileage and car size.

In [18]:
c0 = 'mileage'
c1 = 'car_size'

tmp_interval_cols = ['mileage']
In [19]:
outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, 
                                                                          retbins=True)
outlier_signifs
Out[19]:
car_size L M S XL XS XXL
53.5_30047.0 6.882155 21.483476 18.076204 -8.209536 10.820863 -22.423985
30047.0_60040.5 20.034528 -0.251737 -3.408409 2.534277 -1.973628 -8.209536
60040.5_90033.9 1.627610 -3.043497 -2.265809 10.215936 -1.246784 -8.209536
90033.9_120027.4 -3.711579 -3.827278 -2.885475 12.999048 -1.638288 -7.185622
120027.4_150020.9 -7.665861 -6.173001 -4.746762 9.629145 -2.841508 -0.504521
150020.9_180014.4 -7.533189 -6.063786 -4.660049 1.559370 -2.785049 6.765549
180014.4_210007.8 -5.541940 -4.425929 -3.360023 -4.802787 -1.942469 10.520540
210007.8_240001.3 -3.496905 -2.745103 -2.030802 -5.850529 -1.100873 8.723925
240001.3_269994.8 -5.275976 -4.207164 -3.186534 -8.616464 -1.830944 13.303101
269994.8_299988.2 -8.014016 -6.458253 -4.973240 -12.868389 -2.989055 20.992824

Specify binning per interval variable

Binning can be set per interval variable individually. One can set the number of bins, or specify a list of bin edges.

Note: in case a bin is created without any records this bin will be automatically dropped in the phik and (outlier) significance calculations. However, in the oulier significance calculation this will currently lead to an error as the number of provided bin edges does not match the number of bins anymore.

In [20]:
bins = [0,1E2, 1E3, 1E4, 1E5, 1E6]
outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, 
                                                                          bins=bins, retbins=True)
outlier_signifs
Out[20]:
car_size L M S XL XS XXL
0.0_100.0 -0.223635 -0.153005 -0.096640 -0.504167 2.150837 -1.337308
100.0_1000.0 -0.742899 -0.533211 2.164954 -1.469996 5.704340 -3.272689
1000.0_10000.0 -3.489668 3.499856 18.061724 -6.831062 11.617394 -13.063085
10000.0_100000.0 25.086723 15.956527 -0.251877 5.162309 -3.896807 -8.209536
100000.0_1000000.0 -8.209536 -17.223164 -13.626621 -2.140870 -8.688844 44.933133

Specify binning per interval variable -- dealing with underflow and overflow

When specifying custom bins as situation can occur when the minimal (maximum) value in the data is smaller (larger) than the minimum (maximum) bin edge. Data points outside the specified range will be collected in the underflow (UF) and overflow (OF) bins. One can choose how to deal with these under/overflow bins, by setting the drop_underflow and drop_overflow variables.

Note that the drop_underflow and drop_overflow options are also available for the calculation of the phik matrix and the significance matrix.

In [21]:
bins = [1E2, 1E3, 1E4, 1E5]
outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, 
                                                                          bins=bins, retbins=True, 
                                                                          drop_underflow=False,
                                                                          drop_overflow=False)
outlier_signifs
Out[21]:
car_size L M S XL XS XXL
100.0_1000.0 -0.742899 -0.533211 2.164954 -1.469996 5.704340 -3.272689
1000.0_10000.0 -3.489668 3.499856 18.061724 -6.831062 11.617394 -13.063085
10000.0_100000.0 25.086723 15.956527 -0.251877 5.162309 -3.896807 -8.209536
OF -8.209536 -17.223164 -13.626621 -2.140870 -8.688844 44.933133
UF -0.223635 -0.153005 -0.096640 -0.504167 2.150837 -1.337308

Dealing with NaN's in the data

Let's add some missing values to our data

In [22]:
data.loc[np.random.choice(range(len(data)), size=10), 'car_size'] = np.nan
data.loc[np.random.choice(range(len(data)), size=10), 'mileage'] = np.nan

Sometimes there can be information in the missing values and in which case you might want to consider the NaN values as a separate category. This can be achieved by setting the dropna argument to False.

In [23]:
bins = [1E2, 1E3, 1E4, 1E5]
outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, 
                                                                          bins=bins, retbins=True, 
                                                                          drop_underflow=False,
                                                                          drop_overflow=False,
                                                                          dropna=False)
outlier_signifs
Out[23]:
car_size L M NaN S XL XS XXL
100.0_1000.0 -0.736212 -0.533211 -0.053620 2.175087 -1.461969 5.704340 -3.262069
1000.0_10000.0 -4.077375 3.532244 0.667050 17.885209 -6.765709 11.645544 -12.961110
10000.0_100000.0 24.702076 15.781631 -0.320348 -0.156340 5.229720 -3.880851 -8.209536
NaN 1.029392 0.488424 -0.073439 -0.460500 -0.124655 -0.211155 -0.633265
OF -8.209536 -17.158980 -0.283391 -13.484761 -2.208915 -8.651800 43.991714
UF -0.221302 -0.153005 -0.013130 -0.095429 -0.500819 2.150837 -1.332095

Here OF and UF are the underflow and overflow bin of car_size, respectively.

To just ignore records with missing values set dropna to True (default).

In [24]:
bins = [1E2, 1E3, 1E4, 1E5]
outlier_signifs, binning_dict = data[[c0,c1]].outlier_significance_matrix(interval_cols=tmp_interval_cols, 
                                                                          bins=bins, retbins=True, 
                                                                          drop_underflow=False,
                                                                          drop_overflow=False,
                                                                          dropna=True)
outlier_signifs
Out[24]:
car_size L M S XL XS XXL
100.0_1000.0 -0.735678 -0.534179 2.167263 -1.468203 5.695755 -3.279400
1000.0_10000.0 -4.052167 3.559705 17.912001 -6.751717 11.651568 -12.952671
10000.0_100000.0 25.052541 15.868135 -0.186055 5.221619 -3.896177 -8.209536
OF -8.209536 -17.164792 -13.548056 -2.235086 -8.695547 44.737489
UF -0.221109 -0.153312 -0.096377 -0.503408 2.146765 -1.340589

Note that the dropna option is also available for the calculation of the phik matrix and the significance matrix.