Phi_K spark tutorial

This notebook shows you how to obtain the Phi_K correlation matrix for a spark dataframe. Calculating the Phi_K matrix consists of two steps:

  • Obtain the 2d contingency tables for all variable pairs. To make these we use the popmon package, which relies on the spark histogrammar package.
  • Calculate the Phi_K value for each variable pair from its contingency table.

Make sure you install the popmon package to make the 2d histograms, that are then used to calculate phik.

In [ ]:
!pip install popmon
In [ ]:
import itertools

import pandas as pd
import popmon
from popmon.analysis.hist_numpy import get_2dgrid

import phik
from phik import resources
from phik.phik import spark_phik_matrix_from_hist2d_dict

histogramming in popmon is done using the histogrammar library

In [ ]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.config('spark.jars.packages','org.diana-hep:histogrammar-sparksql_2.11:1.0.4').getOrCreate()
sc = spark.sparkContext

Load data

A simulated dataset is part of the phik-package. The dataset concerns fake car insurance data. Load the dataset here:

In [ ]:
data = pd.read_csv( resources.fixture('fake_insurance_data.csv.gz') )
sdf = spark.createDataFrame(data)
In [ ]:
combis = itertools.combinations_with_replacement(sdf.columns, 2)
combis = [list(c) for c in combis]
In [ ]:

step 1: create histograms (this runs spark histogrammar in the background)

In [ ]:
# see the doc-string of pm_make_histograms() for binning options.
hists = sdf.pm_make_histograms(combis)
In [ ]:
grids = {k:get_2dgrid(h) for k,h in hists.items()}
In [ ]:
# we can store the histograms if we want to
if False:
    import pickle

    with open('grids.pkl', 'wb') as outfile:
        pickle.dump(grids, outfile)

    with open('grids.pkl', 'rb') as handle:
        grids = pickle.load(handle)

step 2: calculate phik matrix (runs rdd parallellization over all 2d histograms)

In [ ]:
phik_matrix = spark_phik_matrix_from_hist2d_dict(sc, grids)
In [ ]: