Michael Defferrard and Hermina Petric Maretic, based on exercises from the 2016 edition of NTDS.
%matplotlib inline
import sys
import os
import warnings
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')
filename = os.path.join('..', 'data', 'credit_card_defaults.csv')
data = pd.read_csv(filename, index_col = 0)
attributes = data.columns.tolist()
# Tansform from numerical to categorical variable.
data['SEX'] = data['SEX'].astype('category')
data['SEX'].cat.categories = ['MALE', 'FEMALE']
data['MARRIAGE'] = data['MARRIAGE'].astype('category')
data['MARRIAGE'].cat.categories = ['UNK', 'MARRIED', 'SINGLE', 'OTHERS']
data['EDUCATION'] = data['EDUCATION'].astype('category')
data['EDUCATION'].cat.categories = ['UNK', 'GRAD SCHOOL', 'UNIVERSITY', 'HIGH SCHOOL', 'OTHERS', 'UNK1', 'UNK2']
data.loc[:6, ['LIMIT', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'DEFAULT']]
LIMIT | SEX | EDUCATION | MARRIAGE | AGE | DEFAULT | |
---|---|---|---|---|---|---|
ID | ||||||
1 | 20000 | FEMALE | UNIVERSITY | MARRIED | 24 | 1 |
2 | 120000 | FEMALE | UNIVERSITY | SINGLE | 26 | 1 |
3 | 90000 | FEMALE | UNIVERSITY | SINGLE | 34 | 0 |
4 | 50000 | FEMALE | UNIVERSITY | MARRIED | 37 | 0 |
5 | 50000 | MALE | UNIVERSITY | MARRIED | 57 | 0 |
6 | 50000 | MALE | GRAD SCHOOL | SINGLE | 37 | 0 |
data.iloc[:5, 4:10]
AGE | DELAY1 | DELAY2 | DELAY3 | DELAY4 | DELAY5 | |
---|---|---|---|---|---|---|
ID | ||||||
1 | 24 | 2 | 2 | -1 | -1 | -2 |
2 | 26 | -1 | 2 | 0 | 0 | 0 |
3 | 34 | 0 | 0 | 0 | 0 | 0 |
4 | 37 | 0 | 0 | 0 | 0 | 0 |
5 | 57 | -1 | 0 | -1 | 0 | 0 |
data.iloc[:5, 11:23]
BILL1 | BILL2 | BILL3 | BILL4 | BILL5 | BILL6 | PAY1 | PAY2 | PAY3 | PAY4 | PAY5 | PAY6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | ||||||||||||
1 | 3913 | 3102 | 689 | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 |
2 | 2682 | 1725 | 2682 | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 |
3 | 29239 | 14027 | 13559 | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 |
4 | 46990 | 48233 | 49291 | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 |
5 | 8617 | 5670 | 35835 | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 |
Export as an HTML table for manual inspection.
data[:1000].to_html('subset.html')
Problems come in two flavours:
The actions to be taken in each case is highly data and problem specific.
Example: marital status
UNK
).print(data['MARRIAGE'].value_counts())
data = data[data['MARRIAGE'] != 'UNK']
data['MARRIAGE'] = data['MARRIAGE'].cat.remove_unused_categories()
print('\nWe are left with {} clients\n'.format(data.shape))
print(data['MARRIAGE'].unique())
SINGLE 15964 MARRIED 13659 OTHERS 323 UNK 54 Name: MARRIAGE, dtype: int64 We are left with (29946, 24) clients [MARRIED, SINGLE, OTHERS] Categories (3, object): [MARRIED, SINGLE, OTHERS]
Example: education
UNK
, UNK1
and UNK2
).print(data['EDUCATION'].value_counts())
data.loc[data['EDUCATION'] == 'UNK1', 'EDUCATION'] = 'UNK'
data.loc[data['EDUCATION'] == 'UNK2', 'EDUCATION'] = 'UNK'
data['EDUCATION'] = data['EDUCATION'].cat.remove_unused_categories()
print(data['EDUCATION'].value_counts())
UNIVERSITY 14024 GRAD SCHOOL 10581 HIGH SCHOOL 4873 UNK1 280 OTHERS 123 UNK2 51 UNK 14 Name: EDUCATION, dtype: int64 UNIVERSITY 14024 GRAD SCHOOL 10581 HIGH SCHOOL 4873 UNK 345 OTHERS 123 Name: EDUCATION, dtype: int64
attributes_numerical = ['LIMIT', 'AGE']
attributes_numerical.extend(attributes[11:23])
data.loc[:, attributes_numerical].describe().astype(np.int)
LIMIT | AGE | BILL1 | BILL2 | BILL3 | BILL4 | BILL5 | BILL6 | PAY1 | PAY2 | PAY3 | PAY4 | PAY5 | PAY6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 | 29946 |
mean | 167546 | 35 | 51278 | 49224 | 47063 | 43306 | 40352 | 38911 | 5659 | 5926 | 5227 | 4829 | 4804 | 5220 |
std | 129807 | 9 | 73682 | 71219 | 69393 | 64374 | 60836 | 59592 | 16552 | 23060 | 17618 | 15677 | 15290 | 17791 |
min | 10000 | 21 | -165580 | -69777 | -157264 | -170000 | -81334 | -339603 | 0 | 0 | 0 | 0 | 0 | 0 |
25% | 50000 | 28 | 3570 | 2988 | 2684 | 2335 | 1770 | 1261 | 1000 | 836 | 390 | 298 | 255 | 122 |
50% | 140000 | 34 | 22400 | 21221 | 20108 | 19066 | 18121 | 17098 | 2100 | 2010 | 1800 | 1500 | 1500 | 1500 |
75% | 240000 | 41 | 67263 | 64108 | 60240 | 54601 | 50244 | 49248 | 5007 | 5000 | 4511 | 4015 | 4040 | 4000 |
max | 1000000 | 79 | 964511 | 983931 | 1664089 | 891586 | 927171 | 961664 | 873552 | 1684259 | 896040 | 621000 | 426529 | 528666 |
Let's plot an histogram of the ages, so that we get a better impression of who our clients are. That may even be an end goal, e.g. if your marketing team asks which customer groups to target.
Then a boxplot of the bills, which may serve as a verification of the quality of the acquired data.
data.loc[:, 'AGE'].plot.hist(bins=20, figsize=(15,5))
ax = data.iloc[:, 11:17].plot.box(logy=True, figsize=(15,5))
Simple question: which proportion of our clients default ?
percentage = data['DEFAULT'].value_counts()[1] / data.shape[0] * 100
print('Percentage of defaults: {:.2f}%'.format(percentage))
Percentage of defaults: 22.14%
Another question: who's more susceptible to default, males or females ?
observed = pd.crosstab(data['SEX'], data['DEFAULT'], margins=True)
observed
DEFAULT | 0 | 1 | All |
---|---|---|---|
SEX | |||
MALE | 9003 | 2871 | 11874 |
FEMALE | 14312 | 3760 | 18072 |
All | 23315 | 6631 | 29946 |
Seems like females are better risk. Let's verify with a Chi-Squared test of independance, using scipy.stats.
from scipy import stats
_, p, _, expected = stats.chi2_contingency(observed.iloc[:2,:2])
print('p-value = {:.2e}'.format(p))
print('expected values:\n{}'.format(expected))
p-value = 6.75e-12 expected values: [[ 9244.71749148 2629.28250852] [ 14070.28250852 4001.71749148]]
Intuition: people who pay late present a higher risk of defaulting. Let's verify! Verifying some intuitions will also help you to identify mistakes. E.g. it would be suspicious if that intuition is not verified in the data: did we select the right column, or did we miss-compute a result?
group = data.groupby('DELAY1').mean()
corr = data['DEFAULT'].corr(data['DELAY1'], method='pearson')
group['DEFAULT'].plot(grid=True, title='Pearson correlation: {:.4f}'.format(corr), figsize=(15,5));
group = data.groupby('AGE').mean()
group['LIMIT'].plot(grid=True, title='Mean credit card limit per age', figsize=(15,5));
print(data['AGE'].value_counts())
29 1604 27 1475 28 1408 30 1393 26 1253 31 1213 25 1186 34 1160 32 1157 33 1145 24 1125 35 1111 36 1106 37 1035 39 953 38 943 23 930 40 868 41 821 42 793 44 699 43 670 45 613 46 569 22 560 47 499 48 466 49 451 50 409 51 338 53 324 52 304 54 247 55 208 56 177 58 122 57 122 59 83 60 67 21 67 61 56 62 44 63 31 64 31 66 25 65 24 67 16 69 15 70 10 68 5 73 4 71 3 72 3 75 3 74 1 79 1 Name: AGE, dtype: int64
Statsmodels is similar to scikit-learn, with much stronger emphasis on parameter estimation and (statistical) testing. It is similar in spirit to other statistical packages such as R, SPSS, SAS and Stata. That split reflects the two statistical modeling cultures: (1) Statistics, which want to know how well a given model fits the data, and what variables "explain" or affect the outcome, and (2) Machine Learning, where the main supported task is chosing the "best" model for prediction.
# Back to numeric values.
data['SEX'].cat.categories = [-1, 1]
data['SEX'] = data['SEX'].astype(np.int)
data['MARRIAGE'].cat.categories = [-1, 1, 0]
data['MARRIAGE'] = data['MARRIAGE'].astype(np.int)
data['EDUCATION'].cat.categories = [-2, 2, 1, 0, -1]
data['EDUCATION'] = data['EDUCATION'].astype(np.int)
data['DEFAULT'] = data['DEFAULT'] * 2 - 1 # [0,1] --> [-1,1]
# Observations and targets.
X = data.values[:,:23]
y = data.values[:,23]
n, d = X.shape
print('The data is a {} with {} samples of dimensionality {}.'.format(type(X), n, d))
The data is a <class 'numpy.ndarray'> with 29946 samples of dimensionality 23.
import statsmodels.api as sm
# Fit the Ordinary Least Square regression model.
results = sm.OLS(y, X).fit()
# Inspect the results.
print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.384 Model: OLS Adj. R-squared: 0.383 Method: Least Squares F-statistic: 809.4 Date: Mon, 27 Nov 2017 Prob (F-statistic): 0.00 Time: 15:09:14 Log-Likelihood: -35248. No. Observations: 29946 AIC: 7.054e+04 Df Residuals: 29923 BIC: 7.073e+04 Df Model: 23 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ x1 -2.364e-07 4.35e-08 -5.428 0.000 -3.22e-07 -1.51e-07 x2 -0.0337 0.005 -7.280 0.000 -0.043 -0.025 x3 -0.0265 0.006 -4.656 0.000 -0.038 -0.015 x4 -0.0801 0.005 -16.708 0.000 -0.090 -0.071 x5 -0.0103 0.000 -42.750 0.000 -0.011 -0.010 x6 0.1853 0.006 33.159 0.000 0.174 0.196 x7 0.0352 0.007 5.221 0.000 0.022 0.048 x8 0.0196 0.007 2.695 0.007 0.005 0.034 x9 0.0060 0.008 0.747 0.455 -0.010 0.022 x10 0.0114 0.009 1.314 0.189 -0.006 0.029 x11 0.0068 0.007 0.959 0.338 -0.007 0.021 x12 -1.372e-06 2.31e-07 -5.946 0.000 -1.82e-06 -9.2e-07 x13 3.537e-07 3.24e-07 1.091 0.275 -2.82e-07 9.89e-07 x14 3.364e-08 3.05e-07 0.110 0.912 -5.65e-07 6.32e-07 x15 -1.465e-07 3.18e-07 -0.461 0.645 -7.7e-07 4.77e-07 x16 1.527e-08 3.73e-07 0.041 0.967 -7.16e-07 7.46e-07 x17 2.117e-07 2.95e-07 0.717 0.473 -3.67e-07 7.9e-07 x18 -1.583e-06 3.58e-07 -4.421 0.000 -2.29e-06 -8.81e-07 x19 -4.521e-07 2.94e-07 -1.536 0.125 -1.03e-06 1.25e-07 x20 -3.437e-08 3.41e-07 -0.101 0.920 -7.04e-07 6.35e-07 x21 -6.357e-07 3.72e-07 -1.710 0.087 -1.36e-06 9.28e-08 x22 -6.479e-07 3.86e-07 -1.680 0.093 -1.4e-06 1.08e-07 x23 -1.959e-07 2.76e-07 -0.710 0.478 -7.37e-07 3.45e-07 ============================================================================== Omnibus: 4479.171 Durbin-Watson: 1.999 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6851.842 Skew: 1.169 Prob(JB): 0.00 Kurtosis: 3.160 Cond. No. 6.14e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 6.14e+05. This might indicate that there are strong multicollinearity or other numerical problems.
Data visualization is a key aspect of exploratory data analysis.
To start slowly, let's make a static line plot from some time series. Reproduce the plots below using:
import matplotlib.pyplot as plt
# Random time series.
n = 1000
rs = np.random.RandomState(42)
data = rs.randn(n, 4).cumsum(axis=0)
plt.figure(figsize=(15,5))
plt.plot(data[:, 0], label='A')
plt.plot(data[:, 1], '.-k', label='B')
plt.plot(data[:, 2], '--m', label='C')
plt.plot(data[:, 3], ':', label='D')
plt.legend(loc='upper left')
plt.xticks(range(0, 1000, 50))
plt.ylabel('Value')
plt.xlabel('Day')
plt.grid()
idx = pd.date_range('1/1/2000', periods=n)
df = pd.DataFrame(data, index=idx, columns=list('ABCD'))
df.plot(figsize=(15,5));
A frequency plot is a graph that shows the pattern in a set of data by plotting how often particular values of a measure occur. They often take the form of an histogram or a box plot.
Seaborn is a statistical visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Its advantage is that you can modify the produced plots with matplotlib, so you loose nothing.
import seaborn as sns
df = sns.load_dataset('iris', data_home=os.path.join('..', 'data'))
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
g = sns.distplot(df['petal_width'], kde=True, rug=False, ax=axes[0])
g.set(title='Distribution of petal width')
g = sns.boxplot('species', 'petal_width', data=df, ax=axes[1])
g.set(title='Distribution of petal width by species');
Scatter plots are very much used to assess the correlation between 2 variables. Pair plots are then a useful way of displaying the pairwise relations between variables in a dataset.
Use the seaborn pairplot()
function to analyze how separable is the iris dataset.
sns.pairplot(df, hue="species");
Humans can only comprehend up to 3 dimensions (in space, then there is e.g. color or size), so dimensionality reduction is often needed to explore high dimensional datasets. Analyze how separable is the iris dataset by visualizing it in a 2D scatter plot after reduction from 4 to 2 dimensions with two popular methods:
Hints:
swarmplot()
.from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
pca = PCA(n_components=2)
X = pca.fit_transform(df.values[:, :4])
df['pca1'] = X[:, 0]
df['pca2'] = X[:, 1]
tsne = TSNE(n_components=2)
X = tsne.fit_transform(df.values[:, :4])
df['tsne1'] = X[:, 0]
df['tsne2'] = X[:, 1]
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.swarmplot(x='pca1', y='pca2', data=df, hue='species', ax=axes[0])
sns.swarmplot(x='tsne1', y='tsne2', data=df, hue='species', ax=axes[1]);
%matplotlib notebook
import time
import networkx as nx
import scipy
n = 20
T = 50
er = nx.erdos_renyi_graph(20,0.3)
er.remove_nodes_from(nx.isolates(er)) # removing isolated nodes
n = len(er)
nx.draw(er)
Time series data can also live on the nodes of our graph (e.g. temperatures in cities). Create random time series data for our graph and plot them on a graph with node size representing the signal value.
You can also experiment with different layouts for drawing the graph. The default layout positions nodes using the Fruchterman-Reingold force-directed algorithm (spring layout). Draw the graph using the spectral layout.
sigma_eps = 1;
epsilon = stats.norm.rvs(scale = sigma_eps, size=[n,T]) # noise
def draw_graph_time_series(x):
pos = nx.spectral_layout(er) # we have to fix the position of nodes
l = 0
for i in range(0,T):
ax.clear()
nx.draw_networkx(er, ax=ax, pos=pos, node_size=x[:,i]*100)
ax.text(0,0,i)
fig.canvas.draw()
time.sleep(0.03)
fig, ax = plt.subplots()
plt.ion()
fig.show()
fig.canvas.draw()
draw_graph_time_series(epsilon)
In the last experiment we plotted noise. Create two events on your graph at times 10 and 30. Each event should be centered in a randomly selected node and increase the signal value of that node by 50. Both events should then propagate through the graph with the heat kernel model. When an event occurs, color the source node of the event to a different color to help visualise it better. Draw the graph using the spring layout.
def create_x(Lapl):
x = epsilon.copy()
peak = np.zeros((n,2))
for i in range(1,T):
l = 0
while (t[l] < i):
peak[source[l]][l] = 1
event = np.dot(scipy.linalg.expm((t[l]-i)*Lapl), peak[:,l])
x[:,i] = x[:,i] + 50*event
l += 1
if l >= 2:
break
return x
t = [10, 30]
source = stats.randint.rvs(low=0, high=n, size=2)
Lapl = nx.normalized_laplacian_matrix(er)
Lapl = Lapl.todense()
x = create_x(Lapl)
def draw_graph_diffusion(x):
pos = nx.spring_layout(er) # we have to fix the position of nodes
node_col = ['r'] * T
l = 0
for i in range(0,T):
ax.clear()
if l<2:
if i == t[l]:
node_col[source[l]] = 'b'
l = l+1
nx.draw_networkx(er, ax=ax, node_color = node_col, pos=pos, node_size=x[:,i]*100)
ax.text(0,0,i)
fig.canvas.draw()
time.sleep(0.03)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.ion()
fig.show()
fig.canvas.draw()
draw_graph_diffusion(x)