Skope-Rules Demo: application to cluster description

This notebook shows a demo of skope-rules applied to identify segments after a clustering analysis. The dataset is available here: It describes football players and their performance attributes.

This notebook performs a hierarchical clustering with 4 segments. Skope-rules is used to interpret each segments by a 1(segment)-vs-all approach.

The notebook is structured into 4 parts:

  1. Imports
  2. Data preparation
  3. Clustering of data
  4. Interpretation of clusters

1. Imports

In [1]:
# Import skope-rules
from skrules import SkopeRules

# Import librairies
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
import warnings

# Import Titanic data
data = pd.read_csv('../data/CompleteDataset.csv')

2. Data preparation

In [2]:
data = data.query("Overall>=85") # Select players with an overall attribute larger than 85/100.

column_to_keep = ['Name', 'Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control',
       'Composure', 'Crossing', 'Curve', 'Dribbling', 'Finishing',
       'Free kick accuracy', 'GK diving', 'GK handling', 'GK kicking',
       'GK positioning', 'GK reflexes', 'Heading accuracy', 'Preferred Positions']
data = data[column_to_keep] # Keep only performance attributes and names.

data.columns = [x.replace(' ', '_') for x in data.columns] # Replace white spaces in the column names

feature_names = data.drop(['Name', 'Preferred_Positions'], axis=1).columns.tolist()

3. Clustering of data

In [3]:
clust = AgglomerativeClustering(n_clusters=4) #with euclidian distance and ward linkage

data['cluster'] = clust.fit_predict(data.drop(['Name', 'Preferred_Positions'], axis=1))

4. Interpretation of clusters

The 4 clusters obtained have to be interpreted. It can be done with a graphical analysis (parallel coordinates, mean comparison, reasoning with examples of players). This notebook presents an approach which consists in looking for a way to separate a cluster from the rest of population (and to repeat the process for each cluster).

With this 1-vs-all approach, it becomes a supervized binary classification task. Skope-rules is very useful because a good interpretation of a cluster is based on a simple expression of the frontier which isolates the cluster. And that is skope-rules's scope!

In [4]:
warnings.filterwarnings('ignore') #To deals with warning raised by max_samples=1 (see below).
#With max_samples=1, there is no Out-Of-Bag sample to evaluate performance (it is evaluated on all samples. 
#As there are less than 100 samples and this is a clustering-oriented task, the risk of overfitting is not 
#dramatic here.

i_cluster = 0
for i_cluster in range(4):
    X_train = data.drop(['Name', 'Preferred_Positions', 'cluster'], axis=1)
    y_train = (data['cluster']==i_cluster)*1
    skope_rules_clf = SkopeRules(feature_names=feature_names, random_state=42, n_estimators=5,
                                   recall_min=0.5, precision_min=0.5, max_depth_duplication=0,
                                   max_samples=1., max_depth=3), y_train)
    print('Cluster '+str(i_cluster)+':')
    #print(data.query('cluster=='+str(i_cluster))[['Name', 'Preferred_Positions']])
Cluster 0:
[('Agility <= 81.5 and Free_kick_accuracy > 56.0 and Heading_accuracy > 58.5', (0.93548387096774188, 0.8529411764705882, 10))]
Cluster 1:
[('Aggression <= 76.5 and Agility > 81.5 and Balance > 66.5', (1.0, 0.77419354838709675, 8))]
Cluster 2:
[('Curve <= 61.5 and Heading_accuracy > 82.5', (1.0, 0.7857142857142857, 8))]
Cluster 3:
[('Curve <= 28.0', (1.0, 1.0, 4))]

In cluster 0, we find players with good heading and free kick accuracy, but which are not the best agile players (<= 81/100). This rule is a good description of cluster 0: it captures 85% of cluster 1, with a precision of 93% (7% of players described by the rule are not in cluster 0). The third term of the performance term (10) is the number of time that this rule was extracted from the trees built during skope-rules' fitting.

In cluster 1, we find very agile players which are not the most aggressive but they are balanced. This rule is very precise but misses 23% of this cluster.

In cluster 2, we find players accurate with their heads but with less skills for dribbling (Curve).

In cluster 3, we find players who are very bad at driblling. This rule perfectly defines the cluster (100% precision, 100% recall). This is the goal-keeper cluster.

In [5]:
for i_cluster in range(4):
    print('5 players from cluster '+str(i_cluster)+':')
    print(data.query("cluster=="+str(i_cluster))['Name'].sample(5, random_state=42).tolist()) # Get 5 random players per cluster
5 players from cluster 0:
['M. Hamšík', 'Alex Sandro', 'Casemiro', 'K. Benzema', 'Z. Ibrahimović']

5 players from cluster 1:
['H. Mkhitaryan', 'David Silva', 'F. Ribéry', 'J. Rodríguez', 'P. Dybala']

5 players from cluster 2:
['Pepe', 'K. Glik', 'G. Chiellini', 'V. Kompany', 'Piqué']

5 players from cluster 3:
['M. ter Stegen', 'D. Subašić', 'M. Neuer', 'K. Navas', 'H. Lloris']

In brief, cluster 0 tends to concentrate strikers and midfielders talented with their heads. Cluster 1 tends to group other midfielders. Cluster 2 focuses on defenders while goal-keepers are gathered in cluster 3!

For a visual analysis (kind of parallel coordinates) of these clusters, you can check this Kaggle kernel: