What "types" of cell towers do we have?

In this notebook we will consider the question of if we can find different 'types' of cell towers based on usage activity.

As part of this we will have a few things to figure out:

  • What feature representation will we use for each cell?
  • What clustering approach will we use?
  • How will we then try to make sense and interpret the clusters?

We will touch on each of these considerations as we go. And like any DS project there are countless ways we could go about formulating the problem, below is an initial approach which no doubt could be iterated and improved on.

Setup

Imports and some settings...

In [1]:
import pandas as pd
import numpy as np
import glob
import hdbscan
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
import geojson
import matplotlib.colors as colors
import matplotlib.cm as cmx
from descartes import PolygonPatch

%matplotlib inline

# set path to where data and files are
MY_DIR = r'C:\Users\Andrew\github\mobile-phone-activity'

Get data

Pull in each csv file one by one and then append them all into df_big and a smaller version of the full data df_small (for experimentation or quick development).

In [2]:
# get data

# get file paths
path = MY_DIR
files = glob.glob(path + "\data\sms-call-internet-mi-2013-11-*")

# empty list to store individual df's into 
list_ = []

# loop through files
for file_ in files:
    df_tmp = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df_tmp)

# concat all the df's into one
df_big = pd.concat(list_)

# make a small version of the full data for developing with
df_small = df_big.sample(n = 10000)

Data prep

First we will just rename some columns and print out some summary stats and info on our data...

In [3]:
# pick big or small df to work with
#df = df_small.copy()
df = df_big.copy()

# clean up some col names
df.rename(columns={'CellID':'cell_id'}, inplace=True)

# ensure cell_id is a string (for count uniques in describe())
df['cell_id'] = df['cell_id'].astype(str)

# ensure datetime is correct format
df['datetime'] = pd.to_datetime(df['datetime'])

print(df.shape)
df.sample(5)
(15089165, 8)
Out[3]:
datetime cell_id countrycode smsin smsout callin callout internet
1869372 2013-11-05 18:00:00 8140 355 0.0299 NaN NaN NaN NaN
1394976 2013-11-01 18:00:00 1636 43 NaN NaN NaN NaN 0.0082
1072241 2013-11-05 12:00:00 9907 39 3.0518 1.1450 4.0324 4.3305 26.2501
2202317 2013-11-07 21:00:00 4943 40 0.0717 0.1076 NaN 0.0359 NaN
1115129 2013-11-02 15:00:00 7629 48 NaN NaN 0.0657 NaN NaN

Print out some info on the data...

In [4]:
# look at some info about df
df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15089165 entries, 0 to 2407383
Data columns (total 8 columns):
datetime       datetime64[ns]
cell_id        object
countrycode    int64
smsin          float64
smsout         float64
callin         float64
callout        float64
internet       float64
dtypes: datetime64[ns](1), float64(5), int64(1), object(1)
memory usage: 1.0+ GB

Print out summary stats for each column...

In [5]:
df.describe(include='all')
Out[5]:
datetime cell_id countrycode smsin smsout callin callout internet
count 15089165 15089165 1.508916e+07 6.023545e+06 3.546510e+06 4.109309e+06 6.484303e+06 6.719964e+06
unique 168 10000 NaN NaN NaN NaN NaN NaN
top 2013-11-07 16:00:00 6064 NaN NaN NaN NaN NaN NaN
freq 143769 6578 NaN NaN NaN NaN NaN NaN
first 2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN
last 2013-11-07 23:00:00 NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN 3.625260e+02 7.553846e+00 7.248224e+00 7.348474e+00 5.340862e+00 1.023244e+02
std NaN NaN 4.373834e+03 2.646614e+01 2.440298e+01 2.865058e+01 2.528570e+01 4.195864e+02
min NaN NaN 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% NaN NaN 3.300000e+01 1.044000e-01 9.390000e-02 5.880000e-02 5.650000e-02 5.790000e-02
50% NaN NaN 4.300000e+01 6.233000e-01 5.766500e-01 2.540000e-01 2.115000e-01 2.084000e-01
75% NaN NaN 8.600000e+01 4.659600e+00 4.475600e+00 2.362000e+00 1.049800e+00 9.899125e+00
max NaN NaN 9.725900e+04 2.288739e+03 2.270682e+03 1.328088e+03 1.511686e+03 3.174863e+04

Missing data

In this case we are safe to set missing data values to be 0.

So we are assuming the absence of measurements for any given cell in a specific hour represents a lack of traffic as opposed to any issues with the data...

In [6]:
# get missing value percentages
print("% Missing Values")
print(df.isnull().sum()/len(df)*100)
% Missing Values
datetime        0.000000
cell_id         0.000000
countrycode     0.000000
smsin          60.080329
smsout         76.496314
callin         72.766492
callout        57.026761
internet       55.464971
dtype: float64

...looks from above that there is indeed quite a bit of missing data here. Given the data is already aggregated up to cell and hour level i'd probably be asking questions on this a bit more. Is it really common enoungh for a cell to go a whole hour without any call's in or out?

Not much we can do about that here so we will take it at face value...

In [7]:
# fill all nulls with 0
df.fillna(value=0, inplace=True)
In [8]:
# look at a sample of data and keep track of the df shape as we do stuff
print(df.shape)
df.sample(5)
(15089165, 8)
Out[8]:
datetime cell_id countrycode smsin smsout callin callout internet
734575 2013-11-07 10:00:00 5316 1 0.0000 0.0 0.0000 0.0000 0.0465
1701043 2013-11-05 17:00:00 6198 33 0.3131 0.0 0.1904 0.0000 0.4869
741661 2013-11-04 10:00:00 9959 32 0.0000 0.0 0.0000 0.0000 0.1045
1072278 2013-11-02 15:00:00 3966 230 0.0000 0.0 0.1350 0.0000 0.0000
1662969 2013-11-06 17:00:00 1236 40 0.0000 0.0 0.0000 0.0579 0.0000

Design considerations

We will build some features to help us aggregate up from day and hour to "weekday" or "weekend" and "time of day" (night, day, or evening).

This will help us form a representation that should capture certain way's we think the data might behave while also being as compact as possible so as to avoid having lots of features feeding into the clustering which would be subject to the curse of dimensionality.

This is one point where we are making a specific design decision, as always its one we could play around with and revisit as we iterate (e.g. why not daily level or slice the day into different time bins).

In [9]:
# get if weekend or weekday from datetime
df['weekday'] = np.where((df['datetime'].dt.dayofweek < 5) , 'weekday', 'weekend')

# segment hours into times of day
def timeofday(hour):
    if hour <= 6:
        return 'night'
    elif hour <= 17:
        return 'day'
    elif hour <= 23:
        return 'evening'
df['timeofday'] = df['datetime'].dt.hour.apply(timeofday)

# make a single key field to describe day and time of traffic data
df['day_time'] = df['weekday'] + '_' + df['timeofday']

# drop some columns we no longer need
del df['datetime']
del df['countrycode']
del df['weekday']
del df['timeofday']

print(df.shape)
df.sample(5)
(15089165, 7)
Out[9]:
cell_id smsin smsout callin callout internet day_time
614220 6202 0.0000 0.0000 0.0000 0.0 0.0504 weekday_day
979574 4825 0.0000 0.4174 0.0000 0.0 0.0000 weekend_day
1922317 666 0.0000 0.0000 0.0226 0.0 0.0000 weekday_evening
1582299 3489 0.0892 0.0000 0.0000 0.0 0.2835 weekday_day
111118 5302 0.0000 0.0000 0.0000 0.0 0.0450 weekday_night

Metrics as percentages

We will transform our data to be percentages - this idea here is to help with the interpretation and it also acts as a sort of way to normalize all our data to be on a similar scale which is key for distance measures when clustering.

Again, this is a design choice, assuming we are really interested in the bahaviour of the traffic in terms of its eb's and flows then it's reasonable to focus on percentages. However one trade off here is that any notion of the size of a cell in terms of the volume of traffic it handles is no longer possible.

So we will first get the total values for each metric across the whole week and then we will use that to transform into percentages - so for example "callsin" will no become the % of all traffic in the sample period for that specific cell that falls on that specific "day_time" slice.

In [10]:
# get cell totals for the whole week
df_cell_totals = df.groupby(['cell_id'], as_index=False).sum()

# get overall volumes regardless of direction
df_cell_totals['calltotal'] = abs(df_cell_totals['callin']) + abs(df_cell_totals['callout'])
df_cell_totals['smstotal'] = abs(df_cell_totals['smsin']) + abs(df_cell_totals['smsout'])
df_cell_totals['internettotal'] = df_cell_totals['internet']

print(df_cell_totals.shape)
df_cell_totals.sample(5)
(10000, 9)
Out[10]:
cell_id smsin smsout callin callout internet calltotal smstotal internettotal
8367 8529 1426.5506 848.5671 993.1165 1113.5407 21195.3750 2106.6572 2275.1177 21195.3750
3525 4170 9724.5614 4760.3308 6540.3675 7709.8688 163272.4982 14250.2363 14484.8922 163272.4982
9848 9862 2540.1626 2607.4815 1801.0932 2050.2716 53953.3427 3851.3648 5147.6441 53953.3427
829 1744 696.2743 356.3576 517.5223 1343.6358 14487.8532 1861.1581 1052.6319 14487.8532
366 1327 297.4898 172.4379 240.7452 297.0120 3290.7352 537.7572 469.9277 3290.7352

Now we will aggregate up the low level data to cell and day_time level - summing all metrics as we go...

In [11]:
# sum up to day_time, cell level
df = df.groupby(['day_time','cell_id'], as_index=False).sum()

print(df.shape)
df.sample(5)
(60000, 7)
Out[11]:
day_time cell_id smsin smsout callin callout internet
9238 weekday_day 9312 775.4674 385.1564 574.5711 677.0885 8332.2649
23462 weekday_night 4113 31.2968 43.0553 11.4247 264.3781 883.5716
21657 weekday_night 249 45.9027 27.1676 20.9051 17.5383 6392.2075
33903 weekend_day 4510 71.5260 43.4356 52.6224 50.7444 2198.6433
52474 weekend_night 3224 9.5462 22.1460 4.2935 6.5087 407.9685

Now we will create df_pct dataframe where the metrics have all been transformed into percentages of cell total traffic for each metric.

In [12]:
# merge in cell totals
df_pct = pd.merge(df,df_cell_totals[['cell_id','internettotal','calltotal','smstotal']],how='left', on='cell_id')

# fill all nulls with 0
df_pct.fillna(value=0, inplace=True)

# convert measures to percentages of totals for that cell for the sample
df_pct['smsin'] = df_pct['smsin']/df_pct['smstotal']
df_pct['smsout'] = df_pct['smsout']/df_pct['smstotal']
df_pct['callin'] = df_pct['callin']/df_pct['calltotal']
df_pct['callout'] = df_pct['callout']/df_pct['calltotal']
df_pct['internet'] = df_pct['internet']/df_pct['internettotal']

# fill all nulls with 0
df_pct.fillna(value=0, inplace=True)

print(df_pct.shape)
df_pct.sample(5)
(60000, 10)
Out[12]:
day_time cell_id smsin smsout callin callout internet internettotal calltotal smstotal
58875 weekend_night 8987 0.006428 0.006417 0.001715 0.001678 0.047045 37468.9915 4129.9736 5220.7609
39305 weekend_day 9373 0.084907 0.034470 0.046111 0.056889 0.115054 47730.9309 7027.8047 6382.4671
5098 weekday_day 5587 0.303393 0.125747 0.215647 0.228417 0.337414 11851.0190 1583.2358 1548.9269
20938 weekday_night 1842 0.012467 0.007371 0.004577 0.009463 0.068141 60434.6257 3467.1937 2955.8705
52338 weekend_night 3101 0.013913 0.007136 0.001829 0.002053 0.044494 2161.8513 179.9286 171.9246
In [13]:
# summary stats
df_pct.describe(include='all')
Out[13]:
day_time cell_id smsin smsout callin callout internet internettotal calltotal smstotal
count 60000 60000 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000 6.000000e+04 60000.000000 60000.000000
unique 6 10000 NaN NaN NaN NaN NaN NaN NaN NaN
top weekend_day 2794 NaN NaN NaN NaN NaN NaN NaN NaN
freq 10000 6 NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN 0.106865 0.059769 0.077695 0.088972 0.166667 6.876161e+04 6482.891662 7120.683090
std NaN NaN 0.122447 0.058692 0.100155 0.111023 0.124953 1.087701e+05 10070.213236 11534.245352
min NaN NaN 0.000000 0.000000 0.000000 0.000084 0.002555 1.008427e+02 24.827400 0.000000
25% NaN NaN 0.016025 0.015161 0.005192 0.007094 0.077333 1.381663e+04 1260.042025 1326.116200
50% NaN NaN 0.055309 0.037762 0.031854 0.040174 0.131537 3.383344e+04 3128.946250 3289.565600
75% NaN NaN 0.135500 0.085894 0.092985 0.105332 0.211785 7.263600e+04 6954.698125 7500.070025
max NaN NaN 0.662643 0.462058 0.513609 0.582798 0.855056 1.514767e+06 163262.387700 188800.341400

Long to wide

As we are interested in clustering cell towers we need to transform our data to be one row per cell. This is typically called moving from a long format to a wide format whereby we will pivot our data so make more columns for each type of row relating to a cell.

So really we want to make our data wider with groups of metrics for each day_time value for each cell...

In [14]:
# now pivot to go wide with metrics for each day_time and one row per cell
df_wide = df_pct[['cell_id',
                  'day_time',
                  'smsin',
                  'smsout',
                  'callin',
                  'callout',
                  'internet']].pivot(index='cell_id', columns='day_time')

# fill in nulls
df_wide.fillna(0, inplace=True)

print(df_wide.shape)
df_wide.sample(5)
(10000, 30)
Out[14]:
smsin smsout ... callout internet
day_time weekday_day weekday_evening weekday_night weekend_day weekend_evening weekend_night weekday_day weekday_evening weekday_night weekend_day ... weekday_night weekend_day weekend_evening weekend_night weekday_day weekday_evening weekday_night weekend_day weekend_evening weekend_night
cell_id
2410 0.309592 0.117197 0.017140 0.085646 0.037540 0.005460 0.198182 0.073157 0.047916 0.071133 ... 0.006716 0.075427 0.026652 0.001764 0.369554 0.214750 0.142597 0.136018 0.083075 0.054005
878 0.350139 0.119102 0.017611 0.074815 0.037311 0.008588 0.243746 0.069434 0.010735 0.040039 ... 0.007481 0.055769 0.023323 0.002356 0.476836 0.181482 0.088657 0.135821 0.083040 0.034163
8746 0.139565 0.062666 0.011680 0.045858 0.023352 0.004965 0.238558 0.066679 0.162412 0.121567 ... 0.010888 0.075594 0.034554 0.002853 0.356139 0.218175 0.151265 0.137103 0.084742 0.052576
9934 0.390052 0.138023 0.015407 0.091390 0.040478 0.004379 0.167363 0.079890 0.012896 0.033843 ... 0.005853 0.062999 0.021757 0.001326 0.479580 0.236414 0.083497 0.111424 0.058273 0.030812
7233 0.302893 0.132436 0.012866 0.118957 0.049589 0.007197 0.143446 0.097011 0.022801 0.059346 ... 0.015574 0.093064 0.026424 0.005729 0.339153 0.201182 0.117996 0.200385 0.083268 0.058015

5 rows × 30 columns

...We now have a hierarchial column index after our pivot - to make things easier we will flatten this and rename columns to be more meaningful...

In [15]:
# get new column names so we can flatten the nested index
new_col_names = []

# loop through each level and build up new col names
for c1 in df_wide.columns.levels[0]:
    for c2 in df_wide.columns.levels[1]:
        new_col_name = c1 + '_' + c2
        new_col_names.append(new_col_name)

#print(new_col_names)
df_wide.columns = df_wide.columns.droplevel()
df_wide.columns = new_col_names

print(df_wide.shape)
df_wide.sample(5)
(10000, 30)
Out[15]:
smsin_weekday_day smsin_weekday_evening smsin_weekday_night smsin_weekend_day smsin_weekend_evening smsin_weekend_night smsout_weekday_day smsout_weekday_evening smsout_weekday_night smsout_weekend_day ... callout_weekday_night callout_weekend_day callout_weekend_evening callout_weekend_night internet_weekday_day internet_weekday_evening internet_weekday_night internet_weekend_day internet_weekend_evening internet_weekend_night
cell_id
3861 0.323158 0.161531 0.019782 0.085641 0.047232 0.006608 0.144269 0.100916 0.025348 0.042191 ... 0.008592 0.068668 0.030928 0.003110 0.359679 0.217869 0.160439 0.123133 0.084068 0.054812
1215 0.358138 0.122989 0.012736 0.081982 0.038316 0.005742 0.180258 0.092714 0.020538 0.049605 ... 0.006924 0.067318 0.024325 0.002293 0.407380 0.204986 0.132531 0.133016 0.074407 0.047680
7491 0.162901 0.070198 0.005600 0.048417 0.021509 0.002425 0.240868 0.126340 0.113916 0.100942 ... 0.004349 0.074312 0.029282 0.001573 0.359620 0.213747 0.143537 0.147655 0.081934 0.053508
9545 0.344179 0.128916 0.017091 0.089986 0.038323 0.005851 0.153588 0.114789 0.019845 0.060097 ... 0.008543 0.071444 0.028715 0.002444 0.385310 0.209087 0.137680 0.134027 0.083326 0.050570
7438 0.381686 0.144020 0.023272 0.089683 0.044593 0.011021 0.151888 0.076634 0.014309 0.035380 ... 0.011354 0.059357 0.028383 0.003307 0.419652 0.187295 0.126300 0.146349 0.082185 0.038219

5 rows × 30 columns

...note: we now have 10,000 rows of data as hoped - one per cell...

In [16]:
df_wide.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 9999
Data columns (total 30 columns):
smsin_weekday_day           10000 non-null float64
smsin_weekday_evening       10000 non-null float64
smsin_weekday_night         10000 non-null float64
smsin_weekend_day           10000 non-null float64
smsin_weekend_evening       10000 non-null float64
smsin_weekend_night         10000 non-null float64
smsout_weekday_day          10000 non-null float64
smsout_weekday_evening      10000 non-null float64
smsout_weekday_night        10000 non-null float64
smsout_weekend_day          10000 non-null float64
smsout_weekend_evening      10000 non-null float64
smsout_weekend_night        10000 non-null float64
callin_weekday_day          10000 non-null float64
callin_weekday_evening      10000 non-null float64
callin_weekday_night        10000 non-null float64
callin_weekend_day          10000 non-null float64
callin_weekend_evening      10000 non-null float64
callin_weekend_night        10000 non-null float64
callout_weekday_day         10000 non-null float64
callout_weekday_evening     10000 non-null float64
callout_weekday_night       10000 non-null float64
callout_weekend_day         10000 non-null float64
callout_weekend_evening     10000 non-null float64
callout_weekend_night       10000 non-null float64
internet_weekday_day        10000 non-null float64
internet_weekday_evening    10000 non-null float64
internet_weekday_night      10000 non-null float64
internet_weekend_day        10000 non-null float64
internet_weekend_evening    10000 non-null float64
internet_weekend_night      10000 non-null float64
dtypes: float64(30)
memory usage: 2.4+ MB
In [17]:
df_wide.describe(include='all')
Out[17]:
smsin_weekday_day smsin_weekday_evening smsin_weekday_night smsin_weekend_day smsin_weekend_evening smsin_weekend_night smsout_weekday_day smsout_weekday_evening smsout_weekday_night smsout_weekend_day ... callout_weekday_night callout_weekend_day callout_weekend_evening callout_weekend_night internet_weekday_day internet_weekday_evening internet_weekday_night internet_weekend_day internet_weekend_evening internet_weekend_night
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 ... 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 0.355898 0.131249 0.017283 0.087370 0.042485 0.006902 0.169967 0.083553 0.020242 0.047308 ... 0.009220 0.067528 0.028419 0.003332 0.409744 0.207268 0.128618 0.129385 0.078477 0.046508
std 0.059264 0.022366 0.010072 0.021773 0.014163 0.004027 0.038110 0.019684 0.018230 0.015025 ... 0.013461 0.016832 0.009288 0.005396 0.073839 0.024509 0.026315 0.026748 0.017782 0.010604
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000335 0.014076 0.002534 0.000084 0.211661 0.073861 0.007569 0.024169 0.006822 0.002555
25% 0.320724 0.119579 0.013130 0.075288 0.035945 0.005011 0.145874 0.071985 0.010823 0.038352 ... 0.005717 0.058084 0.023574 0.001866 0.356893 0.192468 0.115723 0.115746 0.069779 0.041097
50% 0.351739 0.134834 0.015978 0.089621 0.042733 0.006305 0.160757 0.083750 0.014555 0.046639 ... 0.006927 0.068447 0.028428 0.002363 0.393943 0.211158 0.131891 0.130987 0.079651 0.047597
75% 0.387713 0.146895 0.019392 0.100449 0.048789 0.007742 0.183156 0.094318 0.022045 0.053425 ... 0.008812 0.077330 0.032463 0.003217 0.446194 0.223858 0.145832 0.142594 0.087131 0.053011
max 0.662643 0.188492 0.273735 0.254822 0.341694 0.065288 0.462058 0.234634 0.200371 0.209437 ... 0.200000 0.190371 0.200566 0.082486 0.855056 0.291208 0.362664 0.380707 0.245081 0.128641

8 rows × 30 columns

Do Clustering

Now we are ready to perform our clustering and explore the results...

In [18]:
# make a new df to use from here on
df_final = df_wide.copy()

print(df_final.shape)
df_final.sample(5)
(10000, 30)
Out[18]:
smsin_weekday_day smsin_weekday_evening smsin_weekday_night smsin_weekend_day smsin_weekend_evening smsin_weekend_night smsout_weekday_day smsout_weekday_evening smsout_weekday_night smsout_weekend_day ... callout_weekday_night callout_weekend_day callout_weekend_evening callout_weekend_night internet_weekday_day internet_weekday_evening internet_weekday_night internet_weekend_day internet_weekend_evening internet_weekend_night
cell_id
3905 0.366254 0.129563 0.010662 0.105149 0.038177 0.005043 0.160309 0.088219 0.018058 0.049914 ... 0.004345 0.071438 0.022981 0.001232 0.366007 0.222772 0.138736 0.138797 0.087777 0.045911
3958 0.323442 0.152488 0.021054 0.098130 0.048824 0.006912 0.148161 0.094865 0.015185 0.055090 ... 0.010205 0.079407 0.034062 0.004752 0.338538 0.239963 0.165639 0.128009 0.074411 0.053439
3387 0.401024 0.122640 0.010230 0.056927 0.035951 0.004609 0.199441 0.085905 0.016206 0.032849 ... 0.004684 0.050211 0.020204 0.002003 0.508480 0.184865 0.112634 0.095007 0.061659 0.037355
9148 0.336236 0.140544 0.015380 0.110796 0.043323 0.007099 0.136007 0.100445 0.017050 0.061658 ... 0.006696 0.078241 0.030014 0.002787 0.356985 0.218367 0.143340 0.142349 0.083326 0.055634
9456 0.339872 0.136966 0.016477 0.100763 0.044910 0.007594 0.162871 0.086293 0.012583 0.054934 ... 0.005974 0.074639 0.031968 0.002579 0.380629 0.217142 0.129843 0.140365 0.080708 0.051314

5 rows × 30 columns

Here we will just start simple with a k-means approach. We have iterated on a few different choices of k and settled on a final value.

We could use someting a bit more fancy like HDBSCAN which can fit more flexible cluster's and have a bit more of a data driven approach to the number of clusters. It also could be useful for helping pick out outlier cells - those for which no cluster was found to be close enough.

But to keep it simple for now we will stick with k-means...

In [19]:
# define clustering approach
#clusterer = hdbscan.HDBSCAN()
clusterer = KMeans(n_clusters=5)
In [20]:
# do the clustering
clusterer.fit(df_final)
print(clusterer.labels_.max())
4
In [21]:
# add the cluster labels back onto the input data
df_final['cluster'] = clusterer.labels_
#df_final['cluster_prob'] = clusterer.probabilities_ # onlt relvant for HDBSCAN
In [22]:
print(df_final.shape)
df_final.sample(5)
(10000, 31)
Out[22]:
smsin_weekday_day smsin_weekday_evening smsin_weekday_night smsin_weekend_day smsin_weekend_evening smsin_weekend_night smsout_weekday_day smsout_weekday_evening smsout_weekday_night smsout_weekend_day ... callout_weekend_day callout_weekend_evening callout_weekend_night internet_weekday_day internet_weekday_evening internet_weekday_night internet_weekend_day internet_weekend_evening internet_weekend_night cluster
cell_id
2396 0.354738 0.158180 0.018989 0.100286 0.042696 0.008849 0.135989 0.084658 0.011113 0.050326 ... 0.081103 0.029726 0.001552 0.381625 0.210351 0.146584 0.134017 0.074631 0.052792 0
980 0.479957 0.127667 0.017448 0.068282 0.031964 0.005876 0.159496 0.050880 0.008267 0.026772 ... 0.039487 0.019832 0.002185 0.498139 0.183656 0.090403 0.120937 0.076971 0.029894 3
5076 0.382451 0.141270 0.017527 0.079588 0.047267 0.007341 0.156078 0.082137 0.018185 0.037249 ... 0.095843 0.051014 0.003548 0.395459 0.211378 0.156609 0.117818 0.065280 0.053455 2
9301 0.285212 0.140814 0.020248 0.091014 0.043654 0.007989 0.153068 0.109056 0.039449 0.053574 ... 0.078512 0.031500 0.002147 0.327174 0.224378 0.143027 0.145120 0.096457 0.063843 0
1494 0.315375 0.153181 0.019710 0.103524 0.052917 0.007347 0.144202 0.091270 0.020587 0.052061 ... 0.079879 0.033808 0.002200 0.318878 0.235176 0.161452 0.132248 0.092549 0.059698 0

5 rows × 31 columns

Lets look at the amount of cells in each cluster to get a feel for how cells have been distributed to each cluster...

In [23]:
# looks at % in each cluster
cluster_counts = df_final['cluster'].value_counts()
cluster_percents = cluster_counts / len(df_final['cluster'])
print(cluster_counts)
print(cluster_percents)
2    3675
0    3431
3    1797
4     603
1     494
Name: cluster, dtype: int64
2    0.3675
0    0.3431
3    0.1797
4    0.0603
1    0.0494
Name: cluster, dtype: float64

Explore Clustering

Now we will do some plots to help us explore and understand each cluster as well as get some feel for what the main differences between the clusters are...

In [24]:
# for each input metric plot a boxplot and violin plot to get a feel for how distributions differ by cluster
for var in df_final.columns[0:30]:
    fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12,6))
    ax1.set_title(var)
    ax2.set_title(var)
    sns.boxplot(x="cluster", y=var, data=df_final, ax=ax1)
    sns.violinplot(x="cluster", y=var, data=df_final, ax=ax2)
c:\users\andrew\appdata\local\programs\python\python36\lib\site-packages\matplotlib\pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)