## Finding Natural Breaks in Data with the Fisher-Jenks Algorithm¶

Notebook accompanying article on PB Python

In [1]:
import pandas as pd
import jenkspy


Create an example dataframe

In [2]:
sales = {
'account': [
'Jones Inc', 'Alpha Co', 'Blue Inc', 'Super Star Inc', 'Wamo',
'Next Gen', 'Giga Co', 'IniTech', 'Beta LLC'
],
'Total': [1500, 2100, 50, 20, 75, 1100, 950, 1300, 1400]
}
df = pd.DataFrame(sales)

In [3]:
df.sort_values(by='Total')

Out[3]:
account Total
3 Super Star Inc 20
2 Blue Inc 50
4 Wamo 75
6 Giga Co 950
5 Next Gen 1100
7 IniTech 1300
8 Beta LLC 1400
0 Jones Inc 1500
1 Alpha Co 2100

Try cutting the data using qcut

In [4]:
df['quantile'] = pd.qcut(df['Total'], q=2, labels=['bucket_1', 'bucket_2'])

In [5]:
df.sort_values(by='Total')

Out[5]:
account Total quantile
3 Super Star Inc 20 bucket_1
2 Blue Inc 50 bucket_1
4 Wamo 75 bucket_1
6 Giga Co 950 bucket_1
5 Next Gen 1100 bucket_1
7 IniTech 1300 bucket_2
8 Beta LLC 1400 bucket_2
0 Jones Inc 1500 bucket_2
1 Alpha Co 2100 bucket_2

Compare with using cut

In [6]:
df['cut_bins'] = pd.cut(df['Total'],
bins=2,
labels=['bucket_1', 'bucket_2'])

In [7]:
df.sort_values(by='Total')

Out[7]:
account Total quantile cut_bins
3 Super Star Inc 20 bucket_1 bucket_1
2 Blue Inc 50 bucket_1 bucket_1
4 Wamo 75 bucket_1 bucket_1
6 Giga Co 950 bucket_1 bucket_1
5 Next Gen 1100 bucket_1 bucket_2
7 IniTech 1300 bucket_2 bucket_2
8 Beta LLC 1400 bucket_2 bucket_2
0 Jones Inc 1500 bucket_2 bucket_2
1 Alpha Co 2100 bucket_2 bucket_2

Show how jenkspy works

In [8]:
breaks = jenkspy.jenks_breaks(df['Total'], nb_class=2)
print(breaks)

[20.0, 75.0, 2100.0]

In [9]:
df['cut_jenks'] = pd.cut(df['Total'],
bins=breaks,
labels=['bucket_1', 'bucket_2'])
df.sort_values(by='Total')

Out[9]:
account Total quantile cut_bins cut_jenks
3 Super Star Inc 20 bucket_1 bucket_1 NaN
2 Blue Inc 50 bucket_1 bucket_1 bucket_1
4 Wamo 75 bucket_1 bucket_1 bucket_1
6 Giga Co 950 bucket_1 bucket_1 bucket_2
5 Next Gen 1100 bucket_1 bucket_2 bucket_2
7 IniTech 1300 bucket_2 bucket_2 bucket_2
8 Beta LLC 1400 bucket_2 bucket_2 bucket_2
0 Jones Inc 1500 bucket_2 bucket_2 bucket_2
1 Alpha Co 2100 bucket_2 bucket_2 bucket_2

Fix the NaN by using include_lowest

In [10]:
df['cut_jenksv2'] = pd.cut(df['Total'],
bins=breaks,
labels=['bucket_1', 'bucket_2'],
include_lowest=True)
df.sort_values(by='Total')

Out[10]:
account Total quantile cut_bins cut_jenks cut_jenksv2
3 Super Star Inc 20 bucket_1 bucket_1 NaN bucket_1
2 Blue Inc 50 bucket_1 bucket_1 bucket_1 bucket_1
4 Wamo 75 bucket_1 bucket_1 bucket_1 bucket_1
6 Giga Co 950 bucket_1 bucket_1 bucket_2 bucket_2
5 Next Gen 1100 bucket_1 bucket_2 bucket_2 bucket_2
7 IniTech 1300 bucket_2 bucket_2 bucket_2 bucket_2
8 Beta LLC 1400 bucket_2 bucket_2 bucket_2 bucket_2
0 Jones Inc 1500 bucket_2 bucket_2 bucket_2 bucket_2
1 Alpha Co 2100 bucket_2 bucket_2 bucket_2 bucket_2

Try some other examples

In [11]:
df['quantilev2'] = pd.qcut(
df['Total'], q=4, labels=['bucket_1', 'bucket_2', 'bucket_3', 'bucket_4'])

df['cut_jenksv3'] = pd.cut(
df['Total'],
bins=jenkspy.jenks_breaks(df['Total'], nb_class=4),
labels=['bucket_1', 'bucket_2', 'bucket_3', 'bucket_4'],
include_lowest=True)

df.sort_values(by='Total')

Out[11]:
account Total quantile cut_bins cut_jenks cut_jenksv2 quantilev2 cut_jenksv3
3 Super Star Inc 20 bucket_1 bucket_1 NaN bucket_1 bucket_1 bucket_1
2 Blue Inc 50 bucket_1 bucket_1 bucket_1 bucket_1 bucket_1 bucket_1
4 Wamo 75 bucket_1 bucket_1 bucket_1 bucket_1 bucket_1 bucket_1
6 Giga Co 950 bucket_1 bucket_1 bucket_2 bucket_2 bucket_2 bucket_2
5 Next Gen 1100 bucket_1 bucket_2 bucket_2 bucket_2 bucket_2 bucket_2
7 IniTech 1300 bucket_2 bucket_2 bucket_2 bucket_2 bucket_3 bucket_3
8 Beta LLC 1400 bucket_2 bucket_2 bucket_2 bucket_2 bucket_3 bucket_3
0 Jones Inc 1500 bucket_2 bucket_2 bucket_2 bucket_2 bucket_4 bucket_3
1 Alpha Co 2100 bucket_2 bucket_2 bucket_2 bucket_2 bucket_4 bucket_4
In [ ]: