Binning Data with Pandas cut and qcut

This notebook accompanies the article posted on pbpython.com

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
In [2]:
sns.set_style('whitegrid')
In [3]:
raw_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=true')
In [4]:
df = raw_df.groupby(['account number', 'name'])['ext price'].sum().reset_index()
In [5]:
df.head()
Out[5]:
account number name ext price
0 141962 Herman LLC 63626.03
1 146832 Kiehn-Spinka 99608.77
2 163416 Purdy-Kunde 77898.21
3 218895 Kulas Inc 137351.96
4 239344 Stokes LLC 91535.92

A histogram is an example of binning data and showing the visual representation of the data distribution

Introducing qcut

In [6]:
df['ext price'].plot(kind='hist')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f61c1ebae80>

Describe shows how data can be cut by percentiles

In [7]:
df['ext price'].describe()
Out[7]:
count        20.000000
mean     101711.287500
std       27037.449673
min       55733.050000
25%       89137.707500
50%      100271.535000
75%      110132.552500
max      184793.700000
Name: ext price, dtype: float64

Here is an example of using qcut

In [8]:
pd.qcut(df['ext price'], q=4)
Out[8]:
0     (55733.049000000006, 89137.708]
1             (89137.708, 100271.535]
2     (55733.049000000006, 89137.708]
3              (110132.552, 184793.7]
4             (89137.708, 100271.535]
5             (89137.708, 100271.535]
6     (55733.049000000006, 89137.708]
7            (100271.535, 110132.552]
8              (110132.552, 184793.7]
9              (110132.552, 184793.7]
10            (89137.708, 100271.535]
11    (55733.049000000006, 89137.708]
12    (55733.049000000006, 89137.708]
13            (89137.708, 100271.535]
14           (100271.535, 110132.552]
15             (110132.552, 184793.7]
16           (100271.535, 110132.552]
17             (110132.552, 184793.7]
18           (100271.535, 110132.552]
19           (100271.535, 110132.552]
Name: ext price, dtype: category
Categories (4, interval[float64]): [(55733.049000000006, 89137.708] < (89137.708, 100271.535] < (100271.535, 110132.552] < (110132.552, 184793.7]]

Assign the results of the values back to the original dataframe

In [9]:
df['quantile_ex_1'] = pd.qcut(df['ext price'], q=4)
df['quantile_ex_2'] = pd.qcut(df['ext price'], q=10, precision=0)
In [10]:
df.head()
Out[10]:
account number name ext price quantile_ex_1 quantile_ex_2
0 141962 Herman LLC 63626.03 (55733.049000000006, 89137.708] (55732.0, 76471.0]
1 146832 Kiehn-Spinka 99608.77 (89137.708, 100271.535] (95908.0, 100272.0]
2 163416 Purdy-Kunde 77898.21 (55733.049000000006, 89137.708] (76471.0, 87168.0]
3 218895 Kulas Inc 137351.96 (110132.552, 184793.7] (124778.0, 184794.0]
4 239344 Stokes LLC 91535.92 (89137.708, 100271.535] (90686.0, 95908.0]

Look at the distribution

In [11]:
df['quantile_ex_1'].value_counts()
Out[11]:
(110132.552, 184793.7]             5
(100271.535, 110132.552]           5
(89137.708, 100271.535]            5
(55733.049000000006, 89137.708]    5
Name: quantile_ex_1, dtype: int64
In [12]:
df['quantile_ex_2'].value_counts()
Out[12]:
(124778.0, 184794.0]    2
(112290.0, 124778.0]    2
(105938.0, 112290.0]    2
(103606.0, 105938.0]    2
(100272.0, 103606.0]    2
(95908.0, 100272.0]     2
(90686.0, 95908.0]      2
(87168.0, 90686.0]      2
(76471.0, 87168.0]      2
(55732.0, 76471.0]      2
Name: quantile_ex_2, dtype: int64
In [13]:
bin_labels_5 = ['Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond']
df['quantile_ex_3'] = pd.qcut(df['ext price'],
                              q=[0, .2, .4, .6, .8, 1],
                              labels=bin_labels_5)
df.head()
Out[13]:
account number name ext price quantile_ex_1 quantile_ex_2 quantile_ex_3
0 141962 Herman LLC 63626.03 (55733.049000000006, 89137.708] (55732.0, 76471.0] Bronze
1 146832 Kiehn-Spinka 99608.77 (89137.708, 100271.535] (95908.0, 100272.0] Gold
2 163416 Purdy-Kunde 77898.21 (55733.049000000006, 89137.708] (76471.0, 87168.0] Bronze
3 218895 Kulas Inc 137351.96 (110132.552, 184793.7] (124778.0, 184794.0] Diamond
4 239344 Stokes LLC 91535.92 (89137.708, 100271.535] (90686.0, 95908.0] Silver
In [14]:
df['quantile_ex_3'].value_counts()
Out[14]:
Diamond     4
Platinum    4
Gold        4
Silver      4
Bronze      4
Name: quantile_ex_3, dtype: int64
In [15]:
results, bin_edges = pd.qcut(df['ext price'],
                             q=[0, .2, .4, .6, .8, 1],
                             labels=bin_labels_5,
                             retbins=True)

results_table = pd.DataFrame(zip(bin_edges, bin_labels_5),
                             columns=['Threshold', 'Tier'])
In [16]:
results_table
Out[16]:
Threshold Tier
0 55733.050 Bronze
1 87167.958 Silver
2 95908.156 Gold
3 103605.970 Platinum
4 112290.054 Diamond
In [17]:
df.describe(include='category')
Out[17]:
quantile_ex_1 quantile_ex_2 quantile_ex_3
count 20 20 20
unique 4 10 5
top (110132.552, 184793.7] (124778.0, 184794.0] Diamond
freq 5 2 4

We can pass the percentiles to use to describe

In [18]:
df.describe(percentiles=[0, 1/3, 2/3, 1])
Out[18]:
account number ext price
count 20.000000 20.000000
mean 476998.750000 101711.287500
std 231499.208970 27037.449673
min 141962.000000 55733.050000
0% 141962.000000 55733.050000
33.3% 332759.333333 91241.493333
50% 476006.500000 100271.535000
66.7% 662511.000000 104178.580000
100% 786968.000000 184793.700000
max 786968.000000 184793.700000

labels=False will return integers for each bin

In [19]:
df['quantile_ex_4'] = pd.qcut(df['ext price'],
                            q=[0, .2, .4, .6, .8, 1],
                            labels=False)
df.head()
Out[19]:
account number name ext price quantile_ex_1 quantile_ex_2 quantile_ex_3 quantile_ex_4
0 141962 Herman LLC 63626.03 (55733.049000000006, 89137.708] (55732.0, 76471.0] Bronze 0
1 146832 Kiehn-Spinka 99608.77 (89137.708, 100271.535] (95908.0, 100272.0] Gold 2
2 163416 Purdy-Kunde 77898.21 (55733.049000000006, 89137.708] (76471.0, 87168.0] Bronze 0
3 218895 Kulas Inc 137351.96 (110132.552, 184793.7] (124778.0, 184794.0] Diamond 4
4 239344 Stokes LLC 91535.92 (89137.708, 100271.535] (90686.0, 95908.0] Silver 1

cut

Remove the added columns to make the examples shorter

In [20]:
df = df.drop(columns = ['quantile_ex_1','quantile_ex_2', 'quantile_ex_3', 'quantile_ex_4'])
In [21]:
pd.cut(df['ext price'], bins=4)
Out[21]:
0       (55603.989, 87998.212]
1      (87998.212, 120263.375]
2       (55603.989, 87998.212]
3     (120263.375, 152528.538]
4      (87998.212, 120263.375]
5      (87998.212, 120263.375]
6       (55603.989, 87998.212]
7      (87998.212, 120263.375]
8      (87998.212, 120263.375]
9       (152528.538, 184793.7]
10     (87998.212, 120263.375]
11      (55603.989, 87998.212]
12      (55603.989, 87998.212]
13     (87998.212, 120263.375]
14     (87998.212, 120263.375]
15    (120263.375, 152528.538]
16     (87998.212, 120263.375]
17     (87998.212, 120263.375]
18     (87998.212, 120263.375]
19     (87998.212, 120263.375]
Name: ext price, dtype: category
Categories (4, interval[float64]): [(55603.989, 87998.212] < (87998.212, 120263.375] < (120263.375, 152528.538] < (152528.538, 184793.7]]
In [22]:
pd.cut(df['ext price'], bins=4).value_counts()
Out[22]:
(87998.212, 120263.375]     12
(55603.989, 87998.212]       5
(120263.375, 152528.538]     2
(152528.538, 184793.7]       1
Name: ext price, dtype: int64
In [23]:
cut_labels_4 = ['silver', 'gold', 'platinum', 'diamond']
cut_bins = [0, 70000, 100000, 130000, 200000]
df['cut_ex1'] = pd.cut(df['ext price'], bins=cut_bins, labels=cut_labels_4)
In [24]:
df.head()
Out[24]:
account number name ext price cut_ex1
0 141962 Herman LLC 63626.03 silver
1 146832 Kiehn-Spinka 99608.77 gold
2 163416 Purdy-Kunde 77898.21 gold
3 218895 Kulas Inc 137351.96 diamond
4 239344 Stokes LLC 91535.92 gold

We can use nump.linspace to define the ranges

In [25]:
np.linspace(0, 200000, 9)
Out[25]:
array([     0.,  25000.,  50000.,  75000., 100000., 125000., 150000.,
       175000., 200000.])
In [26]:
pd.cut(df['ext price'], bins=np.linspace(0, 200000, 9))
Out[26]:
0       (50000.0, 75000.0]
1      (75000.0, 100000.0]
2      (75000.0, 100000.0]
3     (125000.0, 150000.0]
4      (75000.0, 100000.0]
5      (75000.0, 100000.0]
6      (75000.0, 100000.0]
7     (100000.0, 125000.0]
8     (100000.0, 125000.0]
9     (175000.0, 200000.0]
10     (75000.0, 100000.0]
11      (50000.0, 75000.0]
12     (75000.0, 100000.0]
13     (75000.0, 100000.0]
14    (100000.0, 125000.0]
15    (100000.0, 125000.0]
16    (100000.0, 125000.0]
17    (100000.0, 125000.0]
18    (100000.0, 125000.0]
19    (100000.0, 125000.0]
Name: ext price, dtype: category
Categories (8, interval[float64]): [(0.0, 25000.0] < (25000.0, 50000.0] < (50000.0, 75000.0] < (75000.0, 100000.0] < (100000.0, 125000.0] < (125000.0, 150000.0] < (150000.0, 175000.0] < (175000.0, 200000.0]]

numpy arange is another option

In [27]:
np.arange(0, 200000, 10000)
Out[27]:
array([     0,  10000,  20000,  30000,  40000,  50000,  60000,  70000,
        80000,  90000, 100000, 110000, 120000, 130000, 140000, 150000,
       160000, 170000, 180000, 190000])
In [28]:
pd.interval_range(start=0, freq=10000, end=200000, closed='left')
Out[28]:
IntervalIndex([[0, 10000), [10000, 20000), [20000, 30000), [30000, 40000), [40000, 50000) ... [150000, 160000), [160000, 170000), [170000, 180000), [180000, 190000), [190000, 200000)],
              closed='left',
              dtype='interval[int64]')
In [29]:
interval_range = pd.interval_range(start=0, freq=10000, end=200000)
df['cut_ex2'] = pd.cut(df['ext price'], bins=interval_range, labels=[1,2,3])
df.head()
Out[29]:
account number name ext price cut_ex1 cut_ex2
0 141962 Herman LLC 63626.03 silver (60000, 70000]
1 146832 Kiehn-Spinka 99608.77 gold (90000, 100000]
2 163416 Purdy-Kunde 77898.21 gold (70000, 80000]
3 218895 Kulas Inc 137351.96 diamond (130000, 140000]
4 239344 Stokes LLC 91535.92 gold (90000, 100000]