Lenta Uplift Modeling Dataset

Lenta is a russian food retailer.

Lenta dataset for uplift modeling contains data about Lenta's customers grociery shopping and related marketing campaigns.

The dataset was originally released for the BIGTARGET Hackathon by LENTA and Microsoft and is accessible from sklift.datasets module using fetch_lenta function.

Read more about dataset in the api docs.

Load Lenta dataset

In [ ]:
import sys

# install uplift library scikit-uplift and other libraries 
!{sys.executable} -m pip install scikit-uplift catboost scikit-learn seaborn matplotlib pandas numpy
In [2]:
from sklift.datasets import fetch_lenta
from sklift.models import ClassTransformation
from sklift.metrics import uplift_at_k
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

%matplotlib inline
In [3]:
# returns sklearn Bunch object
# with data, target, treatment keys
# data features (pd.DataFrame), target (pd.Series), treatment (pd.Series) values 
dataset = fetch_lenta()

print(f"Dataset type: {type(dataset)}\n")
print(f"Dataset features shape: {dataset.data.shape}")
print(f"Dataset target shape: {dataset.target.shape}")
print(f"Dataset treatment shape: {dataset.treatment.shape}")

dataset.keys()
Dataset type: <class 'sklearn.utils.Bunch'>

Dataset features shape: (687029, 193)
Dataset target shape: (687029,)
Dataset treatment shape: (687029,)
Out[3]:
dict_keys(['data', 'target', 'treatment', 'DESCR', 'feature_names', 'target_name', 'treatment_name'])

Dataset is a dictionary-like object with the following attributes:

  • data (DataFrame object): Dataset without target and treatment.
  • target (Series object): Column target by values.
  • treatment (Series object): Column treatment by values.
  • DESCR (str): Description of the Lenta dataset.
  • feature_names (list): Names of the features.
  • target_name (str): Name of the target.
  • treatment_name (str): Name of the treatment.

Major columns:

  • treatment group (str): test/control group flag
  • target response_att (binary): target
  • data gender (str): customer gender
  • data age (float): customer age
  • data main_format (int): store type (1 - grociery store, 0 - superstore)

Detailed feature description could be found here.

We can specify the path to the destination folder and the name of the folder where the dataset should be stored with data_home and dest_subdir parameters. By default the path is /.

In [4]:
# data_home, dest_subdir = "/etc", "data"
# dataset = fetch_lenta(data_home=data_home, dest_subdir=dest_subdir)

We can load and return data, target, and treatment with setting the return_X_y_t parameter to True. By default return_X_y_t=False.

In [5]:
# data, target, treatment = fetch_lenta(return_X_y_t=True)

Target share for treatment / control

In [6]:
fig, ax = plt.subplots(1,2, sharey=True, figsize=(15,4))

treatment = dataset["treatment"]
target = dataset["target"]

sns.countplot(x=treatment, ax=ax[0])
sns.countplot(x=target, ax=ax[1])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d634c2400>

The current sample is unbalanced in terms of both treatment and target.

In [7]:
def crosstab_plot(treatment, target):
    ct = pd.crosstab(treatment, target, normalize='index')
    
    sns.heatmap(ct, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r')
    plt.ylabel('Treatment')
    plt.xlabel('Target')
    plt.title("Treatment - Target", size = 15)
    
crosstab_plot(dataset.treatment, dataset.target)

Distributions of some features by treatment

In [8]:
fig, ax = plt.subplots(1,2, figsize=(15,4))

test_index = dataset.treatment[dataset.treatment == 'test'].index
control_index = dataset.treatment[dataset.treatment == 'control'].index

sns.distplot(dataset.data.loc[test_index, 'response_sms'], label='test', ax=ax[0])
sns.distplot(dataset.data.loc[control_index, 'response_sms'], label='control', ax=ax[0])
ax[0].title.set_text('Test & Control response SMS Distribution')
ax[0].legend()

sns.distplot(dataset.data.loc[test_index, 'age'], label='test', ax=ax[1])
sns.distplot(dataset.data.loc[control_index, 'age'], label='control', ax=ax[1])
ax[1].title.set_text('Test & Control age Distribution')
ax[1].legend()
Out[8]:
<matplotlib.legend.Legend at 0x7f9d399d1100>

Clients from the test treatment group tend to respond to sms with a slightly greater probability than clients from the control group. The behavior in the test and control groups does not differ depending on the clients age.

Data analysys

In [9]:
dataset.data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687029 entries, 0 to 687028
Columns: 193 entries, age to stdev_discount_depth_1m
dtypes: float64(191), int64(1), object(1)
memory usage: 1011.6+ MB
In [10]:
dataset.data.head().append(dataset.data.tail())
Out[10]:
age cheque_count_12m_g20 cheque_count_12m_g21 cheque_count_12m_g25 cheque_count_12m_g32 cheque_count_12m_g33 cheque_count_12m_g38 cheque_count_12m_g39 cheque_count_12m_g41 cheque_count_12m_g42 cheque_count_12m_g45 cheque_count_12m_g46 cheque_count_12m_g48 cheque_count_12m_g52 cheque_count_12m_g56 cheque_count_12m_g57 cheque_count_12m_g58 cheque_count_12m_g79 cheque_count_3m_g20 cheque_count_3m_g21 cheque_count_3m_g25 cheque_count_3m_g42 cheque_count_3m_g45 cheque_count_3m_g52 cheque_count_3m_g56 cheque_count_3m_g57 cheque_count_3m_g79 cheque_count_6m_g20 cheque_count_6m_g21 cheque_count_6m_g25 cheque_count_6m_g32 cheque_count_6m_g33 cheque_count_6m_g38 cheque_count_6m_g39 cheque_count_6m_g40 cheque_count_6m_g41 cheque_count_6m_g42 cheque_count_6m_g45 cheque_count_6m_g46 cheque_count_6m_g48 cheque_count_6m_g52 cheque_count_6m_g56 cheque_count_6m_g57 cheque_count_6m_g58 cheque_count_6m_g79 children crazy_purchases_cheque_count_12m crazy_purchases_cheque_count_1m crazy_purchases_cheque_count_3m crazy_purchases_cheque_count_6m crazy_purchases_goods_count_12m crazy_purchases_goods_count_6m disc_sum_6m_g34 food_share_15d food_share_1m gender k_var_cheque_15d k_var_cheque_3m k_var_cheque_category_width_15d k_var_cheque_group_width_15d k_var_count_per_cheque_15d_g24 k_var_count_per_cheque_15d_g34 k_var_count_per_cheque_1m_g24 k_var_count_per_cheque_1m_g27 k_var_count_per_cheque_1m_g34 k_var_count_per_cheque_1m_g44 k_var_count_per_cheque_1m_g49 k_var_count_per_cheque_3m_g24 k_var_count_per_cheque_3m_g27 k_var_count_per_cheque_3m_g32 k_var_count_per_cheque_3m_g34 k_var_count_per_cheque_3m_g41 k_var_count_per_cheque_3m_g44 k_var_count_per_cheque_6m_g24 k_var_count_per_cheque_6m_g27 k_var_count_per_cheque_6m_g32 k_var_count_per_cheque_6m_g44 k_var_days_between_visits_15d k_var_days_between_visits_1m k_var_days_between_visits_3m k_var_disc_per_cheque_15d k_var_disc_share_12m_g32 k_var_disc_share_15d_g24 k_var_disc_share_15d_g34 k_var_disc_share_15d_g49 k_var_disc_share_1m_g24 k_var_disc_share_1m_g27 k_var_disc_share_1m_g34 k_var_disc_share_1m_g40 k_var_disc_share_1m_g44 k_var_disc_share_1m_g49 k_var_disc_share_1m_g54 k_var_disc_share_3m_g24 k_var_disc_share_3m_g26 k_var_disc_share_3m_g27 k_var_disc_share_3m_g32 k_var_disc_share_3m_g33 k_var_disc_share_3m_g34 k_var_disc_share_3m_g38 k_var_disc_share_3m_g40 k_var_disc_share_3m_g41 k_var_disc_share_3m_g44 k_var_disc_share_3m_g46 k_var_disc_share_3m_g48 k_var_disc_share_3m_g49 k_var_disc_share_3m_g54 k_var_disc_share_6m_g24 k_var_disc_share_6m_g27 k_var_disc_share_6m_g32 k_var_disc_share_6m_g34 k_var_disc_share_6m_g44 k_var_disc_share_6m_g46 k_var_disc_share_6m_g49 k_var_disc_share_6m_g54 k_var_discount_depth_15d k_var_discount_depth_1m k_var_sku_per_cheque_15d k_var_sku_price_12m_g32 k_var_sku_price_15d_g34 k_var_sku_price_15d_g49 k_var_sku_price_1m_g24 k_var_sku_price_1m_g26 k_var_sku_price_1m_g27 k_var_sku_price_1m_g34 k_var_sku_price_1m_g40 k_var_sku_price_1m_g44 k_var_sku_price_1m_g49 k_var_sku_price_1m_g54 k_var_sku_price_3m_g24 k_var_sku_price_3m_g26 k_var_sku_price_3m_g27 k_var_sku_price_3m_g32 k_var_sku_price_3m_g33 k_var_sku_price_3m_g34 k_var_sku_price_3m_g40 k_var_sku_price_3m_g41 k_var_sku_price_3m_g44 k_var_sku_price_3m_g46 k_var_sku_price_3m_g48 k_var_sku_price_3m_g49 k_var_sku_price_3m_g54 k_var_sku_price_6m_g24 k_var_sku_price_6m_g26 k_var_sku_price_6m_g27 k_var_sku_price_6m_g32 k_var_sku_price_6m_g41 k_var_sku_price_6m_g42 k_var_sku_price_6m_g44 k_var_sku_price_6m_g48 k_var_sku_price_6m_g49 main_format mean_discount_depth_15d months_from_register perdelta_days_between_visits_15_30d promo_share_15d response_sms response_viber sale_count_12m_g32 sale_count_12m_g33 sale_count_12m_g49 sale_count_12m_g54 sale_count_12m_g57 sale_count_3m_g24 sale_count_3m_g33 sale_count_3m_g57 sale_count_6m_g24 sale_count_6m_g25 sale_count_6m_g32 sale_count_6m_g33 sale_count_6m_g44 sale_count_6m_g54 sale_count_6m_g57 sale_sum_12m_g24 sale_sum_12m_g25 sale_sum_12m_g26 sale_sum_12m_g27 sale_sum_12m_g32 sale_sum_12m_g44 sale_sum_12m_g54 sale_sum_3m_g24 sale_sum_3m_g26 sale_sum_3m_g32 sale_sum_3m_g33 sale_sum_6m_g24 sale_sum_6m_g25 sale_sum_6m_g26 sale_sum_6m_g32 sale_sum_6m_g33 sale_sum_6m_g44 sale_sum_6m_g54 stdev_days_between_visits_15d stdev_discount_depth_15d stdev_discount_depth_1m
0 47.0 3.0 22.0 19.0 3.0 28.0 8.0 7.0 6.0 1.0 13.0 12.0 16.0 3.0 15.0 11.0 0.0 4.0 0.0 7.0 8.0 0.0 5.0 1.0 6.0 6.0 1.0 0.0 12.0 9.0 1.0 6.0 4.0 2.0 5.0 1.0 0.0 5.0 5.0 6.0 1.0 6.0 9.0 0.0 1.0 0.0 13.0 3.0 5.0 8.0 16.0 11.0 153.09 0.6488 0.3254 Ж 0.7288 1.8741 0.5263 0.7692 NaN NaN 0.2917 NaN 0.6682 0.5592 0.400 0.5871 0.4654 NaN 0.6055 0.0000 0.5590 0.6183 0.4845 NaN 0.5471 0.4554 0.6479 0.8240 1.4055 1.4080 NaN NaN NaN 0.5208 NaN 0.5462 NaN 0.1559 0.0449 0.0000 0.8300 0.0115 0.3846 NaN 0.7418 0.5004 1.2014 1.3485 0.0000 1.2304 0.7229 0.5943 1.5156 0.0147 0.8036 0.6366 NaN 0.7793 1.2143 1.0723 1.3947 0.0123 0.4621 0.4864 0.7067 0.0589 NaN NaN 0.5946 0.0823 NaN 0.1414 NaN 0.8669 0.3707 0.0000 0.7177 0.0866 1.3485 NaN 0.4640 0.3956 0.1930 0.0000 0.8019 0.1895 0.6128 2.1596 0.6810 0.6546 0.1300 1.2374 NaN NaN 0.0000 0.8756 0.6718 2.0876 0 0.6055 18.0 1.3393 0.5821 0.923077 0.071429 10.0 84.314 98.0 16.0 11.0 137.282 28.776 6.0 169.658 10.680 7.0 28.776 21.0 8.0 9.0 4469.86 658.85 1286.32 7736.05 418.80 3233.31 811.73 2321.61 182.82 283.84 3648.23 3141.25 356.67 237.25 283.84 3648.23 1195.37 535.42 1.7078 0.2798 0.3008
1 57.0 1.0 0.0 2.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 2.0 1.0 1.0 1.0 0.0 3.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 55.99 0.0000 1.0000 Ж 0.0000 0.9630 0.0000 0.0000 NaN NaN 0.0000 0.0 0.0000 0.0000 0.000 0.0000 0.0000 NaN 1.0102 0.0000 0.0000 NaN NaN NaN 0.0000 0.0000 0.0000 1.0027 0.0000 NaN NaN NaN NaN 0.0000 0.0 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.1094 0.0000 NaN NaN 1.1289 0.0000 0.6188 0.0000 0.0000 0.0000 NaN 0.4981 0.6382 NaN NaN NaN 1.1289 0.0000 0.0000 0.4981 0.6382 0.0000 0.0000 0.0000 NaN NaN NaN 0.0000 0.0000 0.0 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.2072 0.0000 NaN NaN 0.3993 0.8333 0.0000 0.0000 0.0000 NaN 0.6192 0.5405 NaN 0.2072 NaN NaN NaN 0.0000 0.0000 NaN 0.6192 1 0.0000 4.0 0.0000 0.0000 1.000000 0.000000 1.0 1.000 2.0 2.0 0.0 0.000 1.000 0.0 1.744 2.000 1.0 1.000 0.0 2.0 0.0 113.39 62.69 58.71 93.35 87.01 0.00 122.98 0.00 58.71 87.01 179.83 113.39 62.69 58.71 87.01 179.83 0.00 122.98 0.0000 0.0000 0.0000
2 38.0 7.0 0.0 15.0 4.0 9.0 5.0 9.0 14.0 7.0 6.0 10.0 14.0 5.0 11.0 0.0 3.0 2.0 2.0 0.0 3.0 2.0 1.0 1.0 0.0 0.0 2.0 6.0 0.0 9.0 2.0 5.0 1.0 7.0 7.0 8.0 3.0 2.0 6.0 6.0 3.0 4.0 0.0 0.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 290.00 0.3739 0.4768 М NaN 0.3295 NaN NaN 0.0 NaN 0.0000 0.0 0.4159 0.8485 0.000 0.0000 0.0302 0.0 0.6009 0.6205 1.0035 0.5712 0.5762 0.4714 0.9830 0.0000 NaN 0.5559 NaN 0.9780 0.0 NaN NaN 0.0000 0.0 0.0078 NaN 0.8362 1.3183 0.8560 0.0000 0.0000 0.6077 0.0 0.7665 1.2056 NaN 1.0002 0.0541 0.8461 0.5812 0.6945 1.7252 0.7579 0.6608 0.8560 0.9266 1.1554 0.7782 0.7471 2.0674 0.8871 NaN 0.1201 NaN 0.6629 NaN NaN 0.0000 0.0000 0.0 0.0666 NaN 0.4668 1.3422 0.3536 0.0000 0.0000 0.0100 0.0 0.0457 0.2615 0.5856 0.2870 0.5238 0.2017 0.2840 1.8758 0.6338 0.2654 0.0000 0.4481 0.7673 0.2393 0.2851 0.5170 0.2407 2.5227 0 0.7256 34.0 0.0000 0.7256 1.000000 0.250000 5.0 21.102 50.0 109.0 0.0 0.000 7.594 0.0 25.294 11.084 3.0 11.158 31.0 59.0 0.0 1564.91 971.09 177.93 3257.49 975.21 2555.27 6351.29 0.00 0.00 0.00 783.87 1239.19 533.46 83.37 593.13 1217.43 1336.83 3709.82 0.0000 NaN 0.0803
3 65.0 6.0 3.0 25.0 2.0 10.0 14.0 11.0 8.0 1.0 0.0 2.0 6.0 7.0 2.0 0.0 0.0 0.0 1.0 0.0 5.0 0.0 0.0 1.0 0.0 0.0 0.0 2.0 1.0 11.0 2.0 3.0 5.0 5.0 4.0 2.0 1.0 0.0 1.0 3.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 3.0 0.0 51.81 0.0000 1.0000 Ж 0.0000 1.4933 0.0000 0.0000 NaN NaN 0.0000 0.0 0.0000 0.0000 0.000 0.0000 NaN NaN 0.0000 0.0000 NaN NaN 0.3295 0.0000 0.0000 0.0000 0.0000 0.7432 0.0000 0.1315 NaN NaN NaN 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 NaN NaN NaN 0.7904 0.0050 NaN NaN 0.0000 NaN 0.0000 NaN 0.0166 0.5362 NaN 0.5780 0.1315 0.3219 1.1290 NaN 1.7975 1.2530 0.0000 0.0000 0.0000 0.2354 NaN NaN 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 NaN NaN NaN 0.1655 0.6560 NaN 0.0000 NaN 0.0000 NaN 0.1326 0.1477 NaN 0.7469 0.2352 0.2354 0.6846 NaN 0.2671 0.1028 3.0736 1 0.0000 40.0 0.0000 0.0000 0.909091 0.000000 2.0 12.544 49.0 39.0 0.0 0.000 2.778 0.0 2.000 34.212 2.0 3.778 2.0 13.0 0.0 358.22 3798.18 680.93 1425.07 175.73 602.81 3544.76 0.00 119.99 73.24 346.74 139.68 1849.91 360.40 175.73 496.73 172.58 1246.21 0.0000 0.0000 0.0000
4 61.0 0.0 1.0 2.0 0.0 2.0 1.0 0.0 3.0 2.0 1.0 1.0 5.0 5.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 2.0 0.0 0.0 1.0 0.0 1.0 2.0 0.0 2.0 1.0 0.0 8.0 2.0 2.0 1.0 1.0 4.0 3.0 0.0 0.0 0.0 1.0 2.0 4.0 1.0 1.0 2.0 4.0 2.0 161.12 0.2882 0.2882 Ж 0.9301 0.9014 0.8165 0.7542 0.0 NaN 0.0000 0.0 NaN 0.0000 0.000 0.0000 0.6682 0.0 0.7781 0.0000 0.0000 0.4826 0.7526 0.0000 NaN 0.4714 0.4714 0.9980 1.3497 0.0000 0.0 NaN 0.0000 0.0000 0.0 NaN NaN 0.0000 0.0000 1.2273 0.0000 0.0249 0.9172 0.0 NaN 0.6549 0.0000 0.6684 0.4566 0.0000 0.0000 0.0138 1.2899 1.6070 1.0710 0.9058 0.0000 0.5955 NaN NaN 1.6232 1.3194 0.4903 0.4903 0.9423 0.0000 NaN 0.0000 0.0000 0.0000 0.0 NaN NaN 0.0000 0.0000 0.2284 0.0000 0.1325 0.2146 0.0 NaN 0.1934 0.5189 0.0058 0.0000 0.0000 0.3868 2.0293 0.6860 0.6113 1.0926 0.2211 0.0000 0.0058 0.2496 NaN 0.2195 1.4917 0 0.7128 20.0 0.0000 0.7865 1.000000 0.100000 0.0 1.454 25.0 25.0 0.0 0.000 0.454 0.0 3.036 12.000 0.0 1.454 8.0 23.0 0.0 226.98 168.05 960.37 1560.21 0.00 342.45 1039.85 0.00 66.18 0.00 87.94 226.98 168.05 461.37 0.00 237.93 225.51 995.27 1.4142 0.3495 0.3495
687024 35.0 0.0 0.0 4.0 0.0 2.0 0.0 1.0 0.0 3.0 2.0 2.0 3.0 2.0 1.0 0.0 1.0 0.0 0.0 0.0 3.0 2.0 1.0 2.0 1.0 0.0 0.0 0.0 0.0 3.0 0.0 2.0 0.0 0.0 5.0 0.0 2.0 2.0 2.0 2.0 2.0 1.0 0.0 1.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 130.90 0.6214 0.6924 Ж 1.0931 1.1344 0.8240 0.8281 NaN 0.4714 0.4359 NaN 0.4714 NaN 0.441 0.4359 NaN 0.0 0.4714 0.0000 NaN 0.4359 NaN 0.0000 NaN 0.6614 0.5162 0.5162 1.0710 0.0000 NaN 1.1557 1.1622 0.0058 NaN 1.1557 1.0531 NaN 1.3709 0.0000 0.0058 NaN NaN 0.0 0.2821 1.1557 0.0000 1.0531 0.0000 NaN 0.8648 1.2152 1.3709 0.0000 0.0058 NaN 0.0000 0.7422 NaN 0.8648 1.4424 NaN 0.6335 0.6177 0.9331 0.0000 0.0437 1.3395 0.0223 NaN NaN 0.0437 0.5187 NaN 1.6498 0.0000 0.0223 NaN NaN 0.0 0.0444 0.0437 0.5187 0.0000 NaN 0.4438 1.0347 1.6498 0.0000 0.0223 NaN NaN 0.0000 0.0000 0.1156 NaN 1.0347 1.7847 0 0.5756 59.0 1.3333 0.4002 0.000000 0.166667 0.0 3.000 14.0 2.0 0.0 19.856 3.000 0.0 19.856 29.000 0.0 3.000 15.0 1.0 0.0 550.09 695.32 111.87 114.21 0.00 1173.84 147.68 550.09 111.87 0.00 330.96 550.09 669.33 111.87 0.00 330.96 1173.84 119.99 2.6458 0.3646 0.3282
687025 33.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0000 0.0000 М 0.0000 0.0000 0.0000 0.0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 NaN NaN 0.0000 0.0000 0.0000 0.0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.2382 NaN 0 0.0000 4.0 0.0000 0.0000 1.000000 0.000000 0.0 0.000 1.0 1.0 0.0 NaN NaN NaN 0.000 0.000 0.0 0.000 0.0 1.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 28.01 NaN NaN NaN NaN 0.00 0.00 0.00 0.00 0.00 0.00 28.01 0.0000 0.0000 0.0000
687026 36.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.9847 0.9847 М NaN NaN NaN NaN 0.0 0.0000 0.0000 0.0 0.0000 NaN NaN 0.0000 0.0000 0.0 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 NaN 0.0000 0.0 0.0000 NaN 0.0000 0.0 0.0000 0.0000 NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 NaN 0.0000 NaN NaN NaN 0.0000 0.0000 NaN 0.0000 0.0000 0.0 0.0000 0.0000 NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 NaN 1 0.9847 66.0 0.0000 0.9847 1.000000 0.000000 0.0 0.000 5.0 3.0 0.0 0.000 0.000 0.0 0.000 0.000 0.0 0.000 15.0 0.0 0.0 0.00 155.97 23.99 41.51 0.00 615.77 87.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 449.01 0.00 0.0000 NaN NaN
687027 37.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.3545 0.7263 М NaN 0.2269 NaN NaN 0.0 0.0000 0.0000 0.0 0.0000 0.0000 NaN 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 NaN NaN NaN 0.0000 0.0 0.0000 0.0000 0.0000 0.0 0.0000 NaN 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.0 0.0000 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 NaN NaN NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0 0.0000 NaN 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.0 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 NaN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 NaN 0 0.8318 9.0 0.0000 0.8318 1.000000 0.000000 0.0 0.000 1.0 0.0 0.0 0.000 0.000 0.0 0.000 0.476 0.0 0.000 0.0 0.0 0.0 0.00 81.90 29.82 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 46.72 0.00 0.00 0.00 0.00 0.00 0.0000 NaN NaN
687028 40.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0 2.0 2.0 2.0 2.0 3.0 1.0 1.0 2.0 1.0 4.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 3.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 2.0 0.0 1.0 1.0 0.0 3.0 3.0 1.0 0.0 0.0 0.0 1.0 0.0 0.00 0.0000 0.0000 Ж 0.0000 0.8408 0.0000 0.0000 NaN NaN NaN NaN NaN NaN NaN 0.9895 0.9970 0.0 0.0000 0.0000 0.6667 0.9895 0.9970 0.0000 0.6667 0.0000 0.0000 0.3536 0.0000 0.0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.5264 0.0000 0.2564 0.0 NaN 0.0000 0.0000 0.0000 0.0000 1.1059 0.5745 0.2423 0.6481 0.3924 0.5264 0.2564 0.0000 0.0000 1.1059 0.5745 0.6481 0.3924 0.0000 0.0000 0.0000 0.0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.5179 0.0000 0.2672 0.0 NaN 0.0000 0.0000 0.0000 0.5834 0.1777 0.2541 1.2352 0.3381 0.5179 0.0000 0.2672 0.0000 0.0000 NaN 0.5834 0.2541 1.2352 0 0.0000 13.0 0.0000 0.0000 1.000000 0.100000 0.0 6.452 25.0 17.0 3.0 6.660 1.344 1.0 6.660 0.000 0.0 1.344 18.0 4.0 1.0 531.25 0.00 0.00 916.44 0.00 2407.56 1304.03 290.01 0.00 0.00 228.47 290.01 0.00 0.00 0.00 228.47 752.32 596.86 0.0000 0.0000 0.0000
  • There are 193 columns in the dataset
  • The dataset contains:
    • basic information about clients (age, number of children)
    • information about some groups of goods
    • statistical information (variation of discounts, prices)

Missing values

In [11]:
# check NaN values ratio
pd.DataFrame({"Total" : dataset.data.isna().sum().sort_values(ascending = False),
              "Percentage" : round(dataset.data.isna().sum().sort_values(ascending = False) / len(dataset.data), 3)}).head(20)
              
Out[11]:
Total Percentage
k_var_sku_price_15d_g49 496259 0.722
k_var_disc_share_15d_g49 496159 0.722
k_var_count_per_cheque_15d_g34 468551 0.682
k_var_sku_price_15d_g34 468551 0.682
k_var_disc_share_15d_g34 468467 0.682
k_var_count_per_cheque_15d_g24 442121 0.644
k_var_disc_share_15d_g24 442054 0.643
k_var_sku_price_1m_g49 414473 0.603
k_var_count_per_cheque_1m_g49 414473 0.603
k_var_disc_share_1m_g49 414369 0.603
k_var_sku_price_1m_g54 388217 0.565
k_var_disc_share_1m_g54 388139 0.565
k_var_sku_price_1m_g34 385078 0.560
k_var_count_per_cheque_1m_g34 385078 0.560
k_var_disc_share_1m_g34 384997 0.560
k_var_sku_price_1m_g44 383315 0.558
k_var_count_per_cheque_1m_g44 383315 0.558
k_var_disc_share_1m_g44 383219 0.558
k_var_sku_price_1m_g40 380641 0.554
k_var_disc_share_1m_g40 380559 0.554
In [12]:
print('Total missed data percentage:', 
      round(100*dataset.data.isna().sum().sum()/(dataset.data.shape[0]*dataset.data.shape[1]), 2), '%')
Total missed data percentage: 19.34 %

Data transformation

Transform categorical columns gender and treatment into binary.

In [13]:
# make treatment binary
treat_dict = {
    'test': 1,
    'control': 0
}
dataset.treatment = dataset.treatment.map(treat_dict)

# make gender binary
gender_dict = {
    'M': 1,
    'Ж': 0
}
dataset.data.gender = dataset.data.gender.map(gender_dict)

Feature correlation

In [14]:
f = plt.figure(figsize=(19, 15))
plt.matshow(dataset.data.corr(), fignum=f.number)
cb = plt.colorbar()
cb.ax.tick_params(axis=u'both', which=u'both',length=0)
plt.title('Correlation Matrix', fontsize=16);

Train/test split

  • stratify by two columns: treatment and target.

Intuition: In a binary classification problem definition we stratify train set by splitting target 0/1 column. In uplift modeling we have two columns instead of one.

In [15]:
stratify_cols = pd.concat([dataset.treatment, dataset.target], axis=1)

X_train, X_val, trmnt_train, trmnt_val, y_train, y_val = train_test_split(
    dataset.data,
    dataset.treatment,
    dataset.target,
    stratify=stratify_cols,
    test_size=0.3,
    random_state=42
)

print(f"Train shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")
Train shape: (480920, 193)
Validation shape: (206109, 193)

Pipeline with CatBoostClassifier

In [16]:
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
estimator = CatBoostClassifier(verbose=100,
                               random_state=42,
                               thread_count=1)
ct_model = ClassTransformation(estimator=estimator)

my_pipeline = Pipeline([
    ('imputer', imp_mode),
    ('model', ct_model)
])

Usual fit pipeline but with aditional treatment parameter model__treatment = trmnt_train.

In [17]:
my_pipeline = my_pipeline.fit(
    X=X_train, 
    y=y_train, 
    model__treatment=trmnt_train
)
Learning rate set to 0.143939
0:	learn: 0.6695107	total: 421ms	remaining: 7m
100:	learn: 0.5950043	total: 34.2s	remaining: 5m 4s
200:	learn: 0.5908539	total: 1m 5s	remaining: 4m 21s
300:	learn: 0.5870115	total: 1m 39s	remaining: 3m 51s
400:	learn: 0.5835003	total: 2m 13s	remaining: 3m 19s
500:	learn: 0.5800551	total: 2m 47s	remaining: 2m 46s
600:	learn: 0.5768127	total: 3m 21s	remaining: 2m 13s
700:	learn: 0.5736896	total: 3m 54s	remaining: 1m 39s
800:	learn: 0.5706878	total: 4m 27s	remaining: 1m 6s
900:	learn: 0.5676374	total: 5m 2s	remaining: 33.2s
999:	learn: 0.5647908	total: 5m 35s	remaining: 0us

Predict uplift and calculate basic uplift metric [email protected]% at first 30%. Read more about the metric in docs.

In [18]:
uplift_predictions = my_pipeline.predict(X_val)

uplift_30 = uplift_at_k(y_val, uplift_predictions, trmnt_val, strategy='overall')
print(f'[email protected]%: {uplift_30:.4f}')