Use few features, pre-select features. My best submission used only 6 ones!
Make NOT very complex estimators for fitting. I used mainly different gradient boosting realizations (Catboost / Xgboost / Lightgbm), best results gave models with trees of death 1-2.
Use Class transformation approach for uplift modelling.
Try to make more accurate local validation and do not rely only on public validation score. For testing hypotheses I used to make N (N = 30, 50) train - test splits and calculate test scores mean and standard deviation for making decisions.
Package with different uplift model realizations:
uplift modelling package
Tutorial about uplift and approaches (in Russian):
uplift modeling approaches article
Additional code with my feature engenearing, experimemts and validation:
retail hero research
import gc
import pandas as pd
import numpy as np
from datetime import datetime
from xgboost import XGBClassifier
from sklift.models import ClassTransformation
from sklift.metrics import uplift_at_k
import matplotlib.pyplot as plt
%matplotlib inline
df_clients = pd.read_csv(
'data/clients.csv',
index_col='client_id',
parse_dates=['first_issue_date', 'first_redeem_date']
)
df_purchases = pd.read_csv(
'data/purchases.csv'
index_col='client_id',
parse_dates=['transaction_datetime']
)
df_dtrain = pd.read_csv('data/uplift_train.csv', index_col='client_id')
df_test = pd.read_csv('data/uplift_test.csv', index_col='client_id')
'first_redeem_date' - make date as an integer feature
df_clients['first_redeem_date'] =\
df_clients['first_redeem_date']\
.fillna(datetime(2019, 3, 19, 0, 0))
df_clients.loc[:, 'first_redeem_date'] =\
((df_clients['first_redeem_date'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1d'))
Aggregate df_purchases data
# two -level aggregation:
# 1. aggregate to (client, transaction) take one row only
# ('express_points_spent', 'purchase_sum' - same for one transaction)
# 2. aggregate all transactions to client.
df_purch = df_purchases\
.groupby(['client_id','transaction_id'])[['express_points_spent', 'purchase_sum']]\
.last()\
.groupby('client_id')\
.agg({
'express_points_spent': ['mean', 'sum'],
'purchase_sum': ['sum']
})
# set readable column names:
df_purch.columns =\
['express_spent_mean', 'express_points_spent_sum', 'purchase_sum__sum']
#'regular_points_received_sum_last_m'
reg_points_last_m = df_purchases[
df_purchases['transaction_datetime'] > '2019-02-18'
]\
.groupby(['client_id', 'transaction_id'])['regular_points_received']\
.last()\
.groupby('client_id')\
.sum()
reg_points_last_m = pd.DataFrame({
'regular_points_received_sum_last_m': reg_points_last_m
})
Make 'transaction_datetime' from purchases data frame as integer.
Join clients df and purchases df and calculate difference between 'transaction_datetime' and ''first_redeem_date' (date_diff).
df_purchases.loc[:,'purch_day'] =\
((df_purchases['transaction_datetime'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1d'))
Calculate features from purchases data for all transactions after redeem date:
# 'after_redeem_sum'
df_purch_joined = pd.merge(
df_purchases[
['client_id', 'purch_day', 'transaction_id', 'purchase_sum']
],
df_clients\
.reset_index()[
['client_id', 'first_redeem_date']
], on='client_id', how='left')
df_purch_joined = df_purch_joined\
.assign(date_diff=\
df_purch_joined['first_redeem_date'] - df_purch_joined['purch_day']
)
df_purch_agg = df_purch_joined[
df_purch_joined['date_diff'] <= 0
]\
.groupby(
['client_id', 'transaction_id']
)\
.last()\
.groupby('client_id')['purchase_sum']\
.sum()
after_redeem_sum = pd.DataFrame(data={
'after_redeem_sum': df_purch_agg
})
del df_purch_joined, df_purch_agg
gc.collect();
'purch_delta' - difference in days between last and first transaction
df_purch_delta_agg = df_purchases\
.groupby('client_id')\
.agg({
'purch_day': ['max', 'min']
})
df_purch_delta = pd.DataFrame(
data=df_purch_delta_agg['purch_day']['max'] -\
df_purch_delta_agg['purch_day']['min'] + 1,
columns=['purch_delta']
)
del df_purch_delta_agg
gc.collect();
Concat all data together:
df_feats = pd.concat([
df_clients[['first_redeem_date']],
df_purch,
df_purch_delta,
reg_points_last_m,
after_redeem_sum
], axis=1, sort=False)
df_feats = df_feats\
.assign(
avg_spent_perday=\
df_feats['purchase_sum__sum'] / df_feats['purch_delta'],
after_redeem_sum_perday =\
df_feats['after_redeem_sum'] / df_feats['purch_delta']
)\
.drop([
'purch_delta', 'purchase_sum__sum', 'after_redeem_sum'
], axis=1
)
# train and test features
print(df_feats.shape)
df_feats.head(2)
(400162, 6)
first_redeem_date | express_spent_mean | express_points_spent_sum | regular_points_received_sum_last_m | avg_spent_perday | after_redeem_sum_perday | |
---|---|---|---|---|---|---|
000012768d | 17535 | 0.0 | 0.0 | 10.0 | 26.951923 | 26.951923 |
000036f903 | 17279 | 0.0 | 0.0 | 13.7 | 89.136364 | 89.136364 |
Make train and test data: join clints ids and treatment and target for train.
# train data:
df_train_feats = df_train\
.join(
df_feats,
how='left'
)
df_test_feats = df_test\
.join(
df_feats,
how='left'
)
print(f'Train df shape: {df_train_feats.shape}')
print(f'Test df shape: {df_test_feats.shape}')
Train df shape: (200039, 8) Test df shape: (200123, 6)
# save final train / test data
df_train_feats.to_csv('data/retail_hero_final_model_train_data.csv')
df_test_feats.to_csv('data/retail_hero_final_model_test_data.csv')
Use uplift class tranformation approach for uplift prediction with XGBoost as an estimator.
xgb_est_params = {
'max_depth':2,
'learning_rate': 0.2,
'n_estimators': 100,
'nthread':40,
'n_gpus':0,
'seed':42
}
estimator = XGBClassifier(
**xgb_est_params
)
uplift_model_cl_tr = ClassTransformation(
estimator=estimator
)
uplift_model_cl_tr.fit(
X=df_train_feats\
.drop(
['treatment_flg', 'target'],
axis=1
),
y=df_train_feats['target'],
treatment=df_train_feats['treatment_flg']
)
All test data are sorted in descending order by 'predicted uplift'. Then calculates target share / conversion in top 30% of treatment group and conversion in top 30% of control group SEPARATELY. Metric is then the difference between two conversions.
Usually uplift@30% metric calculates differently. Firstly selects top 30% of ALL test data and only then calculates conversion in treatment and control group and they are subtracted.
# predicts
uplift_tr = uplift_model_cl_tr.predict(
df_train_feats\
.drop(['treatment_flg', 'target'], axis=1),
)
uplift_ts = uplift_model_cl_tr.predict(
df_test_feats
)
# score on train:
df_train_scores = df_train_feats[['treatment_flg', 'target']]\
.assign(uplift_score=uplift_tr)
train_score = uplift_at_k(
y_true=df_train_scores['target'],
uplift=df_train_scores['uplift_score'],
treatment=df_train_scores['treatment_flg']
)
print(f'Uplift@30% on train data: {train_score:.3f}')
Uplift@30% on train data: 0.105
df_submit = df_test_feats\
.assign(uplift=uplift_ts)[['uplift']]
print(f'Submit data shape: {df_submit.shape}\n')
df_submit.head(2)
df_submit.to_csv('submissions/retail_hero_2nd_place_submit.csv')
!head -5 submissions/retail_hero_2nd_place_submit.csv
Submit data shape: (200123, 1) client_id,uplift 000048b7a6,0.03721845 000073194a,0.019023776 00007c7133,0.032182574 00007f9014,0.016914487
There is the area of research, how to calculate feature importance in uplift models.
Here feature importance is just vanilla information gain in XGBoost estimator, used in Class transformation approach.
df_feat_imp = pd.DataFrame([
uplift_model_cl_tr\
.estimator\
.get_booster()\
.get_score(importance_type='gain')
]
).T.reset_index()
df_feat_imp.columns =\
["feature", "value"]
df_feat_imp\
.sort_values('value', ascending=False)
feature | value | |
---|---|---|
4 | first_redeem_date | 64.531791 |
2 | express_points_spent_sum | 14.171003 |
3 | express_spent_mean | 14.151961 |
5 | regular_points_received_sum_last_m | 7.514503 |
1 | avg_spent_perday | 5.645311 |
0 | after_redeem_sum_perday | 4.776374 |
plt.figure(figsize=(20, 8));
plt.hist(
df_train_scores['uplift_score'],
bins=200,
color='red',
);
plt.xlim(-0.2, 0.75);
plt.title('Uplift predictions on train data', size=15);
plt.figure(figsize=(20, 8));
plt.hist(
df_submit['uplift'],
bins=200,
color='blue',
);
plt.xlim(-0.2, 0.75);
plt.title('Uplift predictions on test data', size=15);
Looks quite similar.