This notebook is based on the data provided in the Kaggle competition "Google Analytics Customer Revenue Prediction". https://www.kaggle.com/c/ga-customer-revenue-prediction.
The tast was done as part of Penn Data Science Group Kaggle team. http://penndsg.com/.
In this notebook, I trained the model in two steps. In the first step, I used LightGBM classified to predict probabilities of making a purchase for each row. In the second step, I "filtered out" the rows with very low probability of purchase and trained the LightGBM regression model on the remaining data. Since majority of purchases in GStore are made by the small group of customers, such two-step process will allow us to filter out the non-purchasing customers and let us focus on the purchasing customers during the regression model training.
This notebook goes through the following steps to make a final prediction of a future revenue for each customer based on the provided data.
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import datetime
import lightgbm as lgb
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
import json
import os
from pandas.io.json import json_normalize
import datetime
Some of the features are present in json format. We need to expand them in order to use in calculation. The following function does exactly that.
def load_df(csv_path='train.csv', nrows=None):
JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
df = pd.read_csv(csv_path,
converters={column: json.loads for column in JSON_COLUMNS},
dtype={'fullVisitorId': 'str'}, # Important!!
nrows=nrows)
for column in JSON_COLUMNS:
column_as_df = json_normalize(df[column])
column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
return df
# expand the test and train data as pandas dataframes
# this step takes ~5 minutes on my laptop, so instead I created a separate file where I ran these steps once and
# saved the outputs into csv files, so that I don't have to repeate this step every single time.
# uncomment the lines below when running for the first time
#%%time
#train_df = load_df()
#test_df = load_df("test.csv")
#train_df.to_csv('train_expanded.csv')
#test_df.to_csv('test_expanded.csv')
# load data with expanded json features from files created using the code above
train_df = pd.read_csv('train_v2_expanded.csv', low_memory=False, dtype={'fullVisitorId': 'str'})
test_df = pd.read_csv('test_v2_expanded.csv', low_memory=False, dtype={'fullVisitorId': 'str'})
constant_columns = [c for c in train_df.columns if train_df[c].nunique(dropna = False) == 1]
constant_columns
['socialEngagementType', 'device.browserSize', 'device.browserVersion', 'device.flashVersion', 'device.language', 'device.mobileDeviceBranding', 'device.mobileDeviceInfo', 'device.mobileDeviceMarketingName', 'device.mobileDeviceModel', 'device.mobileInputSelector', 'device.operatingSystemVersion', 'device.screenColors', 'device.screenResolution', 'geoNetwork.cityId', 'geoNetwork.latitude', 'geoNetwork.longitude', 'geoNetwork.networkLocation', 'totals.visits', 'trafficSource.adwordsClickInfo.criteriaParameters']
# features with constant values should be excluded, since they don't carry any useful information.
columns_to_exclude = constant_columns
# in addition, I will also exclude 'sessionId' feature, since it carries no useful information and 'Unnamed: 0',
# since it is an artifact I created when I expanded json features.
# columns_to_exclude = columns_to_exclude + ['sessionId']
columns_to_exclude = columns_to_exclude
# I'll remove these features from both test and training data.
train_df = train_df.drop(columns_to_exclude, axis = 1)
test_df = test_df.drop(columns_to_exclude, axis = 1)
set(train_df.columns) - set(test_df.columns)
{'trafficSource.campaignCode'}
train_df = train_df.drop(['trafficSource.campaignCode'], axis = 1)
print('done')
done
test_df.shape
(401589, 39)
train_df.head()
channelGrouping | customDimensions | date | fullVisitorId | visitId | visitNumber | visitStartTime | device.browser | device.deviceCategory | device.isMobile | device.operatingSystem | geoNetwork.city | geoNetwork.continent | geoNetwork.country | geoNetwork.metro | geoNetwork.networkDomain | geoNetwork.region | geoNetwork.subContinent | totals.bounces | totals.hits | totals.newVisits | totals.pageviews | totals.sessionQualityDim | totals.timeOnSite | totals.totalTransactionRevenue | totals.transactionRevenue | totals.transactions | trafficSource.adContent | trafficSource.adwordsClickInfo.adNetworkType | trafficSource.adwordsClickInfo.gclId | trafficSource.adwordsClickInfo.isVideoAd | trafficSource.adwordsClickInfo.page | trafficSource.adwordsClickInfo.slot | trafficSource.campaign | trafficSource.isTrueDirect | trafficSource.keyword | trafficSource.medium | trafficSource.referralPath | trafficSource.source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Organic Search | [{'index': '4', 'value': 'EMEA'}] | 20171016 | 3162355547410993243 | 1508198450 | 1 | 1508198450 | Firefox | desktop | False | Windows | not available in demo dataset | Europe | Germany | not available in demo dataset | (not set) | not available in demo dataset | Western Europe | 1.0 | 1 | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | (not set) | NaN | water bottle | organic | NaN | |
1 | Referral | [{'index': '4', 'value': 'North America'}] | 20171016 | 8934116514970143966 | 1508176307 | 6 | 1508176307 | Chrome | desktop | False | Chrome OS | Cupertino | Americas | United States | San Francisco-Oakland-San Jose CA | (not set) | California | Northern America | NaN | 2 | NaN | 2.0 | 2.0 | 28.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | (not set) | NaN | NaN | referral | /a/google.com/transportation/mtv-services/bike... | sites.google.com |
2 | Direct | [{'index': '4', 'value': 'North America'}] | 20171016 | 7992466427990357681 | 1508201613 | 1 | 1508201613 | Chrome | mobile | True | Android | not available in demo dataset | Americas | United States | not available in demo dataset | windjammercable.net | not available in demo dataset | Northern America | NaN | 2 | 1.0 | 2.0 | 1.0 | 38.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | (not set) | True | NaN | (none) | NaN | (direct) |
3 | Organic Search | [{'index': '4', 'value': 'EMEA'}] | 20171016 | 9075655783635761930 | 1508169851 | 1 | 1508169851 | Chrome | desktop | False | Windows | not available in demo dataset | Asia | Turkey | not available in demo dataset | unknown.unknown | not available in demo dataset | Western Asia | NaN | 2 | 1.0 | 2.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | (not set) | NaN | (not provided) | organic | NaN | |
4 | Organic Search | [{'index': '4', 'value': 'Central America'}] | 20171016 | 6960673291025684308 | 1508190552 | 1 | 1508190552 | Chrome | desktop | False | Windows | not available in demo dataset | Americas | Mexico | not available in demo dataset | prod-infinitum.com.mx | not available in demo dataset | Central America | NaN | 2 | 1.0 | 2.0 | 1.0 | 52.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | (not set) | NaN | (not provided) | organic | NaN |
Let's look at some categorical features in more details.
At first, let's go through the categorical features one by one and use the groupby function to see the total revenues, mean revenues and counts of non-zero transaction.
# Define a function to group by a certain feature, aggregate total revenue sum, mean and count and show the head of
# resulting dataframe (sorted)
def describe_function(col, df = train_df):
display(df.groupby(col)['totals.transactionRevenue'].agg(['sum', 'mean','count']).sort_values(by="count", ascending=False).head())
features = ["channelGrouping", "device.browser",
"device.deviceCategory", "device.operatingSystem",
"geoNetwork.city", "geoNetwork.continent",
"geoNetwork.country", "geoNetwork.metro",
"geoNetwork.networkDomain", "geoNetwork.region",
"geoNetwork.subContinent", "trafficSource.adContent"]
for col in features:
describe_function(col, df = train_df)
sum | mean | count | |
---|---|---|---|
channelGrouping | |||
Referral | 9.960210e+11 | 1.160052e+08 | 8586 |
Organic Search | 5.394990e+11 | 9.785941e+07 | 5513 |
Direct | 5.973243e+11 | 1.799170e+08 | 3320 |
Paid Search | 6.696862e+10 | 9.418934e+07 | 711 |
Display | 1.082893e+11 | 5.282403e+08 | 205 |
sum | mean | count | |
---|---|---|---|
device.browser | |||
Chrome | 2.068165e+12 | 1.238645e+08 | 16697 |
Safari | 7.670099e+10 | 6.338925e+07 | 1210 |
Firefox | 1.502462e+11 | 4.432043e+08 | 339 |
Internet Explorer | 1.195796e+10 | 8.079703e+07 | 148 |
Edge | 8.067300e+09 | 1.090176e+08 | 74 |
sum | mean | count | |
---|---|---|---|
device.deviceCategory | |||
desktop | 2.229097e+12 | 1.318602e+08 | 16905 |
mobile | 7.429866e+10 | 5.447116e+07 | 1364 |
tablet | 1.294908e+10 | 5.285339e+07 | 245 |
sum | mean | count | |
---|---|---|---|
device.operatingSystem | |||
Macintosh | 1.315231e+12 | 1.270264e+08 | 10354 |
Windows | 5.750670e+11 | 1.610381e+08 | 3571 |
Chrome OS | 2.696522e+11 | 1.583395e+08 | 1703 |
Linux | 7.025809e+10 | 5.404468e+07 | 1300 |
iOS | 3.588012e+10 | 4.312514e+07 | 832 |
sum | mean | count | |
---|---|---|---|
geoNetwork.city | |||
not available in demo dataset | 9.397274e+11 | 1.320213e+08 | 7118 |
New York | 3.258931e+11 | 1.357322e+08 | 2401 |
Mountain View | 2.035964e+11 | 9.854617e+07 | 2066 |
San Francisco | 1.429065e+11 | 1.255769e+08 | 1138 |
Sunnyvale | 6.908219e+10 | 8.146485e+07 | 848 |
sum | mean | count | |
---|---|---|---|
geoNetwork.continent | |||
Americas | 2.265098e+12 | 1.247988e+08 | 18150 |
Asia | 2.339772e+10 | 1.231459e+08 | 190 |
Europe | 1.124745e+10 | 8.926548e+07 | 126 |
Oceania | 6.781100e+09 | 2.338310e+08 | 29 |
Africa | 9.025990e+09 | 7.521658e+08 | 12 |
sum | mean | count | |
---|---|---|---|
geoNetwork.country | |||
United States | 2.197885e+12 | 1.244697e+08 | 17658 |
Canada | 4.269923e+10 | 1.355531e+08 | 315 |
Venezuela | 1.390126e+10 | 2.044303e+08 | 68 |
Taiwan | 2.934200e+09 | 9.169375e+07 | 32 |
Mexico | 3.763080e+09 | 1.393733e+08 | 27 |
sum | mean | count | |
---|---|---|---|
geoNetwork.metro | |||
not available in demo dataset | 9.397274e+11 | 1.320213e+08 | 7118 |
San Francisco-Oakland-San Jose CA | 5.062230e+11 | 1.015696e+08 | 4984 |
New York NY | 3.274385e+11 | 1.351376e+08 | 2423 |
Chicago IL | 1.104102e+11 | 1.665313e+08 | 663 |
Los Angeles CA | 9.190494e+10 | 1.598347e+08 | 575 |
sum | mean | count | |
---|---|---|---|
geoNetwork.networkDomain | |||
(not set) | 1.305843e+12 | 1.201659e+08 | 10867 |
comcast.net | 1.773147e+11 | 1.178171e+08 | 1505 |
verizon.net | 7.967340e+10 | 1.055277e+08 | 755 |
unknown.unknown | 5.825398e+10 | 8.759997e+07 | 665 |
rr.com | 5.322676e+10 | 9.052170e+07 | 588 |
sum | mean | count | |
---|---|---|---|
geoNetwork.region | |||
not available in demo dataset | 9.397274e+11 | 1.320213e+08 | 7118 |
California | 6.091657e+11 | 1.079124e+08 | 5645 |
New York | 3.262089e+11 | 1.357507e+08 | 2403 |
Illinois | 1.104102e+11 | 1.665313e+08 | 663 |
Washington | 4.921693e+10 | 9.612682e+07 | 512 |
sum | mean | count | |
---|---|---|---|
geoNetwork.subContinent | |||
Northern America | 2.240584e+12 | 1.246639e+08 | 17973 |
South America | 1.855388e+10 | 1.520810e+08 | 122 |
Eastern Asia | 1.399618e+10 | 1.686287e+08 | 83 |
Southeast Asia | 6.019380e+09 | 1.020234e+08 | 59 |
Western Europe | 4.259950e+09 | 8.874896e+07 | 48 |
sum | mean | count | |
---|---|---|---|
trafficSource.adContent | |||
Google Merchandise Collection | 2.064188e+10 | 1.358018e+08 | 152 |
Official Google Merchandise | 1.719470e+09 | 6.877880e+07 | 25 |
Display Ad created 3/11/14 | 5.697200e+08 | 6.330222e+07 | 9 |
Google Online Store | 2.109800e+08 | 3.516333e+07 | 6 |
Placement Accessores 300 x 250 | 5.771900e+08 | 1.154380e+08 | 5 |
As we can see, using tables to visualize our data is very tricky, so next, I will draw the graphs instead. For each of the following features, I calculated the sum and count of transactionRevenue and sorted in descending order (based on count) before plotting.
#temp_df = train_df.groupby('geoNetwork.continent')['totals.transactionRevenue'].agg(['sum', 'count']).sort_values(by="count", ascending=False).head(10)
plt.rcParams.update({'font.size': 20})
fig, axes = plt.subplots(len(features), 2, figsize=(30,85))
# sns.barplot(y=temp_df.index, x=temp_df['sum'], ax = axes[0,0])
# sns.barplot(y=temp_df.index, x=temp_df['count'], ax = axes[0,1])
for n in range(len(features)):
col = features[n]
temp_df = train_df.groupby(col)['totals.transactionRevenue'].agg(['sum', 'count']).sort_values(by="count", ascending=False).head(10)
# Some names were too long, so I shortened them to just 10 characters
y_temp = list(map(lambda x: x[:10], list(temp_df.index)))
sns.barplot(y=y_temp, x=temp_df['sum'], ax = axes[n,0])
sns.barplot(y=y_temp, x=temp_df['count'], ax = axes[n,1])
axes[n,0].set_title(col + ': sum')
axes[n,1].set_title(col + ': count')
axes[n,0].set_xlabel('')
axes[n,1].set_xlabel('')
#axes[n,0].tick_params(labelsize=20)
#axes[n,1].tick_params(labelsize=20)
As you can see, while there is some strong imbalance for some features (e.g. continent, country, etc.), there is a quite good variation in some other features (e.g. region, metro, city, etc.). Such variation is very useful when training models.
train_df['date'] = pd.to_datetime(train_df['visitStartTime'], unit='s')
train_df['day_of_week'] = train_df['date'].dt.dayofweek
train_df['hour'] = train_df['date'].dt.hour
train_df['day_of_month'] = train_df['date'].dt.day
train_df['month'] = train_df['date'].dt.month
test_df['date'] = pd.to_datetime(test_df['visitStartTime'], unit='s')
test_df['day_of_week'] = test_df['date'].dt.dayofweek
test_df['hour'] = test_df['date'].dt.hour
test_df['day_of_month'] = test_df['date'].dt.day
test_df['month'] = test_df['date'].dt.month
train_df['next_session_1'] = (
train_df['date'] - train_df[['fullVisitorId', 'date']].groupby('fullVisitorId')['date'].shift(1)
).astype(np.int64) // 1e9 // 60 // 60
train_df['next_session_2'] = (
train_df['date'] - train_df[['fullVisitorId', 'date']].groupby('fullVisitorId')['date'].shift(-1)
).astype(np.int64) // 1e9 // 60 // 60
test_df['next_session_1'] = (
test_df['date'] - test_df[['fullVisitorId', 'date']].groupby('fullVisitorId')['date'].shift(1)
).astype(np.int64) // 1e9 // 60 // 60
test_df['next_session_2'] = (
test_df['date'] - test_df[['fullVisitorId', 'date']].groupby('fullVisitorId')['date'].shift(-1)
).astype(np.int64) // 1e9 // 60 // 60
# delete the "date" feature
train_df.drop('date', axis = 1, inplace = True)
test_df.drop('date', axis = 1, inplace = True)
print('done')
done
train_df["totals.transactionRevenue"].fillna(0, inplace=True)
# test_id will be used in the end when generating the submission file.
test_id = test_df["fullVisitorId"].values
# define the categorical features
cat_cols = ["channelGrouping", "device.browser",
"device.deviceCategory", "device.operatingSystem",
"geoNetwork.city", "geoNetwork.continent",
"geoNetwork.country", "geoNetwork.metro",
"geoNetwork.networkDomain", "geoNetwork.region",
"geoNetwork.subContinent", "trafficSource.adContent",
"trafficSource.adwordsClickInfo.adNetworkType",
"trafficSource.adwordsClickInfo.gclId",
"trafficSource.adwordsClickInfo.page",
"trafficSource.adwordsClickInfo.slot", "trafficSource.campaign",
"trafficSource.keyword", "trafficSource.medium",
"trafficSource.referralPath", "trafficSource.source",
'trafficSource.adwordsClickInfo.isVideoAd', 'trafficSource.isTrueDirect',
'day_of_week', 'hour', 'day_of_month', 'month']
# label encode the categorical features
for col in cat_cols:
print(col)
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train_df[col].values.astype('str')) + list(test_df[col].values.astype('str')))
train_df[col] = lbl.transform(list(train_df[col].values.astype('str')))
test_df[col] = lbl.transform(list(test_df[col].values.astype('str')))
# define categorical values as 'categorical'. Might be a bit redundant, but I want to make sure that they are defined as such.
for c in cat_cols:
train_df[c] = train_df[c].astype('category')
test_df[c] = test_df[c].astype('category')
# make sure that the numerical columns are float type
num_cols = ["totals.hits", "totals.pageviews", "visitNumber", "visitStartTime", 'totals.bounces', 'totals.newVisits', 'next_session_1', 'next_session_2']
for col in num_cols:
train_df[col] = train_df[col].astype(float)
test_df[col] = test_df[col].astype(float)
print('done')
channelGrouping device.browser device.deviceCategory device.operatingSystem geoNetwork.city geoNetwork.continent geoNetwork.country geoNetwork.metro geoNetwork.networkDomain geoNetwork.region geoNetwork.subContinent trafficSource.adContent trafficSource.adwordsClickInfo.adNetworkType trafficSource.adwordsClickInfo.gclId trafficSource.adwordsClickInfo.page trafficSource.adwordsClickInfo.slot trafficSource.campaign trafficSource.keyword trafficSource.medium trafficSource.referralPath trafficSource.source trafficSource.adwordsClickInfo.isVideoAd trafficSource.isTrueDirect day_of_week hour day_of_month month done
Majority of visits don't result in a purchase and performing such classification first might help us remove the instances of visits with very low probability of visit. For this purpose, I will identify the visitors based on their 'fullVisitorId' and see if they have ever made even a single purchase. Then I will add a new identifier 'purchasing_customer' to each row. 0 if this customer has never made a purchase and 1 if they have. Then I will use this feature for the classification task. Next, I will have to use the results of the classification to predict the feature for the test set (since we don't know who makes a purchase in the test set).
# calculate the total revenue per visitor. I'll use this later to see if a visitor has ever made a purchase
total_purchase_df = train_df[['fullVisitorId','totals.transactionRevenue']].groupby('fullVisitorId').sum()
total_purchase_df = total_purchase_df['totals.transactionRevenue'].to_dict()
# create a feature purchasing_customer.
# 1 indicates that the visitor has made a purchase at least once. 0 indicates no purchase ever.
train_df['purchasing_customer'] = train_df["fullVisitorId"].apply(lambda x: 1 if total_purchase_df[x] > 0 else 0)
print("there are ", train_df['purchasing_customer'].sum(), "entres for customers who made a purchase at some point")
there are 61319 entres for customers who made a purchase at some point
Parameters for grid search and the fixed parameters need to be in separate dictionaries. For the parameters for which I already performed the grid search, I will put the tested options next to them as comments so that I don't need to test them again.
# Set fixed params
params_classification = {#"objective" : "regression",
"metric" : "rmse",
'max_depth' : -1,
'max_bin': 512,
"min_child_samples" : 100,
"bagging_fraction" : 0.7,
"feature_fraction" : 0.5,
"bagging_frequency" : 5,
"bagging_seed" : 2018,
"verbosity" : -1}
# Create parameters to search
gridParams_classification = {
'learning_rate': [0.1], #0.1, 0.05, 0.2
'n_estimators': [100], # 70, 50
'num_leaves': [70], # 30, 50
'colsample_bytree': [0.3], # 0.1, 1
'subsample': [0.3], # 0.1, 1
'reg_alpha': [1.4], # 1, 1.2
'reg_lambda': [1], # 1.2, 1.4
}
X_train_df = train_df[cat_cols + num_cols]
y_train_df = train_df['purchasing_customer']
# X_test_df will be used to predict the 'purchasing_customer' feature for the test set.
X_test_df = test_df[cat_cols + num_cols]
# Create classifier to use. Note that parameters have to be input manually, not as a dict!
mdl_classifier = lgb.LGBMClassifier(
objective = 'binary',
n_jobs = 3,
silent = True,
max_depth = params_classification['max_depth'],
max_bin = params_classification['max_bin'],
feature_fraction = params_classification['feature_fraction'],
min_child_samples = params_classification['min_child_samples'],
bagging_fraction = params_classification['bagging_fraction']
)
# Create the grid search with 4 fold cross validation
grid_classification = GridSearchCV(mdl_classifier, gridParams_classification,
verbose=0,
cv=5,
n_jobs=2)
# Run the grid
grid_classification.fit(X_train_df, y_train_df)
# Print the best parameters found
print(grid_classification.best_params_)
print(grid_classification.best_score_)
{'colsample_bytree': 0.3, 'learning_rate': 0.1, 'n_estimators': 100, 'num_leaves': 70, 'reg_alpha': 1.4, 'reg_lambda': 1, 'subsample': 0.3} 0.9701522591853949
# check the quality of calculation
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(y_train_df, grid_classification.predict_proba(X_train_df)[:,1]))
print('rms = ', rms)
from sklearn.metrics import accuracy_score
accur = accuracy_score(y_train_df, grid_classification.predict(X_train_df))
print('accuracy = ', accur)
rms = 0.14208548895989545 accuracy = 0.9724708883551665
C:\Users\Leo\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty. if diff:
train_df['proba_prediction_train'] = grid_classification.predict_proba(X_train_df)[:,1]
test_df['proba_prediction_train'] = grid_classification.predict_proba(X_test_df)[:,1]
I will use the results of the classification task in two ways. First, I will use the new feature of 'proba_prediction_train' which predicts the probability of purchase for each row. Second, I will used this feature additionally to filter out the rows with very low probability of purchase. I do not know if such action will be useful, or what is the optimal cut off level for such filter, thus I will test several cut off levels and see if such treatment improves the score.
Similar to the classification task, I will employ the grid search for the regression task as well.
# Set params
params_regression = {"objective" : "regression",
"metric" : "rmse",
'max_depth' : -1,
'max_bin': 512,
"min_child_samples" : 100,
"bagging_fraction" : 0.7,
"feature_fraction" : 0.5,
"bagging_frequency" : 5,
"bagging_seed" : 2018,
"verbosity" : -1}
# Create parameters to search
gridParams_regression = {
'learning_rate': [0.10], # 0.15, 0.10
'n_estimators': [70],
'num_leaves': [50],
'colsample_bytree': [0.3],
'subsample': [0.3],
'reg_alpha': [1], #[1,1.2, 1.4],
'reg_lambda': [1], #[1,1.2,1.4]
}
# Create classifier to use. Note that parameters have to be input manually
# not as a dict!
mdl_regression = lgb.LGBMRegressor(
objective = 'regression',
n_jobs = 3, # Updated from 'nthread'
silent = True,
max_depth = params_regression['max_depth'],
max_bin = params_regression['max_bin'],
feature_fraction = params_regression['feature_fraction'],
min_child_samples = params_regression['min_child_samples'],
bagging_fraction = params_regression['bagging_fraction']
)
# cuot off levels for the probability of making a purchase
# 0.0 will indicate that no data will be cut off
probability_cutoff_list = [0.0, 0.01, 0.05, 0.1, 0.25]
grid_regression = []
for probability_cutoff in probability_cutoff_list:
X_train_df = train_df[train_df['proba_prediction_train']>probability_cutoff][cat_cols + num_cols + ['proba_prediction_train']]
y_train_df = np.log1p(train_df[train_df['proba_prediction_train']>probability_cutoff]["totals.transactionRevenue"].values)
X_test_df = test_df[test_df['proba_prediction_train']>probability_cutoff][cat_cols + num_cols + ['proba_prediction_train']]
# Create the grid search with 4 fold cross validation
grid = GridSearchCV(mdl_regression, gridParams_regression,
verbose=0,
cv=5,
n_jobs=2)
# Run the grid
grid.fit(X_train_df, y_train_df)
# add results of grid search to the list of grid_regression
grid_regression = grid_regression + [grid]
# Print the best parameters found
print(grid.best_params_)
print(grid.best_score_)
{'colsample_bytree': 0.3, 'learning_rate': 0.1, 'n_estimators': 70, 'num_leaves': 50, 'reg_alpha': 1, 'reg_lambda': 1, 'subsample': 0.3} 0.366094783813078 {'colsample_bytree': 0.3, 'learning_rate': 0.1, 'n_estimators': 70, 'num_leaves': 50, 'reg_alpha': 1, 'reg_lambda': 1, 'subsample': 0.3} 0.3433724709301465 {'colsample_bytree': 0.3, 'learning_rate': 0.1, 'n_estimators': 70, 'num_leaves': 50, 'reg_alpha': 1, 'reg_lambda': 1, 'subsample': 0.3} 0.335454834793921 {'colsample_bytree': 0.3, 'learning_rate': 0.1, 'n_estimators': 70, 'num_leaves': 50, 'reg_alpha': 1, 'reg_lambda': 1, 'subsample': 0.3} 0.33146897589710855 {'colsample_bytree': 0.3, 'learning_rate': 0.1, 'n_estimators': 70, 'num_leaves': 50, 'reg_alpha': 1, 'reg_lambda': 1, 'subsample': 0.3} 0.3310650060859367
# predict revenue for the test data used in the regression test.
# Since the 0.0 cut off level gives the best results, the whole test set will be used.
X_test_df['prediction'] = grid_regression[0].predict(X_test_df)
# following lines are technically not necessary since I ended up using the 0.0 cut off level, but I'll still keep them.
test_df['prediction'] = 0
test_df.loc[X_test_df.index.values,'prediction'] = X_test_df[['prediction']]
# replace all the negative predictions with 0, as there can be no negative value
test_df['prediction'] = test_df['prediction'].apply(lambda x: x if x > 0 else 0)
# Load submission template file
submission = pd.read_csv('sample_submission_v2.csv')
submission.head()
C:\Users\Leo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2785: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
fullVisitorId | PredictedLogRevenue | |
---|---|---|
0 | 0000018966949534117 | 0.0 |
1 | 0000039738481224681 | 0.0 |
2 | 0000073585230191399 | 0.0 |
3 | 0000087588448856385 | 0.0 |
4 | 0000149787903119437 | 0.0 |
sub_df = pd.DataFrame({"fullVisitorId":test_id})
sub_df["PredictedLogRevenue"] = np.expm1(test_df['prediction'].values)
sub_df = sub_df.groupby("fullVisitorId")["PredictedLogRevenue"].sum().reset_index()
sub_df.columns = ["fullVisitorId", "PredictedLogRevenue"]
sub_df["PredictedLogRevenue"] = np.log1p(sub_df["PredictedLogRevenue"])
now = datetime.datetime.now()
now = now.strftime("%Y-%m-%d")
sub_df.to_csv("predict_revenues_" + now + ".csv", index=False)
print('Finished')
Finished
In this notebook, I successfully used the LGB classification and regression models to predict the future revenue for the users. There are number of additional methods we could use to improve prediction by our model.