Prediction of online shoppers’ purchasing intention

by Georgy Lazarev (mlcourse slackname: jorgy)

As the title goes the task was to predict whether the user is intended to make a purchase on Internet shop. Data for this project can be found here.

Dataset and features description

We have a binary classification problem which measures user intention to finalize transaction. Originally this dataset was used in research where there was an attempt to build a system consisting of two modules. The first one is to determine visitor's likelihood to leave the site. If probability of that is higher that set threshold, than the second module should predict whether or not this person has commercial intention. As authors of this paper state data is real and was collected and provided by retailer. Company might be interested in system which in real time can offer a special offer to client with positive commercial intention.

Data formed in such way that each session would correspond to different user in 1 year period to avoid any tendency. Target variable is called 'Revenue' and takes two values - 0 and 1, whether or not session ended with purchase. There are 10 numeric and 7 categorical features:

Numeric:

first six features were derived from the URL information of the pages visited by the user. They were updated each time visitor moved from one page to another till the end of the session.

  • Administrative - Number of pages about account management visited by person
  • Administrative duration - Total amount of time (in seconds) spent by the visitor on administrative pages
  • Informational - Number of pages in session about Web site, communication and address information of the shopping site
  • Informational duration - time (in seconds) spent on informational pages
  • Product related - Number of pages concerning product visited
  • Product related duration - time spent on product related pages

next three features were measured by "Google Analytics" for each page in the online-shop website:

  • Bounce rate - Average bounce rate value of the pages visited by the visitor. Bounce rate itself is percentage of visitors who enter the site from that page and then leave
  • Exit rate - Average exit rate value of the pages visited by the visitor. Value of exit rate for page is percentage of all views of this page that were last in the session
  • Page value - Average page value of the pages visited. Indicates how valuable a specific page is to shop holder in monetary terms
  • Special day - Closeness of the site visiting time to a special day. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. for Valentina‚Äôs day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.

Categorical:

  • OperatingSystems - Operating system of the visitor
  • Browser - Browser of the visitor
  • Region - Geographic region from which the session has been started by the visitor
  • TrafficType - Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct)
  • VisitorType - whether the visitor is the new or returning (or not specified)
  • Weekend - Boolean value indicating whether the date of the visit is weekend
  • Month - Month value of the visit date

Dataset was formed such way that each session correpsonds to unique person. That was done to prevent any possible trends

Exploratory data analysis

In [466]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
In [467]:
#load data
df=pd.read_csv('online_shoppers_intention (1).csv')
In [468]:
df.shape
Out[468]:
(12330, 18)

Let's look at dataset:

In [469]:
df.head()
Out[469]:
Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues SpecialDay Month OperatingSystems Browser Region TrafficType VisitorType Weekend Revenue
0 0 0.0 0 0.0 1 0.000000 0.20 0.20 0.0 0.0 Feb 1 1 1 1 Returning_Visitor False False
1 0 0.0 0 0.0 2 64.000000 0.00 0.10 0.0 0.0 Feb 2 2 1 2 Returning_Visitor False False
2 0 0.0 0 0.0 1 0.000000 0.20 0.20 0.0 0.0 Feb 4 1 9 3 Returning_Visitor False False
3 0 0.0 0 0.0 2 2.666667 0.05 0.14 0.0 0.0 Feb 3 2 2 4 Returning_Visitor False False
4 0 0.0 0 0.0 10 627.500000 0.02 0.05 0.0 0.0 Feb 3 3 1 4 Returning_Visitor True False
In [470]:
df.columns
Out[470]:
Index(['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month',
       'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType',
       'Weekend', 'Revenue'],
      dtype='object')
In [471]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
Administrative             12330 non-null int64
Administrative_Duration    12330 non-null float64
Informational              12330 non-null int64
Informational_Duration     12330 non-null float64
ProductRelated             12330 non-null int64
ProductRelated_Duration    12330 non-null float64
BounceRates                12330 non-null float64
ExitRates                  12330 non-null float64
PageValues                 12330 non-null float64
SpecialDay                 12330 non-null float64
Month                      12330 non-null object
OperatingSystems           12330 non-null int64
Browser                    12330 non-null int64
Region                     12330 non-null int64
TrafficType                12330 non-null int64
VisitorType                12330 non-null object
Weekend                    12330 non-null bool
Revenue                    12330 non-null bool
dtypes: bool(2), float64(7), int64(7), object(2)
memory usage: 1.4+ MB

There is no missing data in dataset.

Now let's look at distribution of target value:

In [472]:
sns.countplot(df.Revenue)
Out[472]:
<matplotlib.axes._subplots.AxesSubplot at 0x17fae510>
In [473]:
df.Revenue.value_counts(normalize=True)
Out[473]:
False    0.845255
True     0.154745
Name: Revenue, dtype: float64

Seems that we deal with somewhat imbalanced classes. There are more visitors that leave shop website without purchasing anything and that's not surprising.

Target value will be converted to binary type

In [474]:
#list of numeric features
num_feats=['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay']
In [475]:
df[num_feats].describe()
Out[475]:
Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues SpecialDay
count 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000 12330.000000
mean 2.315166 80.818611 0.503569 34.472398 31.731468 1194.746220 0.022191 0.043073 5.889258 0.061427
std 3.321784 176.779107 1.270156 140.749294 44.475503 1913.669288 0.048488 0.048597 18.568437 0.198917
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 7.000000 184.137500 0.000000 0.014286 0.000000 0.000000
50% 1.000000 7.500000 0.000000 0.000000 18.000000 598.936905 0.003112 0.025156 0.000000 0.000000
75% 4.000000 93.256250 0.000000 0.000000 38.000000 1464.157213 0.016813 0.050000 0.000000 0.000000
max 27.000000 3398.750000 24.000000 2549.375000 705.000000 63973.522230 0.200000 0.200000 361.763742 1.000000

We certainly will scale numerical features. As we see they are of different scales

Now let's look at categorical features:

In [476]:
cat_feats=['Month','OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType','Weekend']
In [477]:
df[cat_feats].head()
Out[477]:
Month OperatingSystems Browser Region TrafficType VisitorType Weekend
0 Feb 1 1 1 1 Returning_Visitor False
1 Feb 2 2 1 2 Returning_Visitor False
2 Feb 4 1 9 3 Returning_Visitor False
3 Feb 3 2 2 4 Returning_Visitor False
4 Feb 3 3 1 4 Returning_Visitor True

As we see, some features are already label-encoded. Some are stll in string format. Weekend will be converted to binary.

In [478]:
df[cat_feats].astype('category').describe()
Out[478]:
Month OperatingSystems Browser Region TrafficType VisitorType Weekend
count 12330 12330 12330 12330 12330 12330 12330
unique 10 8 13 9 20 3 2
top May 2 2 1 2 Returning_Visitor False
freq 3364 6601 7961 4780 3913 10551 9462

There are two interesting observations: number of months present and number of visitor types..

In [479]:
df.Month.unique()
Out[479]:
array(['Feb', 'Mar', 'May', 'Oct', 'June', 'Jul', 'Aug', 'Nov', 'Sep',
       'Dec'], dtype=object)

January and April are missing.

In [480]:
df.VisitorType.unique()
Out[480]:
array(['Returning_Visitor', 'New_Visitor', 'Other'], dtype=object)

'Other'? Let's see how many such values in our dataset:

In [481]:
df.VisitorType.value_counts()
Out[481]:
Returning_Visitor    10551
New_Visitor           1694
Other                   85
Name: VisitorType, dtype: int64

That makes no sense though. We'll get back to that later.

In [482]:
df.groupby('VisitorType')['Revenue'].mean()
Out[482]:
VisitorType
New_Visitor          0.249115
Other                0.188235
Returning_Visitor    0.139323
Name: Revenue, dtype: float64

A bit surprising. I expected percentage of potentially beneficial clients would be higher among visitors who returned to website other than new ones.

In [483]:
sum(df.loc[df.Revenue==1].Administrative==0)
Out[483]:
514
In [484]:
sum(df.loc[df.Revenue==1].Informational==0)
Out[484]:
1295
In [485]:
sum(df.loc[df.Revenue==1].ProductRelated==0)
Out[485]:
6

That makes sense. Only six people made purchase and at the same time din't visit any pages related to products.

In [486]:
(df.Administrative==0).sum()
Out[486]:
5768
In [487]:
(df.Administrative_Duration==0).sum()
Out[487]:
5903

So, there were cases when number of pages was greater than 0 but time spent was 0.

In [488]:
df.loc[df.Administrative>0].loc[df.Administrative_Duration==0].Administrative.value_counts()
Out[488]:
1    131
2      4
Name: Administrative, dtype: int64

So theoretically it is possible.

Special day feature shows closeness to ..special days, right. We might think that this feature will positively affect target value

In [489]:
df.loc[df.SpecialDay>0].Revenue.value_counts(normalize=True)
Out[489]:
False    0.938449
True     0.061551
Name: Revenue, dtype: float64

How come? That's again not what I expected.

In [490]:
df[['Revenue','SpecialDay']].corr()
Out[490]:
Revenue SpecialDay
Revenue 1.000000 -0.082305
SpecialDay -0.082305 1.000000

That's actualy strange..

Primary visual data analysis

Here goes pairwise Pearson-correlation of numerical features:

In [491]:
corrl=num_feats.copy()
corrl.append('Revenue')
In [492]:
sns.heatmap(df[corrl].corr())
Out[492]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f7c5950>

Yes, some features indeed are highly correlated!

In [376]:
df[['ProductRelated', 'ProductRelated_Duration','BounceRates', 'ExitRates']].corr()
Out[376]:
ProductRelated ProductRelated_Duration BounceRates ExitRates
ProductRelated 1.000000 0.860927 -0.204578 -0.292526
ProductRelated_Duration 0.860927 1.000000 -0.184541 -0.251984
BounceRates -0.204578 -0.184541 1.000000 0.913004
ExitRates -0.292526 -0.251984 0.913004 1.000000
In [377]:
fig, axes = plt.subplots(ncols=4, nrows = 2, figsize=(24, 18))
for i in range(len(cat_feats)):
    sns.countplot(df[cat_feats[i]],ax=axes[i//4, i%4])

Well, I'd say it's difficult to draw any concrete conclusions from this plot. There are leaders in each groups . Now let's explore some features a bit more with respect to target value:

In [378]:
sns.countplot(df.Weekend,hue=df.Revenue)
Out[378]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dbe1230>

plt.figure(figsize=(15,15)) plt.subplot(321) df.groupby('Month').Revenue.mean().plot.bar() plt.subplot(322) df.groupby('Browser').Revenue.mean().plot.bar() plt.subplot(323) df.groupby('TrafficType').Revenue.mean().plot.bar() plt.subplot(324) df.groupby('OperatingSystems').Revenue.mean().plot.bar()

Percentage of visitors who made purchases in November seems a bit higher in comparison to other months. In February there was a small number of visitors and too few of them ended up buying something. Maybe it was bad advertising and price policy that was a reason

As for other features distribution of session results is consistent, as it seems. It's difficult to interpret those result in a sense that feature values are encoded by LavelEncoding already so we don't really know which real meanings stand behind them. Yep.

In [295]:
tmp=['Revenue','Administrative_Duration','Informational_Duration','ProductRelated_Duration','BounceRates','ExitRates','PageValues']
In [32]:
r=['Revenue','Administrative','Administrative_Duration','Informational','Informational_Duration','ProductRelated','ProductRelated_Duration','BounceRates','ExitRates','PageValues']
In [33]:
sns.pairplot(df[r],hue='Revenue',diag_kind='hist')
Out[33]:
<seaborn.axisgrid.PairGrid at 0x180081f0>

In general, trends here make sense. Lower Bounce and Exit Rates corresponds to more frequent transactions made. On the other hand higher PageValues not always lead to commercial benefit. Also in most distributions and pairplots related to website pages we see cases where visitor spent too much time on website but still quit it without purchase. That happens in real life too. Thus, as for outliers, I guess I can assume there is no such.

In [34]:
plt.figure(figsize=(10,20))   
for i,v in enumerate(range(len(num_feats))):
    v = v+1
    ax1 = plt.subplot(len(num_feats),1,v)
    ax1=sns.distplot(df[num_feats[i]])