mlcourse.ai – Open Machine Learning Course

Author: Syrovatskiy Ilya Igorevich, ODS Slack nickname : bokomaru

Individual data analysis project "Rossmann store sales"

Research plan

 - 1. Dataset and features description
 - 2. Primary data analysis
 - 3. Primary visual analysis of the features
 - 4. Patterns, insights, pecularities of data
 - 5. Metrics selection
 - 6. Model selection
 - 7. Data preprocessing
 - 8. Cross-validation and adjustment of model hyperparameters
 - 9. Creation of new features and description of this process
 - 10. Plotting training and validation curves 
 - 11. Prediction for test or hold-out samples
 - 12. Conclusions

Part 1. Dataset and features description

Abstract:

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging us to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation.

https://www.kaggle.com/c/rossmann-store-sales

We are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

Files :

Donwloaded from : https://www.kaggle.com/c/rossmann-store-sales/data

train.csv - historical data including Sales

test.csv - historical data excluding Sales

sample_submission.csv - a sample submission file in the correct format

store.csv - supplemental information about the stores

Some preliminary thoughts :

A sales forecast is a tool that can help almost any company. Many companies rely on human forecasts that are not of a constant quality. Other companies use a standard tool that is not flexible enough to suit their needs. This competition gives researchers a possibility to create their own solution, that really improves business.

Since data consists of sales based on date and time, we can build models with time series data, so it will serve us to extract date-time-features for further analysis.

Input variables :

Data fields

Self-explanatory fields :

  1. Date
  1. DayOfWeek

The following fields have descriptions :

  1. Id - an Id that represents a (Store, Date) duple within the test set
  1. Store - a unique Id for each store
  1. Open - an indicator for whether the store was open: 0 = closed, 1 = open
  1. StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
  1. SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
  1. StoreType - differentiates between 4 different store models: a, b, c, d
  1. Assortment - describes an assortment level: a = basic, b = extra, c = extended
  1. CompetitionDistance - distance in meters to the nearest competitor store
  1. CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
  1. Promo - indicates whether a store is running a promo on that day
  1. Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
  1. Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
  1. PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

Output variable (desired target):

  1. Sales - the turnover for any given day

We also have variable, that obviously high correlated with target:

  1. Customers - the number of customers on a given day

Evaluation:

Submissions are evaluated on the Root Mean Square Percentage Error (RMSPE).

where ${y_i}$ denotes the sales of a single store on a single day and ${y(hat)_i}$ denotes the corresponding prediction. Any day and store with 0 sales is ignored in scoring.

Also it's worth saying: obviously there will be not any Sales in closed stores in particular days, so we can put 0 in the test target in rows, where store is closed. But since in the description of this competetion is said that Sales of closed stores will be ignored in resulting score - we won't pay attention to this rows in test data.

Part 2. Primary data analysis

Loading packages :

In [329]:
#system
import os
import sys
import  gc
import warnings
warnings.filterwarnings("ignore")


# basic
import numpy as np
import pandas as pd
from pandas import datetime
import pickle
import scipy
import math
import collections
import sklearn


# data preproc
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import boxcox
from scipy.special import inv_boxcox

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import itertools


# statistics
import scipy.stats as st
from statsmodels.distributions.empirical_distribution import ECDF
#from sklearn import metrics
from sklearn.metrics import mean_squared_error, make_scorer


# models
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV, validation_curve
from sklearn.model_selection import KFold, learning_curve
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

Importing:

In [2]:
# Path to our data
PATH_TO_alldata = os.path.join(os.path.abspath(os.curdir), 'all_data')

# importing train and test data
initial_train = pd.read_csv(os.path.join(PATH_TO_alldata, "train.csv"), parse_dates = True, index_col = 'Date')
train = initial_train.copy()
initial_test = pd.read_csv(os.path.join(PATH_TO_alldata, "test.csv"), parse_dates = True, index_col = 'Date')
test = initial_test.copy()

# additional store data
store = pd.read_csv(os.path.join(PATH_TO_alldata, "store.csv"))
# As we did it with tran and test, create copy of inital frame and work with it:
initial_store = store.copy()

# reading sample_submission 
sample_submission = pd.read_csv(os.path.join(PATH_TO_alldata, "sample_submission.csv")) 

Main info about our data:

In [3]:
print("\n----------------------------train ----------------------------\n")
train.info()
print("\n----------------------------store ----------------------------\n")
store.info()
print("\n----------------------------test ----------------------------\n")
test.info()
----------------------------train ----------------------------

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1017209 entries, 2015-07-31 to 2013-01-01
Data columns (total 8 columns):
Store            1017209 non-null int64
DayOfWeek        1017209 non-null int64
Sales            1017209 non-null int64
Customers        1017209 non-null int64
Open             1017209 non-null int64
Promo            1017209 non-null int64
StateHoliday     1017209 non-null object
SchoolHoliday    1017209 non-null int64
dtypes: int64(7), object(1)
memory usage: 69.8+ MB

----------------------------store ----------------------------

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
Store                        1115 non-null int64
StoreType                    1115 non-null object
Assortment                   1115 non-null object
CompetitionDistance          1112 non-null float64
CompetitionOpenSinceMonth    761 non-null float64
CompetitionOpenSinceYear     761 non-null float64
Promo2                       1115 non-null int64
Promo2SinceWeek              571 non-null float64
Promo2SinceYear              571 non-null float64
PromoInterval                571 non-null object
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB

----------------------------test ----------------------------

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 41088 entries, 2015-09-17 to 2015-08-01
Data columns (total 7 columns):
Id               41088 non-null int64
Store            41088 non-null int64
DayOfWeek        41088 non-null int64
Open             41077 non-null float64
Promo            41088 non-null int64
StateHoliday     41088 non-null object
SchoolHoliday    41088 non-null int64
dtypes: float64(1), int64(5), object(1)
memory usage: 2.5+ MB
In [4]:
print("Number of Rows TRAIN:", train.shape[0])
print("Number of Columns TRAIN:", train.shape[1])
print("---------------------------")
print("Number of Rows TEST:", test.shape[0])
print("Number of Columns TEST:", test.shape[1])
print("---------------------------")
print("Number of Rows STORE:", store.shape[0])
print("Number of Columns STORE:", store.shape[1])
Number of Rows TRAIN: 1017209
Number of Columns TRAIN: 8
---------------------------
Number of Rows TEST: 41088
Number of Columns TEST: 7
---------------------------
Number of Rows STORE: 1115
Number of Columns STORE: 10

Let's look at heads :

In [5]:
train.head()
Out[5]:
Store DayOfWeek Sales Customers Open Promo StateHoliday SchoolHoliday
Date
2015-07-31 1 5 5263 555 1 1 0 1
2015-07-31 2 5 6064 625 1 1 0 1
2015-07-31 3 5 8314 821 1 1 0 1
2015-07-31 4 5 13995 1498 1 1 0 1
2015-07-31 5 5 4822 559 1 1 0 1
In [6]:
test.head()
Out[6]:
Id Store DayOfWeek Open Promo StateHoliday SchoolHoliday
Date
2015-09-17 1 1 4 1.0 1 0 0
2015-09-17 2 3 4 1.0 1 0 0
2015-09-17 3 7 4 1.0 1 0 0
2015-09-17 4 8 4 1.0 1 0 0
2015-09-17 5 9 4 1.0 1 0 0
In [7]:
store.head()
Out[7]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 a a 29910.0 4.0 2015.0 0 NaN NaN NaN

Some easy statistics:

In [8]:
train.describe()
Out[8]:
Store DayOfWeek Sales Customers Open Promo SchoolHoliday
count 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06
mean 5.584297e+02 3.998341e+00 5.773819e+03 6.331459e+02 8.301067e-01 3.815145e-01 1.786467e-01
std 3.219087e+02 1.997391e+00 3.849926e+03 4.644117e+02 3.755392e-01 4.857586e-01 3.830564e-01
min 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 2.800000e+02 2.000000e+00 3.727000e+03 4.050000e+02 1.000000e+00 0.000000e+00 0.000000e+00
50% 5.580000e+02 4.000000e+00 5.744000e+03 6.090000e+02 1.000000e+00 0.000000e+00 0.000000e+00
75% 8.380000e+02 6.000000e+00 7.856000e+03 8.370000e+02 1.000000e+00 1.000000e+00 0.000000e+00
max 1.115000e+03 7.000000e+00 4.155100e+04 7.388000e+03 1.000000e+00 1.000000e+00 1.000000e+00

Counting number of NANs in each column of a Data Frame:

In [5]:
def nans_counter(df):
    for column in df.columns.values:
        print(column,  "NANs : ", pd.isnull(df.loc[:, column]).sum())
    
In [6]:
print("\n TRAIN ")
nans_counter(train)
print("\n TEST ")
nans_counter(test)
print("\n STORE ")
nans_counter(store)
 TRAIN 
Store NANs :  0
DayOfWeek NANs :  0
Sales NANs :  0
Customers NANs :  0
Open NANs :  0
Promo NANs :  0
StateHoliday NANs :  0
SchoolHoliday NANs :  0

 TEST 
Id NANs :  0
Store NANs :  0
DayOfWeek NANs :  0
Open NANs :  11
Promo NANs :  0
StateHoliday NANs :  0
SchoolHoliday NANs :  0

 STORE 
Store NANs :  0
StoreType NANs :  0
Assortment NANs :  0
CompetitionDistance NANs :  3
CompetitionOpenSinceMonth NANs :  354
CompetitionOpenSinceYear NANs :  354
Promo2 NANs :  0
Promo2SinceWeek NANs :  544
Promo2SinceYear NANs :  544
PromoInterval NANs :  544

So we see that our train df doesn't have any NANs, test df has 11 NANs in OPEN column. And store df has many nans . We will consider and remove this NANs later.

Now we will consider influence of different columns on target variable:

Firstly, we will take Open column and look on additional info about SchoolHoliday, StateHoliday

In [11]:
print("Over entire train period, {} - is the number of times that different stores closed on given days."
       .format(train[(train.Open == 0)].count()[0]), "\n")

print("{} - is the number of closed stores because of a school holiday. "
       .format(train[(train.Open == 0) &
                     (train.SchoolHoliday == 1) &
                     (train.StateHoliday == '0')]
               .count()[0]), "\n")

print("{} - is the number of closed stores  because of either a bank holiday or easter or christmas."
       .format(train[(train.Open == 0) &
                     ((train.StateHoliday == 'a') |
                      (train.StateHoliday == 'b') | 
                      (train.StateHoliday == 'c'))]
               .count()[0]), "\n")

print("{} - is the number of times, when shops were closed on days for no apparent reason when no holiday was announced."
      .format(train[(train.Open == 0) &
                    (train.StateHoliday == "0") &
                    (train.SchoolHoliday == 0)]
              .count()[0]))
Over entire train period, 172817 - is the number of times that different stores closed on given days. 

2263 - is the number of closed stores because of a school holiday.  

30140 - is the number of closed stores  because of either a bank holiday or easter or christmas. 

121482 - is the number of times, when shops were closed on days for no apparent reason when no holiday was announced.

From the descrition of the this task we can get, that Rossman stores had undergoing refurbishments sometimes and has been closed. Most probably those times, when stores were closed without any reason (in dataset), mean this refurbishments.

And since we don't want to bias our decision, wi will not consider those exceptions, so we will get rid of closed stores and prevent the models to train on them and get false guidance.

In this case we will analyse only open stores since a close store yield a profit of 0. And as I said in the beginning: we will not pay any attention to similar situations in test data.

And distribution of Opened-closed stores per day of week is here :

'Open' option with respect to Day Of week:

In [161]:
fig, axis = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x='Open',hue='DayOfWeek', data=train,
              palette=sns.color_palette("Set1", n_colors=7), ax=axis)
plt.title("Train Data")
Out[161]:
Text(0.5, 1.0, 'Train Data')
In [7]:
# getting rid of closed stores
train = train[(train.Open != 0) & (train.Sales != 0)]

As we remember: we have 11 NANs in Open column in Test Data.

So it's time to deal with it. First of all, we have to know, which store has this NANs. Then we will explore: when this store was closed. After that we will decide whether every NAN should have a replacing option = 0 or not.

In [162]:
fig, axis = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x='Open',hue='DayOfWeek', data=test,
              palette=sns.color_palette("Set1", n_colors=7), ax=axis)
plt.title("Test Data")
Out[162]:
Text(0.5, 1.0, 'Test Data')
In [170]:
test[(test['Open'] != 0) & (test['Open']!=1)]
Out[170]:
Id Store DayOfWeek Open Promo StateHoliday SchoolHoliday Year Month Day WeekOfYear
Date
2015-09-17 480 622 4 NaN 1 0 0 2015 9 17 38
2015-09-16 1336 622 3 NaN 1 0 0 2015 9 16 38
2015-09-15 2192 622 2 NaN 1 0 0 2015 9 15 38
2015-09-14 3048 622 1 NaN 1 0 0 2015 9 14 38
2015-09-12 4760 622 6 NaN 0 0 0 2015 9 12 37
2015-09-11 5616 622 5 NaN 0 0 0 2015 9 11 37
2015-09-10 6472 622 4 NaN 0 0 0 2015 9 10 37
2015-09-09 7328 622 3 NaN 0 0 0 2015 9 9 37
2015-09-08 8184 622 2 NaN 0 0 0 2015 9 8 37
2015-09-07 9040 622 1 NaN 0 0 0 2015 9 7 37
2015-09-05 10752 622 6 NaN 0 0 0 2015 9 5 36
In [174]:
print("Ok, that's only 622 store, that has NANs. Let's consider days with NAN :\n", 
      sorted(test[(test['Open'] != 0) & (test['Open']!=1)].DayOfWeek.unique()))

fig, axis = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x='Open',hue='DayOfWeek', data=test[test['Store']==622],
              palette=sns.color_palette("Set1", n_colors=7), ax=axis)
plt.title("Test Data for store 622")
Ok, that's only 622 store, that has NANs. Let's consider days with NAN :
 [1, 2, 3, 4, 5, 6]
Out[174]:
Text(0.5, 1.0, 'Test Data for store 622')

Paying attention to plot with test data and open column for 622 store-> we see, that on days [1..6] this store always open.

It leads us to put 1 in all NANs in this data.

In [8]:
# replacing NANs with OPEN-state
test.Open.fillna(1, inplace=True)

Now our data without useless information about closed stores.

We can group by date and get average sales, and percentage change between the current and a prior element:

In [185]:
average_sales    = train.groupby('Date')["Sales"].mean()
pct_change_sales = train.groupby('Date')["Sales"].sum().pct_change()

fig, (axis1,axis2) = plt.subplots(2,1, sharex=True, figsize=(15,8))

# average sales
ax1 = average_sales.plot(legend=True, ax=axis1, marker='p', title="Average Sales")
ax1.set_xticks(range(len(average_sales)))
ax1.set_xticklabels(average_sales.index.tolist(), rotation=90)

# percent change for sales
ax2 = pct_change_sales.plot(legend=True,ax=axis2, marker='o', rot=90, colormap="summer", title="Sales Percent Change")

Got from this plots: clearly seeing increasing of Sales in period around NewYear, seems to be real.

Pattern of every year Average Sales Function is repeated every year - Ok, nothing strange, nothing super interesting.

So what about distribution of Sales in different days of week?

In [17]:
fig = plt.figure()
ax1 = fig.add_subplot(131)
sns.barplot(x=train.groupby('DayOfWeek')['Sales'].sum().index, y=train.groupby('DayOfWeek')['Sales'].sum().values)
ax1.set_title("Summing sales over each day")

ax2 = fig.add_subplot(132)
sns.barplot(x=train.groupby('DayOfWeek')['Sales'].mean().index, y=train.groupby('DayOfWeek')['Sales'].mean().values)
ax2.set_title("Mean sales in each day")

ax3 = fig.add_subplot(133)
sns.countplot(train.DayOfWeek)
ax3.set_title("Total count of sales in each day")

plt.rcParams['figure.figsize'] = 18, 7
sns.set(font_scale=1)

plt.show()
In [18]:
sns.boxplot(x='DayOfWeek', y='Sales', data=train, palette="muted");

OK, from plots we can see that highest profit is happened at 1th day.

At 7th day there is very small quantity of sales (most probably because of closed stores), but mean-sale in this day one of the highest! There are 2 leaders in "mean measure" : 1th and 7th days.

Also we can consider such interesting variable as Promo (whether a store is running a promo on that day or not).

Here we can see how depends sales and customers on every day of week and whether promo is on day or not :

In [76]:
print("\t\t\t----------Dependency of Sales on DayOfWeek with/without Promo--------\n")
print("1nd bar is for day with Promo, 2nd bar is for day without Promo")
fig = plt.figure()
# Bar width
width = 0.4

ax1 = fig.add_subplot(121)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Sales'].sum()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Sales'].sum()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('sum')
ax1.set_title("Summing Sales over each day")

ax2 = fig.add_subplot(122)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Sales'].mean()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Sales'].mean()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('mean')
ax2.set_title("Mean Sales in each day")

plt.rcParams['figure.figsize'] = 18, 8
sns.set(font_scale=1)
plt.show()
			----------Dependency of Sales on DayOfWeek with/without Promo--------

1nd bar is for day with Promo, 2nd bar is for day without Promo
In [77]:
print("\t\t\t----------Dependency of Customers on DayOfWeek with/without Promo--------\n")
print("1nd bar is for day with Promo, 2nd bar is for day without Promo")
fig = plt.figure()
# Bar width
width = 0.4

ax1 = fig.add_subplot(121)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Customers'].sum()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Customers'].sum()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('sum')
ax1.set_title("Summing Customers over each day")

ax2 = fig.add_subplot(122)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Customers'].mean()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Customers'].mean()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('mean')
ax2.set_title("Mean Customers in each day")

plt.rcParams['figure.figsize'] = 18, 8
sns.set(font_scale=1)
plt.show()
			----------Dependency of Customers on DayOfWeek with/without Promo--------

1nd bar is for day with Promo, 2nd bar is for day without Promo

Got some interesting results here. Sales on saturday without Promo really tends to peak a maximum. On sunday, I think, there are many closed stores - so minimum benefits. Other days : we can see, that sales on Promo-days significantlly bigger.

Interesting, that "mean-measure" of sales in Promo-Sunday doesn not have big difference from other days, but the total count of customers of this day is extralarge, and it leads total sum of sales to be bigger.

In the beginning I was talking about obvious assumption about strong connection between Sales and Customers, now it's time to prove it precisely :

In [8]:
print("Pearson correlation : ", round(st.pearsonr(train.Customers, train.Sales)[0], 3))
Pearson correlation :  0.824

Our pearson correlation factor of 0.824 explains that there is a strong positive correlation between Sales and Customers. In general, the more customers you have in a store, the higher your sales for the day.

So it seems interesting to create new variable :

In [9]:
# combinig new variable
train['SalePerCustomer'] = train['Sales'] / train['Customers']

Distribution analysis of target variables:

Creating some usefull functions for distribution analysis:

In [10]:
def norm_testing(vector):
    print("Skewness value :", np.round(st.skew(vector.values),decimals=2))
    print("The kurtosis for Fisher’s definition  :", np.round(st.kurtosis(vector.values) , decimals=2))
    print("Assymetry: ", np.abs(np.round(st.skew(vector.values),decimals=2)))
    print("Excess: ", np.abs(np.round(st.kurtosis(vector.values) ,decimals=2)))
    
    print("The kurtosis test that it's a normal distribution :", 
         "\nZ-score: {}  \nThe 2-sided p-value for the hypothesis test: {}".
         format(round(st.kurtosistest(vector.values)[0], 4), st.kurtosistest(vector.values)[1]))
    
    # for getting better accuracy in Sapiro-Wilk test, let's take from data, for example, 4500 random values
    # In documentation we can find max quantity of values for accurate results : 5000 
    vector = vector.reset_index(drop=True) #reindexing to [0,1,2,3 etc.] 
    indices = np.random.choice(a=vector.index, size=4500, replace=False)
    
    print("The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution: ",
          "\nThe test statistic: {} \nThe p-value for the hypothesis test {}".
          format(round(st.shapiro(vector[indices].values)[0], 4), st.shapiro(vector[indices].values)[1]))
    

def get_distrib_stats(data):
    norm_testing(data)
    
    sns.set(rc={"figure.figsize": (20, 5)})
    sns.set(font_scale=1.2)
    fig = plt.figure()
    ax1 = fig.add_subplot(121)
    p_x = sns.distplot(data, fit=st.norm, kde=True,ax=ax1, bins=50)
    #ax1.legend()
    ax1.set_title("Probability density function")
    ax2 = fig.add_subplot(122)
    prob = st.probplot(data, dist=st.norm, plot=ax2)
    ax2.set_title('Probabilyty plot')
    #ax2.legend()
    plt.show()
    
In [23]:
print("\t\t\t-----------Distribution analysis for Sales-----------")
get_distrib_stats(train['Sales'])

print("\t\t\t-----------Distribution analysis for Customers-----------")
get_distrib_stats(train['Customers'])

print("\t\t\t-----------Distribution analysis for SalePerCustomer-----------")
get_distrib_stats(train['SalePerCustomer'])
			-----------Distribution analysis for Sales-----------
Skewness value : 1.59
The kurtosis for Fisher’s definition  : 4.85
Assymetry:  1.59
Excess:  4.85
The kurtosis test that it's a normal distribution : 
Z-score: 300.4692  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.8855 
The p-value for the hypothesis test 0.0
			-----------Distribution analysis for Customers-----------
Skewness value : 2.79
The kurtosis for Fisher’s definition  : 13.32
Assymetry:  2.79
Excess:  13.32
The kurtosis test that it's a normal distribution : 
Z-score: 412.8111  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.7972 
The p-value for the hypothesis test 0.0
			-----------Distribution analysis for SalePerCustomer-----------
Skewness value : 0.59
The kurtosis for Fisher’s definition  : 2.76
Assymetry:  0.59
Excess:  2.76
The kurtosis test that it's a normal distribution : 
Z-score: 234.0123  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.9832 
The p-value for the hypothesis test 9.89589402453757e-23
OK, it's not normally distributed.

Better for our future modeling to have target variable normally distributed.

How we can deal with it : we can remove "unconvienient" values, like large tails, so it gives us a bit different picture: data seems to become closer to normal. But here we have to deal with some data leaking. Also we can add log() function to our target and we will gain again some "improvment" in the way getting variable more normally distributed.

For sure we can consider more different techniques, I tried many, and for "place and time" saving I will show the best way I explored this data to make variable normally distributed.

We will use coxbox.

But first let's look at boxplots of Sales and Customers:

In [24]:
fig = plt.figure()
ax1 = fig.add_subplot(121)
sns.violinplot(x='Sales', scale='count',data=train[train['Sales'] > 0])
ax1.set_title("violinplot of sales")

ax2 = fig.add_subplot(122)
sns.boxplot( x='Sales', data=train[train['Sales'] > 0], orient="h")
ax2.set_title("Boxplot of sales")

plt.rcParams['figure.figsize'] = 18, 7
sns.set(font_scale=3)

plt.show()
In [45]:
fig = plt.figure()
ax1 = fig.add_subplot(121)
sns.violinplot(x='Customers', scale='count',data=train[train['Customers'] > 0])
ax1.set_title("violinplot of Customers")

ax2 = fig.add_subplot(122)
sns.boxplot( x='Customers', data=train[train['Customers'] > 0], orient="h")
ax2.set_title("Boxplot of Customers")

plt.rcParams['figure.figsize'] = 18, 7
sns.set(font_scale=1)

plt.show()

There are some outliers in target variables, so we will decide how to deal with it in next chapters.

And now coxboxing the target :

(we will also save lambdas from coxbox function for next usage in inverse coxbox function later, when prediction will be calculated, and so we will get true prediction)

In [11]:
# coxboxing and save lambdas : 
lambdas_coxbox = []

coxboxed, lam = boxcox(train['Sales'].values)
sales_coxboxed = pd.Series(coxboxed, index=train.Sales.index)
lambdas_coxbox.append(lam)

coxboxed, lam = boxcox(train['Customers'].values)
cust_coxboxed = pd.Series(coxboxed, index=train.Customers.index)
lambdas_coxbox.append(lam)

coxboxed, lam = boxcox(train['SalePerCustomer'].values)
sale_per_cust_coxboxed = pd.Series(coxboxed, index=train.SalePerCustomer.index)
lambdas_coxbox.append(lam)

print('Lambda for Sales: %f' % lambdas_coxbox[0])
print('Lambda for Customers: %f' % lambdas_coxbox[1])
print('Lambda for SalePerCustomer: %f' % lambdas_coxbox[2])
Lambda for Sales: 0.069340
Lambda for Customers: -0.173125
Lambda for SalePerCustomer: 0.353622
In [65]:
print("\t\t\t-----------Distribution analysis for coxboxed Sales-----------")
get_distrib_stats(sales_coxboxed)

print("\t\t\t-----------Distribution analysis for coxboxed Customers-----------")
get_distrib_stats(cust_coxboxed)

print("\t\t\t-----------Distribution analysis for coxboxed SalePerCustomer-----------")
get_distrib_stats(sale_per_cust_coxboxed)
			-----------Distribution analysis for coxboxed Sales-----------
Skewness value : 0.01
The kurtosis for Fisher’s definition  : 0.57
Assymetry:  0.01
Excess:  0.57
The kurtosis test that it's a normal distribution : 
Z-score: 84.3933  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.9974 
The p-value for the hypothesis test 7.20592254310759e-07
			-----------Distribution analysis for coxboxed Customers-----------
Skewness value : -0.03
The kurtosis for Fisher’s definition  : 1.07
Assymetry:  0.03
Excess:  1.07
The kurtosis test that it's a normal distribution : 
Z-score: 133.6594  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.9912 
The p-value for the hypothesis test 3.7189243600018113e-16
			-----------Distribution analysis for coxboxed SalePerCustomer-----------
Skewness value : 0.01
The kurtosis for Fisher’s definition  : 0.41
Assymetry:  0.01
Excess:  0.41
The kurtosis test that it's a normal distribution : 
Z-score: 63.8305  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.9956 
The p-value for the hypothesis test 2.4582183510979405e-10

OK, now it seems to be better. Also, we didn't do anything with outliers, so, perhaps, removing this values will lead us to better distribuation without leaking in accuracy of forecasting.

Let's pay attention last time for now on distribution statistics and also check, how much data we will have after removing outliers at all? :

In [166]:
# finding outliers
fig = plt.figure()

ax1 = fig.add_subplot(131)
sns.boxplot(data=sales_coxboxed, orient="h")
ax1.set_title("Boxplot of coxboxed sales")

ax2 = fig.add_subplot(132)
sns.boxplot(data=cust_coxboxed, orient="h")
ax2.set_title("Boxplot of coxboxed Customers")

ax3 = fig.add_subplot(133)
sns.boxplot(data=sale_per_cust_coxboxed, orient="h")
ax3.set_title("Boxplot of coxboxed SalePerCustomer")



plt.rcParams['figure.figsize'] = 17, 14
sns.set(font_scale=1)

plt.show()
In [167]:
print("\t\t\t-----------Distribution analysis for cutted coxboxed Sales-----------")
get_distrib_stats(sales_coxboxed.where((sales_coxboxed > 8) & (sales_coxboxed < 15.5 ))
                            .sort_values().dropna())

print("\t\t\t-----------Distribution analysis for cutted coxboxed Customers-----------")
get_distrib_stats(cust_coxboxed.where((cust_coxboxed > 3.1) & (cust_coxboxed < 4.5))
                            .sort_values().dropna())

print("\t\t\t-----------Distribution analysis for cutted coxboxed SalePerCustomer-----------")
get_distrib_stats(sale_per_cust_coxboxed.where(sale_per_cust_coxboxed < 6)
                            .sort_values().dropna())
			-----------Distribution analysis for cutted coxboxed Sales-----------
Skewness value : 0.01
The kurtosis for Fisher’s definition  : 0.54
Assymetry:  0.01
Excess:  0.54
The kurtosis test that it's a normal distribution : 
Z-score: 80.1003  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.9965 
The p-value for the hypothesis test 7.86324871882016e-09
			-----------Distribution analysis for cutted coxboxed Customers-----------
Skewness value : -0.01
The kurtosis for Fisher’s definition  : 0.9
Assymetry:  0.01
Excess:  0.9
The kurtosis test that it's a normal distribution : 
Z-score: 118.5853  
The 2-sided p-value for the hypothesis test: 0.0
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.9939 
The p-value for the hypothesis test 6.712827993124448e-13
			-----------Distribution analysis for cutted coxboxed SalePerCustomer-----------
Skewness value : -0.01
The kurtosis for Fisher’s definition  : 0.21
Assymetry:  0.01
Excess:  0.21
The kurtosis test that it's a normal distribution : 
Z-score: 35.4084  
The 2-sided p-value for the hypothesis test: 1.2674128957791634e-274
The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution:  
The test statistic: 0.996 
The p-value for the hypothesis test 1.079785594271243e-09