**Research plan**

```
- 1. Dataset and features description
- 2. Primary data analysis
- 3. Primary visual analysis of the features
- 4. Patterns, insights, pecularities of data
- 5. Metrics selection
- 6. Model selection
- 7. Data preprocessing
- 8. Cross-validation and adjustment of model hyperparameters
- 9. Creation of new features and description of this process
- 10. Plotting training and validation curves
- 11. Prediction for test or hold-out samples
- 12. Conclusions
```

## Part 1. Dataset and features description¶

Abstract:¶Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging us to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation.

We are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

## Files :¶

Donwloaded from : https://www.kaggle.com/c/rossmann-store-sales/data

train.csv - historical data including Sales

test.csv - historical data excluding Sales

sample_submission.csv - a sample submission file in the correct format

store.csv - supplemental information about the stores

## Some preliminary thoughts :¶

A sales forecast is a tool that can help almost any company. Many companies rely on human forecasts that are not of a constant quality. Other companies use a standard tool that is not flexible enough to suit their needs. This competition gives researchers a possibility to create their own solution, that really improves business.

Since data consists of sales based on date and time, we can build models with time series data, so it will serve us to extract date-time-features for further analysis.

## Input variables :¶

Self-explanatory fields :

- Date

- DayOfWeek

The following fields have descriptions :

- Id - an Id that represents a (Store, Date) duple within the test set

- Store - a unique Id for each store

- Open - an indicator for whether the store was open: 0 = closed, 1 = open

- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

- StoreType - differentiates between 4 different store models: a, b, c, d

- Assortment - describes an assortment level: a = basic, b = extra, c = extended

- CompetitionDistance - distance in meters to the nearest competitor store

- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

- Promo - indicates whether a store is running a promo on that day

- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2

- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

## Output variable (desired target):¶

- Sales - the turnover for any given day

We also have variable, that obviously high correlated with target:

- Customers - the number of customers on a given day

## Evaluation:¶

Submissions are evaluated on the Root Mean Square Percentage Error (RMSPE).

Also it's worth saying: obviously there will be not any Sales in closed stores in particular days, so we can put 0 in the test target in rows, where store is closed. But since in the description of this competetion is said that Sales of closed stores will be ignored in resulting score - we won't pay attention to this rows in test data.

## Part 2. Primary data analysis¶

## Loading packages :¶

In [329]:

```
#system
import os
import sys
import gc
import warnings
warnings.filterwarnings("ignore")
# basic
import numpy as np
import pandas as pd
from pandas import datetime
import pickle
import scipy
import math
import collections
import sklearn
# data preproc
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import boxcox
from scipy.special import inv_boxcox
# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import itertools
# statistics
import scipy.stats as st
from statsmodels.distributions.empirical_distribution import ECDF
#from sklearn import metrics
from sklearn.metrics import mean_squared_error, make_scorer
# models
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV, validation_curve
from sklearn.model_selection import KFold, learning_curve
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
```

## Importing:¶

In [2]:

```
# Path to our data
PATH_TO_alldata = os.path.join(os.path.abspath(os.curdir), 'all_data')
# importing train and test data
initial_train = pd.read_csv(os.path.join(PATH_TO_alldata, "train.csv"), parse_dates = True, index_col = 'Date')
train = initial_train.copy()
initial_test = pd.read_csv(os.path.join(PATH_TO_alldata, "test.csv"), parse_dates = True, index_col = 'Date')
test = initial_test.copy()
# additional store data
store = pd.read_csv(os.path.join(PATH_TO_alldata, "store.csv"))
# As we did it with tran and test, create copy of inital frame and work with it:
initial_store = store.copy()
# reading sample_submission
sample_submission = pd.read_csv(os.path.join(PATH_TO_alldata, "sample_submission.csv"))
```

## Main info about our data:¶

In [3]:

```
print("\n----------------------------train ----------------------------\n")
train.info()
print("\n----------------------------store ----------------------------\n")
store.info()
print("\n----------------------------test ----------------------------\n")
test.info()
```

In [4]:

```
print("Number of Rows TRAIN:", train.shape[0])
print("Number of Columns TRAIN:", train.shape[1])
print("---------------------------")
print("Number of Rows TEST:", test.shape[0])
print("Number of Columns TEST:", test.shape[1])
print("---------------------------")
print("Number of Rows STORE:", store.shape[0])
print("Number of Columns STORE:", store.shape[1])
```

## Let's look at heads :¶

In [5]:

```
train.head()
```

Out[5]:

In [6]:

```
test.head()
```

Out[6]:

In [7]:

```
store.head()
```

Out[7]:

Some easy statistics:

In [8]:

```
train.describe()
```

Out[8]:

## Counting number of NANs in each column of a Data Frame:¶

In [5]:

```
def nans_counter(df):
for column in df.columns.values:
print(column, "NANs : ", pd.isnull(df.loc[:, column]).sum())
```

In [6]:

```
print("\n TRAIN ")
nans_counter(train)
print("\n TEST ")
nans_counter(test)
print("\n STORE ")
nans_counter(store)
```

So we see that our train df doesn't have any NANs, test df has 11 NANs in OPEN column. And store df has many nans . We will consider and remove this NANs later.

## Now we will consider influence of different columns on target variable:¶

Firstly, we will take **Open** column and look on additional info about **SchoolHoliday**, **StateHoliday**

In [11]:

```
print("Over entire train period, {} - is the number of times that different stores closed on given days."
.format(train[(train.Open == 0)].count()[0]), "\n")
print("{} - is the number of closed stores because of a school holiday. "
.format(train[(train.Open == 0) &
(train.SchoolHoliday == 1) &
(train.StateHoliday == '0')]
.count()[0]), "\n")
print("{} - is the number of closed stores because of either a bank holiday or easter or christmas."
.format(train[(train.Open == 0) &
((train.StateHoliday == 'a') |
(train.StateHoliday == 'b') |
(train.StateHoliday == 'c'))]
.count()[0]), "\n")
print("{} - is the number of times, when shops were closed on days for no apparent reason when no holiday was announced."
.format(train[(train.Open == 0) &
(train.StateHoliday == "0") &
(train.SchoolHoliday == 0)]
.count()[0]))
```

From the descrition of the this task we can get, that Rossman stores had undergoing refurbishments sometimes and has been closed. Most probably those times, when stores were closed without any reason (in dataset), mean this refurbishments.

And since we don't want to bias our decision, wi will not consider those exceptions, so we will get rid of closed stores and prevent the models to train on them and get false guidance.

In this case we will analyse only open stores since a close store yield a profit of 0. And as I said in the beginning: we will not pay any attention to similar situations in test data.

And distribution of **Opened-closed stores per day of week** is here :

## 'Open' option with respect to Day Of week:¶

In [161]:

```
fig, axis = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x='Open',hue='DayOfWeek', data=train,
palette=sns.color_palette("Set1", n_colors=7), ax=axis)
plt.title("Train Data")
```

Out[161]:

In [7]:

```
# getting rid of closed stores
train = train[(train.Open != 0) & (train.Sales != 0)]
```

As we remember: we have **11 NANs in Open column in Test Data**.

So it's time to deal with it. First of all, we have to know, which store has this NANs. Then we will explore: when this store was closed. After that we will decide whether every NAN should have a replacing option = 0 or not.

In [162]:

```
fig, axis = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x='Open',hue='DayOfWeek', data=test,
palette=sns.color_palette("Set1", n_colors=7), ax=axis)
plt.title("Test Data")
```

Out[162]:

In [170]:

```
test[(test['Open'] != 0) & (test['Open']!=1)]
```

Out[170]:

In [174]:

```
print("Ok, that's only 622 store, that has NANs. Let's consider days with NAN :\n",
sorted(test[(test['Open'] != 0) & (test['Open']!=1)].DayOfWeek.unique()))
fig, axis = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x='Open',hue='DayOfWeek', data=test[test['Store']==622],
palette=sns.color_palette("Set1", n_colors=7), ax=axis)
plt.title("Test Data for store 622")
```

Out[174]:

Paying attention to plot with test data and open column for 622 store-> we see, that on days [1..6] this store always open.

It leads us to **put 1 in all NANs in this data**.

In [8]:

```
# replacing NANs with OPEN-state
test.Open.fillna(1, inplace=True)
```

Now our data without useless information about closed stores.

We can group by **date** and get **average sales**, and **percentage change** between the current and a prior element:

In [185]:

```
average_sales = train.groupby('Date')["Sales"].mean()
pct_change_sales = train.groupby('Date')["Sales"].sum().pct_change()
fig, (axis1,axis2) = plt.subplots(2,1, sharex=True, figsize=(15,8))
# average sales
ax1 = average_sales.plot(legend=True, ax=axis1, marker='p', title="Average Sales")
ax1.set_xticks(range(len(average_sales)))
ax1.set_xticklabels(average_sales.index.tolist(), rotation=90)
# percent change for sales
ax2 = pct_change_sales.plot(legend=True,ax=axis2, marker='o', rot=90, colormap="summer", title="Sales Percent Change")
```

Got from this plots: clearly seeing increasing of Sales in period around NewYear, seems to be real.

Pattern of every year Average Sales Function is repeated every year - Ok, nothing strange, nothing super interesting.

## So what about distribution of

Salesin differentdays of week?¶

In [17]:

```
fig = plt.figure()
ax1 = fig.add_subplot(131)
sns.barplot(x=train.groupby('DayOfWeek')['Sales'].sum().index, y=train.groupby('DayOfWeek')['Sales'].sum().values)
ax1.set_title("Summing sales over each day")
ax2 = fig.add_subplot(132)
sns.barplot(x=train.groupby('DayOfWeek')['Sales'].mean().index, y=train.groupby('DayOfWeek')['Sales'].mean().values)
ax2.set_title("Mean sales in each day")
ax3 = fig.add_subplot(133)
sns.countplot(train.DayOfWeek)
ax3.set_title("Total count of sales in each day")
plt.rcParams['figure.figsize'] = 18, 7
sns.set(font_scale=1)
plt.show()
```

In [18]:

```
sns.boxplot(x='DayOfWeek', y='Sales', data=train, palette="muted");
```

OK, from plots we can see that highest profit is happened at 1th day.

At 7th day there is very small quantity of sales (most probably because of closed stores), but mean-sale in this day one of the highest! There are 2 leaders in "mean measure" : 1th and 7th days.

Also we can consider such interesting variable as **Promo** (whether a store is running a promo on that day or not).

## Here we can see how depends

salesandcustomerson everyday of weekand whetherpromois on day or not :¶

In [76]:

```
print("\t\t\t----------Dependency of Sales on DayOfWeek with/without Promo--------\n")
print("1nd bar is for day with Promo, 2nd bar is for day without Promo")
fig = plt.figure()
# Bar width
width = 0.4
ax1 = fig.add_subplot(121)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Sales'].sum()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Sales'].sum()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('sum')
ax1.set_title("Summing Sales over each day")
ax2 = fig.add_subplot(122)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Sales'].mean()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Sales'].mean()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('mean')
ax2.set_title("Mean Sales in each day")
plt.rcParams['figure.figsize'] = 18, 8
sns.set(font_scale=1)
plt.show()
```

In [77]:

```
print("\t\t\t----------Dependency of Customers on DayOfWeek with/without Promo--------\n")
print("1nd bar is for day with Promo, 2nd bar is for day without Promo")
fig = plt.figure()
# Bar width
width = 0.4
ax1 = fig.add_subplot(121)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Customers'].sum()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Customers'].sum()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('sum')
ax1.set_title("Summing Customers over each day")
ax2 = fig.add_subplot(122)
Day_Promo = train[train['Promo'] == 1].groupby('DayOfWeek')['Customers'].mean()
Day_NoPromo = train[train['Promo'] != 1].groupby('DayOfWeek')['Customers'].mean()
Day_Promo.plot(kind='bar', width=width, color='#FF0000', position=1, label='Day_Promo', rot=0)
Day_NoPromo.plot(kind='bar', width=width, color='#0000FF', position=0, label='Day_NoPromo', rot=0)
plt.xlabel('Day of week')
plt.ylabel('mean')
ax2.set_title("Mean Customers in each day")
plt.rcParams['figure.figsize'] = 18, 8
sns.set(font_scale=1)
plt.show()
```

Got some interesting results here. Sales on saturday without Promo really tends to peak a maximum. On sunday, I think, there are many closed stores - so minimum benefits. Other days : we can see, that sales on Promo-days significantlly bigger.

Interesting, that "mean-measure" of sales in Promo-Sunday doesn not have big difference from other days, but the total count of customers of this day is extralarge, and it leads total sum of sales to be bigger.

In the beginning I was talking about obvious assumption about strong connection between **Sales** and **Customers**, now it's time to prove it precisely :

In [8]:

```
print("Pearson correlation : ", round(st.pearsonr(train.Customers, train.Sales)[0], 3))
```

Our pearson correlation factor of 0.824 explains that there is a strong positive correlation between Sales and Customers. In general, the more customers you have in a store, the higher your sales for the day.

So it seems interesting to create new variable :

In [9]:

```
# combinig new variable
train['SalePerCustomer'] = train['Sales'] / train['Customers']
```

## Distribution analysis of target variables:¶

Creating some usefull functions for distribution analysis:

In [10]:

```
def norm_testing(vector):
print("Skewness value :", np.round(st.skew(vector.values),decimals=2))
print("The kurtosis for Fisherâ€™s definition :", np.round(st.kurtosis(vector.values) , decimals=2))
print("Assymetry: ", np.abs(np.round(st.skew(vector.values),decimals=2)))
print("Excess: ", np.abs(np.round(st.kurtosis(vector.values) ,decimals=2)))
print("The kurtosis test that it's a normal distribution :",
"\nZ-score: {} \nThe 2-sided p-value for the hypothesis test: {}".
format(round(st.kurtosistest(vector.values)[0], 4), st.kurtosistest(vector.values)[1]))
# for getting better accuracy in Sapiro-Wilk test, let's take from data, for example, 4500 random values
# In documentation we can find max quantity of values for accurate results : 5000
vector = vector.reset_index(drop=True) #reindexing to [0,1,2,3 etc.]
indices = np.random.choice(a=vector.index, size=4500, replace=False)
print("The Shapiro-Wilk test for the null hypothesis, that it's a normal distribution: ",
"\nThe test statistic: {} \nThe p-value for the hypothesis test {}".
format(round(st.shapiro(vector[indices].values)[0], 4), st.shapiro(vector[indices].values)[1]))
def get_distrib_stats(data):
norm_testing(data)
sns.set(rc={"figure.figsize": (20, 5)})
sns.set(font_scale=1.2)
fig = plt.figure()
ax1 = fig.add_subplot(121)
p_x = sns.distplot(data, fit=st.norm, kde=True,ax=ax1, bins=50)
#ax1.legend()
ax1.set_title("Probability density function")
ax2 = fig.add_subplot(122)
prob = st.probplot(data, dist=st.norm, plot=ax2)
ax2.set_title('Probabilyty plot')
#ax2.legend()
plt.show()
```

In [23]:

```
print("\t\t\t-----------Distribution analysis for Sales-----------")
get_distrib_stats(train['Sales'])
print("\t\t\t-----------Distribution analysis for Customers-----------")
get_distrib_stats(train['Customers'])
print("\t\t\t-----------Distribution analysis for SalePerCustomer-----------")
get_distrib_stats(train['SalePerCustomer'])
```

Better for our future modeling to have target variable normally distributed.

How we can deal with it : we can remove "unconvienient" values, like large tails, so it gives us a bit different picture: data seems to become closer to normal. But here we have to deal with some data leaking. Also we can add log() function to our target and we will gain again some "improvment" in the way getting variable more normally distributed.

For sure we can consider more different techniques, I tried many, and for "place and time" saving I will show the best way I explored this data to make variable normally distributed.

We will use **coxbox**.

But first let's look at **boxplots** of Sales and Customers:

In [24]:

```
fig = plt.figure()
ax1 = fig.add_subplot(121)
sns.violinplot(x='Sales', scale='count',data=train[train['Sales'] > 0])
ax1.set_title("violinplot of sales")
ax2 = fig.add_subplot(122)
sns.boxplot( x='Sales', data=train[train['Sales'] > 0], orient="h")
ax2.set_title("Boxplot of sales")
plt.rcParams['figure.figsize'] = 18, 7
sns.set(font_scale=3)
plt.show()
```

In [45]:

```
fig = plt.figure()
ax1 = fig.add_subplot(121)
sns.violinplot(x='Customers', scale='count',data=train[train['Customers'] > 0])
ax1.set_title("violinplot of Customers")
ax2 = fig.add_subplot(122)
sns.boxplot( x='Customers', data=train[train['Customers'] > 0], orient="h")
ax2.set_title("Boxplot of Customers")
plt.rcParams['figure.figsize'] = 18, 7
sns.set(font_scale=1)
plt.show()
```

There are some outliers in target variables, so we will decide how to deal with it in next chapters.

And now **coxboxing the target** :

(we will also save **lambdas** from coxbox function for next usage in **inverse coxbox** function later, when prediction will be calculated, and so we will get true prediction)

In [11]:

```
# coxboxing and save lambdas :
lambdas_coxbox = []
coxboxed, lam = boxcox(train['Sales'].values)
sales_coxboxed = pd.Series(coxboxed, index=train.Sales.index)
lambdas_coxbox.append(lam)
coxboxed, lam = boxcox(train['Customers'].values)
cust_coxboxed = pd.Series(coxboxed, index=train.Customers.index)
lambdas_coxbox.append(lam)
coxboxed, lam = boxcox(train['SalePerCustomer'].values)
sale_per_cust_coxboxed = pd.Series(coxboxed, index=train.SalePerCustomer.index)
lambdas_coxbox.append(lam)
print('Lambda for Sales: %f' % lambdas_coxbox[0])
print('Lambda for Customers: %f' % lambdas_coxbox[1])
print('Lambda for SalePerCustomer: %f' % lambdas_coxbox[2])
```

In [65]:

```
print("\t\t\t-----------Distribution analysis for coxboxed Sales-----------")
get_distrib_stats(sales_coxboxed)
print("\t\t\t-----------Distribution analysis for coxboxed Customers-----------")
get_distrib_stats(cust_coxboxed)
print("\t\t\t-----------Distribution analysis for coxboxed SalePerCustomer-----------")
get_distrib_stats(sale_per_cust_coxboxed)
```

OK, now it seems to be better. Also, we didn't do anything with outliers, so, perhaps, removing this values will lead us to better distribuation without leaking in accuracy of forecasting.

Let's pay attention last time for now on distribution statistics and also check, how much data we will have after **removing outliers** at all? :

In [166]:

```
# finding outliers
fig = plt.figure()
ax1 = fig.add_subplot(131)
sns.boxplot(data=sales_coxboxed, orient="h")
ax1.set_title("Boxplot of coxboxed sales")
ax2 = fig.add_subplot(132)
sns.boxplot(data=cust_coxboxed, orient="h")
ax2.set_title("Boxplot of coxboxed Customers")
ax3 = fig.add_subplot(133)
sns.boxplot(data=sale_per_cust_coxboxed, orient="h")
ax3.set_title("Boxplot of coxboxed SalePerCustomer")
plt.rcParams['figure.figsize'] = 17, 14
sns.set(font_scale=1)
plt.show()
```

In [167]:

```
print("\t\t\t-----------Distribution analysis for cutted coxboxed Sales-----------")
get_distrib_stats(sales_coxboxed.where((sales_coxboxed > 8) & (sales_coxboxed < 15.5 ))
.sort_values().dropna())
print("\t\t\t-----------Distribution analysis for cutted coxboxed Customers-----------")
get_distrib_stats(cust_coxboxed.where((cust_coxboxed > 3.1) & (cust_coxboxed < 4.5))
.sort_values().dropna())
print("\t\t\t-----------Distribution analysis for cutted coxboxed SalePerCustomer-----------")
get_distrib_stats(sale_per_cust_coxboxed.where(sale_per_cust_coxboxed < 6)
.sort_values().dropna())
```