Authors: Yury Kashnitsky (@yorko), Yury Isakov. Edited by Anna Tarelina (@feuerengel), Aditya Soni (@ecdrid) and Kolchenko Sergey (@KolchenkoSergey). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

(beating baselines in the "Alice" competition)

Today we are going to practice working with sparse matrices, training Logistic Regression models, and doing feature engineering. We will reproduce a couple of baselines in the Kaggle Inclass competition "Catch Me If You Can: Intruder Detection through Webpage Session Tracking" (a.k.a. "Alice"). More credits will be given for beating stronger baselines.

Prior to working on the assignment, you'd better check out the corresponding course material:

- Classification, Decision Trees and k Nearest Neighbors, the same as an interactive web-based Kaggle Kernel (basics of machine learning are covered here)
- Linear classification and regression in 5 parts:
- You can also practice with demo assignments, which are simpler and already shared with solutions:
- "Sarcasm detection with logistic regression": assignment + solution
- "Linear regression as optimization": assignment (solution cannot be officially shared)
- "Exploring OLS, Lasso and Random Forest in a regression task": assignment + solution

- Alice baseline with logistic regression and "bag of sites", Kernel
- Correct time-aware cross-validation scheme, more features, and hyperparameter optimization, Kernel
**"Model validation in a competition"**, Kernel - this one reproduces a solution with**0.95055**Public LB ROC AUC and gives a lot of hints how to proceed with this competition- Other Kernels in this competition. You can share yours as well, but not high-performing ones (Public LB ROC AUC shall be < 0.95055). Please don't spoil the competitive spirit.
- If that's still not enough, watch two videos on logistic regression: mlcourse.ai/lectures

**Your task:**

- "Follow me". Complete the missing code and submit your answers via the google form. Use
**the same email and name**as in A1 and A2 (for newcomers: remember your email and use it for all forms during the course). 10 credits max. for this part - "Freeride". Come up with good features to beat the baselines "A3 baseline (10 credits)" (
**0.95343**Public LB ROC-AUC, press "Load more" in the bottom of the Leaderboard to actually see it) and "A3 strong baseline (20 credits)" (**0.95965**Public LB ROC-AUC). As names suggest, you'll get 10 more credits for beating the first one, and 10 more (20 in total) for beating the second one. You need to name your team (out of 1 person) in full accordance with the course rating (for newcomers: you need to name your team with your real full name). You can think of it as a part of the assignment. - If you've beaten "A3 baseline (10 credits)" or performed better, you need to upload your solution as described in course roadmap ("Kaggle Inclass Competition Alice" -> Rules). For all baselines that you see on Public Leaderboard, it's OK to beat them on Public LB as well. But 10 winners will be defined according to the private LB, which will be revealed by @yorko on October 27.

*image credit @muradosmann*

In [1]:

```
# Import libraries and set desired options
import pickle
from pathlib2 import Path
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
%config InlineBackend.figure_format = 'retina'
```

In [2]:

```
# Read the training and test data sets, change paths if needed
PATH_TO_DATA = Path('../../data/alice/')
times = ['time%s' % i for i in range(1, 11)]
train_df = pd.read_csv(PATH_TO_DATA / 'train_sessions.csv',
index_col='session_id', parse_dates=times)
test_df = pd.read_csv(PATH_TO_DATA / 'test_sessions.csv',
index_col='session_id', parse_dates=times)
# Sort the data by time
train_df = train_df.sort_values(by='time1')
# Look at the first rows of the training set
train_df.head()
```

Out[2]:

The training data set contains the following features:

**site1**– id of the first visited website in the session**time1**– visiting time for the first website in the session- ...
**site10**– id of the tenth visited website in the session**time10**– visiting time for the tenth website in the session**target**– target variable, 1 for Alice's sessions, and 0 for the other users' sessions

User sessions are chosen in the way that they are shorter than 30 min. long and contain no more than 10 websites. I.e. a session is considered over either if a user has visited 10 websites or if a session has lasted over 30 minutes.

There are some empty values in the table, it means that some sessions contain less than ten websites. Replace empty values with 0 and change columns types to integer. Also load the websites dictionary and check how it looks like:

In [3]:

```
# Change site1, ..., site10 columns type to integer and fill NA-values with zeros
sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype(np.uint16)
test_df[sites] = test_df[sites].fillna(0).astype(np.uint16)
# Load websites dictionary
with open(PATH_TO_DATA / 'site_dic.pkl', "rb") as input_file:
site_dict = pickle.load(input_file)
# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()),
columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()
```

Out[3]:

Before we start training models, we have to perform Exploratory Data Analysis (EDA). Today, we are going to perform a shorter version, but we will use other techniques as we move forward. Let's check which websites in the training data set are the most visited. As you can see, they are Google services and a bioinformatics website (a website with 'zero'-index is our missed values, just ignore it):

In [4]:

```
# Top websites in the training data set
top_sites = pd.Series(train_df[sites].values.flatten()
).value_counts().sort_values(ascending=False).head(5)
print(top_sites)
sites_dict.loc[top_sites.drop(0).index]
```

Out[4]:

In [5]:

```
# You code here
```

Now let us look at the timestamps and try to characterize sessions as timeframes:

In [6]:

```
# Create a separate dataframe where we will work with timestamps
time_df = pd.DataFrame(index=train_df.index)
time_df['target'] = train_df['target']
# Find sessions' starting and ending
time_df['min'] = train_df[times].min(axis=1)
time_df['max'] = train_df[times].max(axis=1)
# Calculate sessions' duration in seconds
time_df['seconds'] = (time_df['max'] - time_df['min']) / np.timedelta64(1, 's')
time_df.head()
```

Out[6]:

In order to perform the next task, generate descriptive statistics as you did in the first assignment.

*In the next question, we are using the notion of "approximately the same". To be strict, let's define it: $a$ is approximately the same as $b$ ($a \approx b $) if their difference is less than or equal to 5% of the maximum between $a$ and $b$, i.e. $a \approx b \leftrightarrow \frac{|a-b|}{max(a,b)} \leq 0.05$.*

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a3_part1_fall2019*

- on average, Alice's session is shorter than that of other users
- more than 1% of all sessions in the dataset belong to Alice
- minimum and maximum durations of Alice's and other users' sessions are approximately the same
- standard deviation of Alice's sessions duration is approximately the same as for non-Alice's sessions
- less than a quarter of Alice's sessions are greater than or equal to 40 seconds

In [7]:

```
# You code here
```

In order to train our first model, we need to prepare the data. First of all, exclude the target variable from the training set. Now both training and test sets have the same number of columns, therefore aggregate them into one dataframe. Thus, all transformations will be performed simultaneously on both training and test data sets.

On the one hand, it leads to the fact that both data sets have one feature space (you don't have to worry that you forgot to transform a feature in some data sets). On the other hand, processing time will increase. For the enormously large sets it might turn out that it is impossible to transform both data sets simultaneously (and sometimes you have to split your transformations into several stages only for train/test data set). In our case, with this particular data set, we are going to perform all the transformations for the whole united dataframe at once, and before training the model or making predictions we will just take its appropriate part.

In [8]:

```
# Our target variable
y_train = train_df['target']
# United dataframe of the initial data
full_df = pd.concat([train_df.drop('target', axis=1), test_df])
# Index to split the training and test data sets
idx_split = train_df.shape[0]
```

For the very basic model, we will use only the visited websites in the session (but we will not take into account timestamp features). The point behind this data selection is: *Alice has her favorite sites, and the more often you see these sites in the session, the higher probability that this is Alice's session, and vice versa.*

Let us prepare the data, we will take only features `site1, site2, ... , site10`

from the whole dataframe. Keep in mind that the missing values are replaced with zero. Here is how the first rows of the dataframe look like:

In [9]:

```
# Dataframe with indices of visited websites in session
full_sites = full_df[sites]
full_sites.head()
```

Out[9]:

Sessions are sequences of website indices, and data in this representation is useless for machine learning method (just think, what happens if we switched all ids of all websites).

According to our hypothesis (Alice has favorite websites), we need to transform this dataframe so each website has a corresponding feature (column) and its value is equal to number of this website visits in the session. It can be done in two lines:

In [10]:

```
# sequence of indices
sites_flatten = full_sites.values.flatten()
# and the matrix we are looking for
# (make sure you understand which of the `csr_matrix` constructors is used here)
# a further toy example will help you with it
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
sites_flatten,
range(0, sites_flatten.shape[0] + 10, 10)))[:, 1:]
```

In [11]:

```
full_sites_sparse.shape
```

Out[11]:

If you understand what just happened here, then you can skip the next passage (perhaps, you can handle logistic regression too?), If not, then let us figure it out.

Let us estimate how much memory it will require to store our data in the example above. Our united dataframe contains 336 thousand samples of 48 thousand integer features in each. It's easy to calculate the required amount of memory, roughly:

$$336\ K * 48\ K * 8\ bytes \approx 16* 10^9 * 8\ bytes = 130\ GB,$$(that's the exact value). Obviously, ordinary mortals have no such volumes (strictly speaking, Python may allow you to create such a matrix, but it will not be easy to do anything with it). The interesting fact is that most of the elements of our matrix are zeros. If we count non-zero elements, then it will be about 1.8 million, i.е. slightly more than 10% of all matrix elements. Such a matrix, where most elements are zeros, is called sparse, and the ratio between the number of zero elements and the total number of elements is called the sparsity of the matrix.

For the work with such matrices you can use `scipy.sparse`

library, check documentation to understand what possible types of sparse matrices are, how to work with them and in which cases their usage is most effective. You can learn how they are arranged, for example, in Wikipedia article.
Note, that a sparse matrix contains only non-zero elements, and you can get the allocated memory size like this (significant memory savings are obvious):

In [12]:

```
# How much memory does a sparse matrix occupy?
print('{0} elements * {1} bytes = {2} bytes'.format(full_sites_sparse.count_nonzero(), 8,
full_sites_sparse.count_nonzero() * 8))
# Or just like this:
print('sparse_matrix_size = {0} bytes'.format(full_sites_sparse.data.nbytes))
```

Let us explore how the matrix with the websites has been formed using a mini example. Suppose we have the following table with user sessions:

id | site1 | site2 | site3 |
---|---|---|---|

1 | 1 | 0 | 0 |

2 | 1 | 3 | 1 |

3 | 2 | 3 | 4 |

There are 3 sessions, and no more than 3 websites in each. Users visited four different sites in total (there are numbers from 1 to 4 in the table cells). And let us assume that the mapping is:

- vk.com
- habrahabr.ru
- yandex.ru
- ods.ai

If the user has visited less than 3 websites during the session, the last few values will be zero. We want to convert the original dataframe in a way that each session has a corresponding row which shows the number of visits to each particular site. I.e. we want to transform the previous table into the following form:

id | vk.com | habrahabr.ru | yandex.ru | ods.ai |
---|---|---|---|---|

1 | 1 | 0 | 0 | 0 |

2 | 2 | 0 | 1 | 0 |

3 | 0 | 1 | 1 | 1 |

To do this, use the constructor: `csr_matrix ((data, indices, indptr))`

and create a frequency table (see examples, code and comments on the links above to see how it works). Here we set all the parameters explicitly for greater clarity:

In [13]:

```
# data, create the list of ones, length of which equal to the number of elements in the initial dataframe (9)
# By summing the number of ones in the cell, we get the frequency,
# number of visits to a particular site per session
data = [1] * 9
# To do this, you need to correctly distribute the ones in cells
# Indices - website ids, i.e. columns of a new matrix. We will sum ones up grouping them by sessions (ids)
indices = [1, 0, 0, 1, 3, 1, 2, 3, 4]
# Indices for the division into rows (sessions)
# For example, line 0 is the elements between the indices [0; 3) - the rightmost value is not included
# Line 1 is the elements between the indices [3; 6)
# Line 2 is the elements between the indices [6; 9)
indptr = [0, 3, 6, 9]
# Aggregate these three variables into a tuple and compose a matrix
# To display this matrix on the screen transform it into the usual "dense" matrix
csr_matrix((data, indices, indptr)).todense()
```

Out[13]:

As you might have noticed, there are not four columns in the resulting matrix (corresponding to number of different websites) but five. A zero column has been added, which indicates if the session was shorter (in our mini example we took sessions of three). This column is excessive and should be removed from the dataframe (do that yourself).

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a3_part1_fall2019*

- 42%
- 47%
- 50%
- 53%

In [14]:

```
# You code here
```

Another benefit of using sparse matrices is that there are special implementations of both matrix operations and machine learning algorithms for them, which sometimes allows to significantly accelerate operations due to the data structure peculiarities. This applies to logistic regression as well. Now everything is ready to build our first model.

So, we have an algorithm and data for it. Let us build our first model, using logistic regression implementation from `Sklearn`

with default parameters. We will use the first 90% of the data for training (the training data set is sorted by time), and the remaining 10% for validation. Let's write a simple function that returns the quality of the model and then train our first classifier:

In [15]:

```
def get_auc_lr_valid(X, y, C=1.0, seed=17, ratio = 0.9):
# Split the data into the training and validation sets
idx = int(round(X.shape[0] * ratio))
# Classifier training
lr = LogisticRegression(C=C, random_state=seed, solver='liblinear').fit(X[:idx, :], y[:idx])
# Prediction for validation set
y_pred = lr.predict_proba(X[idx:, :])[:, 1]
# Calculate the quality
score = roc_auc_score(y[idx:], y_pred)
return score
```

In [16]:

```
%%time
# Select the training set from the united dataframe (where we have the answers)
X_train = full_sites_sparse[:idx_split, :]
# Calculate metric on the validation set
print(get_auc_lr_valid(X_train, y_train))
```

The first model demonstrated the quality of 0.92 on the validation set. Let's take it as the first baseline and starting point. To make a prediction on the test data set **we need to train the model again on the entire training data set** (until this moment, our model used only part of the data for training), which will increase its generalizing ability:

In [17]:

```
# Function for writing predictions to a file
def write_to_submission_file(predicted_labels, out_file,
target='target', index_label="session_id"):
predicted_df = pd.DataFrame(predicted_labels,
index = np.arange(1, predicted_labels.shape[0] + 1),
columns=[target])
predicted_df.to_csv(out_file, index_label=index_label)
```

In [18]:

```
# Train the model on the whole training data set
# Use random_state=17 for repeatability
# Parameter C=1 by default, but here we set it explicitly
lr = LogisticRegression(C=1.0, random_state=17, solver='liblinear').fit(X_train, y_train)
# Make a prediction for test data set
X_test = full_sites_sparse[idx_split:,:]
y_test = lr.predict_proba(X_test)[:, 1]
# Write it to the file which could be submitted
write_to_submission_file(y_test, 'baseline_1.csv')
```

Create a feature that will be a number in YYYYMM format from the date when the session was held, for example 201407 -- year 2014 and 7th month. Thus, we will take into account the monthly linear trend for the entire period of the data provided.

In [19]:

```
# Dataframe for new features
full_new_feat = pd.DataFrame(index=full_df.index)
# Add start_month feature
full_new_feat['start_month'] = full_df['time1'].apply(lambda ts:
100 * ts.year + ts.month).astype('float64')
```

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #a3_part1_fall2019*

- Alice wasn't online at all for the entire period
- From the beginning of 2013 to mid-2014, the number of Alice's sessions per month decreased
- The number of Alice's sessions per month is generally constant for the entire period
- From the beginning of 2013 to mid-2014, the number of Alice's sessions per month increased

*Hint: the graph will be more explicit if you treat start_month as a categorical ordinal variable*.

In [20]:

```
# You code here
```

In this way, we have an illustration and thoughts about the usefulness of the new feature, add it to the training sample and check the quality of the new model:

In [21]:

```
# Add the new feature to the sparse matrix
tmp = full_new_feat[['start_month']].values
X_train = csr_matrix(hstack([full_sites_sparse[:idx_split,:], tmp[:idx_split,:]]))
# Compute the metric on the validation set
print(get_auc_lr_valid(X_train, y_train))
```

The quality of the model has decreased significantly. We added a feature that definitely seemed useful to us, but its usage only worsened the model. Why did it happen?

Here we give an intuitive reasoning (a rigorous mathematical justification for one or another aspect in linear models you can easily find on the internet). Consider the features more closely: those of them that correspond to the number of visits to a particular web-site per session vary from 0 to 10. The feature `start_month`

has a completely different range: from 201301 to 201412, this means the contribution of this variable is significantly greater than the others. It would seem that problem can be avoided if we put less weight in a linear combination of attributes in this case, but in our case logistic regression with regularization is used (by default, this parameter is `C = 1`

), which penalizes the model the stronger the greater its weights are. Therefore, for linear methods with regularization, it is recommended to convert features to the same scale (you can read more about the regularization, for example, here).

One way to do this is standardization: for each observation you need to subtract the average value of the feature and divide this difference by the standard deviation:

$$ x^{*}_{i} = \dfrac{x_{i} - \mu_x}{\sigma_x}$$The following practical tips can be given:

- It is recommended to scale features if they have essentially different ranges or different units of measurement (for example, the country's population is indicated in units, and the country's GNP in trillions)
- Scale features if you do not have a reason/expert opinion to give a greater weight to any of them
- Scaling can be excessive if the ranges of some of your features differ from each other, but they are in the same system of units (for example, the proportion of middle-aged people and people over 80 among the entire population)
- If you want to get an interpreted model, then build a model without regularization and scaling (most likely, its quality will be worse)
- Binary features (which take only values of 0 or 1) are usually left without conversion, (but)
- If the quality of the model is crucial, try different options and select one where the quality is better

Getting back to `start_month`

, let us rescale the new feature and train the model again. This time the quality has increased:

In [22]:

```
# Add the new standardized feature to the sparse matrix
tmp = StandardScaler().fit_transform(full_new_feat[['start_month']])
X_train = csr_matrix(hstack([full_sites_sparse[:idx_split,:], tmp[:idx_split,:]]))
# Compute metric on the validation set
print(get_auc_lr_valid(X_train, y_train))
```

**#a3_part1_fall2019**

- It has decreased. It is better not to add a new feature.
- It has not changed
- It has decreased. The new feature should be scaled.
- I am confused, and I do not know if it's necessary to scale a new feature.

*Tips: use the nunique() function from pandas. Do not forget to include the start_month in the set. Will you scale a new feature? Why?*

In [23]:

```
# You code here
```

So, the new feature has slightly decreased the quality, so we will not use it. Nevertheless, do not rush to throw features out because they haven't performed well. They can be useful in a combination with other features (for example, when a new feature is a ratio or a product of two others).

The `start_hour`

feature is the hour at which the session started (from 0 to 23), and the binary feature `morning`

is equal to 1 if the session started in the morning and 0 if the session started later (we assume that morning means `start_hour`

is equal to 11 or less).

Will you scale the new features? Make your assumptions and test them in practice.

**#a3_part1_fall2019**

- None of the features gave an improvement :(
`start_hour`

feature gave an improvement, and`morning`

did not`morning`

feature gave an improvement, and`start_hour`

did not- Both features gave an improvement

*Tip: find suitable functions for working with time series data in documentation. Do not forget to include the start_month feature.*

In [24]:

```
# You code here
```

We have introduced features that improve the quality of our model in comparison with the first baseline. Can we do even better? After we have changed the training and test sets, it almost always makes sense to search for the optimal hyperparameters - the parameters of the model that do not change during training.

For example, in week 3, you learned that, in decision trees, the depth of the tree is a hyperparameter, but the feature by which splitting occurs and its threshold is not.

In the logistic regression that we use, the weights of each feature are changing, and we find their optimal values during training; meanwhile, the regularization parameter remains constant. This is the hyperparameter that we are going to optimize now.

Calculate the quality on a validation set with a regularization parameter, which is equal to 1 by default:

In [ ]:

```
# Compose the training set
tmp_scaled = StandardScaler().fit_transform(full_new_feat[['start_month',
'start_hour',
'morning']])
X_train = csr_matrix(hstack([full_sites_sparse[:idx_split,:],
tmp_scaled[:idx_split,:]]))
# Capture the quality with default parameters
score_C_1 = get_auc_lr_valid(X_train, y_train)
print(score_C_1)
```

We will try to beat this result by optimizing the regularization parameter. We will take a list of possible values of C and calculate the quality metric on the validation set for each of C-values:

In [26]:

```
# List of possible C-values
Cs = np.logspace(-3, 1, 10)
# You code here
```

Plot the graph of the quality metric (AUC-ROC) versus the value of the regularization parameter. The value of quality metric corresponding to the default value of C=1 is represented by a horizontal dotted line:

In [27]:

```
# You code here
```

In [28]:

```
# You code here
```

For the last task in this assignment: train the model using the optimal regularization parameter you found (do not round up to two digits like in the last question). If you do everything correctly and submit your solution, you should see `ROC AUC = 0.92784`

on the public leaderboard ("A3 baseline 2"):

In [ ]:

```
# Prepare the training and test data
tmp_scaled = StandardScaler().fit_transform(full_new_feat[['start_month', 'start_hour',
'morning']])
X_train = csr_matrix(hstack([full_sites_sparse[:idx_split,:],
tmp_scaled[:idx_split,:]]))
X_test = csr_matrix(hstack([full_sites_sparse[idx_split:,:],
tmp_scaled[idx_split:,:]]))
# Train the model on the whole training data set using optimal regularization parameter
lr = LogisticRegression(C=C, random_state=17, solver='liblinear').fit(X_train, y_train)
# Make a prediction for the test set
y_test = lr.predict_proba(X_test)[:, 1]
# Write it to the submission file
write_to_submission_file(y_test, 'baseline_2.csv')
```

In this part of the assignment, you have learned how to use sparse matrices, train logistic regression models, create new features and selected the best ones, learned why you need to scale features, and how to select hyperparameters. That's a lot!

*Yorko in Sheregesh, the best place in Russia for snowboarding and skiing.*

In this part, you'll need to beat the 2 more baselines mentioned in the beginning of this assignment. No more step-by-step instructions. But it'll be very helpful for you to study the Kernel "Model validation in a competition".

Here are a few tips for finding new features: think about what you can come up with using existing features, try multiplying or dividing two of them, justify or decline your hypotheses with plots, extract useful information from time series data (time1 ... time10), do not hesitate to convert an existing feature (for example, take a logarithm), etc. Checkout other Kernels. We encourage you to try new ideas and models - it's fun!