Mangaki Data Challenge¶

Here we describe some baseline strategies in order to tackle the Mangaki Data Challenge.

See the blog post on research.mangaki.fr
Feel free to send any comments at: jj@mangaki.fr!

Load the data¶

In [1]:

import pandas as pd

df_watched = pd.read_csv('../data/mdc/watched.csv')
df_train = pd.read_csv('../data/mdc/train.csv')
df_test = pd.read_csv('../data/mdc/test.csv')

In [2]:

df_watched.head()

Out[2]:

	user_id	work_id	rating
0	717	8025	dislike
1	1106	1027	neutral
2	1970	3949	neutral
3	1685	9815	like
4	1703	3482	like

In [3]:

df_train.head()

Out[3]:

	user_id	work_id	rating
0	50	4041	wontsee
1	508	1713	wontsee
2	1780	7053	willsee
3	658	8853	wontsee
4	1003	9401	wontsee

In [4]:

df_test.head()

Out[4]:

	user_id	work_id
0	486	1086
1	1509	3296
2	617	1086
3	270	9648
4	459	3647

Compute basic count features¶

In [6]:

from collections import Counter
import numpy as np

nb = Counter()
for user_id, work_id, choice in np.array(pd.concat([df_watched, df_train])):
    nb[('user', user_id, choice)] += 1
    nb[('work', work_id, choice)] += 1

In [7]:

nb.most_common(5)

Out[7]:

[(('user', 488, 'neutral'), 1119),
 (('work', 9815, 'like'), 1050),
 (('work', 991, 'like'), 927),
 (('work', 4487, 'like'), 893),
 (('work', 1701, 'like'), 822)]

For example, this means that user 488 rated 1119 works as neutral, and work 9815 was liked 1050 times.

This will be useful: BC's solution (ranked #5) was solely based on those basic count features, not even on the watched dataset!

Embed ratings into (ad-hoc) values¶

In [9]:

rating_values = {'love': 2, 'like': 2, 'dislike': -2, 'neutral': -1, 'willsee': 1, 'wontsee': 0}

In [10]:

df_watched['value'] = df_watched['rating'].map(rating_values)

In [12]:

df_train['value'] = df_train['rating'].map(rating_values)

This will basically add one column to our datasets.

In [13]:

df_watched.head()

Out[13]:

	user_id	work_id	rating	value
0	717	8025	dislike	-2
1	1106	1027	neutral	-1
2	1970	3949	neutral	-1
3	1685	9815	like	2
4	1703	3482	like	2

In [14]:

X_watched = np.array(df_watched[['user_id', 'work_id']])
y_watched = df_watched['value']
y_text = df_watched['rating']
X_train = np.array(df_train[['user_id', 'work_id']])
y_train = df_train['value']
X_test = np.array(df_test)
print('Watched dataset:', X_watched.shape, '/'.join(set(y_text)))
print('Train dataset:', X_train.shape, 'willsee/wontsee')
print('Test dataset:', X_test.shape)

Watched dataset: (198970, 2) dislike/love/neutral/like
Train dataset: (11112, 2) willsee/wontsee
Test dataset: (100015, 2)

In [15]:

nb_users = 1 + max(df_watched['user_id'])
nb_works = 1 + max(df_watched['work_id'])

Train Alternate Least Squares on the watched dataset¶

In [16]:

from mangaki.algo.als import MangakiALS

als = MangakiALS(20)
als.set_parameters(nb_users, nb_works)
als.fit(X_watched, y_watched)

Computing M: (1983 × 9897)
Chrono: fill and center matrix [0q, 1619ms]
Shapes (1983, 20) (20, 9897)
Chrono: factor matrix [0q, 9049ms]

Classifier #0: Dummy classifier with constant prediction (AUC = 50%, ranked 28/33)¶

In [28]:

# This part is only executable by admin who knows truth.csv, it is used for evaluation
y_test = np.array(pd.read_csv('../data/mdc/truth.csv')['rating'].map({'willsee': 1, 'wontsee': 0}))

In [29]:

# Dummy prediction that constantly predicts 0 (wontsee)
y_pred = [0] * len(y_test)

In [30]:

from sklearn.metrics import accuracy_score, roc_auc_score

accuracy_score(y_test, y_pred)

Out[30]:

0.59558066290056488

If you always predict wontsee (0), you will get 59.6% accuracy…

In [32]:

roc_auc_score(y_test, y_pred)

Out[32]:

0.5

… but only 50% AUC.

Classifier #1: Logistic Regression (AUC = 69%, ranked 18/33)¶

We first have to build features for each user-work pair.

For this, let's use the features extracted by the ALS algorithm earlier, together with the basic count features:

How many favorite/like/neutral/dislike does this user/work have?

In [33]:

from sklearn.linear_model import LogisticRegression

def build_features(user_id, work_id):
    return np.concatenate((als.U[user_id] * als.VT.T[work_id],
                           als.U[user_id],
                           als.VT.T[work_id],
                           [nb[('user', user_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike']],
                           [nb[('work', work_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike']]))

In [34]:

X_train_reg = [build_features(user_id, work_id) for user_id, work_id in X_train]
X_test_reg = [build_features(user_id, work_id) for user_id, work_id in X_test]

In [119]:

clf = LogisticRegression()
clf.fit(X_train_reg, y_train)  # 2 s

Out[119]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [120]:

y_pred_binary = clf.predict(X_test_reg)

In [121]:

y_pred = clf.predict_proba(X_test_reg)[:, 1]

In [124]:

Counter(y_pred_binary)

Out[124]:

Counter({0: 69450, 1: 30565})

So it predicted 69k wontsee and 31k willsee.

In [128]:

accuracy_score(y_test, y_pred_binary)

Out[128]:

0.653641953706944

In [130]:

roc_auc_score(y_test, y_pred_2)
# Best AUC: 0.70123, so ranked #18 / 33

Out[130]:

0.69347185479684292

Logistic regression achieves 65% accuracy and 69% AUC, enough to be ranked #18.

Classifier #2: Gradient Boosting Trees (AUC = 81%, ranked 8/33)¶

In [144]:

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=300)
gbc.fit(X_train_reg, y_train)  # 7 s if 100 estimators, 18 s if 200, 20 s if 300

Out[144]:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=300, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [145]:

y_pred_binary = gbc.predict(X_test_reg)
accuracy_score(y_test, y_pred_binary)
# Accuracy is 0.72589 if 100 estimators, 0.73984 if 200, 0.74768 if 300

Out[145]:

0.74767784832275164

In [146]:

y_pred = gbc.predict_proba(X_test_reg)[:, 1]

In [148]:

roc_auc_score(y_test, y_pred)
# Best AUC: 0.78795 if 100 estimators, 0.80553 if 200, 0.81260 if 300

Out[148]:

0.81260462171306735

This nonlinear classifier achieves 75% accuracy and 81% AUC, enough to be ranked #8.

Remember, the winning solution by GeniusIke had 86% AUC.

Tip: Locate the errors¶

It can help to track where were the mistakes. What did the classifier classify wrong?

In [158]:

nb_errors = Counter()
for (user_id, work_id), y_p, y_t in zip(X_test, y_pred_binary, y_test):
    if y_p != y_t:
        nb_errors[('user', user_id)] += 1
        nb_errors[('work', work_id)] += 1

for (error_type, error_id), mistakes in nb_errors.most_common(5):
    if error_type == 'user':
        print('User #{} got {} mistakes'.format(error_id, mistakes))
        print('They rated', {choice: nb[('user', error_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike', 'willsee', 'wontsee']})
    else:
        print('Work #{} got {} mistakes'.format(error_id, mistakes))
        print('It was rated', {choice: nb[('work', error_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike', 'willsee', 'wontsee']})

User #425 got 274 mistakes
They rated {'favorite': 0, 'like': 304, 'neutral': 32, 'dislike': 50, 'willsee': 34, 'wontsee': 81}
User #1550 got 274 mistakes
They rated {'favorite': 0, 'like': 115, 'neutral': 162, 'dislike': 106, 'willsee': 31, 'wontsee': 111}
User #130 got 272 mistakes
They rated {'favorite': 0, 'like': 279, 'neutral': 17, 'dislike': 12, 'willsee': 31, 'wontsee': 43}
User #1799 got 261 mistakes
They rated {'favorite': 0, 'like': 133, 'neutral': 59, 'dislike': 38, 'willsee': 25, 'wontsee': 38}
User #459 got 253 mistakes
They rated {'favorite': 0, 'like': 115, 'neutral': 132, 'dislike': 8, 'willsee': 28, 'wontsee': 37}

Classifier #3: BC's solution (AUC = 82.6%, ranked #5)¶

Quite surprisingly, this solution got ranked #5 without using the watched dataset.

We try to reproduce it here.

The idea: compute an average value in the training set and use it for prediction.

Variant 1: Predict 1 if the user rated more willsee than wontsee, 0 otherwise (AUC = 72.8%, ranked 17/33)¶

In [177]:

y_pred_bc = []
for user_id in X_test[:, 0]:
    y_pred_bc.append(1 if nb[('user', user_id, 'willsee')] >= nb[('user', user_id, 'wontsee')] else 0)

In [178]:

roc_auc_score(y_test, y_pred_bc)

Out[178]:

0.72802620226714432

Variant 2: Predict the willsee rate of the user (AUC = 77.7%, ranked 13/33)¶

In [179]:

y_pred_bc = []
for user_id, work_id in X_test:
    user_yes, user_no = nb[('user', user_id, 'willsee')], nb[('user', user_id, 'wontsee')]
    work_yes, work_no = nb[('work', work_id, 'willsee')], nb[('work', work_id, 'wontsee')]
    if user_yes + user_no > 0:
        y_pred_bc.append(user_yes / (user_yes + user_no))
    else:
        y_pred_bc.append(0)

In [181]:

roc_auc_score(y_test, y_pred_bc)

Out[181]:

0.77708060193706996

Variant 3: Predict some combination of the willsee rates of the user and the work (AUC = 81%, ranked 8/33)¶

In [182]:

y_pred_bc = []
for user_id, work_id in X_test:
    user_yes, user_no = nb[('user', user_id, 'willsee')], nb[('user', user_id, 'wontsee')]
    work_yes, work_no = nb[('work', work_id, 'willsee')], nb[('work', work_id, 'wontsee')]
    user_rate = user_yes / (user_yes + user_no) if user_yes + user_no > 0 else 0
    work_rate = work_yes / (work_yes + work_no) if work_yes + work_no > 0 else 0
    if user_yes + user_no > 0:
        y_pred_bc.append(0.73 * user_rate + 0.27 * work_rate)
    elif work_yes + work_no > 0:
        y_pred_bc.append(work_rate)
    else:
        y_pred_bc.append(0)

In [184]:

roc_auc_score(y_test, y_pred_bc)

Out[184]:

0.81446118251383193

Hope you had fun competing!

Stay in touch on Twitter @MangakiFR to know when the next challenge will start!
Try Mangaki: https://mangaki.fr