Mangaki Data Challenge

Here we describe some baseline strategies in order to tackle the Mangaki Data Challenge.

Load the data

In [1]:
import pandas as pd

df_watched = pd.read_csv('../data/mdc/watched.csv')
df_train = pd.read_csv('../data/mdc/train.csv')
df_test = pd.read_csv('../data/mdc/test.csv')
In [2]:
df_watched.head()
Out[2]:
user_id work_id rating
0 717 8025 dislike
1 1106 1027 neutral
2 1970 3949 neutral
3 1685 9815 like
4 1703 3482 like
In [3]:
df_train.head()
Out[3]:
user_id work_id rating
0 50 4041 wontsee
1 508 1713 wontsee
2 1780 7053 willsee
3 658 8853 wontsee
4 1003 9401 wontsee
In [4]:
df_test.head()
Out[4]:
user_id work_id
0 486 1086
1 1509 3296
2 617 1086
3 270 9648
4 459 3647

Compute basic count features

In [6]:
from collections import Counter
import numpy as np

nb = Counter()
for user_id, work_id, choice in np.array(pd.concat([df_watched, df_train])):
    nb[('user', user_id, choice)] += 1
    nb[('work', work_id, choice)] += 1
In [7]:
nb.most_common(5)
Out[7]:
[(('user', 488, 'neutral'), 1119),
 (('work', 9815, 'like'), 1050),
 (('work', 991, 'like'), 927),
 (('work', 4487, 'like'), 893),
 (('work', 1701, 'like'), 822)]

For example, this means that user 488 rated 1119 works as neutral, and work 9815 was liked 1050 times.

This will be useful: BC's solution (ranked #5) was solely based on those basic count features, not even on the watched dataset!

Embed ratings into (ad-hoc) values

In [9]:
rating_values = {'love': 2, 'like': 2, 'dislike': -2, 'neutral': -1, 'willsee': 1, 'wontsee': 0}
In [10]:
df_watched['value'] = df_watched['rating'].map(rating_values)
In [12]:
df_train['value'] = df_train['rating'].map(rating_values)

This will basically add one column to our datasets.

In [13]:
df_watched.head()
Out[13]:
user_id work_id rating value
0 717 8025 dislike -2
1 1106 1027 neutral -1
2 1970 3949 neutral -1
3 1685 9815 like 2
4 1703 3482 like 2
In [14]:
X_watched = np.array(df_watched[['user_id', 'work_id']])
y_watched = df_watched['value']
y_text = df_watched['rating']
X_train = np.array(df_train[['user_id', 'work_id']])
y_train = df_train['value']
X_test = np.array(df_test)
print('Watched dataset:', X_watched.shape, '/'.join(set(y_text)))
print('Train dataset:', X_train.shape, 'willsee/wontsee')
print('Test dataset:', X_test.shape)
Watched dataset: (198970, 2) dislike/love/neutral/like
Train dataset: (11112, 2) willsee/wontsee
Test dataset: (100015, 2)
In [15]:
nb_users = 1 + max(df_watched['user_id'])
nb_works = 1 + max(df_watched['work_id'])

Train Alternate Least Squares on the watched dataset

In [16]:
from mangaki.algo.als import MangakiALS

als = MangakiALS(20)
als.set_parameters(nb_users, nb_works)
als.fit(X_watched, y_watched)
Computing M: (1983 × 9897)
Chrono: fill and center matrix [0q, 1619ms]
Shapes (1983, 20) (20, 9897)
Chrono: factor matrix [0q, 9049ms]

Classifier #0: Dummy classifier with constant prediction (AUC = 50%, ranked 28/33)

In [28]:
# This part is only executable by admin who knows truth.csv, it is used for evaluation
y_test = np.array(pd.read_csv('../data/mdc/truth.csv')['rating'].map({'willsee': 1, 'wontsee': 0}))
In [29]:
# Dummy prediction that constantly predicts 0 (wontsee)
y_pred = [0] * len(y_test)
In [30]:
from sklearn.metrics import accuracy_score, roc_auc_score

accuracy_score(y_test, y_pred)
Out[30]:
0.59558066290056488

If you always predict wontsee (0), you will get 59.6% accuracy…

In [32]:
roc_auc_score(y_test, y_pred)
Out[32]:
0.5

… but only 50% AUC.

Classifier #1: Logistic Regression (AUC = 69%, ranked 18/33)

We first have to build features for each user-work pair.

For this, let's use the features extracted by the ALS algorithm earlier, together with the basic count features:

How many favorite/like/neutral/dislike does this user/work have?

In [33]:
from sklearn.linear_model import LogisticRegression

def build_features(user_id, work_id):
    return np.concatenate((als.U[user_id] * als.VT.T[work_id],
                           als.U[user_id],
                           als.VT.T[work_id],
                           [nb[('user', user_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike']],
                           [nb[('work', work_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike']]))
In [34]:
X_train_reg = [build_features(user_id, work_id) for user_id, work_id in X_train]
X_test_reg = [build_features(user_id, work_id) for user_id, work_id in X_test]
In [119]:
clf = LogisticRegression()
clf.fit(X_train_reg, y_train)  # 2 s
Out[119]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [120]:
y_pred_binary = clf.predict(X_test_reg)
In [121]:
y_pred = clf.predict_proba(X_test_reg)[:, 1]
In [124]:
Counter(y_pred_binary)
Out[124]:
Counter({0: 69450, 1: 30565})

So it predicted 69k wontsee and 31k willsee.

In [128]:
accuracy_score(y_test, y_pred_binary)
Out[128]:
0.653641953706944
In [130]:
roc_auc_score(y_test, y_pred_2)
# Best AUC: 0.70123, so ranked #18 / 33
Out[130]:
0.69347185479684292

Logistic regression achieves 65% accuracy and 69% AUC, enough to be ranked #18.

Classifier #2: Gradient Boosting Trees (AUC = 81%, ranked 8/33)

In [144]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=300)
gbc.fit(X_train_reg, y_train)  # 7 s if 100 estimators, 18 s if 200, 20 s if 300
Out[144]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=300, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
In [145]:
y_pred_binary = gbc.predict(X_test_reg)
accuracy_score(y_test, y_pred_binary)
# Accuracy is 0.72589 if 100 estimators, 0.73984 if 200, 0.74768 if 300
Out[145]:
0.74767784832275164
In [146]:
y_pred = gbc.predict_proba(X_test_reg)[:, 1]
In [148]:
roc_auc_score(y_test, y_pred)
# Best AUC: 0.78795 if 100 estimators, 0.80553 if 200, 0.81260 if 300
Out[148]:
0.81260462171306735

This nonlinear classifier achieves 75% accuracy and 81% AUC, enough to be ranked #8.

Remember, the winning solution by GeniusIke had 86% AUC.

Tip: Locate the errors

It can help to track where were the mistakes. What did the classifier classify wrong?

In [158]:
nb_errors = Counter()
for (user_id, work_id), y_p, y_t in zip(X_test, y_pred_binary, y_test):
    if y_p != y_t:
        nb_errors[('user', user_id)] += 1
        nb_errors[('work', work_id)] += 1

for (error_type, error_id), mistakes in nb_errors.most_common(5):
    if error_type == 'user':
        print('User #{} got {} mistakes'.format(error_id, mistakes))
        print('They rated', {choice: nb[('user', error_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike', 'willsee', 'wontsee']})
    else:
        print('Work #{} got {} mistakes'.format(error_id, mistakes))
        print('It was rated', {choice: nb[('work', error_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike', 'willsee', 'wontsee']})
User #425 got 274 mistakes
They rated {'favorite': 0, 'like': 304, 'neutral': 32, 'dislike': 50, 'willsee': 34, 'wontsee': 81}
User #1550 got 274 mistakes
They rated {'favorite': 0, 'like': 115, 'neutral': 162, 'dislike': 106, 'willsee': 31, 'wontsee': 111}
User #130 got 272 mistakes
They rated {'favorite': 0, 'like': 279, 'neutral': 17, 'dislike': 12, 'willsee': 31, 'wontsee': 43}
User #1799 got 261 mistakes
They rated {'favorite': 0, 'like': 133, 'neutral': 59, 'dislike': 38, 'willsee': 25, 'wontsee': 38}
User #459 got 253 mistakes
They rated {'favorite': 0, 'like': 115, 'neutral': 132, 'dislike': 8, 'willsee': 28, 'wontsee': 37}

Classifier #3: BC's solution (AUC = 82.6%, ranked #5)

Quite surprisingly, this solution got ranked #5 without using the watched dataset.

We try to reproduce it here.

The idea: compute an average value in the training set and use it for prediction.

Variant 1: Predict 1 if the user rated more willsee than wontsee, 0 otherwise (AUC = 72.8%, ranked 17/33)

In [177]:
y_pred_bc = []
for user_id in X_test[:, 0]:
    y_pred_bc.append(1 if nb[('user', user_id, 'willsee')] >= nb[('user', user_id, 'wontsee')] else 0)
In [178]:
roc_auc_score(y_test, y_pred_bc)
Out[178]:
0.72802620226714432

Variant 2: Predict the willsee rate of the user (AUC = 77.7%, ranked 13/33)

In [179]:
y_pred_bc = []
for user_id, work_id in X_test:
    user_yes, user_no = nb[('user', user_id, 'willsee')], nb[('user', user_id, 'wontsee')]
    work_yes, work_no = nb[('work', work_id, 'willsee')], nb[('work', work_id, 'wontsee')]
    if user_yes + user_no > 0:
        y_pred_bc.append(user_yes / (user_yes + user_no))
    else:
        y_pred_bc.append(0)
In [181]:
roc_auc_score(y_test, y_pred_bc)
Out[181]:
0.77708060193706996

Variant 3: Predict some combination of the willsee rates of the user and the work (AUC = 81%, ranked 8/33)

In [182]:
y_pred_bc = []
for user_id, work_id in X_test:
    user_yes, user_no = nb[('user', user_id, 'willsee')], nb[('user', user_id, 'wontsee')]
    work_yes, work_no = nb[('work', work_id, 'willsee')], nb[('work', work_id, 'wontsee')]
    user_rate = user_yes / (user_yes + user_no) if user_yes + user_no > 0 else 0
    work_rate = work_yes / (work_yes + work_no) if work_yes + work_no > 0 else 0
    if user_yes + user_no > 0:
        y_pred_bc.append(0.73 * user_rate + 0.27 * work_rate)
    elif work_yes + work_no > 0:
        y_pred_bc.append(work_rate)
    else:
        y_pred_bc.append(0)
In [184]:
roc_auc_score(y_test, y_pred_bc)
Out[184]:
0.81446118251383193

Hope you had fun competing!