Here we describe some baseline strategies in order to tackle the Mangaki Data Challenge.
jj@mangaki.fr
!import pandas as pd
df_watched = pd.read_csv('../data/mdc/watched.csv')
df_train = pd.read_csv('../data/mdc/train.csv')
df_test = pd.read_csv('../data/mdc/test.csv')
df_watched.head()
user_id | work_id | rating | |
---|---|---|---|
0 | 717 | 8025 | dislike |
1 | 1106 | 1027 | neutral |
2 | 1970 | 3949 | neutral |
3 | 1685 | 9815 | like |
4 | 1703 | 3482 | like |
df_train.head()
user_id | work_id | rating | |
---|---|---|---|
0 | 50 | 4041 | wontsee |
1 | 508 | 1713 | wontsee |
2 | 1780 | 7053 | willsee |
3 | 658 | 8853 | wontsee |
4 | 1003 | 9401 | wontsee |
df_test.head()
user_id | work_id | |
---|---|---|
0 | 486 | 1086 |
1 | 1509 | 3296 |
2 | 617 | 1086 |
3 | 270 | 9648 |
4 | 459 | 3647 |
from collections import Counter
import numpy as np
nb = Counter()
for user_id, work_id, choice in np.array(pd.concat([df_watched, df_train])):
nb[('user', user_id, choice)] += 1
nb[('work', work_id, choice)] += 1
nb.most_common(5)
[(('user', 488, 'neutral'), 1119), (('work', 9815, 'like'), 1050), (('work', 991, 'like'), 927), (('work', 4487, 'like'), 893), (('work', 1701, 'like'), 822)]
For example, this means that user 488 rated 1119 works as neutral, and work 9815 was liked 1050 times.
This will be useful: BC's solution (ranked #5) was solely based on those basic count features, not even on the watched dataset!
rating_values = {'love': 2, 'like': 2, 'dislike': -2, 'neutral': -1, 'willsee': 1, 'wontsee': 0}
df_watched['value'] = df_watched['rating'].map(rating_values)
df_train['value'] = df_train['rating'].map(rating_values)
This will basically add one column to our datasets.
df_watched.head()
user_id | work_id | rating | value | |
---|---|---|---|---|
0 | 717 | 8025 | dislike | -2 |
1 | 1106 | 1027 | neutral | -1 |
2 | 1970 | 3949 | neutral | -1 |
3 | 1685 | 9815 | like | 2 |
4 | 1703 | 3482 | like | 2 |
X_watched = np.array(df_watched[['user_id', 'work_id']])
y_watched = df_watched['value']
y_text = df_watched['rating']
X_train = np.array(df_train[['user_id', 'work_id']])
y_train = df_train['value']
X_test = np.array(df_test)
print('Watched dataset:', X_watched.shape, '/'.join(set(y_text)))
print('Train dataset:', X_train.shape, 'willsee/wontsee')
print('Test dataset:', X_test.shape)
Watched dataset: (198970, 2) dislike/love/neutral/like Train dataset: (11112, 2) willsee/wontsee Test dataset: (100015, 2)
nb_users = 1 + max(df_watched['user_id'])
nb_works = 1 + max(df_watched['work_id'])
from mangaki.algo.als import MangakiALS
als = MangakiALS(20)
als.set_parameters(nb_users, nb_works)
als.fit(X_watched, y_watched)
Computing M: (1983 × 9897) Chrono: fill and center matrix [0q, 1619ms] Shapes (1983, 20) (20, 9897) Chrono: factor matrix [0q, 9049ms]
# This part is only executable by admin who knows truth.csv, it is used for evaluation
y_test = np.array(pd.read_csv('../data/mdc/truth.csv')['rating'].map({'willsee': 1, 'wontsee': 0}))
# Dummy prediction that constantly predicts 0 (wontsee)
y_pred = [0] * len(y_test)
from sklearn.metrics import accuracy_score, roc_auc_score
accuracy_score(y_test, y_pred)
0.59558066290056488
If you always predict wontsee
(0), you will get 59.6% accuracy…
roc_auc_score(y_test, y_pred)
0.5
… but only 50% AUC.
We first have to build features for each user-work pair.
For this, let's use the features extracted by the ALS algorithm earlier, together with the basic count features:
How many favorite/like/neutral/dislike does this user/work have?
from sklearn.linear_model import LogisticRegression
def build_features(user_id, work_id):
return np.concatenate((als.U[user_id] * als.VT.T[work_id],
als.U[user_id],
als.VT.T[work_id],
[nb[('user', user_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike']],
[nb[('work', work_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike']]))
X_train_reg = [build_features(user_id, work_id) for user_id, work_id in X_train]
X_test_reg = [build_features(user_id, work_id) for user_id, work_id in X_test]
clf = LogisticRegression()
clf.fit(X_train_reg, y_train) # 2 s
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
y_pred_binary = clf.predict(X_test_reg)
y_pred = clf.predict_proba(X_test_reg)[:, 1]
Counter(y_pred_binary)
Counter({0: 69450, 1: 30565})
So it predicted 69k wontsee and 31k willsee.
accuracy_score(y_test, y_pred_binary)
0.653641953706944
roc_auc_score(y_test, y_pred_2)
# Best AUC: 0.70123, so ranked #18 / 33
0.69347185479684292
Logistic regression achieves 65% accuracy and 69% AUC, enough to be ranked #18.
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=300)
gbc.fit(X_train_reg, y_train) # 7 s if 100 estimators, 18 s if 200, 20 s if 300
GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.1, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=300, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False)
y_pred_binary = gbc.predict(X_test_reg)
accuracy_score(y_test, y_pred_binary)
# Accuracy is 0.72589 if 100 estimators, 0.73984 if 200, 0.74768 if 300
0.74767784832275164
y_pred = gbc.predict_proba(X_test_reg)[:, 1]
roc_auc_score(y_test, y_pred)
# Best AUC: 0.78795 if 100 estimators, 0.80553 if 200, 0.81260 if 300
0.81260462171306735
This nonlinear classifier achieves 75% accuracy and 81% AUC, enough to be ranked #8.
Remember, the winning solution by GeniusIke had 86% AUC.
It can help to track where were the mistakes. What did the classifier classify wrong?
nb_errors = Counter()
for (user_id, work_id), y_p, y_t in zip(X_test, y_pred_binary, y_test):
if y_p != y_t:
nb_errors[('user', user_id)] += 1
nb_errors[('work', work_id)] += 1
for (error_type, error_id), mistakes in nb_errors.most_common(5):
if error_type == 'user':
print('User #{} got {} mistakes'.format(error_id, mistakes))
print('They rated', {choice: nb[('user', error_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike', 'willsee', 'wontsee']})
else:
print('Work #{} got {} mistakes'.format(error_id, mistakes))
print('It was rated', {choice: nb[('work', error_id, choice)] for choice in ['favorite', 'like', 'neutral', 'dislike', 'willsee', 'wontsee']})
User #425 got 274 mistakes They rated {'favorite': 0, 'like': 304, 'neutral': 32, 'dislike': 50, 'willsee': 34, 'wontsee': 81} User #1550 got 274 mistakes They rated {'favorite': 0, 'like': 115, 'neutral': 162, 'dislike': 106, 'willsee': 31, 'wontsee': 111} User #130 got 272 mistakes They rated {'favorite': 0, 'like': 279, 'neutral': 17, 'dislike': 12, 'willsee': 31, 'wontsee': 43} User #1799 got 261 mistakes They rated {'favorite': 0, 'like': 133, 'neutral': 59, 'dislike': 38, 'willsee': 25, 'wontsee': 38} User #459 got 253 mistakes They rated {'favorite': 0, 'like': 115, 'neutral': 132, 'dislike': 8, 'willsee': 28, 'wontsee': 37}
Quite surprisingly, this solution got ranked #5 without using the watched dataset.
We try to reproduce it here.
The idea: compute an average value in the training set and use it for prediction.
y_pred_bc = []
for user_id in X_test[:, 0]:
y_pred_bc.append(1 if nb[('user', user_id, 'willsee')] >= nb[('user', user_id, 'wontsee')] else 0)
roc_auc_score(y_test, y_pred_bc)
0.72802620226714432
y_pred_bc = []
for user_id, work_id in X_test:
user_yes, user_no = nb[('user', user_id, 'willsee')], nb[('user', user_id, 'wontsee')]
work_yes, work_no = nb[('work', work_id, 'willsee')], nb[('work', work_id, 'wontsee')]
if user_yes + user_no > 0:
y_pred_bc.append(user_yes / (user_yes + user_no))
else:
y_pred_bc.append(0)
roc_auc_score(y_test, y_pred_bc)
0.77708060193706996
y_pred_bc = []
for user_id, work_id in X_test:
user_yes, user_no = nb[('user', user_id, 'willsee')], nb[('user', user_id, 'wontsee')]
work_yes, work_no = nb[('work', work_id, 'willsee')], nb[('work', work_id, 'wontsee')]
user_rate = user_yes / (user_yes + user_no) if user_yes + user_no > 0 else 0
work_rate = work_yes / (work_yes + work_no) if work_yes + work_no > 0 else 0
if user_yes + user_no > 0:
y_pred_bc.append(0.73 * user_rate + 0.27 * work_rate)
elif work_yes + work_no > 0:
y_pred_bc.append(work_rate)
else:
y_pred_bc.append(0)
roc_auc_score(y_test, y_pred_bc)
0.81446118251383193
Hope you had fun competing!