# coding: utf-8
# In[1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Regression metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
np.random.seed(17)
# ## Intro
# ** Leaderboard probing ** - it's a competition specific technick tightly connected with data leakages. There are two tipes of LB probing:
# * extraction ground thruth from public part of LB by changing small subset of predictions in submission;
# * extraction information about private part of LB by submitting to public LB (based on consistent categories).
#
# In this tutorial we will concentrate on the first type of probing. In some cases specific exploit makes possible to find all y-values of the public leaderboard (LB) score. In other cases we can obtain some imformation about public LB y-values distribution.
# ## Creating test environment functions
# Now, we are going to create some functions, which will help us in testing exploits.
# In[2]:
def generate_leaderboards(values):
"""
Generate public and private leaderboard from values.
"""
df = pd.DataFrame(values, columns=['target'])
public_lb, private_pb = train_test_split(df, test_size=0.3, shuffle=False)
return public_lb, private_pb
# In[3]:
def generate_submission(values, leaderboard):
"""
Generate sample submission from values for leaderboard.
"""
sample_submission = pd.DataFrame(leaderboard, copy=True)
sample_submission['target'] = values
return sample_submission
# In[4]:
def generate_data(values):
"""
Generate experimental environment: public and private leaderboards, zero sample submissio from values.
"""
public_lb, private_pb = generate_leaderboards(values)
zero_submission = generate_submission(0, public_lb)
n = public_lb.size
return public_lb, private_pb, zero_submission, n
# In[5]:
def make_submission(predicted_values, metric, leaderboard):
"""
Evaluate predicted values with metric for leaderboard.
"""
return metric(leaderboard['target'], predicted_values['target'])
# ## Regression evaluation metrics
# We gonna iterate through several main regression metrics and find some vulnerabilities related to each of them.
# ### 1. MAE
# Let's start with mean absolute error metric:
# $$\large{MAE(y, \hat{y}) = \frac{\sum_{i=1}^n |y_i - \hat{y}_i|}{n} }$$
# where
# - $y$ - vector with true components;
# - $y_i$ - true y-value;
# - $\hat{y}$ - vector with predicted (i.e. submitted) values;
# - $\hat{y_i}$ - predicted y-value;
# - $n$ - number of data points.
# In case of all target values are **non negative** we can make zero submission to **obtain mean target value**:
# $$\large{MAE(y, 0) = \frac{\sum_{i=1}^n |y_i - 0|}{n} = \frac{\sum_{i=1}^n y_i}{n} }$$
# In[6]:
# Generate environment with non negative target values
target_values = np.random.randint(0, 10, size=1000)
public_lb, private_lb, sample_submission, n = generate_data(target_values)
sample_submission.head(3)
# In[7]:
# Make zero submission
p_z = make_submission(sample_submission, mean_absolute_error, public_lb)
print('Zero submission score:', p_z)
print('Public leaderbord target mean:', public_lb['target'].mean())
# Using this information we can modify our predictions to improve public LB score.
# ### 2. MSE
# Next probing method makes possible to **find all y-values** for all data points used in computation of public LB score. Now we will talk about mean squared error:
#
# $$\large{ MSE(y, \hat{y}) = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n} }$$
# At first, let's make all zeros submission and denote result score as $p_z$:
# $$\large{ p_{z} = \frac{\sum_{i=1}^n (y_i - 0)^2}{n} } = \frac{(y_1 - 0)^2}{n} + \frac{\sum_{i=2}^n (y_i - 0)^2}{n}$$
# In[8]:
# Generate environment
target_values = np.random.randint(-10, 10, size=1000)
public_lb, private_lb, sample_submission, n = generate_data(target_values)
sample_submission.head(3)
# In[9]:
# Make zero submission
p_z = make_submission(sample_submission, mean_squared_error, public_lb)
print('Zero submission score:', p_z)
# Other submissions will contain exactly one value differs from zero submission. Here we will change first y-value from 0 to 100:
# In[10]:
# Set first y-value to 100
sample_submission['target'][0] = 100
sample_submission.head(3)
# We make a new submission where 1st y-value set to 100 and denote result score as $p^1_{100}$:
# $$ \large{ p_{100}^{1} = \frac{(y_1 - 100)^2}{n} + \frac{\sum_{i=2}^n (y_i - 0)^2}{n} }$$
# In[11]:
p_1_100 = make_submission(sample_submission, mean_squared_error, public_lb)
print('Submission score:', p_1_100)
# Now, we have a system of last two equations ($p_z$ and $p^1_{100}$) which can be solved over $y_1$, by subtracting one from another:
# $$\large{ y_1 = \frac{100^2 - n*(p_{100}^1 - p_{z})}{2*100} }$$
# In[12]:
# Calculate y_1
y1 = (100**2 - n * (p_1_100-p_z)) / 200
print('Obtained y_1:', y1)
print('Actual y_1:', public_lb['target'][0])
# More generally we can obtain $y_i$ by this formula:
# $$\large{ y_i = \frac{100^2 - n*(p_{100}^i - p_{z})}{2*100} }$$
# For example, for $y_2$:
# In[13]:
# Set second y-value to 100
sample_submission['target'][0] = 0
sample_submission['target'][1] = 100
# Make submission
p_2_100 = make_submission(sample_submission, mean_squared_error, public_lb)
# Calculate y_2
y2 = (100**2 - n * (p_2_100-p_z)) / 200
print('Obtained y_2:', y2)
print('Actual y_2:', public_lb['target'][1])
# So, in this method we can use exactly one LB probe in order to get the true y-value (up to numerical accuracy) of one data point. It's worth mentioning, that applying such technic for mean absolute error metric (MAE) and root mean squared metric (RMSE) also makes possible to find all y-values for all data points used in computation of public LB score
# ### 3. R squared (the coefficient of determination)
# Vulnerability related to $R^2$ metric allows us to find **variance of public LB** and **find all y-values** for all data points used in computation of public LB score. Metric definition:
# $$\large{ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y}_i)^2} }$$
# The denominator in this formula is called total sum of squares, which proportional to the variance of the data.
# $$\large{ S_{tot} = \sum_{i=1}^n (y_i - \bar{y}_i)^2 }$$
# Let's look on the variance formula:
# $$\large{ Var(y) = \frac{\sum_{i=1}^n (y_i - \bar{y}_i)^2}{n} = \frac{S_{tot}}{n} }$$
# So, if we can find $S_{tot}$ in $R^2$ metric, then we can obtain the variance of public LB.
# As always, let's start with zero submission:
# $$\large{ p_z = 1 - \frac{\sum_{i=1}^n (y_i - 0)^2}{S_{tot}} = 1 - \frac{y_1^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$
# In[14]:
## Generate environment
data = np.random.randint(-15, 15, size=1000)
public_lb, private_pb, sample_submission, n = generate_data(data)
sample_submission.head(3)
# In[15]:
# Make zero submission
p_z = make_submission(sample_submission, r2_score, public_lb)
print('Zero submission score:', p_z)
# Second probe will be with 1st y-value set to 100 (no suprise here):
# $$\large{ p^1_{100} = 1 - \frac{(y_1 - 100)^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$
# In[16]:
# Set first y-value to 100
sample_submission['target'][0] = 100
sample_submission.head(3)
# In[17]:
# Make submission
p_1_100 = make_submission(sample_submission, r2_score, public_lb)
print('Submission score:', p_1_100)
# As in the previous example, now we have a system of last two equations ($p_z$ and $p^1_{100}$) which can be solved over $y_1$:
# $$\large{ y_1 = \frac{S_{tot}(p^1_{100} - p_z) + 100^2}{2*100} }$$
# But, the only left unknown is $S_{tot}$, which can be calculated with one more submission. So we will make a submission with 1st y-value set to 200.
# $$\large{ p^1_{200} = 1 - \frac{(y_1 - 200)^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$
# In[18]:
# Set first y-value to 200
sample_submission['target'][0] = 200
sample_submission.head(3)
# In[19]:
# Make submission
p_1_200 = make_submission(sample_submission, r2_score, public_lb)
print('Submission score:', p_1_200)
# Then combine equation with $p_z$ and $p^1_{200}$ into system and solve it over $y_1$ (similary to provious step):
# $$\large{ y_1 = \frac{S_{tot}(p^1_{200} - p_z) + 200^2}{2*200} }$$
# Now we have two equations that can be solved over $S_{tot}$, by excluding $y_1$:
# $$\large{ S_{tot} = \frac{200^2 - 2*100^2}{2*p^1_{100} - p_z - p^1_{200}} }$$
# In[20]:
# Calculate total sum of squares
s_tot = (200.0**2 - 2*100.0**2) / (2*p_1_100 - p_z - p_1_200)
print('Total sum of squares:', s_tot)
# Now we are ready to calculate the variance:
# In[21]:
# Calculate variance
var_y = s_tot / n
print('Obtained variance of public LB:', var_y)
print('Actual variance:', public_lb['target'].var(ddof=0))
# And $y_1$:
# In[22]:
# Calculate y_1
y1 = (s_tot*(p_1_100 - p_z) + 100**2) / (2 * 100.0)
print('Obtained y_1:', y1)
print('Actual y_1:', public_lb['target'][0])
# Having now $S_{tot}$ we can find $y_i$ by this formula:
# $$\large{ y_i = \frac{S_{tot}(p^i_{100} - p_z) + 100^2}{2*100} }$$
# ## Conclusion
# So base idea behind all of described methods is to making submissions which defferent by one value from each other and then solving equations (mostly by substracting one from another).
#
# Knowledge obtained by described methods cannot be used to directly improve result in private LB, since we cannot probe the y-values from there. However if public and private LB data cames from the same distribution, obtained mean or variance of public LB can be useful.
#
# And, of course, obtaining all y-values methods are limited by some number of submissions per day.
# ## References
# 1. [Mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error)
# 2. [Mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
# 3. [Coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)
# 4. [How to get the exact y-values of all data points used for the computation of the public leaderboard score in the ā€¯Mercedes-Benz Greener Manufacturingā€¯ competition on Kaggle](https://crowdstats.eu/Kaggle_MercedesBenz_LBprobing.pdf)
# 5. [How to Win a Data Science Competition: Learn from Top Kagglers (Week 2)](https://www.coursera.org/learn/competitive-data-science?specialization=aml)
# 6. [The "Perfect Score" Script](https://www.kaggle.com/olegtrott/the-perfect-score-script)