In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

# Regression metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

np.random.seed(17)


## Intro¶

Leaderboard probing - it's a competition specific technick tightly connected with data leakages. There are two tipes of LB probing:

• extraction ground thruth from public part of LB by changing small subset of predictions in submission;
• extraction information about private part of LB by submitting to public LB (based on consistent categories).

In this tutorial we will concentrate on the first type of probing. In some cases specific exploit makes possible to find all y-values of the public leaderboard (LB) score. In other cases we can obtain some imformation about public LB y-values distribution.

## Creating test environment functions¶

Now, we are going to create some functions, which will help us in testing exploits.

In [2]:
def generate_leaderboards(values):
"""
Generate public and private leaderboard from values.
"""
df = pd.DataFrame(values, columns=['target'])
public_lb, private_pb = train_test_split(df, test_size=0.3, shuffle=False)

return public_lb, private_pb

In [3]:
def generate_submission(values, leaderboard):
"""
Generate sample submission from values for leaderboard.
"""
sample_submission['target'] = values

return sample_submission

In [4]:
def generate_data(values):
"""
Generate experimental environment: public and private leaderboards, zero sample submissio from values.
"""
zero_submission = generate_submission(0, public_lb)
n = public_lb.size

return public_lb, private_pb, zero_submission, n

In [5]:
def make_submission(predicted_values, metric, leaderboard):
"""
Evaluate predicted values with metric for leaderboard.
"""


## Regression evaluation metrics¶

We gonna iterate through several main regression metrics and find some vulnerabilities related to each of them.

### 1. MAE¶

$$\large{MAE(y, \hat{y}) = \frac{\sum_{i=1}^n |y_i - \hat{y}_i|}{n} }$$

where

• $y$ - vector with true components;
• $y_i$ - true y-value;
• $\hat{y}$ - vector with predicted (i.e. submitted) values;
• $\hat{y_i}$ - predicted y-value;
• $n$ - number of data points.

In case of all target values are non negative we can make zero submission to obtain mean target value:

$$\large{MAE(y, 0) = \frac{\sum_{i=1}^n |y_i - 0|}{n} = \frac{\sum_{i=1}^n y_i}{n} }$$
In [6]:
# Generate environment with non negative target values
target_values = np.random.randint(0, 10, size=1000)
public_lb, private_lb, sample_submission, n = generate_data(target_values)


Out[6]:
target
0 0
1 0
2 0
In [7]:
# Make zero submission
p_z = make_submission(sample_submission, mean_absolute_error, public_lb)
print('Zero submission score:', p_z)

Zero submission score: 4.468571428571429


Using this information we can modify our predictions to improve public LB score.

### 2. MSE¶

Next probing method makes possible to find all y-values for all data points used in computation of public LB score. Now we will talk about mean squared error:

$$\large{ MSE(y, \hat{y}) = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n} }$$

At first, let's make all zeros submission and denote result score as $p_z$:

$$\large{ p_{z} = \frac{\sum_{i=1}^n (y_i - 0)^2}{n} } = \frac{(y_1 - 0)^2}{n} + \frac{\sum_{i=2}^n (y_i - 0)^2}{n}$$
In [8]:
# Generate environment
target_values = np.random.randint(-10, 10, size=1000)
public_lb, private_lb, sample_submission, n = generate_data(target_values)


Out[8]:
target
0 0
1 0
2 0
In [9]:
# Make zero submission
p_z = make_submission(sample_submission, mean_squared_error, public_lb)
print('Zero submission score:', p_z)

Zero submission score: 33.66571428571429


Other submissions will contain exactly one value differs from zero submission. Here we will change first y-value from 0 to 100:

In [10]:
# Set first y-value to 100
sample_submission['target'][0] = 100

Out[10]:
target
0 100
1 0
2 0

We make a new submission where 1st y-value set to 100 and denote result score as $p^1_{100}$:

$$\large{ p_{100}^{1} = \frac{(y_1 - 100)^2}{n} + \frac{\sum_{i=2}^n (y_i - 0)^2}{n} }$$
In [11]:
p_1_100 = make_submission(sample_submission, mean_squared_error, public_lb)
print('Submission score:', p_1_100)

Submission score: 48.23714285714286


Now, we have a system of last two equations ($p_z$ and $p^1_{100}$) which can be solved over $y_1$, by subtracting one from another:

$$\large{ y_1 = \frac{100^2 - n*(p_{100}^1 - p_{z})}{2*100} }$$
In [12]:
# Calculate y_1
y1 = (100**2 - n * (p_1_100-p_z)) / 200
print('Obtained y_1:', y1)
print('Actual y_1:', public_lb['target'][0])

Obtained y_1: -0.9999999999999909
Actual y_1: -1


More generally we can obtain $y_i$ by this formula:

$$\large{ y_i = \frac{100^2 - n*(p_{100}^i - p_{z})}{2*100} }$$

For example, for $y_2$:

In [13]:
# Set second y-value to 100
sample_submission['target'][0] = 0
sample_submission['target'][1] = 100

# Make submission
p_2_100 = make_submission(sample_submission, mean_squared_error, public_lb)

# Calculate y_2
y2 = (100**2 - n * (p_2_100-p_z)) / 200
print('Obtained y_2:', y2)
print('Actual y_2:', public_lb['target'][1])

Obtained y_2: 5.000000000000009
Actual y_2: 5


So, in this method we can use exactly one LB probe in order to get the true y-value (up to numerical accuracy) of one data point. It's worth mentioning, that applying such technic for mean absolute error metric (MAE) and root mean squared metric (RMSE) also makes possible to find all y-values for all data points used in computation of public LB score

### 3. R squared (the coefficient of determination)¶

Vulnerability related to $R^2$ metric allows us to find variance of public LB and find all y-values for all data points used in computation of public LB score. Metric definition:

$$\large{ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y}_i)^2} }$$

The denominator in this formula is called total sum of squares, which proportional to the variance of the data.

$$\large{ S_{tot} = \sum_{i=1}^n (y_i - \bar{y}_i)^2 }$$

Let's look on the variance formula:

$$\large{ Var(y) = \frac{\sum_{i=1}^n (y_i - \bar{y}_i)^2}{n} = \frac{S_{tot}}{n} }$$

So, if we can find $S_{tot}$ in $R^2$ metric, then we can obtain the variance of public LB.

$$\large{ p_z = 1 - \frac{\sum_{i=1}^n (y_i - 0)^2}{S_{tot}} = 1 - \frac{y_1^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$
In [14]:
## Generate environment
data = np.random.randint(-15, 15, size=1000)
public_lb, private_pb, sample_submission, n = generate_data(data)


Out[14]:
target
0 0
1 0
2 0
In [15]:
# Make zero submission
p_z = make_submission(sample_submission, r2_score, public_lb)
print('Zero submission score:', p_z)

Zero submission score: -0.0029571155108647496


Second probe will be with 1st y-value set to 100 (no suprise here):

$$\large{ p^1_{100} = 1 - \frac{(y_1 - 100)^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$
In [16]:
# Set first y-value to 100
sample_submission['target'][0] = 100

Out[16]:
target
0 100
1 0
2 0
In [17]:
# Make submission
p_1_100 = make_submission(sample_submission, r2_score, public_lb)
print('Submission score:', p_1_100)

Submission score: -0.21438609655518914


As in the previous example, now we have a system of last two equations ($p_z$ and $p^1_{100}$) which can be solved over $y_1$:

$$\large{ y_1 = \frac{S_{tot}(p^1_{100} - p_z) + 100^2}{2*100} }$$

But, the only left unknown is $S_{tot}$, which can be calculated with one more submission. So we will make a submission with 1st y-value set to 200.

$$\large{ p^1_{200} = 1 - \frac{(y_1 - 200)^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$
In [18]:
# Set first y-value to 200
sample_submission['target'][0] = 200

Out[18]:
target
0 200
1 0
2 0
In [19]:
# Make submission
p_1_200 = make_submission(sample_submission, r2_score, public_lb)
print('Submission score:', p_1_200)

Submission score: -0.7903478035380036


Then combine equation with $p_z$ and $p^1_{200}$ into system and solve it over $y_1$ (similary to provious step):

$$\large{ y_1 = \frac{S_{tot}(p^1_{200} - p_z) + 200^2}{2*200} }$$

Now we have two equations that can be solved over $S_{tot}$, by excluding $y_1$:

$$\large{ S_{tot} = \frac{200^2 - 2*100^2}{2*p^1_{100} - p_z - p^1_{200}} }$$
In [20]:
# Calculate total sum of squares
s_tot = (200.0**2 - 2*100.0**2) / (2*p_1_100 - p_z - p_1_200)
print('Total sum of squares:', s_tot)

Total sum of squares: 54864.75857142858


Now we are ready to calculate the variance:

In [21]:
# Calculate variance
var_y = s_tot / n
print('Obtained variance of public LB:', var_y)
print('Actual variance:', public_lb['target'].var(ddof=0))

Obtained variance of public LB: 78.37822653061225
Actual variance: 78.37822653061235


And $y_1$:

In [22]:
# Calculate y_1
y1 = (s_tot*(p_1_100 - p_z) + 100**2) / (2 * 100.0)
print('Obtained y_1:', y1)
print('Actual y_1:', public_lb['target'][0])

Obtained y_1: -8.000000000000036
Actual y_1: -8


Having now $S_{tot}$ we can find $y_i$ by this formula:

$$\large{ y_i = \frac{S_{tot}(p^i_{100} - p_z) + 100^2}{2*100} }$$

## Conclusion¶

So base idea behind all of described methods is to making submissions which defferent by one value from each other and then solving equations (mostly by substracting one from another).

Knowledge obtained by described methods cannot be used to directly improve result in private LB, since we cannot probe the y-values from there. However if public and private LB data cames from the same distribution, obtained mean or variance of public LB can be useful.

And, of course, obtaining all y-values methods are limited by some number of submissions per day.