QRT Challenge Data 2021

Summary

This exploratory notebook is a good starting point to help you make your first steps in the challenge.

We remind that the aim of the 2021 QRT Challenge Data is to determine the link between two types of assets: liquids and illiquids. We provide returns of 100 illiquid assets and the aim is to predict, for the same day, the sign of the return of 100 liquid assets.

In the following, we propose a very simple approach that determines for each liquid asset, the illiquid asset with maximum correlation. Thus we measures the $\beta$ (see definition here) between these assets which will be used for prediction.

This notebook is very straightforward, but if you have any question or comment, please ask it in the forum.

In [1]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

Loading the data

In [2]:
X_train = pd.read_parquet('./X_train.parquet')
Y_train = pd.read_csv('./y_train.csv', index_col=0)
X_test = pd.read_parquet('./X_test.parquet')
X_train.head()
Out[2]:
ID_DAY RET_216 RET_238 RET_45 RET_295 RET_230 RET_120 RET_188 RET_260 RET_15 ... RET_122 RET_194 RET_72 RET_293 RET_281 RET_193 RET_95 RET_162 RET_297 ID_TARGET
ID
0 3316 0.004024 0.009237 0.004967 NaN 0.01704 0.013885 0.041885 0.015207 -0.003143 ... 0.007596 0.01501 0.014733 -0.000476 0.006539 -0.010233 0.001251 -0.003102 -0.094847 139
1 3316 0.004024 0.009237 0.004967 NaN 0.01704 0.013885 0.041885 0.015207 -0.003143 ... 0.007596 0.01501 0.014733 -0.000476 0.006539 -0.010233 0.001251 -0.003102 -0.094847 129
2 3316 0.004024 0.009237 0.004967 NaN 0.01704 0.013885 0.041885 0.015207 -0.003143 ... 0.007596 0.01501 0.014733 -0.000476 0.006539 -0.010233 0.001251 -0.003102 -0.094847 136
3 3316 0.004024 0.009237 0.004967 NaN 0.01704 0.013885 0.041885 0.015207 -0.003143 ... 0.007596 0.01501 0.014733 -0.000476 0.006539 -0.010233 0.001251 -0.003102 -0.094847 161
4 3316 0.004024 0.009237 0.004967 NaN 0.01704 0.013885 0.041885 0.015207 -0.003143 ... 0.007596 0.01501 0.014733 -0.000476 0.006539 -0.010233 0.001251 -0.003102 -0.094847 217

5 rows × 102 columns

Reshaping the data

We transform the data so that each line corresponds to a specific day

In [3]:
idx_ret_features = np.where(X_train.columns.str.contains('RET'))[0]
init_ret_features = X_train.columns[idx_ret_features]
target_ret_features = 'RET_' + X_train['ID_TARGET'].unique()
returns = {}
for day in tqdm(X_train.ID_DAY.unique()):
    u = X_train.loc[X_train.ID_DAY == day]
    a = u.iloc[0, idx_ret_features]
    b = Y_train[X_train.ID_DAY == day]['RET_TARGET']
    b.index = 'RET_' + u.ID_TARGET
    returns[day] = pd.concat([a, b])
returns = pd.DataFrame(returns).T.astype(float)
returns.head()

Out[3]:
RET_0 RET_1 RET_102 RET_105 RET_106 RET_108 RET_109 RET_110 RET_114 RET_115 ... RET_88 RET_9 RET_90 RET_91 RET_93 RET_95 RET_96 RET_97 RET_98 RET_99
3316 -0.016501 0.018693 0.005316 NaN 0.001352 0.027374 -0.010890 0.057911 0.014155 0.021104 ... 0.027113 0.008602 0.018250 0.038581 0.027588 0.001251 0.037917 -0.002689 0.002140 0.023026
3355 0.000674 0.005759 0.007938 NaN -0.005017 -0.007413 -0.005980 0.011900 -0.011755 0.032401 ... 0.017933 0.001658 0.061274 -0.018643 0.000372 -0.010232 -0.013496 0.013819 -0.036501 0.017736
1662 -0.000919 0.010430 0.007901 NaN 0.006959 0.005593 -0.066666 0.049822 0.021599 0.019816 ... 0.016645 -0.016289 0.027690 -0.026479 0.006664 0.023721 0.013057 -0.008237 0.014655 0.011614
3405 -0.004989 0.004558 0.004325 NaN -0.007720 0.002788 -0.038432 -0.021308 0.000595 0.001613 ... -0.039065 0.021897 -0.031456 0.029652 0.016171 -0.024835 -0.020391 0.006790 0.015796 -0.015763
1602 0.002468 -0.024584 0.008947 NaN -0.004735 -0.031061 0.012366 0.045117 -0.015403 -0.015327 ... -0.028080 0.016530 0.019826 0.014627 0.010884 -0.025205 0.006157 -0.006526 0.002125 0.036186

5 rows × 200 columns

Beta computation

We compute the $\beta$ between all assets. This matrix will determine the linear link between all assets.

This step is not necessary and could be done in the next step, but it is a good way to introduce the use of a matrix shrinkage, greatly used in finance when dealing with noisy data. See here for more information.

In [4]:
from sklearn.covariance import oas
features = returns.columns
cov = pd.DataFrame(oas(returns.fillna(0))[0], index=features, columns=features)
beta = cov / np.diag(cov)
beta.head()
Out[4]:
RET_0 RET_1 RET_102 RET_105 RET_106 RET_108 RET_109 RET_110 RET_114 RET_115 ... RET_88 RET_9 RET_90 RET_91 RET_93 RET_95 RET_96 RET_97 RET_98 RET_99
RET_0 1.000000 0.126513 0.108238 0.128353 0.117609 0.070110 0.194425 0.080396 0.174791 0.118757 ... 0.146664 -0.012388 0.155055 0.081016 0.181609 0.172702 0.063350 0.138673 0.104377 0.165404
RET_1 0.184228 1.000000 0.122461 0.214639 0.131930 0.117362 0.196036 0.174030 0.256237 0.223324 ... 0.242727 -0.067669 0.228235 0.132136 0.177939 0.253013 0.105388 0.160021 0.109545 0.206384
RET_102 0.086988 0.067585 1.000000 0.149505 0.381845 0.146912 0.064788 0.142368 0.090649 0.104195 ... 0.101701 0.008833 0.111743 0.043866 0.090732 0.162930 0.033348 0.124352 0.450246 0.230993
RET_105 0.141652 0.162670 0.205304 1.000000 0.195409 0.126792 0.134372 0.238152 0.197308 0.179270 ... 0.165933 -0.003716 0.200567 0.221046 0.223728 0.219295 0.154519 0.205005 0.245215 0.320027
RET_106 0.093577 0.072087 0.378043 0.140882 1.000000 0.099451 0.076619 0.123011 0.100518 0.105586 ... 0.096279 -0.018555 0.104224 0.020172 0.121545 0.133660 0.011020 0.144167 0.393937 0.196698

5 rows × 200 columns

Determine the pairs and beta coefficients

For each target asset (liquid assets), we determine the illiquid asset that has maximum correlation and we save the id and the associated beta coefficient.

In [5]:
proj_matrix = beta.T.loc[init_ret_features, target_ret_features]
corr = returns.corr().loc[init_ret_features, target_ret_features]

coeffs = {}
for id_target in tqdm(target_ret_features):
    x = proj_matrix[id_target]
    c = corr[id_target]
    id_init_max = c.abs().idxmax()
    j = id_target.replace('RET_', '')
    coeffs[j] = (id_init_max,  x[id_init_max])

Prediction on test data

We thus simply make the predictions on the test data set using the pairs we saved and the beta.

If there is missing values, we replace them with the mean.

In [6]:
pred = {}
for idx, row in tqdm(X_test.iterrows()):
    j = row['ID_TARGET']
    i, val = coeffs[j]
    x = row[i]
    if np.isnan(x):
        x = row[init_ret_features].mean()
    p = x * val
    pred[idx] = p
pred = pd.Series(pred, name="RET_TARGET")

# The NaNs are filled by the mean of the prediction of that day
pred_mean_day = pred.groupby(X_test['ID_DAY']).transform('mean')
pred = pred.fillna(pred_mean_day)
pred = np.sign(pred)

Save the result before submission

In [7]:
pred.name = "RET_TARGET"
pred = pred.astype(int)
pred.to_csv('./benchmark.csv')