In this notebook we show how we can solve a hard problem using some reformulations.
The Gini mean difference (GMD) is a measure of dispersion and it was introduced in the context of portfolio optimization by Yitzhaki (1982). However, this model is not used by practitioners due to the original formulation having a number of variables that increases proportional to $T(T-1)/2$, where $T$ is the number of observations. The original model is presented as follows:
$$ \begin{equation} \begin{aligned} & \underset{x,\, d}{\text{min}} & & \frac{1}{T(T-1)} \sum^{T}_{i=1} \sum^{T}_{j > i} d_{i,j} \\ & \text{s.t.} & & \mu x \geq \bar{\mu} \\ & & & d_{i,j} \geq (r_{i} -r_{j})x \; ; \; \forall \; i,j =1, \ldots, T \; ; \; i < j \\ & & & d_{i,j} \geq -(r_{i} -r_{j})x \\ & & & \sum_{i=1}^{N} x_i = 1 \\ & & & x_{i} \geq 0 \; ; \; \forall \; i =1, \ldots, N \\ & & & d_{i,j} \geq 0 \; ; \; \forall \; i,j =1, \ldots, T \\ \end{aligned} \end{equation} $$Where $r_{i}$ is the vector of returns in period $i$ and $d$ is an auxiliary variable.
To increase the efficiency of the problem above, Murray (2022) proposed the following reformulation:
$$ \begin{equation} \begin{aligned} & \underset{x,\, d}{\text{min}} & & \frac{1}{T(T-1)} \sum^{T}_{i=1} \sum^{T}_{j > i} d_{i,j} \\ & \text{s.t.} & & \mu x \geq \bar{\mu} \\ & & & y = r x \\ & & & d \geq M y \\ & & & d \geq -M y \\ & & & \sum_{i=1}^{N} x_i = 1 \\ & & & x_{i} \geq 0 \; ; \; \forall \; i =1, \ldots, N \\ & & & d_{i,j} \geq 0 \; ; \; \forall \; i,j =1, \ldots, T \\ \end{aligned} \end{equation} $$where $$ M = \begin{bmatrix} \left. \begin{matrix} -1 & 1 & 0 & 0 & \ldots & 0 & 0\\ -1 & 0 & 1 & 0 & \ldots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ -1 & 0 & 0 & 0 & \ldots & 0 & 1 \\ \end{matrix} \right \} & T-1\\ \left. \begin{matrix} 0 & -1 & 1 & 0 & \ldots & 0 & 0\\ 0 & -1 & 0 & 1 & \ldots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & -1 & 0 & 0 & \ldots & 0 & 1 \\ \end{matrix} \right \} & T-2\\ \vdots \\ \underbrace{ \left. \begin{matrix} 0 & 0 & 0 & 0 & \ldots & -1 & 1 \\ \end{matrix} \right \} }_{T} & 1 \\ \end{bmatrix} $$
This reformulation is more efficient for medium scale problems (T<800).
Cajas (2021) proposed an alternative reformulation based on the ordered weighted averaging (OWA) operator optimization model for monotonic weights proposed by Chassein and Goerigk (2015). This formulation works better for large scale problems (T>=800). This formulation is presented as follows:
$$ \begin{equation} \begin{aligned} & \min_{\alpha, \, \beta, \, x, \, y} & & \sum^{T}_{i=1} \alpha_{i} + \beta_{i} \\ & \text{s.t.} & & \mu x \geq \bar{\mu} \\ & & & r x = y \\ & & & \alpha_{i} + \beta_{j} \geq w_{i} y_{j} \;\;\;\; \forall \; i,j =1, \ldots, T \\ & & & \sum_{i=1}^{N} x_i = 1 \\ & & & x_i \geq 0 \; ; \; \forall \; i =1, \ldots, N \\ \end{aligned} \end{equation} $$where $w_{i} = 2 \left ( \frac{2i - 1 - T}{T(T-1)} \right )$.
import numpy as np
import pandas as pd
import cvxpy as cp
import mosek
import scipy.stats as st
from timeit import default_timer as timer
from datetime import timedelta
import warnings
warnings.filterwarnings("ignore")
def gini(mu, returns, D, assets, lift=0):
(T, N) = returns.shape
d = cp.Variable((int(T * (T - 1) / 2),1))
w = cp.Variable((N,1))
constraints = []
if lift in ['Murray', 'Yitzhaki']: # use Murray's reformulation
if lift == 'Murray':
ret_w = cp.Variable((T,1))
constraints.append(ret_w == returns @ w)
mat = np.zeros((d.shape[0], T))
"""
We need to create a vector that has the following entries:
ret_w[i] - ret_w[j]
for j in range(T), for i in range(j+1, T).
We do this by building a numpy array of mostly 0's and 1's.
(It would be better to use SciPy sparse matrix objects.)
"""
ell = 0
for j in range(T):
for i in range(j+1, T):
# write to mat so that (mat @ ret_w)[ell] == var_i - var_j
mat[ell, i] = 1
mat[ell, j] = -1
ell += 1
all_pairs_ret_diff = mat @ ret_w
elif lift == 'Yitzhaki': # use the original formulation
all_pairs_ret_diff = D @ w
constraints += [d >= all_pairs_ret_diff,
d >= -all_pairs_ret_diff,
w >= 0,
cp.sum(w) == 1,
]
risk = cp.sum(d) / ((T - 1) * T)
elif lift == 'Cajas':
a = cp.Variable((T,1))
b = cp.Variable((T,1))
y = cp.Variable((T,1))
owa_w = []
for i in range(1,T+1):
owa_w.append(2*i - 1 - T)
owa_w = np.array(owa_w) / (T * (T-1))
constraints = [returns @ w == y,
w >= 0,
cp.sum(w) == 1]
for i in range(T):
constraints += [a[i] + b >= cp.multiply(owa_w[i], y)]
risk = cp.sum(a + b)
objective = cp.Minimize(risk * 1000)
prob = cp.Problem(objective, constraints)
try:
prob.solve(solver=cp.MOSEK,
mosek_params={'MSK_IPAR_NUM_THREADS': 2},
verbose=False,)
w = pd.DataFrame(w.value)
w.index = assets
w = w / w.sum()
except:
w = None
return w
####################################
# Calculating the Gini Portfolio
# using three formulations
####################################
rs = np.random.RandomState(123)
sizes = [100]
data = {}
weights = {}
lifts = ['Yitzhaki', 'Murray', 'Cajas']
k = 0
T = [200, 500, 700, 1000]
for t in T:
for n in sizes:
cov = rs.rand(n,n) * 1.5 - 0.5
cov = cov @ cov.T/1000 + np.diag(rs.rand(n) * 0.7 + 0.3)/1000
mean = np.zeros(n) + 1/1000
Y = st.multivariate_normal.rvs(mean=mean, cov=cov, size=t, random_state=rs)
Y = pd.DataFrame(Y)
assets = ['Asset '+ str(i) for i in range(1,n+1)]
mu = Y.mean().to_numpy()
returns = Y.to_numpy()
D = np.array([]).reshape(0,len(assets))
for j in range(0, returns.shape[0]-1):
D = np.concatenate((D, returns[j+1:] - returns[j,:]), axis=0)
for lift in lifts:
name = str(lift) + '-' + str(t) + '-' + str(n)
data[name] = []
weights[name] = []
if t >= 700 and lift == 'Yitzhaki':
continue
else:
start = timer()
w = gini(mu, returns, D, assets, lift=lift)
end = timer()
data[name].append(timedelta(seconds=end-start).total_seconds())
weights[name].append(w)
k += 1
print(name)
Yitzhaki-200-100 Murray-200-100 Cajas-200-100 Yitzhaki-500-100 Murray-500-100 Cajas-500-100 Murray-700-100 Cajas-700-100 Murray-1000-100 Cajas-1000-100
keys = list(data.keys())
for i in keys:
if len(data[i]) == 0:
del data[i]
pd.options.display.float_format = '{:.2f}'.format
a = pd.DataFrame(data).T
a.columns = ['Time in Seconds']
display(a)
Time in Seconds | |
---|---|
Yitzhaki-200-100 | 9.55 |
Murray-200-100 | 0.88 |
Cajas-200-100 | 1.59 |
Yitzhaki-500-100 | 99.33 |
Murray-500-100 | 11.49 |
Cajas-500-100 | 13.67 |
Murray-700-100 | 27.19 |
Cajas-700-100 | 29.12 |
Murray-1000-100 | 96.62 |
Cajas-1000-100 | 80.50 |
As we can see, as the number of observations $T$ increases the formulation proposed by Cajas (2021) becomes more efficient than Yitzhaki's and Murray's formulations.
b = pd.DataFrame([])
for i in weights.keys():
if len(weights[i]) == 0:
continue
weights[i][0].columns = [i]
b = pd.concat([b , mu @ weights[i][0]],axis=0)
pd.options.display.float_format = '{:.4%}'.format
display(b)
0 | |
---|---|
Yitzhaki-200-100 | -0.0413% |
Murray-200-100 | -0.0413% |
Cajas-200-100 | -0.0414% |
Yitzhaki-500-100 | -0.2315% |
Murray-500-100 | -0.2315% |
Cajas-500-100 | -0.2313% |
Murray-700-100 | 0.1601% |
Cajas-700-100 | 0.1602% |
Murray-1000-100 | 0.0339% |
Cajas-1000-100 | 0.0339% |
If we calculate the expected returns for each formulation, we can see that the three models give us the same results, which means that these formulations give us the same solution.
For more portfolio optimization models and applications, you can see the CVXPY based library Riskfolio-Lib.
Finally, I hope you liked this example.