Using integer programming and portfolio optimization to pick a March Madness bracket¶

In this notebook I explore the use of an integer program solver in Python to pick an optimal March Madness bracket.

I use the cvxopt Python interface to the glpk solver, both freely available. I also use pandas and numpy. Installing cvxopt and glpk was simple but not completely straightforward; see the appendix at the end for how I did it.

Scoring¶

The pool I was invited to had an interesting set of rules. You ignore the brackets and simply choose teams, but each team has a price and you have a limited budget:

Pick any set of teams you want as long as you stay within your $2000 budget. And all of your teams earn you points as long as they're still in the tournament.

0 points for play-in wins
1 point for every first round win
2 points for every second round win
3 points for every third round win
5 points for every quarterfinal win
8 points for every semifinal win
13 points for a finals win

Pricing structure:

1 seeds: $500
2 seeds: $300
3 seeds: $225
4 seeds: $175
5 seeds: $125
6 seeds: $125
7 seeds: $95
8 seeds: $85
9 seeds: $60
10 seeds: $65
11 seeds: $60
12 seeds: $55
13 seeds: $25
14 seeds: $20
15 seeds: $5
16 seeds: $1```

I'm not sure if the price for 9 seeds was a typo.

Data¶

fivethirtyeight.com (Nate Silver's website) has a model for the tournament and provided data on the output of that model. The data includes the predicted probability that each team will win in each round, which is what we're after. I selected the most recent data set after all the play-in games had completed.

In [7]:

data = pd.read_csv('bracket-05.tsv', sep='\t')
data = data.\
    query('rd1_win > 0').\
    rename(columns=dict(rd1_win=1, rd2_win=2, rd3_win=3, rd4_win=4, rd5_win=5, rd6_win=6, rd7_win=7))\
    [['team_name', 'team_seed', 1, 2, 3, 4, 5, 6, 7]]
data.head()

Out[7]:

	team_name	team_seed	1	2	3	4	5	6	7
0	Kentucky	1	1	0.998983	0.940664	0.855849	0.732191	5.351611e-01	4.131957e-01
2	Hampton	16b	1	0.001017	0.000103	0.000012	0.000001	5.387219e-08	4.307362e-09
3	Cincinnati	8	1	0.536887	0.034189	0.017280	0.007178	2.065072e-03	7.358615e-04
4	Purdue	9	1	0.463113	0.025044	0.012093	0.004774	1.382256e-03	4.958285e-04
5	West Virginia	5	1	0.682463	0.392345	0.052872	0.025248	7.441731e-03	2.693933e-03

The numbered columns represent the probability that a team will win in that round of the tournament. This of course means that they had to win all previous rounds, so you can see that the numbers are always decreasing from left to right.

I found it useful to think about the score you get from each team as a discrete random variable. In this case we need to change the above probabilities so that they represent the chance of winning exactly that number of games, so that for each team the sum of probabilities is 1.

In [8]:

data[8] = 0
for col in range(8, 1, -1):
    data[col] = data[col-1] - data[col]
data = data.drop(labels=1, axis=1)
data = data.rename(columns=dict(zip(range(2,9), range(7))))
data.head()

Out[8]:

	team_name	team_seed	0	1	2	3	4	5	6
0	Kentucky	1	0.001017	0.058320	0.084815	0.123658	0.197030	1.219654e-01	4.131957e-01
2	Hampton	16b	0.998983	0.000913	0.000091	0.000011	0.000001	4.956483e-08	4.307362e-09
3	Cincinnati	8	0.463113	0.502698	0.016909	0.010102	0.005113	1.329211e-03	7.358615e-04
4	Purdue	9	0.536887	0.438069	0.012951	0.007319	0.003392	8.864273e-04	4.958285e-04
5	West Virginia	5	0.317537	0.290118	0.339473	0.027624	0.017807	4.747797e-03	2.693933e-03

Now we set up the data for the scoring rules of the pool:

In [9]:

rounds = range(7)
scores = [0, 1, 2, 3, 5, 8, 13]
cumscores = np.cumsum(scores)
prices =   {'1': 500,
            '2': 300,
            '3': 225,
            '4': 175,
            '5': 125,
            '6': 125,
            '7': 95,
            '8': 85,
            '9': 60,
            '10': 65,
            '11': 60,
            '11a': 60,
            '11b': 60,
            '12': 55,
            '13': 25,
            '14': 20,
            '15': 5,
            '16': 1,
            '16a': 1,
            '16b': 1}
budget = 2000
n = len(data)
data['price'] = [prices[seed] for seed in data.team_seed]

A few quantities which we'll be interested in are:

The expected score for each team. This is the average of the number of points for winning each possible number of games (0 through 6), weighted by the probability of winning that number of games.
The variance of the score for each team.
What I call the efficiency of each team. This is simply the expected score divided by the price.

In [10]:

def get_expected_score(team):
    return sum(team[r]*cumscores[r] for r in rounds)
data['expected_score'] = data.apply(get_expected_score, axis=1)

def get_variance(team):
    return sum(team[r]*(cumscores[r]-team['expected_score'])**2 for r in rounds)
data['variance'] = data.apply(get_variance, axis=1)

data['efficiency'] = data.expected_score/data.price

In [11]:

cols = ['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency']
data[cols].sort(columns=['efficiency'], ascending=False).head()

Out[11]:

	team_name	team_seed	price	expected_score	variance	efficiency
0	Kentucky	1	500	18.761645	144.256509	0.037523
16	New Mexico State	15	5	0.166190	0.390123	0.033238
39	Utah	5	125	4.123377	27.834924	0.032987
32	Arizona	2	300	9.649058	76.986045	0.032164
36	Robert Morris	16b	1	0.027626	0.065626	0.027626

Choosing the optimal set of teams¶

Maximizing expected score¶

A naive approach to maximizing the expected score is to continue adding the most efficient remaining team to your list until you run out of budget. But this could leave you with a little bit of unused budget. You might be better off sacrificing a little bit of efficiency so that you can spend your whole budget and get a higher total expected score.

In fact this is an example of the knapsack problem which is known to be NP-hard. There are specialized algorithms that solve the knapsack problem, but this is a particularly easy one and I wanted to play with the available free Python integer program solvers.

You don't actually want that¶

But that isn't necessarily the best way to maximize either your chance of winning or your expected earnings from the pool, which are more likely to be what you actually want. The right way to do it probably involves understanding how large and diverse your competition is, and throwing in some unpopular picks without sacrificing much efficiency. I don't know how to do that though.

Add in a risk penalty¶

But even if your goal is something like "get a lot of points in this tournament", picking the set of teams with the largest EV isn't necessarily the right play. The max-EV-set could be very risky in that it's expected to get a bad score most of the time, but occasionally gets a huge score, so that on average the score is pretty high. You might be willing to sacrifice a bit of EV in exchange for reduced risk.

A method for doing this from portfolio optimization is to maximize the expected return plus a variance penalty. You can tune the variance penalty based on your risk tolerance, setting it to zero to get the usual max-EV-set.

Notation¶

The problem I'll solve can be written as follows:

$$ \max_x \left\{ \mu^T x - \epsilon v^T x : p^T x \leq B, x \in \{0,1\} \right\} $$

The optimization variable is $x$, which is a binary vector of length 64. Each element represents a team, with a 1 indicating that you select that team to be in your set.

$\mu$ is the vector of expected scores for each team, $v$ the variance, and $p$ the price. $B$ is the budget, 2000 in this case. $\epsilon$ is the risk penalty factor.

Solving the optimization problem¶

I'll use the integer linear program solver ilp from glpk. I use the cvxopt Python interface to glpk to access it.

In [12]:

from cvxopt import matrix
from cvxopt.glpk import ilp

In [16]:

def solve_binary_program(eps):
    """
    Uses the integer linear program solver ilp from glpk:

    (status, x) = ilp(c, G, h, A, b, I, B)

        minimize    c'*x
        subject to  G*x <= h
                    A*x = b
                    x[k] is integer for k in I
                    x[k] is binary for k in B

    c            nx1 dense 'd' matrix with n>=1
    G            mxn dense or sparse 'd' matrix with m>=1
    h            mx1 dense 'd' matrix
    A            pxn dense or sparse 'd' matrix with p>=0
    b            px1 dense 'd' matrix
    I            set of indices of integer variables
    B            set of indices of binary variables
    """
    c = data.expected_score - eps*data.variance
    c = matrix(c)
    G = matrix(data.price[:, np.newaxis].T, tc='d')
    h = matrix(budget, tc='d')
    A = matrix(np.zeros((1, n)), tc='d')
    b = matrix(0.)
    I = set(range(n))
    B = set(range(n))

    (status, x) = ilp(-c, G, h, A, b, I, B)
    if status != 'optimal':
        raise
    return x

Results¶

Notice how as we increase the risk penalty, the optimal set of teams increases in size. And of course, both the expected score and the total variance decrease.

In [20]:

def solve_and_display(eps=0):
    x = solve_binary_program(eps)
    print('number of teams', sum(x))

    data['selected'] = x
    expected_score = data[data.selected == 1].expected_score.sum()
    total_variance = data[data.selected == 1].variance.sum()
    print('expected score %.2f' % expected_score)
    print('total variance %.2f' % total_variance)
    return data\
        [data.selected == 1]\
        [['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency']].\
        sort(columns='price', ascending=False)

The maximum expected score solution:

In [21]:

solve_and_display()

number of teams 15.0
expected score 57.23
total variance 441.66

Out[21]:

	team_name	team_seed	price	expected_score	variance	efficiency
0	Kentucky	1	500	18.761645	144.256509	0.037523
51	Villanova	1	500	10.798966	90.892268	0.021598
32	Arizona	2	300	9.649058	76.986045	0.032164
66	Virginia	2	300	7.598104	71.553645	0.025327
39	Utah	5	125	4.123377	27.834924	0.032987
13	Wichita State	7	95	2.275967	11.880365	0.023958
31	Ohio State	10	65	1.350079	8.044640	0.020770
10	Texas	11	60	1.499421	7.395559	0.024990
8	Valparaiso	13	25	0.481423	1.126213	0.019257
29	Georgia State	14	20	0.453335	1.129402	0.022667
16	New Mexico State	15	5	0.166190	0.390123	0.033238
2	Hampton	16b	1	0.001266	0.002283	0.001266
18	Coastal Carolina	16	1	0.021691	0.049848	0.021691
36	Robert Morris	16b	1	0.027626	0.065626	0.027626
52	Lafayette	16	1	0.022346	0.056766	0.022346

With just a little bit of risk penalty, Villanova drops out of the optimal set. The extra 500 of budget is spent on Gonzaga, North Carolina, and UC Irvine, with only a tiny loss of expected score.

Either this set of teams or the next one might be good choices for the pool, since they expect to get close to the same score without quite as much variance.

In [22]:

solve_and_display(eps=.03)

number of teams 17.0
expected score 56.87
total variance 416.10

Out[22]:

	team_name	team_seed	price	expected_score	variance	efficiency
0	Kentucky	1	500	18.761645	144.256509	0.037523
32	Arizona	2	300	9.649058	76.986045	0.032164
49	Gonzaga	2	300	6.418647	44.254151	0.021395
66	Virginia	2	300	7.598104	71.553645	0.025327
23	North Carolina	4	175	3.598615	20.148501	0.020564
39	Utah	5	125	4.123377	27.834924	0.032987
13	Wichita State	7	95	2.275967	11.880365	0.023958
31	Ohio State	10	65	1.350079	8.044640	0.020770
10	Texas	11	60	1.499421	7.395559	0.024990
8	Valparaiso	13	25	0.481423	1.126213	0.019257
58	UC Irvine	13	25	0.421511	0.925401	0.016860
29	Georgia State	14	20	0.453335	1.129402	0.022667
16	New Mexico State	15	5	0.166190	0.390123	0.033238
18	Coastal Carolina	16	1	0.021691	0.049848	0.021691
2	Hampton	16b	1	0.001266	0.002283	0.001266
36	Robert Morris	16b	1	0.027626	0.065626	0.027626
52	Lafayette	16	1	0.022346	0.056766	0.022346

Increasing the risk penalty typically increases the number of teams in the optimal set, spreading the eggs across many baskets to mitigate risk. But the relationship isn't strictly monotone as we see here, going down from 17 to 16 teams after increasing $\epsilon$ from .03 to .05:

In [26]:

solve_and_display(eps=.05)

number of teams 16.0
expected score 54.39
total variance 364.04

Out[26]:

	team_name	team_seed	price	expected_score	variance	efficiency
0	Kentucky	1	500	18.761645	144.256509	0.037523
32	Arizona	2	300	9.649058	76.986045	0.032164
49	Gonzaga	2	300	6.418647	44.254151	0.021395
23	North Carolina	4	175	3.598615	20.148501	0.020564
39	Utah	5	125	4.123377	27.834924	0.032987
55	Northern Iowa	5	125	2.231097	10.160632	0.017849
13	Wichita State	7	95	2.275967	11.880365	0.023958
31	Ohio State	10	65	1.350079	8.044640	0.020770
10	Texas	11	60	1.499421	7.395559	0.024990
20	Oklahoma State	9	60	0.947926	2.718533	0.015799
27	Ole Miss	11b	60	0.927713	2.705765	0.015462
61	Dayton	11b	60	1.084642	4.081855	0.018077
8	Valparaiso	13	25	0.481423	1.126213	0.019257
58	UC Irvine	13	25	0.421511	0.925401	0.016860
29	Georgia State	14	20	0.453335	1.129402	0.022667
16	New Mexico State	15	5	0.166190	0.390123	0.033238

In [27]:

solve_and_display(eps=.1)

number of teams 26.0
expected score 45.79
total variance 268.87

Out[27]:

	team_name	team_seed	price	expected_score	variance	efficiency
0	Kentucky	1	500	18.761645	144.256509	0.037523
23	North Carolina	4	175	3.598615	20.148501	0.020564
21	Arkansas	5	125	1.669730	5.633805	0.013358
55	Northern Iowa	5	125	2.231097	10.160632	0.017849
5	West Virginia	5	125	1.846568	7.557176	0.014773
39	Utah	5	125	4.123377	27.834924	0.032987
13	Wichita State	7	95	2.275967	11.880365	0.023958
37	San Diego State	8	85	1.036467	4.378130	0.012194
31	Ohio State	10	65	1.350079	8.044640	0.020770
44	UCLA	11	60	0.716904	2.787515	0.011948
38	St. John's	9	60	0.600717	1.537617	0.010012
27	Ole Miss	11b	60	0.927713	2.705765	0.015462
20	Oklahoma State	9	60	0.947926	2.718533	0.015799
10	Texas	11	60	1.499421	7.395559	0.024990
4	Purdue	9	60	0.590854	1.707142	0.009848
61	Dayton	11b	60	1.084642	4.081855	0.018077
6	Buffalo	12	55	0.633698	1.796136	0.011522
8	Valparaiso	13	25	0.481423	1.126213	0.019257
42	Eastern Washington	13	25	0.301771	0.500441	0.012071
58	UC Irvine	13	25	0.421511	0.925401	0.016860
29	Georgia State	14	20	0.453335	1.129402	0.022667
16	New Mexico State	15	5	0.166190	0.390123	0.033238
2	Hampton	16b	1	0.001266	0.002283	0.001266
18	Coastal Carolina	16	1	0.021691	0.049848	0.021691
36	Robert Morris	16b	1	0.027626	0.065626	0.027626
52	Lafayette	16	1	0.022346	0.056766	0.022346

As we increase the risk penalty even further, the optimization problem no longer really suits our purpose. It becomes so afraid of risk that it spends far below the budget, choosing mostly terrible teams that are likely to lose in the first round, contributing very little uncertainty to our result, but also very little value.

In [47]:

solve_and_display(eps=.4)

number of teams 14.0
expected score 3.00
total variance 6.62

Out[47]:

	team_name	team_seed	price	expected_score	variance	efficiency
22	Wofford	12	55	0.332589	0.586285	0.006047
56	Wyoming	12	55	0.439177	1.026169	0.007985
8	Valparaiso	13	25	0.481423	1.126213	0.019257
42	Eastern Washington	13	25	0.301771	0.500441	0.012071
58	UC Irvine	13	25	0.421511	0.925401	0.016860
12	Northeastern	14	20	0.153953	0.360553	0.007698
29	Georgia State	14	20	0.453335	1.129402	0.022667
63	Albany	14	20	0.149620	0.355647	0.007481
16	New Mexico State	15	5	0.166190	0.390123	0.033238
33	Texas Southern	15	5	0.008298	0.014673	0.001660
50	North Dakota State	15	5	0.038325	0.090304	0.007665
2	Hampton	16b	1	0.001266	0.002283	0.001266
18	Coastal Carolina	16	1	0.021691	0.049848	0.021691
36	Robert Morris	16b	1	0.027626	0.065626	0.027626

In [48]:

import matplotlib.pyplot as plt
%matplotlib inline

In [53]:

f = lambda eps: sum(solve_binary_program(eps))
eps = np.linspace(0, .75)
num_teams = [f(_) for _ in eps]
plt.plot(eps, num_teams)
plt.ylim(0, 40)
plt.xlabel('Risk penalty $\epsilon$')
plt.ylabel('Optimal number of teams');

Appendix¶

Installing cvxopt and glpk¶

I think this should do the trick.

Ubuntu¶

sudo apt-get install libglpk36

export CVXOPT_BUILD_GLPK=1
export CVXOPT_GLPK_LIB_DIR=/usr/lib/x86_64-linux-gnu/
export CVXOPT_GLPK_INC_DIR=/usr/include/
pip install cvxopt

python
> from cvxopt.glpk import ilp

Mac¶

brew install homebrew/science/glpk

export CVXOPT_BUILD_GLPK=1
export CVXOPT_GLPK_LIB_DIR=/usr/local/Cellar/glpk/4.52/lib
export CVXOPT_GLPK_INC_DIR=/usr/local/Cellar/glpk/4.52/include
pip install cvxopt

python
> from cvxopt.glpk import ilp