# Using integer programming and portfolio optimization to pick a March Madness bracket¶

In this notebook I explore the use of an integer program solver in Python to pick an optimal March Madness bracket.

I use the cvxopt Python interface to the glpk solver, both freely available. I also use pandas and numpy. Installing cvxopt and glpk was simple but not completely straightforward; see the appendix at the end for how I did it.

## Scoring¶

The pool I was invited to had an interesting set of rules. You ignore the brackets and simply choose teams, but each team has a price and you have a limited budget:

Pick any set of teams you want as long as you stay within your $2000 budget. And all of your teams earn you points as long as they're still in the tournament. 0 points for play-in wins 1 point for every first round win 2 points for every second round win 3 points for every third round win 5 points for every quarterfinal win 8 points for every semifinal win 13 points for a finals win Pricing structure: 1 seeds:$500
2 seeds: $300 3 seeds:$225
4 seeds: $175 5 seeds:$125
6 seeds: $125 7 seeds:$95
8 seeds: $85 9 seeds:$60
10 seeds: $65 11 seeds:$60
12 seeds: $55 13 seeds:$25
14 seeds: $20 15 seeds:$5
16 seeds: $1 I'm not sure if the price for 9 seeds was a typo. ## Data¶ fivethirtyeight.com (Nate Silver's website) has a model for the tournament and provided data on the output of that model. The data includes the predicted probability that each team will win in each round, which is what we're after. I selected the most recent data set after all the play-in games had completed. In [7]: data = pd.read_csv('bracket-05.tsv', sep='\t') data = data.\ query('rd1_win > 0').\ rename(columns=dict(rd1_win=1, rd2_win=2, rd3_win=3, rd4_win=4, rd5_win=5, rd6_win=6, rd7_win=7))\ [['team_name', 'team_seed', 1, 2, 3, 4, 5, 6, 7]] data.head() Out[7]: team_name team_seed 1 2 3 4 5 6 7 0 Kentucky 1 1 0.998983 0.940664 0.855849 0.732191 5.351611e-01 4.131957e-01 2 Hampton 16b 1 0.001017 0.000103 0.000012 0.000001 5.387219e-08 4.307362e-09 3 Cincinnati 8 1 0.536887 0.034189 0.017280 0.007178 2.065072e-03 7.358615e-04 4 Purdue 9 1 0.463113 0.025044 0.012093 0.004774 1.382256e-03 4.958285e-04 5 West Virginia 5 1 0.682463 0.392345 0.052872 0.025248 7.441731e-03 2.693933e-03 The numbered columns represent the probability that a team will win in that round of the tournament. This of course means that they had to win all previous rounds, so you can see that the numbers are always decreasing from left to right. I found it useful to think about the score you get from each team as a discrete random variable. In this case we need to change the above probabilities so that they represent the chance of winning exactly that number of games, so that for each team the sum of probabilities is 1. In [8]: data[8] = 0 for col in range(8, 1, -1): data[col] = data[col-1] - data[col] data = data.drop(labels=1, axis=1) data = data.rename(columns=dict(zip(range(2,9), range(7)))) data.head() Out[8]: team_name team_seed 0 1 2 3 4 5 6 0 Kentucky 1 0.001017 0.058320 0.084815 0.123658 0.197030 1.219654e-01 4.131957e-01 2 Hampton 16b 0.998983 0.000913 0.000091 0.000011 0.000001 4.956483e-08 4.307362e-09 3 Cincinnati 8 0.463113 0.502698 0.016909 0.010102 0.005113 1.329211e-03 7.358615e-04 4 Purdue 9 0.536887 0.438069 0.012951 0.007319 0.003392 8.864273e-04 4.958285e-04 5 West Virginia 5 0.317537 0.290118 0.339473 0.027624 0.017807 4.747797e-03 2.693933e-03 Now we set up the data for the scoring rules of the pool: In [9]: rounds = range(7) scores = [0, 1, 2, 3, 5, 8, 13] cumscores = np.cumsum(scores) prices = {'1': 500, '2': 300, '3': 225, '4': 175, '5': 125, '6': 125, '7': 95, '8': 85, '9': 60, '10': 65, '11': 60, '11a': 60, '11b': 60, '12': 55, '13': 25, '14': 20, '15': 5, '16': 1, '16a': 1, '16b': 1} budget = 2000 n = len(data) data['price'] = [prices[seed] for seed in data.team_seed] A few quantities which we'll be interested in are: • The expected score for each team. This is the average of the number of points for winning each possible number of games (0 through 6), weighted by the probability of winning that number of games. • The variance of the score for each team. • What I call the efficiency of each team. This is simply the expected score divided by the price. In [10]: def get_expected_score(team): return sum(team[r]*cumscores[r] for r in rounds) data['expected_score'] = data.apply(get_expected_score, axis=1) def get_variance(team): return sum(team[r]*(cumscores[r]-team['expected_score'])**2 for r in rounds) data['variance'] = data.apply(get_variance, axis=1) data['efficiency'] = data.expected_score/data.price In [11]: cols = ['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency'] data[cols].sort(columns=['efficiency'], ascending=False).head() Out[11]: team_name team_seed price expected_score variance efficiency 0 Kentucky 1 500 18.761645 144.256509 0.037523 16 New Mexico State 15 5 0.166190 0.390123 0.033238 39 Utah 5 125 4.123377 27.834924 0.032987 32 Arizona 2 300 9.649058 76.986045 0.032164 36 Robert Morris 16b 1 0.027626 0.065626 0.027626 ## Choosing the optimal set of teams¶ ### Maximizing expected score¶ A naive approach to maximizing the expected score is to continue adding the most efficient remaining team to your list until you run out of budget. But this could leave you with a little bit of unused budget. You might be better off sacrificing a little bit of efficiency so that you can spend your whole budget and get a higher total expected score. In fact this is an example of the knapsack problem which is known to be NP-hard. There are specialized algorithms that solve the knapsack problem, but this is a particularly easy one and I wanted to play with the available free Python integer program solvers. ### You don't actually want that¶ But that isn't necessarily the best way to maximize either your chance of winning or your expected earnings from the pool, which are more likely to be what you actually want. The right way to do it probably involves understanding how large and diverse your competition is, and throwing in some unpopular picks without sacrificing much efficiency. I don't know how to do that though. ### Add in a risk penalty¶ But even if your goal is something like "get a lot of points in this tournament", picking the set of teams with the largest EV isn't necessarily the right play. The max-EV-set could be very risky in that it's expected to get a bad score most of the time, but occasionally gets a huge score, so that on average the score is pretty high. You might be willing to sacrifice a bit of EV in exchange for reduced risk. A method for doing this from portfolio optimization is to maximize the expected return plus a variance penalty. You can tune the variance penalty based on your risk tolerance, setting it to zero to get the usual max-EV-set. ### Notation¶ The problem I'll solve can be written as follows: $$\max_x \left\{ \mu^T x - \epsilon v^T x : p^T x \leq B, x \in \{0,1\} \right\}$$ The optimization variable is$x$, which is a binary vector of length 64. Each element represents a team, with a 1 indicating that you select that team to be in your set.$\mu$is the vector of expected scores for each team,$v$the variance, and$p$the price.$B$is the budget, 2000 in this case.$\epsilon$is the risk penalty factor. ## Solving the optimization problem¶ I'll use the integer linear program solver ilp from glpk. I use the cvxopt Python interface to glpk to access it. In [12]: from cvxopt import matrix from cvxopt.glpk import ilp In [16]: def solve_binary_program(eps): """ Uses the integer linear program solver ilp from glpk: (status, x) = ilp(c, G, h, A, b, I, B) minimize c'*x subject to G*x <= h A*x = b x[k] is integer for k in I x[k] is binary for k in B c nx1 dense 'd' matrix with n>=1 G mxn dense or sparse 'd' matrix with m>=1 h mx1 dense 'd' matrix A pxn dense or sparse 'd' matrix with p>=0 b px1 dense 'd' matrix I set of indices of integer variables B set of indices of binary variables """ c = data.expected_score - eps*data.variance c = matrix(c) G = matrix(data.price[:, np.newaxis].T, tc='d') h = matrix(budget, tc='d') A = matrix(np.zeros((1, n)), tc='d') b = matrix(0.) I = set(range(n)) B = set(range(n)) (status, x) = ilp(-c, G, h, A, b, I, B) if status != 'optimal': raise return x ## Results¶ Notice how as we increase the risk penalty, the optimal set of teams increases in size. And of course, both the expected score and the total variance decrease. In [20]: def solve_and_display(eps=0): x = solve_binary_program(eps) print('number of teams', sum(x)) data['selected'] = x expected_score = data[data.selected == 1].expected_score.sum() total_variance = data[data.selected == 1].variance.sum() print('expected score %.2f' % expected_score) print('total variance %.2f' % total_variance) return data\ [data.selected == 1]\ [['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency']].\ sort(columns='price', ascending=False) The maximum expected score solution: In [21]: solve_and_display() number of teams 15.0 expected score 57.23 total variance 441.66 Out[21]: team_name team_seed price expected_score variance efficiency 0 Kentucky 1 500 18.761645 144.256509 0.037523 51 Villanova 1 500 10.798966 90.892268 0.021598 32 Arizona 2 300 9.649058 76.986045 0.032164 66 Virginia 2 300 7.598104 71.553645 0.025327 39 Utah 5 125 4.123377 27.834924 0.032987 13 Wichita State 7 95 2.275967 11.880365 0.023958 31 Ohio State 10 65 1.350079 8.044640 0.020770 10 Texas 11 60 1.499421 7.395559 0.024990 8 Valparaiso 13 25 0.481423 1.126213 0.019257 29 Georgia State 14 20 0.453335 1.129402 0.022667 16 New Mexico State 15 5 0.166190 0.390123 0.033238 2 Hampton 16b 1 0.001266 0.002283 0.001266 18 Coastal Carolina 16 1 0.021691 0.049848 0.021691 36 Robert Morris 16b 1 0.027626 0.065626 0.027626 52 Lafayette 16 1 0.022346 0.056766 0.022346 With just a little bit of risk penalty, Villanova drops out of the optimal set. The extra 500 of budget is spent on Gonzaga, North Carolina, and UC Irvine, with only a tiny loss of expected score. Either this set of teams or the next one might be good choices for the pool, since they expect to get close to the same score without quite as much variance. In [22]: solve_and_display(eps=.03) number of teams 17.0 expected score 56.87 total variance 416.10 Out[22]: team_name team_seed price expected_score variance efficiency 0 Kentucky 1 500 18.761645 144.256509 0.037523 32 Arizona 2 300 9.649058 76.986045 0.032164 49 Gonzaga 2 300 6.418647 44.254151 0.021395 66 Virginia 2 300 7.598104 71.553645 0.025327 23 North Carolina 4 175 3.598615 20.148501 0.020564 39 Utah 5 125 4.123377 27.834924 0.032987 13 Wichita State 7 95 2.275967 11.880365 0.023958 31 Ohio State 10 65 1.350079 8.044640 0.020770 10 Texas 11 60 1.499421 7.395559 0.024990 8 Valparaiso 13 25 0.481423 1.126213 0.019257 58 UC Irvine 13 25 0.421511 0.925401 0.016860 29 Georgia State 14 20 0.453335 1.129402 0.022667 16 New Mexico State 15 5 0.166190 0.390123 0.033238 18 Coastal Carolina 16 1 0.021691 0.049848 0.021691 2 Hampton 16b 1 0.001266 0.002283 0.001266 36 Robert Morris 16b 1 0.027626 0.065626 0.027626 52 Lafayette 16 1 0.022346 0.056766 0.022346 Increasing the risk penalty typically increases the number of teams in the optimal set, spreading the eggs across many baskets to mitigate risk. But the relationship isn't strictly monotone as we see here, going down from 17 to 16 teams after increasing$\epsilon$from .03 to .05: In [26]: solve_and_display(eps=.05) number of teams 16.0 expected score 54.39 total variance 364.04 Out[26]: team_name team_seed price expected_score variance efficiency 0 Kentucky 1 500 18.761645 144.256509 0.037523 32 Arizona 2 300 9.649058 76.986045 0.032164 49 Gonzaga 2 300 6.418647 44.254151 0.021395 23 North Carolina 4 175 3.598615 20.148501 0.020564 39 Utah 5 125 4.123377 27.834924 0.032987 55 Northern Iowa 5 125 2.231097 10.160632 0.017849 13 Wichita State 7 95 2.275967 11.880365 0.023958 31 Ohio State 10 65 1.350079 8.044640 0.020770 10 Texas 11 60 1.499421 7.395559 0.024990 20 Oklahoma State 9 60 0.947926 2.718533 0.015799 27 Ole Miss 11b 60 0.927713 2.705765 0.015462 61 Dayton 11b 60 1.084642 4.081855 0.018077 8 Valparaiso 13 25 0.481423 1.126213 0.019257 58 UC Irvine 13 25 0.421511 0.925401 0.016860 29 Georgia State 14 20 0.453335 1.129402 0.022667 16 New Mexico State 15 5 0.166190 0.390123 0.033238 In [27]: solve_and_display(eps=.1) number of teams 26.0 expected score 45.79 total variance 268.87 Out[27]: team_name team_seed price expected_score variance efficiency 0 Kentucky 1 500 18.761645 144.256509 0.037523 23 North Carolina 4 175 3.598615 20.148501 0.020564 21 Arkansas 5 125 1.669730 5.633805 0.013358 55 Northern Iowa 5 125 2.231097 10.160632 0.017849 5 West Virginia 5 125 1.846568 7.557176 0.014773 39 Utah 5 125 4.123377 27.834924 0.032987 13 Wichita State 7 95 2.275967 11.880365 0.023958 37 San Diego State 8 85 1.036467 4.378130 0.012194 31 Ohio State 10 65 1.350079 8.044640 0.020770 44 UCLA 11 60 0.716904 2.787515 0.011948 38 St. John's 9 60 0.600717 1.537617 0.010012 27 Ole Miss 11b 60 0.927713 2.705765 0.015462 20 Oklahoma State 9 60 0.947926 2.718533 0.015799 10 Texas 11 60 1.499421 7.395559 0.024990 4 Purdue 9 60 0.590854 1.707142 0.009848 61 Dayton 11b 60 1.084642 4.081855 0.018077 6 Buffalo 12 55 0.633698 1.796136 0.011522 8 Valparaiso 13 25 0.481423 1.126213 0.019257 42 Eastern Washington 13 25 0.301771 0.500441 0.012071 58 UC Irvine 13 25 0.421511 0.925401 0.016860 29 Georgia State 14 20 0.453335 1.129402 0.022667 16 New Mexico State 15 5 0.166190 0.390123 0.033238 2 Hampton 16b 1 0.001266 0.002283 0.001266 18 Coastal Carolina 16 1 0.021691 0.049848 0.021691 36 Robert Morris 16b 1 0.027626 0.065626 0.027626 52 Lafayette 16 1 0.022346 0.056766 0.022346 As we increase the risk penalty even further, the optimization problem no longer really suits our purpose. It becomes so afraid of risk that it spends far below the budget, choosing mostly terrible teams that are likely to lose in the first round, contributing very little uncertainty to our result, but also very little value. In [47]: solve_and_display(eps=.4) number of teams 14.0 expected score 3.00 total variance 6.62 Out[47]: team_name team_seed price expected_score variance efficiency 22 Wofford 12 55 0.332589 0.586285 0.006047 56 Wyoming 12 55 0.439177 1.026169 0.007985 8 Valparaiso 13 25 0.481423 1.126213 0.019257 42 Eastern Washington 13 25 0.301771 0.500441 0.012071 58 UC Irvine 13 25 0.421511 0.925401 0.016860 12 Northeastern 14 20 0.153953 0.360553 0.007698 29 Georgia State 14 20 0.453335 1.129402 0.022667 63 Albany 14 20 0.149620 0.355647 0.007481 16 New Mexico State 15 5 0.166190 0.390123 0.033238 33 Texas Southern 15 5 0.008298 0.014673 0.001660 50 North Dakota State 15 5 0.038325 0.090304 0.007665 2 Hampton 16b 1 0.001266 0.002283 0.001266 18 Coastal Carolina 16 1 0.021691 0.049848 0.021691 36 Robert Morris 16b 1 0.027626 0.065626 0.027626 In [48]: import matplotlib.pyplot as plt %matplotlib inline In [53]: f = lambda eps: sum(solve_binary_program(eps)) eps = np.linspace(0, .75) num_teams = [f(_) for _ in eps] plt.plot(eps, num_teams) plt.ylim(0, 40) plt.xlabel('Risk penalty$\epsilon\$')
plt.ylabel('Optimal number of teams');

# Appendix¶

## Installing cvxopt and glpk¶

I think this should do the trick.

### Ubuntu¶

sudo apt-get install libglpk36

export CVXOPT_BUILD_GLPK=1
export CVXOPT_GLPK_LIB_DIR=/usr/lib/x86_64-linux-gnu/
export CVXOPT_GLPK_INC_DIR=/usr/include/
pip install cvxopt

python
> from cvxopt.glpk import ilp

### Mac¶

brew install homebrew/science/glpk

export CVXOPT_BUILD_GLPK=1
export CVXOPT_GLPK_LIB_DIR=/usr/local/Cellar/glpk/4.52/lib
export CVXOPT_GLPK_INC_DIR=/usr/local/Cellar/glpk/4.52/include
pip install cvxopt

python
> from cvxopt.glpk import ilp