#!/usr/bin/env python
# coding: utf-8

# #Using integer programming and portfolio optimization to pick a March Madness bracket
# 
# In this notebook I explore the use of an [integer program](http://en.wikipedia.org/wiki/Integer_programming) solver in Python to pick an optimal March Madness bracket.
# 
# I use the [cvxopt](http://cvxopt.org/) Python interface to the [glpk](https://www.gnu.org/software/glpk/) solver, both freely available. I also use pandas and numpy. Installing cvxopt and glpk was simple but not completely straightforward; see the appendix at the end for how I did it.

# ## Scoring
# The pool I was invited to had an interesting set of rules. You ignore the brackets and simply choose teams, but each team has a price and you have a limited budget:
# 
# >Pick any set of teams you want as long as you stay within your $2000 budget. And all of your teams earn you points as long as they're still in the tournament.
# ```
# 0 points for play-in wins
# 1 point for every first round win
# 2 points for every second round win
# 3 points for every third round win
# 5 points for every quarterfinal win
# 8 points for every semifinal win
# 13 points for a finals win
# ```
# 
# >Pricing structure:
# ```
# 1 seeds: $500
# 2 seeds: $300
# 3 seeds: $225
# 4 seeds: $175
# 5 seeds: $125
# 6 seeds: $125
# 7 seeds: $95
# 8 seeds: $85
# 9 seeds: $60
# 10 seeds: $65
# 11 seeds: $60
# 12 seeds: $55
# 13 seeds: $25
# 14 seeds: $20
# 15 seeds: $5
# 16 seeds: $1```
# 
# I'm not sure if the price for 9 seeds was a typo.

# ## Data

# [fivethirtyeight.com](fivethirtyeight.com) (Nate Silver's website) has a model for the tournament and provided [data](https://github.com/fivethirtyeight/data/tree/master/march-madness-predictions-2015) on the output of that model. The data includes the predicted probability that each team will win in each round, which is what we're after. I selected the most recent data set after all the play-in games had completed.

# In[7]:


data = pd.read_csv('bracket-05.tsv', sep='\t')
data = data.\
    query('rd1_win > 0').\
    rename(columns=dict(rd1_win=1, rd2_win=2, rd3_win=3, rd4_win=4, rd5_win=5, rd6_win=6, rd7_win=7))\
    [['team_name', 'team_seed', 1, 2, 3, 4, 5, 6, 7]]
data.head()


# The numbered columns represent the probability that a team will win in that round of the tournament. This of course means that they had to win all previous rounds, so you can see that the numbers are always decreasing from left to right.
# 
# I found it useful to think about the score you get from each team as a discrete random variable. In this case we need to change the above probabilities so that they represent the chance of winning exactly that number of games, so that for each team the sum of probabilities is 1.

# In[8]:


data[8] = 0
for col in range(8, 1, -1):
    data[col] = data[col-1] - data[col]
data = data.drop(labels=1, axis=1)
data = data.rename(columns=dict(zip(range(2,9), range(7))))
data.head()


# Now we set up the data for the scoring rules of the pool:

# In[9]:


rounds = range(7)
scores = [0, 1, 2, 3, 5, 8, 13]
cumscores = np.cumsum(scores)
prices =   {'1': 500,
            '2': 300,
            '3': 225,
            '4': 175,
            '5': 125,
            '6': 125,
            '7': 95,
            '8': 85,
            '9': 60,
            '10': 65,
            '11': 60,
            '11a': 60,
            '11b': 60,
            '12': 55,
            '13': 25,
            '14': 20,
            '15': 5,
            '16': 1,
            '16a': 1,
            '16b': 1}
budget = 2000
n = len(data)
data['price'] = [prices[seed] for seed in data.team_seed]


# A few quantities which we'll be interested in are:
# 
# * The expected score for each team. This is the average of the number of points for winning each possible number of games (0 through 6), weighted by the probability of winning that number of games.
# * The variance of the score for each team.
# * What I call the efficiency of each team. This is simply the expected score divided by the price.

# In[10]:


def get_expected_score(team):
    return sum(team[r]*cumscores[r] for r in rounds)
data['expected_score'] = data.apply(get_expected_score, axis=1)

def get_variance(team):
    return sum(team[r]*(cumscores[r]-team['expected_score'])**2 for r in rounds)
data['variance'] = data.apply(get_variance, axis=1)

data['efficiency'] = data.expected_score/data.price


# In[11]:


cols = ['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency']
data[cols].sort(columns=['efficiency'], ascending=False).head()


# ## Choosing the optimal set of teams
# 
# ### Maximizing expected score
# 
# A naive approach to maximizing the expected score is to continue adding the most efficient remaining team to your list until you run out of budget. But this could leave you with a little bit of unused budget. You might be better off sacrificing a little bit of efficiency so that you can spend your whole budget and get a higher total expected score.
# 
# In fact this is an example of the [knapsack problem](http://en.wikipedia.org/wiki/Knapsack_problem) which is known to be NP-hard. There are specialized algorithms that solve the knapsack problem, but this is a particularly easy one and I wanted to play with the available free Python integer program solvers.
# 
# ### You don't actually want that
# 
# But that isn't necessarily the best way to maximize either your chance of winning or your expected earnings from the pool, which are more likely to be what you actually want. The right way to do it probably involves understanding how large and diverse your competition is, and throwing in some unpopular picks without sacrificing much efficiency. I don't know how to do that though.
# 
# ### Add in a risk penalty
# 
# But even if your goal is something like "get a lot of points in this tournament", picking the set of teams with the largest EV isn't necessarily the right play. The max-EV-set could be very risky in that it's expected to get a bad score most of the time, but occasionally gets a huge score, so that on average the score is pretty high. You might be willing to sacrifice a bit of EV in exchange for reduced risk.
# 
# A method for doing this from portfolio optimization is to maximize the expected return plus a variance penalty. You can tune the variance penalty based on your risk tolerance, setting it to zero to get the usual max-EV-set.
# 
# ### Notation
# 
# The problem I'll solve can be written as follows:
# 
# $$ \max_x \left\{ \mu^T x - \epsilon v^T x : p^T x \leq B, x \in \{0,1\} \right\} $$
# 
# The optimization variable is $x$, which is a binary vector of length 64. Each element represents a team, with a 1 indicating that you select that team to be in your set.
# 
# $\mu$ is the vector of expected scores for each team, $v$ the variance, and $p$ the price. $B$ is the budget, 2000 in this case. $\epsilon$ is the risk penalty factor.

# ##Solving the optimization problem
# 
# I'll use the integer linear program solver `ilp` from glpk. I use the cvxopt Python interface to glpk to access it.

# In[12]:


from cvxopt import matrix
from cvxopt.glpk import ilp


# In[16]:


def solve_binary_program(eps):
    """
    Uses the integer linear program solver ilp from glpk:

    (status, x) = ilp(c, G, h, A, b, I, B)

        minimize    c'*x
        subject to  G*x <= h
                    A*x = b
                    x[k] is integer for k in I
                    x[k] is binary for k in B

    c            nx1 dense 'd' matrix with n>=1
    G            mxn dense or sparse 'd' matrix with m>=1
    h            mx1 dense 'd' matrix
    A            pxn dense or sparse 'd' matrix with p>=0
    b            px1 dense 'd' matrix
    I            set of indices of integer variables
    B            set of indices of binary variables
    """
    c = data.expected_score - eps*data.variance
    c = matrix(c)
    G = matrix(data.price[:, np.newaxis].T, tc='d')
    h = matrix(budget, tc='d')
    A = matrix(np.zeros((1, n)), tc='d')
    b = matrix(0.)
    I = set(range(n))
    B = set(range(n))

    (status, x) = ilp(-c, G, h, A, b, I, B)
    if status != 'optimal':
        raise
    return x


# ## Results
# 
# Notice how as we increase the risk penalty, the optimal set of teams increases in size. And of course, both the expected score and the total variance decrease.

# In[20]:


def solve_and_display(eps=0):
    x = solve_binary_program(eps)
    print('number of teams', sum(x))

    data['selected'] = x
    expected_score = data[data.selected == 1].expected_score.sum()
    total_variance = data[data.selected == 1].variance.sum()
    print('expected score %.2f' % expected_score)
    print('total variance %.2f' % total_variance)
    return data\
        [data.selected == 1]\
        [['team_name', 'team_seed', 'price', 'expected_score', 'variance', 'efficiency']].\
        sort(columns='price', ascending=False)


# The maximum expected score solution:

# In[21]:


solve_and_display()


# With just a little bit of risk penalty, Villanova drops out of the optimal set. The extra 500 of budget is spent on Gonzaga, North Carolina, and UC Irvine, with only a tiny loss of expected score.
# 
# Either this set of teams or the next one might be good choices for the pool, since they expect to get close to the same score without quite as much variance.

# In[22]:


solve_and_display(eps=.03)


# Increasing the risk penalty typically increases the number of teams in the optimal set, spreading the eggs across many baskets to mitigate risk. But the relationship isn't strictly monotone as we see here, going down from 17 to 16 teams after increasing $\epsilon$ from .03 to .05:

# In[26]:


solve_and_display(eps=.05)


# In[27]:


solve_and_display(eps=.1)


# As we increase the risk penalty even further, the optimization problem no longer really suits our purpose. It becomes so afraid of risk that it spends far below the budget, choosing mostly terrible teams that are likely to lose in the first round, contributing very little uncertainty to our result, but also very little value.

# In[47]:


solve_and_display(eps=.4)


# In[48]:


import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# In[53]:


f = lambda eps: sum(solve_binary_program(eps))
eps = np.linspace(0, .75)
num_teams = [f(_) for _ in eps]
plt.plot(eps, num_teams)
plt.ylim(0, 40)
plt.xlabel('Risk penalty $\epsilon$')
plt.ylabel('Optimal number of teams');


# <hr>
# 
# #Appendix
# 
# ##Installing cvxopt and glpk
# 
# I think this should do the trick.
# 
# ### Ubuntu
# 
# ```
# sudo apt-get install libglpk36
# 
# export CVXOPT_BUILD_GLPK=1
# export CVXOPT_GLPK_LIB_DIR=/usr/lib/x86_64-linux-gnu/
# export CVXOPT_GLPK_INC_DIR=/usr/include/
# pip install cvxopt
# 
# python
# > from cvxopt.glpk import ilp
# ```
# 
# ### Mac
# 
# ```
# brew install homebrew/science/glpk
# 
# export CVXOPT_BUILD_GLPK=1
# export CVXOPT_GLPK_LIB_DIR=/usr/local/Cellar/glpk/4.52/lib
# export CVXOPT_GLPK_INC_DIR=/usr/local/Cellar/glpk/4.52/include
# pip install cvxopt
# 
# python
# > from cvxopt.glpk import ilp
# ```