$$\newcommand{\Rv}{\mathbf{R}} \newcommand{\rv}{\mathbf{r}} \newcommand{\Qv}{\mathbf{Q}} \newcommand{\Qnv}{\mathbf{Qn}} \newcommand{\Av}{\mathbf{A}} \newcommand{\Aiv}{\mathbf{Ai}} \newcommand{\av}{\mathbf{a}} \newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\betav}{\mathbf{\beta}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\Uv}{\mathbf{U}} \newcommand{\uv}{\mathbf{u}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\TDv}{\mathbf{TD}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\Gv}{\mathbf{G}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} \newcommand{\grad}{\mathbf{\nabla}} \newcommand{\ebx}[1]{e^{\betav_{#1}^T \xv_n}} \newcommand{\eby}[1]{e^{y_{n,#1}}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Fv}{\mathbf{F}} \newcommand{\ones}[1]{\mathbf{1}_{#1}}$$

# Reinforcement Learning for Two-Player Games¶

How does Tic-Tac-Toe differ from the maze problem?

• Different state and action sets.
• Two players rather than one.
• Reinforcement is 0 until end of game, when it is 1 for win, 0 for draw, or -1 for loss.
• Maximizing sum of reinforcement rather than minimizing.
• Anything else?

## Representing the Q Table¶

The state is the board configuration. There are $3^9$ of them, though not all are reachable. Is this too big?

It is a bit less than 20,000. Not bad. Is this the full size of the Q table?

No. We must add the action dimension. There are at most 9 actions, one for each cell on the board. So the Q table will contain about $20,000 \cdot 9$ values or about 200,000. No worries.

Instead of thinking about the Q table as a three-dimensional array, as we did last time, let's be more pythonic and use a dictionary. Use the current state as the key, and the value associated with the state is an array of Q values for each action taken in that state.

We still need a way to represent a board.

How about an array of characters? So

 X |   | O
---------
| X | O
---------
X |   |



would be

 board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' '])



The initial board would be

 board = np.array([' ']*9)



We can represent a move as an index, 0 to 8, into this array.

What should the reinforcement values be?

How about 0 every move except when X wins, with a reinforcement of 1, and when O wins, with a reinforcement of -1.

For the above board, let's say we, meaning Player X, prefer move to index 3. In fact, this always results in a win. So the Q value for move to 3 should be 1. What other Q values do you know?

If we don't play a move to win, O could win in one move. So the other moves might have Q values close to -1, depending on the skill of Player O. In the following discussion we will be using a random player for O, so the Q value for a move other than 8 or 3 will be close to but not exactly -1.

## Agent-World Interaction Loop¶

For our agent to interact with its world, we must implement

1. Initialize Q.
2. Set initial state, as empty board.
3. Repeat:
1. Agent chooses next X move.
2. If X wins, set Q(board,move) to 1.
3. Else, if board is full, set Q(board,move) to 0.
4. Else, let O take move.
5. If O won, update Q(board,move) by (-1 - Q(board,move))
6. For all cases, update Q(oldboard,oldmove) by Q(board,move) - Q(oldboard,oldmove)
7. Shift current board and move to old ones.

## Now in Python¶

First, here is the result of running tons of games.

First, let's get some function definitions out of the way.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from copy import copy


Let's write a function to print a board in the usual Tic-Tac-Toe style.

In [2]:
def printBoard(board):
print('''
{}|{}|{}
-----
{}|{}|{}
------
{}|{}|{}'''.format(*tuple(board)))
printBoard(np.array(['X',' ','O', ' ','X','O', 'X',' ',' ']))

X| |O
-----
|X|O
------
X| |


Let's write a function that returns True if the current board is a winning board for us. We will be Player X. What does the value of combos represent?

In [3]:
def winner(board):
combos = np.array((0,1,2, 3,4,5, 6,7,8, 0,3,6, 1,4,7, 2,5,8, 0,4,8, 2,4,6))
if np.any(np.logical_or(np.all('X' == board[combos].reshape((-1,3)), axis=1),
np.all('O' == board[combos].reshape((-1,3)), axis=1))):
return True
else:
return False

In [4]:
board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' '])
printBoard(board), print(winner(board))
board = np.array(['X',' ','X', ' ','X','O', 'X',' ',' '])
printBoard(board), print(winner(board))

X| |O
-----
|X|O
------
X| |
False

X| |X
-----
|X|O
------
X| |
True

Out[4]:
(None, None)

How can we find all valid moves from a board? Just find all of the spaces in the board representation

In [5]:
np.where(board == ' ')

Out[5]:
(array([1, 3, 7, 8]),)
In [6]:
np.where(board == ' ')[0]

Out[6]:
array([1, 3, 7, 8])

And how do we pick one at random and make that move?

In [7]:
board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' '])
validMoves = np.where(board == ' ')[0]
move = np.random.choice(validMoves)
boardNew = copy(board)
boardNew[move] = 'X'
print('From this board')
printBoard(board)
print('\n  Move',move)
print('\nresults in board')
printBoard(boardNew)

From this board

X| |O
-----
|X|O
------
X| |

Move 1

results in board

X|X|O
-----
|X|O
------
X| |


If X just won, we want to set the Q value for the previous state (board) to 1, because X will always win from that state and that action (move).

First we must figure out how to implement the Q table? We want to associate a value with each board and move. We can use a python dictionary for this. We know how to represent a board. A move can be an integer from 0 to 8 to index into the board array for the location to place a marker.

In [8]:
Q = {}  # empty table
Q[(tuple(board),1)] = 0
Q

Out[8]:
{(('X', ' ', 'O', ' ', 'X', 'O', 'X', ' ', ' '), 1): 0}
In [9]:
Q[(tuple(board),1)]

Out[9]:
0

What if we try to look up a Q value for a state,action we have not encountered yet? It will not be in the dictionary. We can use the get method for the dictionary, that has a second argument as the value returned if the key does not exist.

In [10]:
board[1] = 'X'
Q[(tuple(board),1)]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-10-f4e07b043832> in <module>()
1 board[1] = 'X'
----> 2 Q[(tuple(board),1)]

KeyError: (('X', 'X', 'O', ' ', 'X', 'O', 'X', ' ', ' '), 1)
In [11]:
Q.get((tuple(board),1), 42)

Out[11]:
42

Now we can set the Q value for (board,move) to 1.

In [12]:
Q[(tuple(board),move)] = 1


If the board is full, then the the previous state and action should be assigned 0.

In [13]:
Q[(tuple(board),move)] = 0


If the board is not full, better check to see if O just won. If O did just win, then we should adjust the Q value of the previous state and X action to be closer to -1, because we just received a -1 reinforcement and the game is over.

In [14]:
rho = 0.1 # learning rate
Q[(tuple(board),move)] += rho * (-1 - Q[(tuple(board),move)])


If nobody won yet, let's calculate the temporal difference error and use it to adjust the Q value of the previous board,move. We do this only if we are not at the first move of a game.

In [15]:
step = 0
if step > 0:
Q[(tuple(boardOld),moveOld)] += rho * (Q[(tuple(board),move)] - Q[(tuple(boardOld),moveOld)])


Initially, taking random moves is a good strategy, because we know nothing about how to play Tic-Tac-Toe. But, once we have gained some experience and our Q table has acquired some good predictions of the sum of future reinforcement, we should rely on our Q values to pick good moves. For a given board, which move is predicted to lead to the best possible future using the current Q table?

In [16]:
validMoves = np.where(board == ' ')[0]
print('Valid moves are',validMoves)
Qs = np.array([Q.get((tuple(board),m), 0) for m in validMoves])
print('Q values for validMoves are',Qs)
bestMove = validMoves[np.argmax(Qs)]
print('Best move is',bestMove)

Valid moves are [3 7 8]
Q values for validMoves are [0 0 0]
Best move is 3


To slowly transition from taking random actions to taking the action currently believed to be best, called the greedy action, we slowly decay a parameter, $\epsilon$, from 1 down towards 0 as the probability of selecting a random action. This is called the $\epsilon$-greedy policy.

In [75]:
s=np.array([4,2,5])
np.random.shuffle(s)
s

Out[75]:
array([5, 2, 4])
In [59]:
def epsilonGreedy(epsilon, Q, board):
validMoves = np.where(board == ' ')[0]
if np.random.uniform() < epsilon:
# Random Move
return np.random.choice(validMoves)
else:
# Greedy Move
np.random.shuffle(validMoves)
Qs = np.array([Q.get((tuple(board),m), 0) for m in validMoves])
return validMoves[ np.argmax(Qs) ]
epsilonGreedy(0.8,Q,board)

Out[59]:
5

Now write a function to make plots to show results of some games. Say the variable outcomes is a vector of 1's, 0's, and -1's, for games in which X wins, draws, and loses, respectively.

In [42]:
outcomes = np.random.choice([-1,0,1],replace=True,size=(1000))
outcomes[:10]

Out[42]:
array([-1, -1,  1,  0,  1,  0, -1,  1, -1,  0])
In [43]:
def plotOutcomes(outcomes,epsilons,maxGames,nGames):
if nGames==0:
return
nBins = 100
nPer = int(maxGames/nBins)
outcomeRows = outcomes.reshape((-1,nPer))
outcomeRows = outcomeRows[:int(nGames/float(nPer))+1,:]
avgs = np.mean(outcomeRows,axis=1)
plt.subplot(3,1,1)
xs = np.linspace(nPer,nGames,len(avgs))
plt.plot(xs, avgs)
plt.xlabel('Games')
plt.ylabel('Mean of Outcomes\n(0=draw, 1=X win, -1=O win)')
plt.title('Bins of {:d} Games'.format(nPer))
plt.subplot(3,1,2)
plt.plot(xs,np.sum(outcomeRows==-1,axis=1),'r-',label='Losses')
plt.plot(xs,np.sum(outcomeRows==0,axis=1),'b-',label='Draws')
plt.plot(xs,np.sum(outcomeRows==1,axis=1),'g-',label='Wins')
plt.legend(loc="center")
plt.ylabel('Number of Games\nin Bins of {:d}'.format(nPer))
plt.subplot(3,1,3)
plt.plot(epsilons[:nGames])
plt.ylabel('$\epsilon$')

In [44]:
plt.figure(figsize=(8,8))
plotOutcomes(outcomes,np.zeros(1000),1000,1000)


Finally, let's write the whole Tic-Tac-Toe learning loop!

In [45]:
from IPython.display import display, clear_output

In [153]:
maxGames = 10000
rho = 0.5
epsilonDecayRate = 0.9999
epsilon = 1.0
graphics = True
showMoves = not graphics

outcomes = np.zeros(maxGames)
epsilons = np.zeros(maxGames)
Q = {}

if graphics:
fig = plt.figure(figsize=(10,10))

for nGames in range(maxGames):
epsilon *= epsilonDecayRate
epsilons[nGames] = epsilon
step = 0
board = np.array([' '] * 9)  # empty board
done = False

while not done:
step += 1

# X's turn
move = epsilonGreedy(epsilon, Q, board)
boardNew = copy(board)
boardNew[move] = 'X'
if (tuple(board),move) not in Q:
Q[(tuple(board),move)] = 0  # initial Q value for new board,move
if showMoves:
printBoard(boardNew)

if winner(boardNew):
# X won!
if showMoves:
print('        X Won!')
Q[(tuple(board),move)] = 1
done = True
outcomes[nGames] = 1

elif not np.any(boardNew == ' '):
# Game over. No winner.
if showMoves:
print('        draw.')
Q[(tuple(board),move)] = 0
done = True
outcomes[nGames] = 0

else:
# O's turn.  O is a random player!
moveO = np.random.choice(np.where(boardNew==' ')[0])
boardNew[moveO] = 'O'
if showMoves:
printBoard(boardNew)
if winner(boardNew):
# O won!
if showMoves:
print('        O Won!')
Q[(tuple(board),move)] += rho * (-1 - Q[(tuple(board),move)])
done = True
outcomes[nGames] = -1

if step > 1:
Q[(tuple(boardOld),moveOld)] += rho * (Q[(tuple(board),move)] - Q[(tuple(boardOld),moveOld)])

boardOld, moveOld = board, move # remember board and move to Q(board,move) can be updated after next steps
board = boardNew

if graphics and (nGames % (maxGames/10) == 0 or nGames == maxGames-1):
fig.clf()
plotOutcomes(outcomes,epsilons,maxGames,nGames-1)
clear_output(wait=True)
display(fig);

if graphics:
clear_output(wait=True)
print('Outcomes: {:d} X wins {:d} O wins {:d} draws'.format(np.sum(outcomes==1), np.sum(outcomes==-1), np.sum(outcomes==0)))

Outcomes: 7171 X wins 1883 O wins 946 draws

In [154]:
Q[(tuple([' ']*9),0)]

Out[154]:
0.44758640765803587
In [155]:
Q[(tuple([' ']*9),1)]

Out[155]:
0.430007917702212
In [156]:
Q.get((tuple([' ']*9),0), 0)

Out[156]:
0.44758640765803587
In [157]:
[Q.get((tuple([' ']*9),m), 0) for m in range(9)]

Out[157]:
[0.44758640765803587,
0.430007917702212,
0.36224246798591586,
0.3538047190944087,
0.669464921320719,
0.4192150309661714,
0.48834465403376426,
0.11156529838525431,
0.18115877280086978]
In [158]:
board = np.array([' ']*9)
Qs = [Q.get((tuple(board),m), 0) for m in range(9)]
printBoard(board)
print('''{:.2f} | {:.2f} | {:.2f}
------------------
{:.2f} | {:.2f} | {:.2f}
------------------
{:.2f} | {:.2f} | {:.2f}'''.format(*Qs))

 | |
-----
| |
------
| |
0.45 | 0.43 | 0.36
------------------
0.35 | 0.67 | 0.42
------------------
0.49 | 0.11 | 0.18

In [159]:
def printBoardQs(board,Q):
printBoard(board)
Qs = [Q.get((tuple(board),m), 0) for m in range(9)]
print()
print('''{:.2f} | {:.2f} | {:.2f}
------------------
{:.2f} | {:.2f} | {:.2f}
------------------
{:.2f} | {:.2f} | {:.2f}'''.format(*Qs))

In [160]:
board[0] = 'X'
board[1] = 'O'
printBoardQs(board,Q)

X|O|
-----
| |
------
| |

0.00 | 0.00 | 0.45
------------------
0.43 | 0.23 | 0.56
------------------
0.69 | 0.22 | 0.37

In [161]:
board[4] = 'X'
board[3] = 'O'
printBoardQs(board,Q)

X|O|
-----
O|X|
------
| |

0.00 | 0.00 | 0.56
------------------
0.00 | 0.00 | 0.50
------------------
0.88 | 0.77 | 1.00

In [162]:
board = np.array([' ']*9)
printBoardQs(board,Q)

 | |
-----
| |
------
| |

0.45 | 0.43 | 0.36
------------------
0.35 | 0.67 | 0.42
------------------
0.49 | 0.11 | 0.18

In [163]:
board[0] = 'X'
board[4] = 'O'
printBoardQs(board,Q)

X| |
-----
|O|
------
| |

0.00 | 0.50 | 0.25
------------------
0.79 | 0.00 | 0.28
------------------
0.82 | 0.11 | 0.14

In [164]:
board[2] = 'X'
board[1] = 'O'
printBoardQs(board,Q)

X|O|X
-----
|O|
------
| |

0.00 | 0.00 | 0.00
------------------
-0.03 | 0.00 | 0.00
------------------
0.41 | -0.25 | 0.00

In [165]:
board[7] = 'X'
board[3] = 'O'
printBoardQs(board,Q)

X|O|X
-----
O|O|
------
|X|

0.00 | 0.00 | 0.00
------------------
0.00 | 0.00 | 0.19
------------------
0.00 | 0.00 | -0.50

In [166]:
board[5] = 'X'
board[6] = 'O'
printBoardQs(board,Q)

X|O|X
-----
O|O|X
------
O|X|

0.00 | 0.00 | 0.00
------------------
0.00 | 0.00 | 0.00
------------------
0.00 | 0.00 | 1.00

In [ ]:


In [ ]: