# Reinforcement Learning Solution to the Towers of Hanoi Puzzle¶

For this assignment, you will use reinforcement learning to solve the Towers of Hanoi puzzle with three pegs and five disks.

To accomplish this, you must modify the code discussed in lecture for learning to play Tic-Tac-Toe. Modify the code so that it learns to solve the five-disk, three-peg Towers of Hanoi Puzzle. In some ways, this will be simpler than the Tic-Tac-Toe code.

Steps required to do this include the following:

• Represent the state and move, and use it as a tuple as a key to the Q dictionary.
• Make sure only valid moves are tried from each state.
• Assign reinforcement of $1$ to each move, even for the move that results in the goal state.

Make a plot of the number of steps required to reach the goal for each trial. Each trial starts from the same initial state. Decay epsilon as in the Tic-Tac-Toe code.

## Requirements¶

First, how should we represent the state of this puzzle? We need to keep track of which disks are on which pegs. Name the disks 1, 2, 3, 4, and 5, with 1 being the smallest disk and 5 being the largest. The set of disks on a peg can be represented as a list of integers. Then the state can be a list of three lists.

For example, the starting state with all disks being on the left peg would be [[1, 2, 3, 4, 5], [], []]. After moving disk 1 to peg 2, we have [[2, 3, 4, 5], [1], []].

To represent that move we just made, we can use a list of two peg numbers, like [1, 2], representing a move of the top disk on peg 1 to peg 2.

Now on to some functions. Define at least the following functions. Examples showing required output appear below.

• print_state(state): prints the state in the form shown below
• get_valid_moves(state): returns list of moves that are valid from state
• make_move(state, move): returns new (copy of) state after move has been applied.
• train_Q(n_repetitions, learning_rate, epsilon_decay_factor, get_valid_moves, make_move): train the Q function for number of repetitions, decaying epsilon at start of each repetition. Returns Q and list or array of number of steps to reach goal for each repetition.
• test_Q(Q, max_steps, get_valid_moves, make_move): without updating Q, use Q to find greedy action each step until goal is found. Return path of states.

A function that you might choose to implement is

• state_move_tuple(state, move): returns tuple of state and move.

This is useful for converting state and move to a key to be used for the Q dictionary.

Show the code and results for testing each function. Then experiment with various values of n_repetitions, learning_rate, and epsilon_decay_factor to find values that work reasonably well, meaning that eventually the minimum solution path of 31 steps is found consistently.

Make a plot of the number of steps in the solution path versus number of repetitions. The plot should clearly show the number of steps in the solution path eventually reaching the minimum of 31 steps, though the decrease will not be monotonic. Also plot a horizontal, dashed line at a height (value on y axis) of 31 to show the optimal path length.

Use at least a total of 15 sentences to describe the following results:

• Add markdown cells in which you describe the Q learning algorithm and your implementation of Q learning as applied to the Towers of Hanoi problem.
• Add code cells to examine several Q values from the start state with different moves and discuss if the Q values make sense.
• Also add code cells to examine several Q values from one or two states that are two steps away from the goal and discuss if these Q values make sense.

# Examples¶

In [13]:
state = [[1, 2, 3, 4, 5], [], []]
print_state(state)

1
2
3
4
5
------


In [14]:
move =[1, 2]  # Move top (smallest) disk from first peg to second peg

state_move_tuple(state, move)

Out[14]:
(((1, 2, 3, 4, 5), (), ()), (1, 2))
In [15]:
new_state = make_move(state, move)
new_state

Out[15]:
[[2, 3, 4, 5], [1], []]
In [16]:
get_valid_moves(new_state)

Out[16]:
[[1, 3], [2, 1], [2, 3]]
In [17]:
print_state(new_state)

2
3
4
5 1
------


In [18]:
Q, steps_to_goal = train_Q(200, 0.5, 0.7, get_valid_moves, make_move)

In [19]:
steps_to_goal

Out[19]:
array([1302.,  425.,  900., 1919.,  528.,  907.,  982.,  834.,  280.,
314.,  519.,  307.,  325.,  704.,  246.,  618.,  330.,  420.,
144.,  241.,  179.,  199.,  405.,  379.,  556.,  183.,  559.,
167.,  162.,  257.,   89.,  160.,  425.,  262.,  109.,  164.,
322.,  217.,  164.,   96.,  198.,  104.,  113.,  196.,   93.,
94.,  288.,  121.,  372.,   89.,  146.,  343.,  233.,   82.,
63.,   58.,  260.,   90.,   96.,   70.,  342.,  124.,   67.,
115.,   86.,  360.,   93.,  189.,   80.,   67.,   70.,  253.,
60.,   97.,   41.,  362.,   80.,  109.,  174.,  249.,   81.,
63.,   52.,  207.,  275.,   51.,   43.,   92.,   48.,  123.,
57.,  157.,   48.,   42.,   50.,   42.,   55.,  149.,  228.,
60.,   68.,   37.,   36.,   68.,   38.,   50.,  261.,   93.,
126.,   60.,   36.,   50.,   46.,   40.,  126.,   50.,   82.,
108.,   35.,   39.,   33.,  102.,   33.,   45.,   36.,   37.,
49.,  224.,   31.,   39.,   32.,  157.,   39.,   34.,   53.,
31.,   31.,   32.,   31.,   31.,   35.,   31.,  103.,   31.,
79.,   31.,   31.,   31.,   40.,   31.,   31.,   31.,   31.,
31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,
31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,
31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,
31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,
31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,   31.,
31.,   31.])
In [20]:
path = test_Q(Q, 100, get_valid_moves, make_move)

In [21]:
path

Out[21]:
[[[1, 2, 3, 4, 5], [], []],
[[2, 3, 4, 5], [], [1]],
[[3, 4, 5], [2], [1]],
[[3, 4, 5], [1, 2], []],
[[4, 5], [1, 2], [3]],
[[1, 4, 5], [2], [3]],
[[1, 4, 5], [], [2, 3]],
[[4, 5], [], [1, 2, 3]],
[[5], [4], [1, 2, 3]],
[[5], [1, 4], [2, 3]],
[[2, 5], [1, 4], [3]],
[[1, 2, 5], [4], [3]],
[[1, 2, 5], [3, 4], []],
[[2, 5], [3, 4], [1]],
[[5], [2, 3, 4], [1]],
[[5], [1, 2, 3, 4], []],
[[], [1, 2, 3, 4], [5]],
[[1], [2, 3, 4], [5]],
[[1], [3, 4], [2, 5]],
[[], [3, 4], [1, 2, 5]],
[[3], [4], [1, 2, 5]],
[[3], [1, 4], [2, 5]],
[[2, 3], [1, 4], [5]],
[[1, 2, 3], [4], [5]],
[[1, 2, 3], [], [4, 5]],
[[2, 3], [], [1, 4, 5]],
[[3], [2], [1, 4, 5]],
[[3], [1, 2], [4, 5]],
[[], [1, 2], [3, 4, 5]],
[[1], [2], [3, 4, 5]],
[[1], [], [2, 3, 4, 5]],
[[], [1], [2, 3, 4, 5]],
[[], [], [1, 2, 3, 4, 5]]]
In [22]:
for s in path:
print_state(s)
print()

1
2
3
4
5
------

2
3
4
5   1
------

3
4
5 2 1
------

3
4 1
5 2
------

4 1
5 2 3
------

1
4
5 2 3
------

1
4   2
5   3
------

1
4   2
5   3
------

1
2
5 4 3
------

1 2
5 4 3
------

2 1
5 4 3
------

1
2
5 4 3
------

1
2 3
5 4
------

2 3
5 4 1
------

2
3
5 4 1
------

1
2
3
5 4
------

1
2
3
4 5
------

2
3
1 4 5
------

3 2
1 4 5
------

1
3 2
4 5
------

1
2
3 4 5
------

1 2
3 4 5
------

2 1
3 4 5
------

1
2
3 4 5
------

1
2   4
3   5
------

1
2   4
3   5
------

1
4
3 2 5
------

1 4
3 2 5
------

3
1 4
2 5
------

3
4
1 2 5
------

2
3
4
1   5
------

2
3
4
1 5
------

1
2
3
4
5
------



Download and extract A4grader.py from A4grader.tar.

In [3]:
%run -i A4grader.py

======================= Code Execution =======================

['Anderson-A4.ipynb']
Extracting python code from notebook named 'Anderson-A4.ipynb' and storing in notebookcode.py
Removing all statements that are not function or class defs or import statements.

Testing

state = [[1], [2,3], [4, 5]]
moves = get_valid_moves(state)

--- 5/5 points. Correctly returned [[1, 2], [1, 3], [2, 3]]

Testing

state = [[], [], [1, 2, 3, 4, 5]]
moves = get_valid_moves(state)

--- 5/5 points. Correctly returned [[3, 1], [3, 2]]

Testing

state = [[], [], [1, 2, 3, 4, 5]]
new_state = make_move(state, [3, 1])

--- 5/5 points. Correctly returned [[1], [], [2, 3, 4, 5]]

Testing

state = [[1, 2], [3], [4, 5]]
new_state = make_move(state, [1, 3])

--- 5/5 points. Correctly returned [[2], [3], [1, 4, 5]]

Testing

Q, steps = train_Q(1000, 0.5, 0.7, get_valid_moves, make_move)

--- 10/10 points. Correctly returned list of steps that has 1000 elements.

--- 10/10 points. Correctly returned a Q dictionary with at least 700 elements.

Testing

path = test_Q(Q, 20, get_valid_moves, make_move)

--- 20/20 points. Correctly returned path with fewer than 40 states.

======================================================================
A4 Execution Grade is 60 / 60
======================================================================

___ / 10 points. Correct plot of the number of steps in the solution path versus the
number of repetitions.

___ / 10 points. Add markdown cells in which you describe the Q learning algorithm
and your implementation of Q learning as applied to the
Towers of Hanoi problem.

___ / 10 points. Add code cells to examine several Q values from the start state
with different moves and discuss if the Q values make sense.

___ / 10 points. Also add code cells to examine several Q values from one or two states
that are two steps away from the goal and discuss if these Q values make sense.

======================================================================
======================================================================

======================================================================
A4 FINAL GRADE is  _  / 100
======================================================================

Extra Credit: Earn one point of extra credit for your code for solving
the Towers of Hanoi puzzle with four pegs and five disks and the experiments
and discussion of results.

A4 EXTRA CREDIT is 0 / 1


## Extra Credit¶

Modify your code to solve the Towers of Hanoi puzzle with four pegs and five disks. Name your functions

- print_state_4pegs
- get_valid_moves_4pegs
- make_move_4pegs



Find values for number of repetitions, learning rate, and epsilon decay factor for which train_Q learns a Q function that test_Q can use to find the shortest solution path. Include the output from the successful calls to train_Q and test_Q.