Childish Random TSP Approximation

Author: Vincent D. Warmerdam @ http://koaning.github.io

This notebook will explain a dead simple approach to TSP that approximates an optimal solution and runs in linear time.

Background TSP

If you need a reminder of what a TSP is and why it is such an interesting problem in computer science I can recommend this this video.

Packages

Always load the packages first. In this case, both the R-stuff (for plotting) and the python-stuff.

In [663]:
from ggplot import *
import pandas as pd
import numpy as np
%pylab inline
%load_ext rpy2.ipython
Populating the interactive namespace from numpy and matplotlib
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
WARNING: pylab import has clobbered these variables: ['xlim', 'sign', 'ylim']
`%matplotlib` prevents importing * from pylab and numpy
In [664]:
%%R
library(ggplot2)

Example Data

I'll be using ggplot for the visualisation so most of my data will always need to be transported via dataframes.

In [665]:
x = np.random.uniform(0,1,200)
y = np.random.uniform(0,1,200)
df = pd.DataFrame({'x':x, 'y':y})
%Rpush df
In [666]:
%%R -w 1200 -o df
ggplot() + geom_point(aes(x,y), data=df, size=5) + ggtitle("TSP points")

Theoretically there are a lot of possible paths to choose from. All permutations of this problem leaves us with $99!$ options. First, let's just visualize a random permutation.

In [667]:
df = pd.DataFrame({'x':x, 'y':y})
perm = np.random.permutation(np.arange(0,df.shape[0],1))
df['order'] = perm
df_perm = df.sort('order')
%Rpush df_perm
In [668]:
%%R -w 1200 -o df_perm
p = ggplot() + geom_path(aes(x,y), data=df_perm, size=1) 
p + geom_point(aes(x,y), data=df_perm, size=4) + ggtitle("TSP points")

Just by looking at this plot I get the idea that we might be able to make things better. Let's define a cost function so we can start to compare different permutations.

In [669]:
def distance(df):
    dx = df['x'] - df['x'].shift()
    dy = df['y'] - df['y'].shift()
    return sum(sqrt(dx**2+dy**2))

rand_scores = [] 
for i in range(1000):
    df['order'] = np.random.permutation(np.arange(0,df.shape[0],1))
    df = df.sort('order')
    rand_scores.append(distance(df))

dist_df = pd.DataFrame({'dist': np.array(rand_scores)})

%Rpush dist_df
In [670]:
%%R -w 1200 -o dist_df
p = ggplot() + geom_histogram(aes(dist), data=dist_df, binwidth=0.5) 
p + ggtitle("Histogram of Random RSP Distances")

We can see that there is a bit of spread in just 1000 samples. This will help us to give a general impression on how well a tsp allocation performs.

Block-Heuristic

Let's approximate blocks in the 2d space we've created and go through them.

Alt text

Just as an approximation, we could visit all the points in one block before we move on to the next one.

Alt text

It is a simple heuristic that immediately removes a lot of expensive city connections because it both locally and globally puts a limit on distances that are travelled. It is a very simple, childlike algorithm, but as we will see, it has some nice characteristics.

The pandas cut method helps out a lot with the implementation.

In [671]:
dim = 6
df_perm = df
df_perm['xcut'] = pd.cut(df_perm.x, dim, labels=range(dim))
df_perm['ycut'] = pd.cut(df_perm.y, dim, labels=range(dim))
df_perm = df_perm.sort(['xcut','ycut'])
df_perm.order = np.arange(0, df_perm.shape[0], 1)
df_perm = pd.concat([df_perm, df_perm.head(1)])
print(df_perm.head(15))
print(distance(df_perm))
%Rpush df_perm
            x         y  order  xcut  ycut
6    0.166565  0.088334      0     0     0
159  0.038724  0.134862      1     0     0
173  0.081720  0.092598      2     0     0
117  0.048155  0.071911      3     0     0
13   0.130920  0.090656      4     0     0
43   0.104134  0.155605      5     0     0
112  0.100763  0.303163      6     0     1
103  0.078253  0.200736      7     0     1
63   0.005040  0.188670      8     0     1
50   0.057423  0.220988      9     0     1
131  0.160861  0.220250     10     0     1
133  0.010323  0.267152     11     0     1
32   0.140946  0.435007     12     0     2
55   0.106370  0.369894     13     0     2
71   0.023560  0.439148     14     0     2

[15 rows x 5 columns]
25.7295874131
In [672]:
%%R -w 1200 -o df_perm
p = ggplot() + geom_path(aes(x,y), data=df_perm, size=1) 
p + geom_point(aes(x,y), data=df_perm, size=4) + ggtitle("TSP points")

Already, by just binning all the points together in order, we have gotten a result that is far better that something we got out of a random sample.

Notice that we have one variable that we could tune in this case, namely the dimensions by which we are cutting up our search space. Let's turn that into a function and check what the performance does.

In [673]:
def boxer1(dim):
    df_perm = df
    df_perm['xcut'] = pd.cut(df_perm.x, dim, labels=range(dim))
    df_perm['ycut'] = pd.cut(df_perm.y, dim, labels=range(dim))
    df_perm = df_perm.sort(['xcut','ycut'])
    df_perm.order = np.arange(0, df_perm.shape[0], 1)
    return(pd.concat([df_perm, df_perm.head(1)]))

pltr = pd.DataFrame({'x':range(1,50), 
                     'y':[distance(boxer1(i)) for i in range(1,50)]})

%Rpush pltr
In [674]:
%%R -w 1200 -o pltr
p = ggplot() + geom_path(aes(x,y), data=pltr, size=1) 
p + ggtitle("TSP performance over box dimensions")

There seems to be a sweet spot at having a dimension around 8, after that the results seem to be getting worce. Let's compare a solution of dimension 8 and a solution of dimension 40.

In [675]:
pltr = boxer1(8)
%Rpush pltr
In [676]:
%%R -w 1200 -o pltr
p = ggplot() + geom_path(aes(x,y), data=pltr, size=1) 
p + geom_point(aes(x,y), data=pltr, size=4) + ggtitle("TSP points")
In [677]:
pltr = boxer1(40)
%Rpush pltr
In [678]:
%%R -w 1200 -o pltr
p = ggplot() + geom_path(aes(x,y), data=pltr, size=1) 
p + geom_point(aes(x,y), data=pltr, size=4) + ggtitle("TSP points")