Python Is Not C: Take Two

Jean-François Puget

An exercise in optimising Python code.

This notebook is commented in Python Is Not C: Take Two

First, let's import useful tools and packages

In [1]:
%load_ext Cython
%load_ext line_profiler
from numba import jit,autojit

import random
import numpy as np
import math
import pandas as pd

import sys

if sys.version_info < (3,):
    range = xrange
    
from matplotlib import pyplot as plt
%matplotlib inline

Data sets

We will use two data sets, a random one, and one taken from RAAM. Each data set has a set of test points and a set of track points. The problem is to compute the closest point among the track points for each point in the test points.

Random

Track and waypoints are uniformly drawn in the North East part of the world: 0<= Latitude <= 90 and 0 <= Longitude <= 180. It has 25,000 track points and 400 waypoints.

In [2]:
def create_points(n):
    return pd.DataFrame({'Latitude': np.random.uniform(0.0,90.0,n), \
                         'Longitude': np.random.uniform(0.0,180.0,n)})

track_points = create_points(25000)
test_points = create_points(400)

Raam

Track and waypoints from the 2015 RAAM competition. It has 25,092 track points and 428 waypoints.

In [3]:
def read_points(file):
    points = pd.read_csv(file, sep='\t')
    points = points[['Lat','Lon']]
    points.columns = ['Latitude','Longitude']
    return points

track_points2= read_points('trkpts.tsv')
points2 = np.zeros((len(track_points2),2))
points2[:,0] = track_points2['Latitude']
points2[:,1] = track_points2['Longitude']

test_points2 = read_points('wpts3.tsv')

Let's display these data sets. We display the test_points

In [4]:
datasets = [test_points, test_points2]
titles = ['random', 'raam']

fig, ax = plt.subplots(1, 2, figsize=(8, 3.5))

for axi, title, dataset in zip(ax, titles, datasets):
    axi.plot(dataset['Longitude'], dataset['Latitude'], '.k')
    axi.set_title(title, size=14)
    axi.set_xlabel('Longitude')
    axi.set_ylabel('Latitude')
    
plt.savefig('nneighbors.png')

Reference Code In Python

This is the code I wrote almost a year ago, and that is described in Python Is Not C

We use the following function is from John D Cook. It implements the great circle distance

In [5]:
def distance_on_unit_sphere(lat1, long1, lat2, long2):
 
    # Convert latitude and longitude to 
    # spherical coordinates in radians.
    degrees_to_radians = math.pi/180.0
         
    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - lat2)*degrees_to_radians
         
    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = long2*degrees_to_radians
         
    # Compute spherical distance from spherical coordinates.
         
    # For two locations in spherical coordinates 
    # (1, theta, phi) and (1, theta', phi')
    # cosine( arc length ) = 
    #    sin phi sin phi' cos(theta-theta') + cos phi cos phi'
    # distance = rho * arc length
     
    cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) + 
           math.cos(phi1)*math.cos(phi2))
    arc = math.acos( cos )
 
    # Remember to multiply arc by the radius of the earth 
    # in your favorite set of units to get length.
    return arc

We use it to find the closest point in a list of points.

In [6]:
def closest_distance(latitude,longitude,points):
    d = 100000.0
    best = -1
    r = points.index
    for i in r:
        latitude_i = points.ix[i,'Latitude']
        longitude_i = points.ix[i,'Longitude']
        md = distance_on_unit_sphere(latitude, longitude, latitude_i, longitude_i)
        if d > md:
            best = i
            d = md
    return best,d

Profiling

Let's run the above code for each test point

In [9]:
def test(track_points,test_points):
    test_points['Closest_point'] = 0
    test_points['Closest_distance'] = 0.0
    for i in range(len(test_points)): 
        latitude_i = test_points.ix[i,'Latitude']
        longitude_i = test_points.ix[i,'Longitude']
        best, distance = closest_distance(latitude_i, longitude_i, track_points)
        test_points.ix[i,'Closest_point'] = best
        test_points.ix[i,'Closest_distance'] = distance
    return test_points  
In [10]:
%timeit test(track_points,test_points)
%timeit test(track_points2,test_points2)
1 loops, best of 3: 3min 44s per loop
1 loops, best of 3: 3min 42s per loop

Where is this pending time? Let's see with a smaller data set.

In [10]:
%lprun -s -f closest_distance -f test test(create_points(2500),create_points(40))

We see most time is spent on .ix and distance_on_unit_sphere

Series As Numpy Arrays

It is an advice found on pandas documentation to move to Numpy arrays to speed up random access to elements. Let's do it here.

In [11]:
def closest_distance_numpy(latitude,longitude,m,points_latitude,points_longitude):
    d = 100000.0
    best = -1
    for i in range(m):
        latitude_i = points_latitude[i]
        longitude_i = points_longitude[i]
        md = distance_on_unit_sphere(latitude, longitude, latitude_i, longitude_i)
        if d > md:
            best = i
            d = md
    return best,d

def test_numpy(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values

    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
     
    m = len(track_points)
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_latitudes[i]
        best, distance = closest_distance_numpy(latitude_i, longitude_i, \
                                                m, track_latitudes,track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
    
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points  

How does it performs?

In [12]:
%timeit test_numpy(track_points,test_points)
%timeit test_numpy(track_points2,test_points2)
1 loops, best of 3: 33.9 s per loop
1 loops, best of 3: 34.7 s per loop

Much better.

Profiling it shows that the bulk of the time is in the distance_on_unit_sphere function.

In [13]:
%lprun -f closest_distance_numpy -f test_numpy test_numpy(track_points,test_points)

Compiling With Numba

Let's see if compiling the distance function speeds up the code.

In [14]:
@jit
def distance_on_unit_sphere_numba(lat1, long1, lat2, long2):
 
    # Convert latitude and longitude to 
    # spherical coordinates in radians.
    degrees_to_radians = math.pi/180.0
         
    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - lat2)*degrees_to_radians
         
    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = long2*degrees_to_radians
         
    # Compute spherical distance from spherical coordinates.
         
    # For two locations in spherical coordinates 
    # (1, theta, phi) and (1, theta', phi')
    # cosine( arc length ) = 
    #    sin phi sin phi' cos(theta-theta') + cos phi cos phi'
    # distance = rho * arc length
     
    cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) + 
           math.cos(phi1)*math.cos(phi2))
    arc = math.acos( cos )
 
    # Remember to multiply arc by the radius of the earth 
    # in your favorite set of units to get length.
    return arc

@jit
def closest_distance_numba(latitude,longitude,m,points_latitude,points_longitude):
    d = 100000.0
    best = -1
    for i in range(m):
        latitude_i = points_latitude[i]
        longitude_i = points_longitude[i]
        md = distance_on_unit_sphere_numba(latitude, longitude, latitude_i, longitude_i)
        if d > md:
            best = i
            d = md
    return best,d

def test_numba(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values

    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
     
    m = len(track_points)
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_latitudes[i]
        best, distance = closest_distance_numba(latitude_i, longitude_i, m, track_latitudes,track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
    
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points
In [15]:
%timeit test_numba(track_points,test_points)
%timeit test_numba(track_points2,test_points2)
1 loops, best of 3: 943 ms per loop
1 loops, best of 3: 1.23 s per loop

Wow, about 30 times faster!

Let's see if moving to Numpy functions is better

In [16]:
@jit
def distance_on_unit_sphere_numba(lat1, long1, lat2, long2):
 
    # Convert latitude and longitude to 
    # spherical coordinates in radians.
    degrees_to_radians = math.pi/180.0
         
    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - lat2)*degrees_to_radians
         
    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = long2*degrees_to_radians
         
    # Compute spherical distance from spherical coordinates.
         
    # For two locations in spherical coordinates 
    # (1, theta, phi) and (1, theta', phi')
    # cosine( arc length ) = 
    #    sin phi sin phi' cos(theta-theta') + cos phi cos phi'
    # distance = rho * arc length
     
    cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) + 
           np.cos(phi1)*np.cos(phi2))
    arc = np.acos( cos )
 
    # Remember to multiply arc by the radius of the earth 
    # in your favorite set of units to get length.
    return arc
In [17]:
%timeit test_numba(track_points,test_points)
%timeit test_numba(track_points2,test_points2)
1 loops, best of 3: 937 ms per loop
1 loops, best of 3: 1.23 s per loop

No change in timing. Let's profile our top function.

In [18]:
%lprun -s -f test_numba test_numba(track_points,test_points)

99.7 % of the time is spent in the closest_distance_numba function

Cython

What about cython performance?

In [68]:
%%cython 
import pandas as pd
import numpy as np
import math

cimport cython

cdef double distance_on_unit_sphere_cython(double lat1, double long1, double lat2, double long2):
    cdef double degrees_to_radians, phi1, phi2, theta1, theta2, cos_value, arc_value, pi
        
    # Convert latitude and longitude to 
    # spherical coordinates in radians.
    pi = math.pi
    degrees_to_radians = pi/180.0
         
    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - lat2)*degrees_to_radians
         
    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = long2*degrees_to_radians
         
    # Compute spherical distance from spherical coordinates.
         
    # For two locations in spherical coordinates 
    # (1, theta, phi) and (1, theta', phi')
    # cosine( arc length ) = 
    #    sin phi sin phi' cos(theta-theta') + cos phi cos phi'
    # distance = rho * arc length
     
    cos_value = (np.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) + 
           math.cos(phi1)*math.cos(phi2))
    arc_value = math.acos( cos_value )
 
    # Remember to multiply arc by the radius of the earth 
    # in your favorite set of units to get length.
    return arc_value


@cython.boundscheck(False)
@cython.wraparound(False)
cdef closest_distance_cython( double latitude, double longitude, long m,\
                             double[:] points_latitude,\
                             double[:] points_longitude):
    cdef:
        double d, md, latitude_i, longitude_i
        long bext, i
        
    d = 100000.0
    best = -1
    for i in range(m):
        latitude_i = points_latitude[i]
        longitude_i = points_longitude[i]
        md = distance_on_unit_sphere_cython(latitude, longitude, latitude_i, longitude_i)
        if d > md:
            best = i
            d = md
    return best,d

def test_cython(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values

    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
     
    m = len(track_points)
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_latitudes[i]
        best, distance = closest_distance_cython(latitude_i, longitude_i, m, track_latitudes,track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
    
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points

Cython isn't nearly as fast as Numba here because the math functions aren't inlined.

In [20]:
%timeit test_cython(track_points,test_points)
%timeit test_cython(track_points2,test_points2)
1 loops, best of 3: 32.3 s per loop
1 loops, best of 3: 34 s per loop

Using vectorized operations

The best function I could find 8 months ago was using Numpy boradcast operations to avoid explicitly looping over the test points arrays. It is documented in Python Is Not C. It is reproduced below for comparison purpose.

In [23]:
#does not help: @jit
def closest_distance_pandas (lat1, long1, points):
    degrees_to_radians = math.pi/180.0
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - points.Latitude)*degrees_to_radians
    theta1 = long1*degrees_to_radians
    theta2 = points.Longitude*degrees_to_radians
    cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
           np.cos(phi1)*np.cos(phi2))
    arc = np.arccos( cos )
    return arc.idxmin(), arc.min()

#does not help: @jit 
def test_pandas(track_points,test_points):
    test_points['Closest_point'] = 0
    test_points['Closest_distance'] = 0.0
    for i in range(len(test_points)): 
        latitude_i = test_points.ix[i,'Latitude']
        longitude_i = test_points.ix[i,'Longitude']
        best, distance = closest_distance_pandas(latitude_i, longitude_i, track_points)
        test_points.ix[i,'Closest_point'] = best
        test_points.ix[i,'Closest_distance'] = distance
    return test_points  
In [24]:
%timeit test_pandas(track_points,test_points)
%timeit test_pandas(track_points2,test_points2)
1 loops, best of 3: 2.17 s per loop
1 loops, best of 3: 2.3 s per loop

Not bad at all.

Let's profile it.

In [25]:
%lprun -f test_pandas test_pandas(track_points,test_points)

The bulk of the time is spent in closest_distance_pandas but ix also shows. As before, let's get rid of ix calls by switching to Numpy series.

Using Numpy

In [26]:
def closest_distance_broadcast (lat1, long1, latitudes, longitudes):
    degrees_to_radians = math.pi/180.0
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - latitudes)*degrees_to_radians
    theta1 = long1*degrees_to_radians
    theta2 = longitudes*degrees_to_radians
    cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
           np.cos(phi1)*np.cos(phi2))
    arc = np.arccos( cos )
    return arc.argmin(), arc.min()

def test_broadcast(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values
    
    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
    
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_longitudes[i]
        best, distance = closest_distance_broadcast(latitude_i, longitude_i, track_latitudes, track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
     
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points  
In [27]:
%timeit test_broadcast(track_points,test_points)
%timeit test_broadcast(track_points,test_points)
1 loops, best of 3: 797 ms per loop
1 loops, best of 3: 806 ms per loop

Almost 3 times faster!

Let's profile it.

In [32]:
%lprun -s -f test_broadcast -f closest_distance_broadcast test_broadcast(track_points,test_points)

All the time is spent in broadcast math functions.

Adding Numba

In [33]:
@jit
def closest_distance_broadcast_numba (lat1, long1, latitudes, longitudes):
    # Convert latitude and longitude to
    # spherical coordinates in radians.
    degrees_to_radians = math.pi/180.0
         
    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - latitudes)*degrees_to_radians
         
    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = longitudes*degrees_to_radians
       
    # Compute spherical distance from spherical coordinates.
         
    # For two locations in spherical coordinates
    # (1, theta, phi) and (1, theta, phi)
    # cosine( arc length ) =
    #    sin phi sin phi' cos(theta-theta') + cos phi cos phi'
    # distance = rho * arc length
     
    cos = (math.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
           math.cos(phi1)*np.cos(phi2))
    arc = np.arccos( cos )
 
    # Return the index of the smallest value in arc
    return arc.argmin(), arc.min()

def test_broadcast_numba(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values
    
    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
    
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_longitudes[i]
        best, distance = closest_distance_broadcast_numba(latitude_i, longitude_i, track_latitudes, track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
    
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points  
In [34]:
%timeit test_broadcast_numba(track_points,test_points)
%timeit test_broadcast_numba(track_points2,test_points2)
1 loops, best of 3: 614 ms per loop
1 loops, best of 3: 678 ms per loop

An additional 25% improvement!

Cython

Compiling with cython instead of Numba.

In [48]:
%%cython
import pandas as pd
import numpy as np
import math

cimport cython
cimport numpy as np

def closest_distance_broadcast_cython (double lat1, double long1, \
                             np.ndarray[double,ndim=1] latitudes, \
                             np.ndarray[double,ndim=1] longitudes):
    # Convert latitude and longitude to
    # spherical coordinates in radians.
    
    cdef:
        double degrees_to_radians, phi1, theta1
        np.ndarray[double, ndim=1] phi2, theta2, cos, arc
    degrees_to_radians = math.pi/180.0
         
    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - latitudes)*degrees_to_radians
         
    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = longitudes*degrees_to_radians
       
    # Compute spherical distance from spherical coordinates.
         
    # For two locations in spherical coordinates
    # (1, theta, phi) and (1, theta, phi)
    # cosine( arc length ) =
    #    sin phi sin phi' cos(theta-theta') + cos phi cos phi'
    # distance = rho * arc length
     
    cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
           np.cos(phi1)*np.cos(phi2))
    arc = np.arccos( cos )
 
    # Return the index of the smallest value in arc
    return arc.argmin(), arc.min()

def test_broadcast_cython(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values
    
    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
    
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_longitudes[i]
        best, distance = closest_distance_broadcast_cython(latitude_i, longitude_i, track_latitudes, track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
    
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points

Cython does not speed the code.

In [49]:
%timeit test_broadcast_cython(track_points, test_points)
%timeit test_broadcast_cython(track_points2, test_points2)
1 loops, best of 3: 832 ms per loop
1 loops, best of 3: 822 ms per loop

Manhattan Distance

In Python Is Not C, I decided to use amuch simpler distance, namely the Manhattan distance. The code becomes

In [28]:
def closest_manhattan(lat,lon,trkpts):
    cl = np.abs(trkpts.Latitude - lat) + np.abs(trkpts.Longitude - lon)
    return cl.idxmin(), cl.min()

def test_manhattan(track_points,test_points):
    test_points['Closest_point'] = 0
    test_points['Closest_distance'] = 0.0
    for i in range(len(test_points)): 
        latitude_i = test_points.ix[i,'Latitude']
        longitude_i = test_points.ix[i,'Longitude']
        best, distance = closest_manhattan(latitude_i, longitude_i, track_points)
        test_points.ix[i,'Closest_point'] = best
        test_points.ix[i,'Closest_distance'] = distance
    return test_points  
In [29]:
%timeit test_manhattan(track_points,test_points)
%timeit test_manhattan(track_points2,test_points2)
1 loops, best of 3: 809 ms per loop
1 loops, best of 3: 884 ms per loop

This is alsmost as fast as the Numba compiled code above. Let's profile it.

In [52]:
%lprun -f test_manhattan test_manhattan(track_points,test_points)

About 40% of the time is spent in ix. As before, let's move to Numpy arrays.

In [53]:
def closest_manhattan_numpy(lat,lon,latitudes, longitudes):
    cl = np.abs(latitudes - lat) + np.abs(longitudes - lon)
    return cl.argmin(), cl.min()

def test_manhattan_numpy(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values
    
    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
    
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_longitudes[i]
        best, distance = closest_manhattan_numpy(latitude_i, longitude_i, track_latitudes, track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
    
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points  

This is way faster.

In [54]:
%timeit test_manhattan_numpy(track_points,test_points)
%timeit test_manhattan_numpy(track_points2,test_points2)
10 loops, best of 3: 52.2 ms per loop
10 loops, best of 3: 76.7 ms per loop

Profiling it confirms that most of the time is now spent in closest_manhattan_numpy

In [57]:
%lprun -f test_manhattan_numpy test_manhattan_numpy(track_points,test_points)

Let's compile it with Numba.

In [58]:
@jit
def closest_manhattan_numba(latitude,longitude,m,points_latitude,points_longitude):
    d = 100000.0
    best = -1
    for i in range(m):
        latitude_i = points_latitude[i]
        longitude_i = points_longitude[i]
        md = abs(latitude - latitude_i) + abs(longitude - longitude_i)
        if d > md:
            best = i
            d = md
    return best,d

def test_manhattan_numba(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values
    
    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)
    
    m = len(track_points)
    for i in range(n): 
        latitude_i = test_latitudes[i]
        longitude_i = test_longitudes[i]
        best, distance = closest_manhattan_numba(latitude_i, longitude_i, m, track_latitudes, track_longitudes)
        test_closest_point[i] = best
        test_closest_distance[i] = distance
    
    test_points['Closest_point'] = test_closest_point
    test_points['Closest_distance'] = test_closest_distance
    return test_points  

This yields another 2x speedup.

In [60]:
%timeit test_manhattan_numba(track_points,test_points)
%timeit test_manhattan_numba(track_points2,test_points2)
10 loops, best of 3: 30.4 ms per loop
10 loops, best of 3: 35.7 ms per loop

Broadcasting

We can try Numpy broadcasting with this simpler distance function.

In [61]:
def test_manhattan_broadcast(track_points,test_points):
    track_latitudes = track_points['Latitude'].values
    track_longitudes = track_points['Longitude'].values

    test_latitudes = test_points['Latitude'].values
    test_longitudes = test_points['Longitude'].values
    
    n = len(test_points)
    test_closest_point = np.ndarray(n,dtype=int)
    test_closest_distance = np.ndarray(n,dtype=float)    
    
    m = len(track_points)
    
    t1 = test_latitudes[:,None]-track_latitudes
    t2 = test_longitudes[:,None]-track_longitudes
    distances = np.abs(t1) + np.abs(t2)
        
    test_points['Closest_point'] = distances.argmin(axis=1)
    test_points['Closest_distance'] = distances.min(axis=1)
    return test_points  
In [62]:
%timeit test_manhattan_broadcast(track_points,test_points)
%timeit test_manhattan_broadcast(track_points2,test_points2)
1 loops, best of 3: 196 ms per loop
1 loops, best of 3: 229 ms per loop

Using Scipy and SKlearn

Let's now use built-in packages for finding nearest neighbors.

In [5]:
import scipy.spatial
import sklearn.neighbors

We need to copy track points coordinates in a 2D array.

In [6]:
points = np.zeros((len(track_points),2))
points[:,0] = track_points['Latitude']
points[:,1] = track_points['Longitude']

We can now benchmark four built-in functions. These functions start by building a tree with the track points. Let's measure the time it takes as this step is not required with the code above.

In [7]:
%timeit scipy.spatial.cKDTree(points)
%timeit scipy.spatial.KDTree(points)
%timeit sklearn.neighbors.KDTree(points,metric='minkowski',p=1)
%timeit sklearn.neighbors.BallTree(points,metric='minkowski',p=1)
100 loops, best of 3: 10.9 ms per loop
10 loops, best of 3: 134 ms per loop
100 loops, best of 3: 11 ms per loop
100 loops, best of 3: 10.4 ms per loop

Let's create the trees for the rest of the benchmark.

In [8]:
t1 = scipy.spatial.cKDTree(points)
t2 = scipy.spatial.KDTree(points)
t3 = sklearn.neighbors.KDTree(points,metric='minkowski',p=1)
t4 = sklearn.neighbors.BallTree(points,metric='minkowski',p=1)

A utility function for the benchmark. It is good that all these trees have a similar api.

In [20]:
def test_tree(tree, test_points):

    tests = np.zeros((len(test_points),2))
    tests[:,0] = test_points['Latitude']
    tests[:,1] = test_points['Longitude']

    distances, closests = tree.query(tests)
    
    test_points['Closest_point'] = closests
    test_points['Closest_distance'] = distances
    return test_points  
In [21]:
tests = np.zeros((len(test_points),2))
tests[:,0] = test_points['Latitude']
tests[:,1] = test_points['Longitude']
In [23]:
%timeit test_tree(t1, test_points)
%timeit test_tree(t2, test_points)
%timeit test_tree(t3, test_points)
%timeit test_tree(t4, test_points)
1000 loops, best of 3: 800 µs per loop
10 loops, best of 3: 83.7 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.91 ms per loop

Let's test with the Raam data set

In [24]:
t1 = scipy.spatial.cKDTree(points2)
t2 = scipy.spatial.KDTree(points2)
t3 = sklearn.neighbors.KDTree(points2,metric='minkowski',p=1)
t4 = sklearn.neighbors.BallTree(points2,metric='minkowski',p=1)

Timing tree creation

In [25]:
%timeit scipy.spatial.cKDTree(points2)
%timeit scipy.spatial.KDTree(points2)
%timeit sklearn.neighbors.KDTree(points2,metric='minkowski',p=1)
%timeit sklearn.neighbors.BallTree(points2,metric='minkowski',p=1)
1 loops, best of 3: 802 ms per loop
10 loops, best of 3: 205 ms per loop
1 loops, best of 3: 967 ms per loop
1 loops, best of 3: 971 ms per loop

Timing closest distance

In [26]:
%timeit test_tree(t1, test_points2)
%timeit test_tree(t2, test_points2)
%timeit test_tree(t3, test_points2)
%timeit test_tree(t4, test_points2)
The slowest run took 91.35 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 759 µs per loop
10 loops, best of 3: 111 ms per loop
1000 loops, best of 3: 1.02 ms per loop
1000 loops, best of 3: 1.29 ms per loop
In [ ]: