An exercise in optimising Python code.
This notebook is commented in Python Is Not C: Take Two
First, let's import useful tools and packages
%load_ext Cython
%load_ext line_profiler
from numba import jit,autojit
import random
import numpy as np
import math
import pandas as pd
import sys
if sys.version_info < (3,):
range = xrange
from matplotlib import pyplot as plt
%matplotlib inline
We will use two data sets, a random one, and one taken from RAAM. Each data set has a set of test points and a set of track points. The problem is to compute the closest point among the track points for each point in the test points.
Track and waypoints are uniformly drawn in the North East part of the world: 0<= Latitude <= 90 and 0 <= Longitude <= 180. It has 25,000 track points and 400 waypoints.
def create_points(n):
return pd.DataFrame({'Latitude': np.random.uniform(0.0,90.0,n), \
'Longitude': np.random.uniform(0.0,180.0,n)})
track_points = create_points(25000)
test_points = create_points(400)
Track and waypoints from the 2015 RAAM competition. It has 25,092 track points and 428 waypoints.
def read_points(file):
points = pd.read_csv(file, sep='\t')
points = points[['Lat','Lon']]
points.columns = ['Latitude','Longitude']
return points
track_points2= read_points('trkpts.tsv')
points2 = np.zeros((len(track_points2),2))
points2[:,0] = track_points2['Latitude']
points2[:,1] = track_points2['Longitude']
test_points2 = read_points('wpts3.tsv')
Let's display these data sets. We display the test_points
datasets = [test_points, test_points2]
titles = ['random', 'raam']
fig, ax = plt.subplots(1, 2, figsize=(8, 3.5))
for axi, title, dataset in zip(ax, titles, datasets):
axi.plot(dataset['Longitude'], dataset['Latitude'], '.k')
axi.set_title(title, size=14)
axi.set_xlabel('Longitude')
axi.set_ylabel('Latitude')
plt.savefig('nneighbors.png')
This is the code I wrote almost a year ago, and that is described in Python Is Not C
We use the following function is from John D Cook. It implements the great circle distance
def distance_on_unit_sphere(lat1, long1, lat2, long2):
# Convert latitude and longitude to
# spherical coordinates in radians.
degrees_to_radians = math.pi/180.0
# phi = 90 - latitude
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - lat2)*degrees_to_radians
# theta = longitude
theta1 = long1*degrees_to_radians
theta2 = long2*degrees_to_radians
# Compute spherical distance from spherical coordinates.
# For two locations in spherical coordinates
# (1, theta, phi) and (1, theta', phi')
# cosine( arc length ) =
# sin phi sin phi' cos(theta-theta') + cos phi cos phi'
# distance = rho * arc length
cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) +
math.cos(phi1)*math.cos(phi2))
arc = math.acos( cos )
# Remember to multiply arc by the radius of the earth
# in your favorite set of units to get length.
return arc
We use it to find the closest point in a list of points.
def closest_distance(latitude,longitude,points):
d = 100000.0
best = -1
r = points.index
for i in r:
latitude_i = points.ix[i,'Latitude']
longitude_i = points.ix[i,'Longitude']
md = distance_on_unit_sphere(latitude, longitude, latitude_i, longitude_i)
if d > md:
best = i
d = md
return best,d
Let's run the above code for each test point
def test(track_points,test_points):
test_points['Closest_point'] = 0
test_points['Closest_distance'] = 0.0
for i in range(len(test_points)):
latitude_i = test_points.ix[i,'Latitude']
longitude_i = test_points.ix[i,'Longitude']
best, distance = closest_distance(latitude_i, longitude_i, track_points)
test_points.ix[i,'Closest_point'] = best
test_points.ix[i,'Closest_distance'] = distance
return test_points
%timeit test(track_points,test_points)
%timeit test(track_points2,test_points2)
1 loops, best of 3: 3min 44s per loop 1 loops, best of 3: 3min 42s per loop
Where is this pending time? Let's see with a smaller data set.
%lprun -s -f closest_distance -f test test(create_points(2500),create_points(40))
We see most time is spent on .ix and distance_on_unit_sphere
It is an advice found on pandas documentation to move to Numpy arrays to speed up random access to elements. Let's do it here.
def closest_distance_numpy(latitude,longitude,m,points_latitude,points_longitude):
d = 100000.0
best = -1
for i in range(m):
latitude_i = points_latitude[i]
longitude_i = points_longitude[i]
md = distance_on_unit_sphere(latitude, longitude, latitude_i, longitude_i)
if d > md:
best = i
d = md
return best,d
def test_numpy(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
m = len(track_points)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_latitudes[i]
best, distance = closest_distance_numpy(latitude_i, longitude_i, \
m, track_latitudes,track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
How does it performs?
%timeit test_numpy(track_points,test_points)
%timeit test_numpy(track_points2,test_points2)
1 loops, best of 3: 33.9 s per loop 1 loops, best of 3: 34.7 s per loop
Much better.
Profiling it shows that the bulk of the time is in the distance_on_unit_sphere
function.
%lprun -f closest_distance_numpy -f test_numpy test_numpy(track_points,test_points)
Let's see if compiling the distance function speeds up the code.
@jit
def distance_on_unit_sphere_numba(lat1, long1, lat2, long2):
# Convert latitude and longitude to
# spherical coordinates in radians.
degrees_to_radians = math.pi/180.0
# phi = 90 - latitude
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - lat2)*degrees_to_radians
# theta = longitude
theta1 = long1*degrees_to_radians
theta2 = long2*degrees_to_radians
# Compute spherical distance from spherical coordinates.
# For two locations in spherical coordinates
# (1, theta, phi) and (1, theta', phi')
# cosine( arc length ) =
# sin phi sin phi' cos(theta-theta') + cos phi cos phi'
# distance = rho * arc length
cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) +
math.cos(phi1)*math.cos(phi2))
arc = math.acos( cos )
# Remember to multiply arc by the radius of the earth
# in your favorite set of units to get length.
return arc
@jit
def closest_distance_numba(latitude,longitude,m,points_latitude,points_longitude):
d = 100000.0
best = -1
for i in range(m):
latitude_i = points_latitude[i]
longitude_i = points_longitude[i]
md = distance_on_unit_sphere_numba(latitude, longitude, latitude_i, longitude_i)
if d > md:
best = i
d = md
return best,d
def test_numba(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
m = len(track_points)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_latitudes[i]
best, distance = closest_distance_numba(latitude_i, longitude_i, m, track_latitudes,track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
%timeit test_numba(track_points,test_points)
%timeit test_numba(track_points2,test_points2)
1 loops, best of 3: 943 ms per loop 1 loops, best of 3: 1.23 s per loop
Wow, about 30 times faster!
Let's see if moving to Numpy functions is better
@jit
def distance_on_unit_sphere_numba(lat1, long1, lat2, long2):
# Convert latitude and longitude to
# spherical coordinates in radians.
degrees_to_radians = math.pi/180.0
# phi = 90 - latitude
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - lat2)*degrees_to_radians
# theta = longitude
theta1 = long1*degrees_to_radians
theta2 = long2*degrees_to_radians
# Compute spherical distance from spherical coordinates.
# For two locations in spherical coordinates
# (1, theta, phi) and (1, theta', phi')
# cosine( arc length ) =
# sin phi sin phi' cos(theta-theta') + cos phi cos phi'
# distance = rho * arc length
cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
np.cos(phi1)*np.cos(phi2))
arc = np.acos( cos )
# Remember to multiply arc by the radius of the earth
# in your favorite set of units to get length.
return arc
%timeit test_numba(track_points,test_points)
%timeit test_numba(track_points2,test_points2)
1 loops, best of 3: 937 ms per loop 1 loops, best of 3: 1.23 s per loop
No change in timing. Let's profile our top function.
%lprun -s -f test_numba test_numba(track_points,test_points)
99.7 % of the time is spent in the closest_distance_numba function
What about cython performance?
%%cython
import pandas as pd
import numpy as np
import math
cimport cython
cdef double distance_on_unit_sphere_cython(double lat1, double long1, double lat2, double long2):
cdef double degrees_to_radians, phi1, phi2, theta1, theta2, cos_value, arc_value, pi
# Convert latitude and longitude to
# spherical coordinates in radians.
pi = math.pi
degrees_to_radians = pi/180.0
# phi = 90 - latitude
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - lat2)*degrees_to_radians
# theta = longitude
theta1 = long1*degrees_to_radians
theta2 = long2*degrees_to_radians
# Compute spherical distance from spherical coordinates.
# For two locations in spherical coordinates
# (1, theta, phi) and (1, theta', phi')
# cosine( arc length ) =
# sin phi sin phi' cos(theta-theta') + cos phi cos phi'
# distance = rho * arc length
cos_value = (np.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) +
math.cos(phi1)*math.cos(phi2))
arc_value = math.acos( cos_value )
# Remember to multiply arc by the radius of the earth
# in your favorite set of units to get length.
return arc_value
@cython.boundscheck(False)
@cython.wraparound(False)
cdef closest_distance_cython( double latitude, double longitude, long m,\
double[:] points_latitude,\
double[:] points_longitude):
cdef:
double d, md, latitude_i, longitude_i
long bext, i
d = 100000.0
best = -1
for i in range(m):
latitude_i = points_latitude[i]
longitude_i = points_longitude[i]
md = distance_on_unit_sphere_cython(latitude, longitude, latitude_i, longitude_i)
if d > md:
best = i
d = md
return best,d
def test_cython(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
m = len(track_points)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_latitudes[i]
best, distance = closest_distance_cython(latitude_i, longitude_i, m, track_latitudes,track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
Cython isn't nearly as fast as Numba here because the math functions aren't inlined.
%timeit test_cython(track_points,test_points)
%timeit test_cython(track_points2,test_points2)
1 loops, best of 3: 32.3 s per loop 1 loops, best of 3: 34 s per loop
The best function I could find 8 months ago was using Numpy boradcast operations to avoid explicitly looping over the test points arrays. It is documented in Python Is Not C. It is reproduced below for comparison purpose.
#does not help: @jit
def closest_distance_pandas (lat1, long1, points):
degrees_to_radians = math.pi/180.0
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - points.Latitude)*degrees_to_radians
theta1 = long1*degrees_to_radians
theta2 = points.Longitude*degrees_to_radians
cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
np.cos(phi1)*np.cos(phi2))
arc = np.arccos( cos )
return arc.idxmin(), arc.min()
#does not help: @jit
def test_pandas(track_points,test_points):
test_points['Closest_point'] = 0
test_points['Closest_distance'] = 0.0
for i in range(len(test_points)):
latitude_i = test_points.ix[i,'Latitude']
longitude_i = test_points.ix[i,'Longitude']
best, distance = closest_distance_pandas(latitude_i, longitude_i, track_points)
test_points.ix[i,'Closest_point'] = best
test_points.ix[i,'Closest_distance'] = distance
return test_points
%timeit test_pandas(track_points,test_points)
%timeit test_pandas(track_points2,test_points2)
1 loops, best of 3: 2.17 s per loop 1 loops, best of 3: 2.3 s per loop
Not bad at all.
Let's profile it.
%lprun -f test_pandas test_pandas(track_points,test_points)
The bulk of the time is spent in closest_distance_pandas but ix also shows. As before, let's get rid of ix calls by switching to Numpy series.
def closest_distance_broadcast (lat1, long1, latitudes, longitudes):
degrees_to_radians = math.pi/180.0
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - latitudes)*degrees_to_radians
theta1 = long1*degrees_to_radians
theta2 = longitudes*degrees_to_radians
cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
np.cos(phi1)*np.cos(phi2))
arc = np.arccos( cos )
return arc.argmin(), arc.min()
def test_broadcast(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_longitudes[i]
best, distance = closest_distance_broadcast(latitude_i, longitude_i, track_latitudes, track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
%timeit test_broadcast(track_points,test_points)
%timeit test_broadcast(track_points,test_points)
1 loops, best of 3: 797 ms per loop 1 loops, best of 3: 806 ms per loop
Almost 3 times faster!
Let's profile it.
%lprun -s -f test_broadcast -f closest_distance_broadcast test_broadcast(track_points,test_points)
All the time is spent in broadcast math functions.
@jit
def closest_distance_broadcast_numba (lat1, long1, latitudes, longitudes):
# Convert latitude and longitude to
# spherical coordinates in radians.
degrees_to_radians = math.pi/180.0
# phi = 90 - latitude
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - latitudes)*degrees_to_radians
# theta = longitude
theta1 = long1*degrees_to_radians
theta2 = longitudes*degrees_to_radians
# Compute spherical distance from spherical coordinates.
# For two locations in spherical coordinates
# (1, theta, phi) and (1, theta, phi)
# cosine( arc length ) =
# sin phi sin phi' cos(theta-theta') + cos phi cos phi'
# distance = rho * arc length
cos = (math.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
math.cos(phi1)*np.cos(phi2))
arc = np.arccos( cos )
# Return the index of the smallest value in arc
return arc.argmin(), arc.min()
def test_broadcast_numba(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_longitudes[i]
best, distance = closest_distance_broadcast_numba(latitude_i, longitude_i, track_latitudes, track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
%timeit test_broadcast_numba(track_points,test_points)
%timeit test_broadcast_numba(track_points2,test_points2)
1 loops, best of 3: 614 ms per loop 1 loops, best of 3: 678 ms per loop
An additional 25% improvement!
Compiling with cython instead of Numba.
%%cython
import pandas as pd
import numpy as np
import math
cimport cython
cimport numpy as np
def closest_distance_broadcast_cython (double lat1, double long1, \
np.ndarray[double,ndim=1] latitudes, \
np.ndarray[double,ndim=1] longitudes):
# Convert latitude and longitude to
# spherical coordinates in radians.
cdef:
double degrees_to_radians, phi1, theta1
np.ndarray[double, ndim=1] phi2, theta2, cos, arc
degrees_to_radians = math.pi/180.0
# phi = 90 - latitude
phi1 = (90.0 - lat1)*degrees_to_radians
phi2 = (90.0 - latitudes)*degrees_to_radians
# theta = longitude
theta1 = long1*degrees_to_radians
theta2 = longitudes*degrees_to_radians
# Compute spherical distance from spherical coordinates.
# For two locations in spherical coordinates
# (1, theta, phi) and (1, theta, phi)
# cosine( arc length ) =
# sin phi sin phi' cos(theta-theta') + cos phi cos phi'
# distance = rho * arc length
cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1 - theta2) +
np.cos(phi1)*np.cos(phi2))
arc = np.arccos( cos )
# Return the index of the smallest value in arc
return arc.argmin(), arc.min()
def test_broadcast_cython(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_longitudes[i]
best, distance = closest_distance_broadcast_cython(latitude_i, longitude_i, track_latitudes, track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
Cython does not speed the code.
%timeit test_broadcast_cython(track_points, test_points)
%timeit test_broadcast_cython(track_points2, test_points2)
1 loops, best of 3: 832 ms per loop 1 loops, best of 3: 822 ms per loop
In Python Is Not C, I decided to use amuch simpler distance, namely the Manhattan distance. The code becomes
def closest_manhattan(lat,lon,trkpts):
cl = np.abs(trkpts.Latitude - lat) + np.abs(trkpts.Longitude - lon)
return cl.idxmin(), cl.min()
def test_manhattan(track_points,test_points):
test_points['Closest_point'] = 0
test_points['Closest_distance'] = 0.0
for i in range(len(test_points)):
latitude_i = test_points.ix[i,'Latitude']
longitude_i = test_points.ix[i,'Longitude']
best, distance = closest_manhattan(latitude_i, longitude_i, track_points)
test_points.ix[i,'Closest_point'] = best
test_points.ix[i,'Closest_distance'] = distance
return test_points
%timeit test_manhattan(track_points,test_points)
%timeit test_manhattan(track_points2,test_points2)
1 loops, best of 3: 809 ms per loop 1 loops, best of 3: 884 ms per loop
This is alsmost as fast as the Numba compiled code above. Let's profile it.
%lprun -f test_manhattan test_manhattan(track_points,test_points)
About 40% of the time is spent in ix. As before, let's move to Numpy arrays.
def closest_manhattan_numpy(lat,lon,latitudes, longitudes):
cl = np.abs(latitudes - lat) + np.abs(longitudes - lon)
return cl.argmin(), cl.min()
def test_manhattan_numpy(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_longitudes[i]
best, distance = closest_manhattan_numpy(latitude_i, longitude_i, track_latitudes, track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
This is way faster.
%timeit test_manhattan_numpy(track_points,test_points)
%timeit test_manhattan_numpy(track_points2,test_points2)
10 loops, best of 3: 52.2 ms per loop 10 loops, best of 3: 76.7 ms per loop
Profiling it confirms that most of the time is now spent in closest_manhattan_numpy
%lprun -f test_manhattan_numpy test_manhattan_numpy(track_points,test_points)
Let's compile it with Numba.
@jit
def closest_manhattan_numba(latitude,longitude,m,points_latitude,points_longitude):
d = 100000.0
best = -1
for i in range(m):
latitude_i = points_latitude[i]
longitude_i = points_longitude[i]
md = abs(latitude - latitude_i) + abs(longitude - longitude_i)
if d > md:
best = i
d = md
return best,d
def test_manhattan_numba(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
m = len(track_points)
for i in range(n):
latitude_i = test_latitudes[i]
longitude_i = test_longitudes[i]
best, distance = closest_manhattan_numba(latitude_i, longitude_i, m, track_latitudes, track_longitudes)
test_closest_point[i] = best
test_closest_distance[i] = distance
test_points['Closest_point'] = test_closest_point
test_points['Closest_distance'] = test_closest_distance
return test_points
This yields another 2x speedup.
%timeit test_manhattan_numba(track_points,test_points)
%timeit test_manhattan_numba(track_points2,test_points2)
10 loops, best of 3: 30.4 ms per loop 10 loops, best of 3: 35.7 ms per loop
We can try Numpy broadcasting with this simpler distance function.
def test_manhattan_broadcast(track_points,test_points):
track_latitudes = track_points['Latitude'].values
track_longitudes = track_points['Longitude'].values
test_latitudes = test_points['Latitude'].values
test_longitudes = test_points['Longitude'].values
n = len(test_points)
test_closest_point = np.ndarray(n,dtype=int)
test_closest_distance = np.ndarray(n,dtype=float)
m = len(track_points)
t1 = test_latitudes[:,None]-track_latitudes
t2 = test_longitudes[:,None]-track_longitudes
distances = np.abs(t1) + np.abs(t2)
test_points['Closest_point'] = distances.argmin(axis=1)
test_points['Closest_distance'] = distances.min(axis=1)
return test_points
%timeit test_manhattan_broadcast(track_points,test_points)
%timeit test_manhattan_broadcast(track_points2,test_points2)
1 loops, best of 3: 196 ms per loop 1 loops, best of 3: 229 ms per loop
Let's now use built-in packages for finding nearest neighbors.
import scipy.spatial
import sklearn.neighbors
We need to copy track points coordinates in a 2D array.
points = np.zeros((len(track_points),2))
points[:,0] = track_points['Latitude']
points[:,1] = track_points['Longitude']
We can now benchmark four built-in functions. These functions start by building a tree with the track points. Let's measure the time it takes as this step is not required with the code above.
%timeit scipy.spatial.cKDTree(points)
%timeit scipy.spatial.KDTree(points)
%timeit sklearn.neighbors.KDTree(points,metric='minkowski',p=1)
%timeit sklearn.neighbors.BallTree(points,metric='minkowski',p=1)
100 loops, best of 3: 10.9 ms per loop 10 loops, best of 3: 134 ms per loop 100 loops, best of 3: 11 ms per loop 100 loops, best of 3: 10.4 ms per loop
Let's create the trees for the rest of the benchmark.
t1 = scipy.spatial.cKDTree(points)
t2 = scipy.spatial.KDTree(points)
t3 = sklearn.neighbors.KDTree(points,metric='minkowski',p=1)
t4 = sklearn.neighbors.BallTree(points,metric='minkowski',p=1)
A utility function for the benchmark. It is good that all these trees have a similar api.
def test_tree(tree, test_points):
tests = np.zeros((len(test_points),2))
tests[:,0] = test_points['Latitude']
tests[:,1] = test_points['Longitude']
distances, closests = tree.query(tests)
test_points['Closest_point'] = closests
test_points['Closest_distance'] = distances
return test_points
tests = np.zeros((len(test_points),2))
tests[:,0] = test_points['Latitude']
tests[:,1] = test_points['Longitude']
%timeit test_tree(t1, test_points)
%timeit test_tree(t2, test_points)
%timeit test_tree(t3, test_points)
%timeit test_tree(t4, test_points)
1000 loops, best of 3: 800 µs per loop 10 loops, best of 3: 83.7 ms per loop 1000 loops, best of 3: 1.11 ms per loop 1000 loops, best of 3: 1.91 ms per loop
Let's test with the Raam data set
t1 = scipy.spatial.cKDTree(points2)
t2 = scipy.spatial.KDTree(points2)
t3 = sklearn.neighbors.KDTree(points2,metric='minkowski',p=1)
t4 = sklearn.neighbors.BallTree(points2,metric='minkowski',p=1)
Timing tree creation
%timeit scipy.spatial.cKDTree(points2)
%timeit scipy.spatial.KDTree(points2)
%timeit sklearn.neighbors.KDTree(points2,metric='minkowski',p=1)
%timeit sklearn.neighbors.BallTree(points2,metric='minkowski',p=1)
1 loops, best of 3: 802 ms per loop 10 loops, best of 3: 205 ms per loop 1 loops, best of 3: 967 ms per loop 1 loops, best of 3: 971 ms per loop
Timing closest distance
%timeit test_tree(t1, test_points2)
%timeit test_tree(t2, test_points2)
%timeit test_tree(t3, test_points2)
%timeit test_tree(t4, test_points2)
The slowest run took 91.35 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 759 µs per loop 10 loops, best of 3: 111 ms per loop 1000 loops, best of 3: 1.02 ms per loop 1000 loops, best of 3: 1.29 ms per loop