We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
Optimized code usually does not satisfy the above criteria. Although it might be correct at the end, it takes very long time to write, because
Develop in highlevel (e.g. Python, Matlab) -> profile -> optimize only bottlenecks or "embarrassingly parallel/lowlevel"
This part of the tutorial is intented to be a practical approach to the development of efficient code using python.
There are a few benchmarks out there, but they tend to be rather old. An interesting writeup is (because it includes Numba, a relativly new approach)
http://nbviewer.ipython.org/4271861/
We load an example image and extract a patch from it. Then we show both.
For this tutorial we are going to compute the first part of the histogram of oriented gradients descriptor. (see http://en.wikipedia.org/wiki/Histogram_of_oriented_gradients)
%pylab inline
import numpy as np
import skimage.data as skiD
import skimage.color as clr
from time import time
img = skiD.lena()
img = clr.rgb2grey(img)
###################
p = img[100:200, 400:]
figure()
imshow(img, cmap='gray')
figure()
imshow(p, cmap='gray')
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
<matplotlib.image.AxesImage at 0x3948e50>
For this naive histogram of oriented gradients (HOG) implementation we compute
def calcHOGNaive(img, binSize):
# calculate gradient
dx, dy = np.gradient(img)
# calculate magnitude of gradient
mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
# calculate angles and round them into bins
## plain angle 0 .. 2pi
# >>>> angle = np.arctan2(dy, dx) + np.pi)
## normalization to 0 .. binSize-0.000000000001
# >>>> normAngle = angle / (2 * np.pi + np.spacing(10000)) * binSize)
## binning and flatting
# >>>> np.floor(normAngle).flatten()
ang = (np.floor((np.arctan2(dy, dx) + np.pi) \
/ (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
# initialize histogram
hog = np.zeros((binSize, ))
# fill histogram
for i in range(ang.shape[0]):
hog[ang[i]] += mag[i]
# normalize
return hog / ang.shape[0]
hog = calcHOGNaive(p, 9)
%timeit calcHOGNaive(p, 9)
10 loops, best of 3: 28.3 ms per loop
http://pynash.org/2013/03/06/timing-and-profiling.html
%prun -s cumulative calcHOGNaive(p, 9)
https://github.com/certik/line_profiler
Since the function above does not have many function calls and spends the most of the time within itself, we need to analize how long it takes to compute each line.
This profiler does not give sensible output if it is used on notebook cells. Therefore I just put the function in a module and import it back.
This is an example of using the line_profiler magic.
import pyTools.sandbox.lineProfExample as lPE
%load_ext line_profiler_ext
%lprun -f lPE.calcHOGNaive lPE.calcHOGNaive(p, 9)
Alternatively, one can also use the API to profile a function
import line_profiler
lP = line_profiler.LineProfiler(lPE.calcHOGNaive)
res = lP.runcall(lPE.calcHOGNaive, p, 9)
lP.print_stats()
Timer unit: 1e-06 s File: /home/peter/code/pyTools/sandbox/lineProfExample.py Function: calcHOGNaive at line 3 Total time: 0.153876 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 3 def calcHOGNaive(img, binSize): 4 # calculate gradient 5 1 1771 1771.0 1.2 dx, dy = np.gradient(img) 6 7 # calculate magnitude of gradient 8 1 535 535.0 0.3 mag = np.sqrt(dx ** 2 + dy ** 2).flatten() 9 10 # calculate angles and round them into bins 11 ## plain angle 0 .. 2pi 12 # >>>> angle = np.arctan2(dy, dx) + np.pi) 13 ## normalization to 0 .. binSize-0.000000000001 14 # >>>> normAngle = angle / (2 * np.pi + np.spacing(10000)) * binSize) 15 ## binning and flatting 16 # >>>> np.floor(normAngle).flatten() 17 18 1 6 6.0 0.0 ang = (np.floor((np.arctan2(dy, dx) + np.pi) \ 19 1 3129 3129.0 2.0 / (2 * np.pi + np.spacing(10000)) * binSize)).flatten() 20 21 # initialize histogram 22 1 15 15.0 0.0 hog = np.zeros((binSize, )) 23 24 # fill histogram 25 11201 36855 3.3 24.0 for i in range(ang.shape[0]): 26 11200 111471 10.0 72.4 hog[ang[i]] += mag[i] 27 28 # normalize 29 1 94 94.0 0.1 return hog / ang.shape[0]
Loops are the obvious performance bottle necks. In this case it is only type-checking that slows this loop down, if functions are called there is likely some range-checking overhead, too.
Lets just embed the loop in C++ and compute the rest in numpy. It could quickly get messy if we converted the rest into C++, because each of the lines of numpy code would expand into two nested loops. Sometimes, however, it might pay off to put even vectorized numpy expressions in C. That is usually the case if several operations are performed sequencially and temporary results are reused. In such cases the fact that the data stays in the L1 or L2 cache might speed up the program enormously.
#import scipy.weave as weave
def calcHOGWeaveCpp(img, binSize):
import scipy.weave as weave
import numpy as np
dx, dy = np.gradient(img)
mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
ang = (np.floor((np.arctan2(dy, dx) + np.pi) \
/ (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
hog = np.zeros((binSize, ))
# for i in range(ang.shape[0]):
# hog[ang[i]] += mag[i]
code = """
int pos;
for (int i = 0; i < Nang[0]; ++i){
pos = ang[i];
hog[pos] += mag[i];
}
"""
weave.inline(code, ['ang', 'hog', 'mag'],
extra_compile_args=['-O3'], compiler='gcc')
return hog / ang.shape[0]
hogCpp = calcHOGWeaveCpp(p, 9)
%timeit calcHOGWeaveCpp(p, 9)
print "numpy and cpp version are the same: ", np.allclose(hog, hogCpp)
100 loops, best of 3: 4.83 ms per loop numpy and cpp version are the same: True
The advantage of fortran is not the speed (compared to optimized C code, which can be slightly faster) rather than the simplicity of the syntax of the logic which is very close to Matlab/Numpy expressions. However, the variable declaration makes it easily look cumbersome. Also in plain fortran code much of the structural syntax is quite verbose and debugging is limited. In combination with Python, the logic is the biggest part, therefore fortran can be a fast way of converting already vectorized expressions from numpy into a compiled language. Gaining impressive speed-ups.
For simplicity this code is just implementing the easy part. Namely the part that relies on the standard library of fortran and can be almost copy-pasted.
.. code-block:: fortran
!file hog.f90
module hog
contains
subroutine angHist(dx, dy, binSize, hist)
! compute simple hog from gradients
implicit none
! I/O
real(kind=8), dimension(:,:), intent(in) :: dx, dy
integer, intent(in) :: binSize
real(kind=8), dimension(binSize), intent(out) :: hist
! dummy variables
real(kind=8), dimension(size(dx,1), size(dx,2)) :: mag
integer, dimension(size(dx,1), size(dx,2)) :: ang
real(kind=8) :: pi, eps
integer :: i, m, p
! constants
pi = 3.14159265358979323846
eps = 0.0000000000000000001
! magnitude and angles
mag = sqrt(dx ** 2 + dy ** 2)
ang = floor((atan2(dy, dx) + pi) / (2 * pi + eps) * binSize)
! initialize hist as zero
hist = 0.0_8
do i=1,size(ang,1)
do m=1,size(ang,2)
p = ang(i,m) + 1 ! fortran counts from 1 to end (incl)
hist(p) = hist(p) + mag(i,m)
enddo
enddo
! normalize hist
hist = hist / (size(ang,1) * size(ang,2))
end subroutine angHist
end module hog
$f2py -c -m hog hog.f90
import pyTools.imgProc.hog as fHog
dx, dy = np.gradient(p)
hogF = fHog.hog.anghist(dx, dy, 9)
%timeit dx, dy = np.gradient(p); fHog.hog.anghist(dx, dy, 9)
print "Naive and Fortran (1) equal: ", np.allclose(hog, hogF)
100 loops, best of 3: 1.42 ms per loop Naive and Fortran (1) equal: True
We can also go a step further and replicate numpys gradient function in fortran. This is not much work, as the syntax is almost the same.
You can find the code also in https://gist.github.com/groakat/5775099
https://github.com/numpy/numpy/blob/v1.7.0/numpy/lib/function_base.py#L829
for axis in range(N):
# select out appropriate parts for this dimension
out = np.empty_like(f, dtype=otype)
slice1[axis] = slice(1, -1)
slice2[axis] = slice(2, None)
slice3[axis] = slice(None, -2)
# 1D equivalent -- out[1:-1] = (f[2:] - f[:-2])/2.0
out[slice1] = (f[slice2] - f[slice3])/2.0
slice1[axis] = 0
slice2[axis] = 1
slice3[axis] = 0
# 1D equivalent -- out[0] = (f[1] - f[0])
out[slice1] = (f[slice2] - f[slice3])
slice1[axis] = -1
slice2[axis] = -1
slice3[axis] = -2
# 1D equivalent -- out[-1] = (f[-1] - f[-2])
out[slice1] = (f[slice2] - f[slice3])
# divide by step size
outvals.append(out / dx[axis])
# reset the slice object in this dimension to ":"
slice1[axis] = slice(None)
slice2[axis] = slice(None)
slice3[axis] = slice(None)
.. code-block:: fortran90
subroutine angHist2(img, binSize, hist)
! compute simple hog from image
implicit none
! I/O variables
real(kind=8), dimension(:,:), intent(in) :: img
integer, intent(in) :: binSize
real(kind=8), dimension(binSize), intent(out) :: hist
! dummy variables
real(kind=8), dimension(size(img,1),size(img,2)) :: dx, dy, mag
integer, dimension(size(img,1),size(img,2)) :: ang
real(kind=8) :: pi, eps
integer :: i, m, p, l
! constants
pi = 3.14159265358979323846
eps = 0.0000000000000000001
! initialize hist
hist = 0.0_8
! compute dx as numpy.gradient does
dx(1,:) = img(2,:) - img(1,:)
do i=3,size(img,1)
dx(i-1,:) = (img(i,:) - img(i-2,:)) / 2.0
enddo
l = size(img,1)
dx(l,:) = img(l,:) - img(l-1,:)
! compute dy as numpy.gradient does
dy(:,1) = img(:,2) - img(:,1)
do i=3,size(img,2)
dy(:, i-1) = (img(:,i) - img(:,i-2)) / 2.0
enddo
l = size(img,2)
dy(:,l) = img(:,l) - img(:,l-1)
! compute magnitude and angle of gradients
mag = sqrt(dx ** 2 + dy ** 2)
ang = floor((atan2(dy, dx) + pi) / (2 * pi + eps) * binSize)
! bin them
do i=1,size(ang,1)
do m=1,size(ang,2)
p = ang(i,m) + 1 ! fortran counts from 1 to end (incl)
hist(p) = hist(p) + mag(i,m)
enddo
enddo
! normalize hist
hist = hist / (size(ang,1) * size(ang,2))
end subroutine angHist2
$f2py -c -m hog hog.f90
import pyTools.imgProc.hog as fHog
hogF2 = fHog.hog.anghist2(p, 9)
%timeit fHog.hog.anghist2(p, 9)
print "Naive and Fortran (2) equal: ", np.allclose(hog, hogF2)
1000 loops, best of 3: 1.19 ms per loop Naive and Fortran (2) equal: True
Uses LLVM to runtime-compile the code. Here the overhead of the compilation is relatively large compared with the exectuted code length. This compiler seems to be intended to be used with completely "unrolled" and not vectorized functions. I.e. if the code looks more like a C program
from numba.decorators import jit
from numba import float64, int32
@jit(argtypes=[float64[:,:], int32])
def calcHOGNumba(img, binSize):
# calculate gradient
d = np.gradient(img)
dx = d[0]
dy = d[1]
# calculate magnitude of gradient
mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
# calculate angles and round them into bins
ang = (np.floor((np.arctan2(dy, dx) + np.pi) \
/ (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
# initialize histogram
hog = np.zeros((binSize, ))
# fill histogram
for i in range(ang.shape[0]):
hog[ang[i]] += mag[i]
# normalize
return hog / ang.shape[0]
hogN = calcHOGNumba(p, 9)
%timeit calcHOGNumba(p, 9)
print "Naive and Numba equal: ", np.allclose(hog, hogN)
10 loops, best of 3: 70.2 ms per loop Naive and Numba equal: True
The LLVM jit struggles to compile the numpy functions fast. Since the compiled output is not cached it is a real overhead. The following code shows that numba can be competitive to weave/C++ if only the loop is compiled.
from numba.decorators import jit
from numba import float64, int32
@jit(argtypes=[float64[:], int32[:], float64[:]])
def fillHist(hog, ang, mag):
# fill histogram
for i in range(ang.shape[0]):
hog[ang[i]] += mag[i]
return hog
def calcHOGNumba2(img, binSize):
# calculate gradient
d = np.gradient(img)
dx = d[0]
dy = d[1]
# calculate magnitude of gradient
mag = np.sqrt(dx ** 2 + dy ** 2).flatten()
# calculate angles and round them into bins
ang = np.int32(np.floor((np.arctan2(dy, dx) + np.pi) \
/ (2 * np.pi + np.spacing(10000)) * binSize)).flatten()
# initialize histogram
hog = np.zeros((binSize, ))
hog = fillHist(hog, ang, mag)
# normalize
return hog / ang.shape[0]
hogN2 = calcHOGNumba2(p, 9)
%timeit calcHOGNumba2(p, 9)
print "Naive and Numba equal: ", np.allclose(hog, hogN2)
100 loops, best of 3: 4.88 ms per loop Naive and Numba equal: True
p = img[100:200, 400:]
dx, dy = np.gradient(p)
hog = calcHOGNaive(p, 9)
hogCpp = calcHOGWeaveCpp(p, 9)
hogF = fHog.hog.anghist(dx, dy, 9)
hogF2 = fHog.hog.anghist2(p, 9)
hogN = calcHOGNumba(p, 9)
hogN2 = calcHOGNumba2(p, 9)
print "all results are equal:", all([np.allclose(hog, hogCpp),
np.allclose(hog, hogF),
np.allclose(hog, hogF2),
np.allclose(hog, hogN),
np.allclose(hog, hogN2)])
all results are equal: True
# patch 100x112
p = img[100:200, 400:]
dx, dy = np.gradient(p)
print "naive"
%timeit -n 100 calcHOGNaive(p, 9)
print "\nweave cpp"
%timeit -n 100 calcHOGWeaveCpp(p, 9)
print "\nfortran (1)"
%timeit -n 100 dx, dy = np.gradient(p); fHog.hog.anghist(dx, dy, 9)
print "\nfortran (2)"
%timeit -n 100 fHog.hog.anghist2(p, 9)
print "\nnumba (1)"
%timeit -n 100 calcHOGNumba(p, 9)
print "\nnumba (2)"
%timeit -n 100 calcHOGNumba2(p, 9)
naive 100 loops, best of 3: 40.4 ms per loop weave cpp 100 loops, best of 3: 4.79 ms per loop fortran (1) 100 loops, best of 3: 4.1 ms per loop fortran (2) 100 loops, best of 3: 2.88 ms per loop numba (1) 100 loops, best of 3: 62.1 ms per loop numba (2) 100 loops, best of 3: 4.79 ms per loop
# over entire image
print "naive"
%timeit -n 100 calcHOGNaive(img, 9)
print "\nweave cpp"
%timeit -n 100 calcHOGWeaveCpp(img, 9)
print "\nfortran (1)"
%timeit -n 100 dx, dy = np.gradient(img); fHog.hog.anghist(dx, dy, 9)
print "\nfortran (2)"
%timeit -n 100 fHog.hog.anghist2(img, 9)
print "\nnumba (1)"
%timeit -n 100 calcHOGNumba(img, 9)
print "\nnumba (2)"
%timeit -n 100 calcHOGNumba2(img, 9)
naive 100 loops, best of 3: 406 ms per loop weave cpp 100 loops, best of 3: 31.4 ms per loop fortran (1) 100 loops, best of 3: 39.7 ms per loop fortran (2) 100 loops, best of 3: 36.3 ms per loop numba (1) 100 loops, best of 3: 470 ms per loop numba (2) 100 loops, best of 3: 31.4 ms per loop
from IPython.parallel import Client
# open clients
rc = Client()
print rc.ids
[0, 1, 2, 3]
Get direct view of the remote clients (used later for clean up) and load balanced view (used to asyncronously add jobs to the clusters)
dview = rc[:]
lbview = rc.load_balanced_view()
Now we are ready to put stuff on the clusters
lbview.apply_async(func, **args)
But we need to have some data to feed. In our example we split the image into parts and process them individually.
splits = np.split(img, 16)
results = []
for split in splits:
results += [lbview.apply_async(calcHOGWeaveCpp, split, 9)]
We could add more jobs to the cluster that are other functions. Now, however, we have to wait until all results came in and continue processing.
from time import sleep
allReady = False
while not allReady:
allReady = True
for ar in results:
allReady = allReady and ar.ready()
sleep(0.01)
Now we copy the data back and clean up on the cluster (data on the cluster will not be deleted/ garbage collected and will cause a memory leak)
hogs = []
for ar in results:
# copy data
data = ar.get()
hogs += [data]
# delete data from cluster
msgId = ar.msg_id
#~ del lbview.results[msgId]
del rc.results[msgId]
del rc.metadata[msgId]
# close client to close socket connections
rc.close()
Normally, one should do it by subclassing QObject and using moveToThread() as described in the article http://labs.qt.nokia.com/2007/07/05/qthreads-no-longer-abstract/ . But that did not work for me, because there needs to be a QMainLoop around (if I remember correctly). Anyway the old style subclassing of QThread still works.
The BaseThread idea comes from http://pythonchem.blogspot.co.uk/2012/06/threaded-pyqt-gui-and-particles.html .
This QThread example replicates the IPython Parallel interface, but it creates a thread for each object with might not be desired. This is an untested toy example. I cannot guarantee that it would not break in some circumstances.
from PyQt4.QtCore import *
class BaseThread(QThread):
""" Provides infrastructure for save deleting of thread.
Check self.exiting in heavy operations frequently and
abort exectution of run(), to allow the thread to get
deleted.
"""
def __init__(self):
QThread.__init__(self)
self.exiting = False
def __del__(self):
self.exiting = True
self.wait()
class Worker(BaseThread):
def __init__(self, func, *args):
"""
This object will exectute `func` with `args` in a
separate thread. You can query ready() to check
if processing finished and get() to get the result.
Args:
func (function pointer)
function will be called asyncroneously
args (arguments)
arguments for func
"""
BaseThread.__init__(self)
self.func = func
self.args = args
self.result = None
self.isReady = False
self.start()
def run(self):
self.result = self.func(*self.args)
self.isReady = True
def ready(self):
return self.isReady
def get(self):
if self.isReady:
return self.result
else:
return None
w = Worker(calcHOGWeaveCpp, p, 9)
w.ready()
False
working = True
while working:
isReady = w.ready()
if isReady:
res = w.get()
working = False
# clean up, otherwise the object will
# stay in memory for ever (similar to
# clients in IPython parallel)
del w
print res
[ 0.00134604 0.00474881 0.00305021 0.00114426 0.00127909 0.00407113 0.00395126 0.00130679 0.00104403]
def test(a, *args):
print a
print args
test2(*args)
def test2(a,b):
print a
print b
test(4,5,6)
4 (5, 6) 5 6