This is an IJulia notebook (the Jupyter-based front-end for Julia) for 18.S096 at MIT in IAP 2017, designed to accompany the first lecture.
The basic goal of this lecture is to understand why some code (in some languages and/or styles of coding) is slow while other code is fast, based on whether it can be compiled to efficiently use the CPU registers and low-level arithmetic instructions, or whether it relies on "boxed" types that force "dynamic" computations.
To illustrate this, we will implement a sum function sum(a)
, which computes
for an array a
with n
elements. We will use the built-in sum
functions in Julia and Python along with hand-coded implementations in C, Python, and Julia.
We will use some tricks so that we can write and benchmark C, Python, and Julia code all in the same notebook. In the case of Python, this will rely on the PyCall package to call Python from Julia. We will use the BenchmarkTools Julia package to collect benchmarking statistics for us.
To start with, we will write a baseline implementation in the low-level C programming language. Our C function c_sum
will only work for a single data type: an array X
of double-precision floating-point values (double
in C, or Float64
in Julia).
(In contrast, our Julia code, and some of our Python code, will work for any numeric type; we'll see whether we pay a price for this.)
Julia can easily call C functions in shared libraries via its ccall
syntax. So, we'll take our C routine (in a string) and pipe it through the C compiler gcc
to produce a shared library file that we can load and call.
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
double s = 0.0;
for (size_t i = 0; i < n; ++i) {
s += X[i];
}
return s;
}
"""
# compile to a shared library by piping C_code to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
using Libdl
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
print(f, C_code)
end
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)
c_sum (generic function with 1 method)
Of course, we should first check whether our function is correct, by comparing it to Julia's built-in sum
function on an array of $10^7$ random numbers. Different floating-point algorithms for the sum
function give slightly different results (Julia's sum
algorithm is actually much more accurate than the one here, but that's a story for another day), so we'll compute their "fractional difference" or "relative error" and make sure that this is small.
(Double-precision floating-point arithmetic keeps about 15 decimal digits, so any relative error close to $10^{-15}$ is a reasonable amount of roundoff error.)
# define a function to compute the relative (fractional) error |x-y| / mean(|x|,|y|)
relerr(x,y) = abs(x - y) * 2 / (abs(x) + abs(y))
a = rand(10^7) # array of random numbers in [0,1)
relerr(c_sum(a), sum(a))
5.96023069383953e-14
Collecting accurate benchmarking statistics can be a tricky business, so we'll use the Julia BenchmarkTools
package to do most of the work. If you don't have it installed, you may need to type Pkg.add("BenchmarkTools")
to tell Julia to download and install it.
It defines a macro @btime
that takes some Julia code and transforms it into a benchmark measuring the speed of that code. We pass the argument a
of the c_sum
function to be benchmarked by the special syntax $a
for technical reasons, basically to make sure that Julia's analysis of the variable a
happens before the benchmark starts. Macro syntax and metaprogramming will be a topic for another lecture.
using BenchmarkTools
c_bench = @btime c_sum($a)
11.286 ms (0 allocations: 0 bytes)
5.000196119673649e6
Here, the time is around 10 ms for summing $10^7$ numbers, or about 1 billion additions per second. That sounds like a lot, but on my 2.5GHz laptop it is well below the peak rate at which the computer can perform arithmetic. It doesn't reach the peak arithmetic rate because for each floating-point addition, the processor needs to perform several additional calculations to load the next element of the array from memory, not to mention the time for the memory access itself. Of course, you may get a slightly different number if you run this benchmark on a different computer.
This 10 ms number for type-specific compiled C code is a baseline against which we will compare our other implementations of summation, below.
Now, we'll call the Python functions and benchmark them. The PyCall package allows to load Python as a library and to call it directly from Julia, sharing memory with Python and passing data and functions back and forth. There is very little overhead to this, and in any case we will be summing $10^7$ numbers so the overhead of the Julia/Python interface is negligible compared to the cost of the summation itself.
using PyCall
PyCall.pyversion
v"3.7.0"
sum
of a Python list
¶To start with, I will convert our array a
into a Python list
(the built-in Python array-like data structure), and sum it with the built-in Python sum
function.
# call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):
apy_list = PyCall.array2py(a, 1, 1)
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")
PyObject <built-in function sum>
Let's check that we can call it, and that it computes the same answer as the Julia sum
:
relerr(pysum(apy_list), sum(a))
5.96023069383953e-14
Now, we'll benchmark it:
py_list_bench = @btime $pysum($apy_list)
43.166 ms (3 allocations: 48 bytes)
5.000196119673649e6
It takes around 50 ms, or almost 5x slower than the C routine above. This is true even though the Python sum function is written in C. The problem is that the Python code pays a price for being generic: it handles arbitrary iterable data structures of arbitrary numeric "boxes" (PyObject
pointers), and has to perform lots of computations both to fetch each item and also to perform each addition.
sum
of a NumPy array
¶You can do much better if you can take advantage of the fact that all of the elements are the same type. Then, you can store the array as the actual floating-point data stored consecutively in memory (not an array of pointers to boxes), and your inner loop can be fast because the type checks can occur outside the loop. In Python, this kind of homogeneous array is exactly what is provided by NumPy. Internally, a NumPy array is essentially just a wrapper around a C-like double*
array. NumPy
also provides numpy.sum
function that can sum a NumPy array quickly.
There is a catch, though: NumPy itself is written mostly in C, not Python. And because C code is not type-generic, in order to handle a wide variety of NumPy array types (integers, double precision, single precision, etcetera), NumPy uses rather tricky auto-generated C code. And even then it can only handle a small set of commonly used types; you can't define your own types and sum them quickly.
Anyway, we can easily convert a Julia array to a NumPy array with PyCall (in fact, PyCall does this by
by default), and benchmark numpy.sum
:
numpy_sum = pyimport("numpy")["sum"]
apy_numpy = PyObject(a) # converts to a numpy array by default
py_numpy_bench = @btime $numpy_sum($apy_numpy)
3.778 ms (3 allocations: 48 bytes)
5.000196119673346e6
WOW, it is actually roughly twice as fast as our C function!
The reason for this extra boost is that the NumPy functions exploit SIMD instructions: special CPU instructions that can perform multiple additions at once, which we didn't use in our C code above.
sum
function¶To complete the story, let's write our own mysum
function in Python that sums an arbitrary Python list (or array, or any iterable Python container).
Of course, you would never do this for summation — you would always use one of the built-in functions in practice. But someday, you will inevitably run into a problem where the performance-critical code has not already been written for you, and you will need to write your own. So it is a good exercise to see how easy it is to get performance that is comparable to the library routines.
py"""
def mysum(a):
s = 0.0
for x in a:
s = s + x
return s
"""
mysum_py = py"mysum"
PyObject <function mysum at 0x12955a950>
As usual, let's check that it works first:
relerr(mysum_py(apy_list), sum(a))
5.96023069383953e-14
Now, let's time it on our Python list:
@btime $mysum_py($apy_list)
235.801 ms (3 allocations: 48 bytes)
5.000196119673649e6
Yikes, about 200ms. That's more than 4× slower than the Python sum
and about 20× slower than our C code.
Using our mysum
function with the NumPy array is no better, and in fact is 4× worse:
@btime $mysum_py($apy_numpy)
869.044 ms (3 allocations: 48 bytes)
5.000196119673649e6
You can't take advantage of the NumPy array format in Python itself — you still have to write the performance-critical code in C (or hope someone else has written it for you).
sum
function¶Now, let's try the same thing in Julia, starting with the built-in Julia sum
function operating on our array a
:
j_bench = @btime sum($a)
4.378 ms (0 allocations: 0 bytes)
5.000196119673351e6
Hooray, about 2× the speed of the C code, comparable to numpy.sum
.
(In January 2018, they were almost exactly the same speed, but Julia is currently about 30% slower. It may have something to do with LLVM's ability to optimise AVX instructions.)
Again, you can guess that it must be using SIMD to beat the C code. And, again, it must be taking advantage of the fact that the array is homogeneous.
The type of a
is:
typeof(a)
Array{Float64,1}
This is the Julia type for a 1-dimensional array of Float64
values (64-bit "double" precision floating-point numbers, equivalent to C double
). Because the type of the elements is "attached" to the type of the array (a "parameterized" type, more on this later), Julia is able to store it as a "flat" array of consecutive Float64
values in memory.
In contrast, the Julia equivalent of a Python list
is a Vector{Any}
(a synonym for Array{Any,1}
): internally, this is an array of pointers to "boxes" that can hold any type (Any
). This makes things much slower: each +
computation on an Any
value must dynamically look up the type of object, figure out what +
function to call, and allocate a new "box" to store the result:
a_any = Vector{Any}(a)
j_bench_any = @btime sum($a_any)
536.614 ms (19999999 allocations: 305.18 MiB)
5.000196119673351e6
This is about a 100× slowdown. It is more than 5× slower than the Python sum(list)
code, and comparable to the hand-code Python sum
, in fact. The Python sum
function is better optimized than the Julia sum
function for dealing with untyped (Any
) values, in part because in Julia it is expected that you will use "concretely" typed arrays in all performance-critical cases.
Unlike NumPy, however, Julia allows you to make efficient homogeneous arrays for any data type, even data types you define yourself, and you can operate on them efficiently with code written in Julia itself.
sum
functions¶Let's try to write our own sum
function in Julia, just as we wrote our own Python function. We'll implement four different versions and see how they compare. We'll start simple:
function mysum1(A)
s = zero(eltype(A)) # the correct type of zero for A
for a in A
s += a
end
return s
end
relerr(mysum1(a), sum(a))
5.96023069383953e-14
j2_bench = @btime mysum1($a)
12.449 ms (0 allocations: 0 bytes)
5.000196119673649e6
Now that's more like it! It runs in about 10 ms, essentially the same speed as the hand-written C code.
Unlike the C code, however, it works for any type of array, and in fact just about any type of "iterable container" (as long as it provides an eltype
method), and can sum any type of value (as long as zero
and +
are defined), including user-defined types. (We'll give an example below).
The performance does not quite match the Julia built-in sum
function or the numpy.sum
function, however. Our guess above was that they were exploiting SIMD optimizations. However, we can do that too, in our own Julia code, by using the @simd
decorator to tell Julia's compiler to turn on SIMD optimizations for that loop.
(SIMD optimizations are not turned on by default, because they only speed up very particular kinds of code, and turning them on everywhere would slow down the compiler too much.)
function mysum(A)
s = zero(eltype(A))
@simd for a in A
s += a
end
return s
end
relerr(mysum(a), sum(a))
8.567831622394544e-15
j3_bench = @btime mysum($a)
4.406 ms (0 allocations: 0 bytes)
5.000196119673394e6
Hooray! Basically the same speed as Julia's built-in sum
function and numpy.sum
! And it only required 7 lines of code, some care with types, and a very minor bit of wizardry with the @simd
tag to get the last factor of two.
Moreover, the code is still type generic: it can sum any container of any type that works with addition. For example, it works for complex numbers, which are about two times slower as you might expect (since each complex addition requires two real additions):
z = rand(Complex{Float64}, length(a));
@btime mysum($z)
11.597 ms (0 allocations: 0 bytes)
5.001104893086427e6 + 5.001047177942184e6im
And we didn't have to declare any types of any arguments or variables; the compiler figured everything out. How?