NumPy ('Numerical Python') is the defacto standard module for doing numerical work in Python. Its main feature is its array data type which allows very compact and efficient storage of homogenous (of the same type) data.
A lot of the material in this section is based on SciPy Lecture Notes (CC-by 4.0).
As you go through this material, you'll likely find it useful to refer to the NumPy documentation, particularly the array objects section.
As with pandas
there is a standard convention for importing numpy
, and that is as np
:
import numpy as np
Now that we have access to the numpy
package we can start using its features.
In many ways a NumPy array can be treated like a standard Python list
and much of the way you interact with it is identical. Given a list, you can create an array as follows:
python_list = [1, 2, 3, 4, 5, 6, 7, 8]
numpy_array = np.array(python_list)
print(numpy_array)
[1 2 3 4 5 6 7 8]
# ndim give the number of dimensions
numpy_array.ndim
1
# the shape of an array is a tuple of its length in each dimension. In this case it is only 1-dimensional
numpy_array.shape
(8,)
# as in standard Python, len() gives a sensible answer
len(numpy_array)
8
nested_list = [[1, 2, 3], [4, 5, 6]]
two_dim_array = np.array(nested_list)
print(two_dim_array)
[[1 2 3] [4 5 6]]
two_dim_array.ndim
2
two_dim_array.shape
(2, 3)
It's very common when working with data to not have it already in a Python list but rather to want to create some data from scratch. numpy
comes with a whole suite of functions for creating arrays. We will now run through some of the most commonly used.
The first is np.arange
(meaning "array range") which works in a vary similar fashion the the standard Python range()
function, including how it defaults to starting from zero, doesn't include the number at the top of the range and how it allows you to specify a 'step:
np.arange(10) #0 .. n-1 (!)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(1, 9, 2) # start, end (exclusive), step
array([1, 3, 5, 7])
Next up is the np.linspace
(meaning "linear space") which generates a given floating point numbers starting from the first argument up to the second argument. The third argument defines how many numbers to create:
np.linspace(0, 1, 6) # start, end, num-points
array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])
Note how it included the end point unlike arange()
. You can change this feature by using the endpoint
argument:
np.linspace(0, 1, 5, endpoint=False)
array([0. , 0.2, 0.4, 0.6, 0.8])
np.ones
creates an n-dimensional array filled with the value 1.0
. The argument you give to the function defines the shape of the array:
np.ones((3, 3)) # reminder: (3, 3) is a tuple
array([[1., 1., 1.], [1., 1., 1.], [1., 1., 1.]])
Likewise, you can create an array of any size filled with zeros:
np.zeros((2, 2))
array([[0., 0.], [0., 0.]])
The np.eye
(referring to the matematical identity matrix, commonly labelled as I
) creates a square matrix of a given size with 1.0
on the diagonal and 0.0
elsewhere:
np.eye(3)
array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
The np.diag
creates a square matrix with the given values on the diagonal and 0.0
elsewhere:
np.diag([1, 2, 3, 4])
array([[1, 0, 0, 0], [0, 2, 0, 0], [0, 0, 3, 0], [0, 0, 0, 4]])
Finally, you can fill an array with random numbers, specifying the seed if you want reproducibility:
np.random.seed(42)
np.random.rand(4) # uniform in [0, 1]
array([0.37454012, 0.95071431, 0.73199394, 0.59865848])
np.random.randn(4) # Gaussian
array([-0.23415337, -0.23413696, 1.57921282, 0.76743473])
arange
, linspace
, ones
, zeros
, eye
and diag
.np.empty
. What does it do? When might this be useful?Behind the scenes, a multi-dimensional NumPy array
is just stored as a linear segment of memory. The fact that it is presented as having more than one dimension is simply a layer on top of that (sometimes called a view). This means that we can simply change that interpretive layer and change the shape of an array very quickly (i.e without NumPy having to copy any data around).
This is mostly done with the reshape()
method on the array object:
my_array = np.arange(16)
my_array
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
my_array.shape
(16,)
my_array.reshape((2, 8))
array([[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15]])
my_array.reshape((4, 4))
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]])
Note that if you check, my_array.shape
will still return (16,)
as reshaped
is simply a view on the original data, it hasn't actually changed it. If you want to edit the original object in-place then you can use the resize()
method.
You can also transpose an array using the transpose()
method which mirrors the array along its diagonal:
my_array.reshape((2, 8)).transpose()
array([[ 0, 8], [ 1, 9], [ 2, 10], [ 3, 11], [ 4, 12], [ 5, 13], [ 6, 14], [ 7, 15]])
my_array.reshape((4,4)).transpose()
array([[ 0, 4, 8, 12], [ 1, 5, 9, 13], [ 2, 6, 10, 14], [ 3, 7, 11, 15]])
Using the NumPy documentation at https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html, to create, in one line a NumPy array which looks like:
[10, 60, 20, 70, 30, 80, 40, 90, 50, 100]
Hint: you will need to use transpose()
, reshape()
and arange()
as well as one new function from the "Shape manipulation" section of the documentation. Can you find a method which uses less than 4 function calls?
You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2.
vs 2
). This is due to a difference in the data-type used:
a = np.array([1, 2, 3])
a.dtype
dtype('int64')
b = np.array([1., 2., 3.])
b.dtype
dtype('float64')
Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-type from the input.
c = np.array([1, 2, 3], dtype=float)
c.dtype
dtype('float64')
The default data type for most arrays is 64 bit floating point.
d = np.ones((3, 3))
d.dtype
dtype('float64')
There are other data types as well:
e = np.array([1+2j, 3+4j, 5+6*1j])
e.dtype
dtype('complex128')
f = np.array([True, False, False, True])
f.dtype
dtype('bool')
g = np.array(['Bonjour', 'Hello', 'Hallo',])
g.dtype # <--- strings containing max. 7 letters
dtype('<U7')
We previously came across dtype
s when learing about pandas
. This is because pandas
uses NumPy as its underlying library. A pandas.Series
is essentially a np.array
with some extra features wrapped around it.
To show some of the advantages of NumPy over a standard Python list, let's do some benchmarking. It's an important habit in programming that whenever you think one method may be faster than another, you check to see whether your assumption is true.
Python provides some tools to make this easier, particularly via the timeit
module. Using this functionality, IPython provides a %timeit
magic function to make our life easier. To use the %timeit
magic, simply put it at the beginning of a line and it will give you information about how ling it took to run. It doesn't always work as you would expect so to make your life easier, put whatever code you want to benchmark inside a function and time that function call.
We start by making a list and an array of 10000 items each of values counting from 0 to 9999:
python_list = list(range(100000))
numpy_array = np.arange(100000)
We are going to go through each item in the list and double its value in-place, such that the list is changed after the operation. To do this with a Python list
we need a for
loop:
# NBVAL_IGNORE_OUTPUT
def python_double(a):
for i, val in enumerate(a):
a[i] = val * 2
%timeit python_double(python_list)
8.12 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To do the same operation in NumPy we can use the fact that multiplying a NumPy array
by a value will apply that operation to each of its elements:
# NBVAL_IGNORE_OUTPUT
def numpy_double(a):
a *= 2
%timeit numpy_double(numpy_array)
31.4 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As you can see, the NumPy version is at least 10 times faster, sometimes up to 100 times faster.
Have a think about why this might be, what is NumPy doing to make this so much faster? There are two main parts to the answer.
A slicing operation (like reshaping before) creates a view on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. You can use np.may_share_memory()
to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives.
When modifying the view, the original array is modified as well:
a = np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
b = a[::2]
np.may_share_memory(a, b)
True
b[0] = 12
b
array([12, 2, 4, 6, 8])
a # (!)
array([12, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a = np.arange(10)
c = a[::2].copy() # force a copy
c[0] = 12
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.may_share_memory(a, c) # we made a copy so there is no shared memory
False
Continue to the [next section](numpy indexing.ipynb).