Intro to Numpy¶

Attribution note: This notebook was created by David Dotson and Oliver Beckstein and is made available under a CC-BY 4.0 license. Some of the material was inspired by lessons developed by Software Carpentry (Programming with Python), as well as material previously created by Oliver Beckstein (Python and Numpy for SimBioNano PHY 598).

When it comes to doing numerical work, Python by itself is rather slow. By slow we mean compared to languages like C and Fortran, which benefit from being compiled languages in which a program is preprocessed into machine code by a compiler. Python by contrast is an interpreted language, in which each line in a program is fed to the Python interpreter in sequence, then executed. The flexiblity and ease of use that come with Python come at the cost of pure performance.

However, though Python code itself may be slow, Python can be used to run code that is written in a compiled language and already compiled. We will use a library (a.k.a., a Python module) that does exactly this underneath the hood to get fast performance for numerical operations on arrays.

In [1]:

import numpy

Importing a module is like taking a piece of equipment out of a storage locker and setting it up on a lab bench. Importing the name numpy makes all the functions and classes (object types) available to us. The core data structure that numpy provides is known as the numpy array:

In [2]:

somenums = numpy.array([1, 2, 3, 4])
print(somenums)

[1 2 3 4]

A numpy array looks superfically similar to a list, which is a builtin to Python. They are fundamentally different, however, in how they both work and how they exist in memory. numpy arrays don't store references to other objects, but instead point to contiguous blocks of memory in which each element is of exactly the same data type. For instance, we just made an array of 64 bit integers:

In [3]:

somenums.dtype

Out[3]:

dtype('int64')

In [4]:

# this will give an array with a string dtype
numpy.array(['Han Solo', "Kylo Ren", 7])

Out[4]:

array(['Han Solo', 'Kylo Ren', '7'], dtype='<U8')

In [5]:

# this will give an array of 64 bit floats
floats = numpy.array([63.3, -5.0, 1])
print(floats)

[63.3 -5.   1. ]

In [6]:

floats.dtype

Out[6]:

dtype('float64')

Also, because arrays are not a collection of objects but are a single object of identically sized pieces of data, they cannot be resized. To add elements to an array, one must create a new array.

In [7]:

# this will create a new array with repeated elements
numpy.hstack([somenums, somenums])

Out[7]:

array([1, 2, 3, 4, 1, 2, 3, 4])

Array methods (or, arrays are objects)¶

numpy arrays are built for numerical operations, and doing them quickly. Since like everything in Python these are objects, they include built-in methods such as:

In [8]:

somenums.mean()

Out[8]:

2.5

In [9]:

somenums.std()

Out[9]:

1.118033988749895

and a whole plethora of others. You can get a view of what methods and attributes are part of an array's namespace with:

In [10]:

dir(somenums)

Out[10]:

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__xor__',
 'all',
 'any',
 'argmax',
 'argmin',
 'argpartition',
 'argsort',
 'astype',
 'base',
 'byteswap',
 'choose',
 'clip',
 'compress',
 'conj',
 'conjugate',
 'copy',
 'ctypes',
 'cumprod',
 'cumsum',
 'data',
 'diagonal',
 'dot',
 'dtype',
 'dump',
 'dumps',
 'fill',
 'flags',
 'flat',
 'flatten',
 'getfield',
 'imag',
 'item',
 'itemset',
 'itemsize',
 'max',
 'mean',
 'min',
 'nbytes',
 'ndim',
 'newbyteorder',
 'nonzero',
 'partition',
 'prod',
 'ptp',
 'put',
 'ravel',
 'real',
 'repeat',
 'reshape',
 'resize',
 'round',
 'searchsorted',
 'setfield',
 'setflags',
 'shape',
 'size',
 'sort',
 'squeeze',
 'std',
 'strides',
 'sum',
 'swapaxes',
 'take',
 'tobytes',
 'tofile',
 'tolist',
 'tostring',
 'trace',
 'transpose',
 'var',
 'view']

or in the notebook, by typing the name of the array followed by a . and the tab key:

somenums.

Recall that you can also get the documentation for any function or method with a question mark at the end of the name:

somenums.mean?

Multidimensionality, indexing, and slicing¶

numpy arrays can be of any dimensionality, not just 1-D. It's common to encounter 2-D arrays.

In [11]:

import numpy as np

ndarrays¶

Make a 2D array:

In [12]:

a = np.array([ [1, 2], [3, 4]])
a

Out[12]:

array([[1, 2],
       [3, 4]])

In [13]:

a.shape

Out[13]:

(2, 2)

Arrays are made by rows.

In [14]:

a[0]

Out[14]:

array([1, 2])

In [15]:

a[1]

Out[15]:

array([3, 4])

Indexing notation:

separate axes (dimensions) by commas
gives sub-arrays or elements

In [16]:

a[0, 0]

Out[16]:

In [17]:

a[0, 1]

Out[17]:

In [18]:

a[1, 1]

Out[18]:

Slicing: like lists, with numpy indexing:

For instance, second column:

In [19]:

a[:, 1]

Out[19]:

array([2, 4])

In [20]:

a[0, :]

Out[20]:

array([1, 2])

First row: same as

In [21]:

a[0]

Out[21]:

array([1, 2])

Let's make a 3D array:

In [22]:

A = np.arange(24)  # 1D
B = A.reshape((2, 4, 3))  # 3D

In [23]:

Out[23]:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

The 3D array contains 2 2D arrays of shape 4x3:

In [24]:

Out[24]:

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17],
        [18, 19, 20],
        [21, 22, 23]]])

Get the second sub array:

In [25]:

B[1]

Out[25]:

array([[12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23]])

Take every second element along axis 1:

In [26]:

B[:, ::2]

Out[26]:

array([[[ 0,  1,  2],
        [ 6,  7,  8]],

       [[12, 13, 14],
        [18, 19, 20]]])

In [27]:

B[:, ::2].shape

Out[27]:

(2, 2, 3)

Example: a particle trajectory¶

For illustration we'll look at the position of a particle in three dimensions with time (we will later also plot the trajectory).

The create_position() function is just a helper function that we will use throughout the notebook in order to quickly generate interesting data.

You can get it from the helpers.py module with

from helpers import create_position

or just execute the cell below. (In general, it is good practice to put reusable code into a module and keep your notebook simple.)

In [28]:

def create_position(nframes=10**6):
    """Get array of x, y, and z position of a particle with time.
    
    Parameters
    ----------
    nframes: int
        number of frames; more frames increases the resolution
        of the trajectory, but not its length
        
    Returns
    -------
    position : `nframes` x 3 array
        (x, y, z) position of the particle with time
        
    """
    # generate x, y, z positions
    x = np.cos(np.linspace(0, 20, nframes))
    y = 3 * np.sin(np.linspace(0, 10, nframes))
    z = -2 * np.sin(np.pi * np.linspace(0, 5, nframes))

    # put them all in a single array; this gives
    # an array with 3 rows and nframes columns
    position = np.array([x, y, z])

    # transposing puts the array into the [[x, y, z], [x, y, z], ...] shape
    return position.transpose()

Let's generate an array position to work with:

In [29]:

position = create_position()

In [30]:

position.shape

Out[30]:

(1000000, 3)

Now say we wanted to examine the position of the particle in the very first frame (row), we could do:

In [31]:

position[0]

Out[31]:

array([ 1.,  0., -0.])

to extract it. Notice that indexing starts at 0, as is the convention in Python.

What about the third frame?

In [32]:

position[2]

Out[32]:

array([ 9.99999999e-01,  6.00000600e-05, -6.28319159e-05])

In zero-based terms, this we would call the "first frame" the zeroth frame, and so on. To avoid confusion we'll assume this from now on.

What if we wanted a bunch of frames, but only the 5th through the 72nd? It should have 68 rows:

In [33]:

position[5:73].shape

Out[33]:

(68, 3)

Notice the slicing notation. Remember, this should be read as

"Get each row in the array starting from the row at index 5 up to and not including the row at index 73."

We could even coarse-grain by slicing out every fifth row in this range:

In [34]:

position[5:73:5].shape

Out[34]:

(14, 3)

Now what if we wanted a specific element of the array? Indexing works for this too:

In [35]:

position[42, 1]

Out[35]:

0.0012600012229571492

This is the y-position of the 42nd frame.

NOTE: The numpy indexing notation differs from Python indexing of nested lists, which would look like positions[42][1].

Incidentally, this also works for numpy arrays but is less readable and slower, as we can demonstrate with the %timeit magic function of IPython/Jupyter:

In [36]:

%timeit position[42, 1]

113 ns ± 0.771 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [37]:

%timeit position[42][1]

233 ns ± 3.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The first index/slice corresponds to the first axis of the array, which for a 2-D array corresponds to the rows. The second index/slice would then be the columns. If we had a 3-D array, indexing the first axis would yield 2-D arrays. If we had a 4-D array, indexing the first axis would yield 3-D arrays, and so on.

Example: Numpy slicing¶

Obtain an array giving the mean of the x, y positions (separately) from the frame at index 10 to the frame at index 43 as a 1-D array.

We can do this by slicing both the first axis (rows) and the second axis (columns), then using the mean method of the resulting array. To only take a mean across the rows (a mean for each column), we must specify the axis=0 keyword.

In [38]:

position[10:42, :2].mean(axis=0)

Out[38]:

array([9.99999853e-01, 7.65000753e-04])

What if we wanted the smaller of the two numbers only?

In [39]:

position[10:43, :2].mean(axis=0).min()

Out[39]:

0.000780000767676743

Since slicing and methods of arrays often yield arrays, you can chain operations in this way. This is what qualifies as a pythonic way to work with these objects.

Fancy and boolean indexing¶

It's also possible to index arrays with lists of indices to select out; these can be repeated and in any order.

In [40]:

Out[40]:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

Fancy indexing with a list of indices:

In [41]:

A[[2, -1, 2, 3]]

Out[41]:

array([ 2, 23,  2,  3])

Fancy indexing with booleans:

In [42]:

big_values = A > 10
big_values

Out[42]:

array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True])

In [43]:

A[big_values]

Out[43]:

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])

Boolean indexing is a convenient way to select parts of an array. Test it out interactively.

For our trajectory: We can select arbitrary time steps:

In [44]:

position[[2, 4, 7, -1, 2]]

Out[44]:

array([[ 9.99999999e-01,  6.00000600e-05, -6.28319159e-05],
       [ 9.99999997e-01,  1.20000120e-04, -1.25663832e-04],
       [ 9.99999990e-01,  2.10000210e-04, -2.19911705e-04],
       [ 4.08082062e-01, -1.63206333e+00, -1.22464680e-15],
       [ 9.99999999e-01,  6.00000600e-05, -6.28319159e-05]])

We can also use arrays of booleans to get back arrays with items for which True was the value in the boolean array used:

In [45]:

(position[:, :2] > 2).any(axis=1)

Out[45]:

array([False, False, False, ..., False, False, False])

We can use this array to get only the rows for which either the x or y position is greater than 2:

In [46]:

position[(position[:,:2] > 2).any(axis=1)].shape

Out[46]:

(336428, 3)

Boolean arrays are useful for filtering data for rows of interest.

Technical note: fancy and boolean indexing like this generally gives back a new array instead of a view to the existing one. Slicing, by contrast, always gives a view. This matters when using indexing or slicing to alter the values in an array.

Array arithmetic¶

In [47]:

a = np.array([1, 2, 3, 4])
b = np.array([-1, 8, 7, -3])

Array operations are element-wise:

In [48]:

a + b

Out[48]:

array([ 0, 10, 10,  1])

In [49]:

a - b

Out[49]:

array([ 2, -6, -4,  7])

In [50]:

a * b

Out[50]:

array([ -1,  16,  21, -12])

In [51]:

a / b

Out[51]:

array([-1.        ,  0.25      ,  0.42857143, -1.33333333])

In [52]:

np.asarray(a, dtype=np.float) ** b

Out[52]:

array([1.0000e+00, 2.5600e+02, 2.1870e+03, 1.5625e-02])

For n-D arrays the same applies:

In [53]:

U = np.array([[1, 2], [3, 4]])
V = np.array([[10, 20], [30, 40]])

In [54]:

U + V

Out[54]:

array([[11, 22],
       [33, 44]])

In [55]:

U - V

Out[55]:

array([[ -9, -18],
       [-27, -36]])

In [56]:

U * V

Out[56]:

array([[ 10,  40],
       [ 90, 160]])

In [57]:

U / V

Out[57]:

array([[0.1, 0.1],
       [0.1, 0.1]])

Trajectory example continued¶

In [58]:

position = create_position()

Calculate the distance from a reference value, e.g., the end point of the trajectory:

In [59]:

ref = position[-1]
ref

Out[59]:

array([ 4.08082062e-01, -1.63206333e+00, -1.22464680e-15])

$$ d_i = |\mathbf{r}_i - \mathbf{r}_\text{ref}| = \sqrt{\sum_{k=0}^2 (r_{i,k} - r_{\text{ref},k})^2} $$

Let's rewrite with the difference vector $$ \Delta\mathbf{r}_i = \mathbf{r}_i - \mathbf{r}_\text{ref} $$ as $$ d_i = \sqrt{\Delta\mathbf{r}_i \cdot \Delta\mathbf{r}_i} = \sqrt{\sum_{k=0}^2 \Delta r_{i,k}^2} $$

This will give us an instruction for how to write it in numpy:

All the difference vectors:

In [60]:

Delta_r = position - ref

In [61]:

Delta_r.shape

Out[61]:

(1000000, 3)

In [62]:

Delta_r[[0, 1, -1]]

Out[62]:

array([[ 5.91917938e-01,  1.63206333e+00,  1.22464680e-15],
       [ 5.91917938e-01,  1.63209333e+00, -3.14159579e-05],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00]])

Square each element:

In [63]:

Delta_r ** 2

Out[63]:

array([[3.50366846e-01, 2.66363072e+00, 1.49975978e-30],
       [3.50366845e-01, 2.66372865e+00, 9.86962414e-10],
       [3.50366845e-01, 2.66382657e+00, 3.94784965e-09],
       ...,
       [1.33352927e-09, 2.53458565e-09, 3.94784966e-09],
       [3.33385299e-10, 6.33642303e-10, 9.86962414e-10],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

Sum along the second axis:

In [64]:

s = np.sum(Delta_r ** 2, axis=1)
s

Out[64]:

array([3.01399757e+00, 3.01409549e+00, 3.01419342e+00, ...,
       7.81596457e-09, 1.95399002e-09, 0.00000000e+00])

In [65]:

s.shape

Out[65]:

(1000000,)

Take the root for each individual $\Delta\mathbf{r}_i \cdot \Delta\mathbf{r}_i$:

In [66]:

np.sqrt(s)

Out[66]:

array([1.73608685e+00, 1.73611506e+00, 1.73614326e+00, ...,
       8.84079441e-05, 4.42039593e-05, 0.00000000e+00])

Altogether:

In [67]:

np.sqrt(np.sum((position - ref)**2, axis=1))

Out[67]:

array([1.73608685e+00, 1.73611506e+00, 1.73614326e+00, ...,
       8.84079441e-05, 4.42039593e-05, 0.00000000e+00])

looks very similar to

$$ d_i = \sqrt{\sum_{k=0}^2 (r_{i,k} - r_{\text{ref},k})^2} $$

Note on matrices¶

Note that multiplication between two arrays is not the same as matrix multiplcation. Arithmetic operations are element-wise.

In [83]:

v = np.array([0, -1, 10])
w = np.array([3, 5, -1])
v * w

Out[83]:

array([  0,  -5, -10])

In [94]:

A = np.array([[1, 1, 0], [0, 1, 0], [0, 0, 2]])
B = np.array([[2, 0, 0], [-1, 1, 0], [0, 0, 0.5]])
print(A)
print(B)

[[1 1 0]
 [0 1 0]
 [0 0 2]]
[[ 2.   0.   0. ]
 [-1.   1.   0. ]
 [ 0.   0.   0.5]]

The multiplication is element-wise:

In [95]:

A * B

Out[95]:

array([[ 2.,  0.,  0.],
       [-0.,  1.,  0.],
       [ 0.,  0.,  1.]])

But there is a method for doing matrix multiplication/inner products: vector/vector $$ \mathbf{v}\cdot\mathbf{w} = \sum_{i=1}^3 v_i w_i, $$ matrix/vector $$ \mathbf{A}\cdot\mathbf{v} = \sum_{i=j}^3 A_{ij} v_j, $$ matrix/matrix $$ [\mathsf{A}\cdot\mathsf{B}]_{ik} = \sum_{j=1}^3 A_{ij} B_{jk} $$

In [85]:

np.dot(v, w)

Out[85]:

-15

or with @ (matrix multiplication operator)

In [87]:

v @ w

Out[87]:

-15

Numpy arrays also have the dot() method:

In [92]:

A.dot(v)

Out[92]:

array([-1, -1, 20])

In [93]:

A.dot(B)

Out[93]:

array([[ 1.,  1.,  0.],
       [-1.,  1.,  0.],
       [ 0.,  0.,  1.]])

And more linear algebra functions can be found in the numpy.linalg module:

In [72]:

dir(np.linalg)

Out[72]:

['LinAlgError',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_umath_linalg',
 'cholesky',
 'cond',
 'det',
 'eig',
 'eigh',
 'eigvals',
 'eigvalsh',
 'inv',
 'lapack_lite',
 'linalg',
 'lstsq',
 'matrix_power',
 'matrix_rank',
 'multi_dot',
 'norm',
 'pinv',
 'qr',
 'slogdet',
 'solve',
 'svd',
 'tensorinv',
 'tensorsolve',
 'test']

Creating new arrays¶

In [96]:

np.zeros((2,3))

Out[96]:

array([[0., 0., 0.],
       [0., 0., 0.]])

In [97]:

np.ones((2,3))

Out[97]:

array([[1., 1., 1.],
       [1., 1., 1.]])

In [98]:

np.ones((2,3), dtype=np.bool)

Out[98]:

array([[ True,  True,  True],
       [ True,  True,  True]])

Number ranges:

np.arange is the equivalent to range:

In [101]:

np.arange(-6, 6, 2)

Out[101]:

array([-6, -4, -2,  0,  2,  4])

It can use float steps:

In [102]:

np.arange(10, step=0.5)

Out[102]:

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
       6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])

More useful is np.linspace(start, stop, num) to make exactly num equally distant numbers between start and stop inclusive:

In [104]:

np.linspace(-6, 6, 13)

Out[104]:

array([-6., -5., -4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.])

Thinking in arrays¶

Say we wanted to displace our particle a full 5 units in each of the x, y, and z directions. If you have experience with a language like C, you might be used to writing nested loops like this one to achieve this:

In [73]:

position = create_position()

(We use the %%time magic to get the time for a whole code block.)

In [74]:

%%time
for i in range(position.shape[0]):
    for j in range(position.shape[1]):
        position[i, j] += 5

CPU times: user 1.84 s, sys: 10.7 ms, total: 1.85 s
Wall time: 1.86 s

Even slower when you try to do it in a "Pythonic" fashion with the (otherwise very good!) enumerate() function:

In [75]:

position = create_position()

In [76]:

%%time
for i, row in enumerate(position):
    for j, element in enumerate(row):
        position[i, j] += 5     

CPU times: user 2.74 s, sys: 12.7 ms, total: 2.76 s
Wall time: 2.77 s

In [77]:

position

Out[77]:

array([[6.        , 5.        , 5.        ],
       [6.        , 5.00003   , 4.99996858],
       [6.        , 5.00006   , 4.99993717],
       ...,
       [5.40811858, 3.36798701, 4.99993717],
       [5.40810032, 3.36796184, 4.99996858],
       [5.40808206, 3.36793667, 5.        ]])

But one of the main points of numpy is performance, so we'd do better to spend as little time in an operation running through the Python interpreter, which is the case in the above loop. Instead we can do:

In [78]:

position = create_position()

In [79]:

%%time
position += 5

CPU times: user 2.66 ms, sys: 1.18 ms, total: 3.83 ms
Wall time: 2.13 ms

Speed-up for using array operations instead of for loops:

In [80]:

2.61 / 2.6e-3

Out[80]:

1003.8461538461538

On my laptop the difference in speed is about a factor of 400–1000 (you might see speed-ups on the order of 100 to 1000). The larger the array the more pronounced the difference in speed will be, too. The general rule when using numpy is to try and put what you're trying to do in terms of operations on whole arrays (or slices of them). Avoid Python loops unless absolutely necessary.

Example: array arithmetic with broadcasting¶

Rescale (multiply) the y-positions by 2 and displace the x and z positions by 3 and -100, respectively.

There are a lot of ways to do this, but the most succinct way is to take advantage of broadcasting. That is, doing:

In [81]:

position = create_position()

In [82]:

position = position * np.array([1, 2, 1]) + np.array([3, 0, -100])

numpy will take the 3-element, 1-D arrays here and apply them to whole columns in position. Note that we already took advantage of broadcasting rules in a way, since multiplying an array by a scalar is the same as multiplying the array by an array of equal shape with all elements equal to the scalar.