CRÜCIAL PŸTHON Week 3: Numpy Stride Tricks

In [1]:
from IPython.core.display import Image 
Image(url='http://labrosa.ee.columbia.edu/crucialpython/logo.png', width=600) 
Out[1]:
In [2]:
# Get our pylab mise en place going
import numpy as np
import matplotlib.pyplot as plt
In [3]:
# I'll make up a time series of sequential data
# 
x = np.arange(8 * 16)

# And print it
print x
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127]

Let's say x holds sequential data, and we want to do some sliding window analysis. This is common in time series modeling and signal processing.

If the windows are disjoint (i.e., no overlap) then we can get away with simply reshaping x into a matrix like so:

In [4]:
frame_length = 8

x_framed = np.reshape(x, (-1, frame_length))

print x_framed
[[  0   1   2   3   4   5   6   7]
 [  8   9  10  11  12  13  14  15]
 [ 16  17  18  19  20  21  22  23]
 [ 24  25  26  27  28  29  30  31]
 [ 32  33  34  35  36  37  38  39]
 [ 40  41  42  43  44  45  46  47]
 [ 48  49  50  51  52  53  54  55]
 [ 56  57  58  59  60  61  62  63]
 [ 64  65  66  67  68  69  70  71]
 [ 72  73  74  75  76  77  78  79]
 [ 80  81  82  83  84  85  86  87]
 [ 88  89  90  91  92  93  94  95]
 [ 96  97  98  99 100 101 102 103]
 [104 105 106 107 108 109 110 111]
 [112 113 114 115 116 117 118 119]
 [120 121 122 123 124 125 126 127]]

In the above example, x_framed[i] is a single window's worth of data.

But what if we want the windows to overlap? Reshaping will not work, since we need multiple copies of most entries in the matrix.

The simple thing to do is allocate a new array, and iteratively populate each row with a properly advanced window, like so:

In [5]:
hop_length = frame_length / 2

num_frames = 1 + (len(x) - frame_length) / hop_length

x_framed_overlap = np.zeros( (num_frames, frame_length), dtype=int )

for t in range(num_frames):
    x_framed_overlap[t, :] = x[t*hop_length:t*hop_length + frame_length]
    
print x_framed_overlap
[[  0   1   2   3   4   5   6   7]
 [  4   5   6   7   8   9  10  11]
 [  8   9  10  11  12  13  14  15]
 [ 12  13  14  15  16  17  18  19]
 [ 16  17  18  19  20  21  22  23]
 [ 20  21  22  23  24  25  26  27]
 [ 24  25  26  27  28  29  30  31]
 [ 28  29  30  31  32  33  34  35]
 [ 32  33  34  35  36  37  38  39]
 [ 36  37  38  39  40  41  42  43]
 [ 40  41  42  43  44  45  46  47]
 [ 44  45  46  47  48  49  50  51]
 [ 48  49  50  51  52  53  54  55]
 [ 52  53  54  55  56  57  58  59]
 [ 56  57  58  59  60  61  62  63]
 [ 60  61  62  63  64  65  66  67]
 [ 64  65  66  67  68  69  70  71]
 [ 68  69  70  71  72  73  74  75]
 [ 72  73  74  75  76  77  78  79]
 [ 76  77  78  79  80  81  82  83]
 [ 80  81  82  83  84  85  86  87]
 [ 84  85  86  87  88  89  90  91]
 [ 88  89  90  91  92  93  94  95]
 [ 92  93  94  95  96  97  98  99]
 [ 96  97  98  99 100 101 102 103]
 [100 101 102 103 104 105 106 107]
 [104 105 106 107 108 109 110 111]
 [108 109 110 111 112 113 114 115]
 [112 113 114 115 116 117 118 119]
 [116 117 118 119 120 121 122 123]
 [120 121 122 123 124 125 126 127]]

What's wrong with this? A few of things:

  • It's slow, and requires lots of memory copies
  • The new array is much larger than the original time series
  • The smaller hop_length gets, the bigger the memory usage!

If we only need read-only access to the data, then this is a lot of overhead.

Of course, we could just use direct indexing, e.g., x[t * hop_length : t * hop_length + frame_length], but that gets ugly quickly.

Stride tricks

All we really want is a clear way to index windowed samples from x without incurring a lot of computational overhead.

NumPy's stride tricks submodule provides exactly this. We'll need to know a little about how numpy arrays are organized though.

An ndarray contains numerical data of some type, say int8 or float64, and this dtype determines how many bytes in ram each entry occupies.

Although ndarrays expose multi-dimensional indexing, their contents are actually stored in a linear index. The default ordering of the dimensions is row-major (order='C', as in The C Programming Language) or column-major (order='F', as in Fortran). Yes, it's confusing.

A two-dimensional row-major array a will be organized as follows:

  • a[0, 0], a[0, 1], a[0, 2], ... a[0, n], a[1, 0], a[1, 1], a[1, 2], ...

while a column-major array will look like

  • a[0, 0], a[1, 0], a[2, 0], ... a[m, 0], a[0, 1], a[1, 1], a[2, 1], ...

Why does this matter?

When you index an ndarray in python, say M[i, j], it needs to translate the two-dimensional index (i, j) into a linear index:

  • i * row_stride + j * column_stride

So column- or row-major array indexing is merely an issue of setting up the strides appropriately. Of course, this idea generalizes to n dimensional arrays by tacking on more indices and strides.

We're going to use stride tricks to fool numpy into thinking it has much more data than it actually does, while keeping the same underlying data in place. How? By manually setting the row and column strides!

The key idea here is that a row stride need not span the full shape of the array.

WARNING

Stride tricks can be dangerous. Because we're digging into the guts of ndarrays, it's possible to violate memory protection if we're not careful.

Ok, you've been warned.

In [6]:
from numpy.lib import stride_tricks
In [7]:
# Ok, it's a linear index, so it's both F- and C-contiguous
# Next, let's see how big each item in x is.  This is provided by the `itemsize` field
print x.itemsize, x.dtype
8 int64
In [8]:
# Now, we're ready to set up our fake frmaed data

# Each row advances by `hop_length` entries, times the number of bytes in each entry
row_stride = x.itemsize * hop_length

# Columns are still contiguous, and only advance by one entry
col_stride = x.itemsize

# as_strided will construct a new view of the data in x, using the shape and stride parameters we supply
x_framed_strided = stride_tricks.as_strided(x, 
                                            shape=(num_frames, frame_length), 
                                            strides=(row_stride, col_stride))

print x_framed_strided
[[  0   1   2   3   4   5   6   7]
 [  4   5   6   7   8   9  10  11]
 [  8   9  10  11  12  13  14  15]
 [ 12  13  14  15  16  17  18  19]
 [ 16  17  18  19  20  21  22  23]
 [ 20  21  22  23  24  25  26  27]
 [ 24  25  26  27  28  29  30  31]
 [ 28  29  30  31  32  33  34  35]
 [ 32  33  34  35  36  37  38  39]
 [ 36  37  38  39  40  41  42  43]
 [ 40  41  42  43  44  45  46  47]
 [ 44  45  46  47  48  49  50  51]
 [ 48  49  50  51  52  53  54  55]
 [ 52  53  54  55  56  57  58  59]
 [ 56  57  58  59  60  61  62  63]
 [ 60  61  62  63  64  65  66  67]
 [ 64  65  66  67  68  69  70  71]
 [ 68  69  70  71  72  73  74  75]
 [ 72  73  74  75  76  77  78  79]
 [ 76  77  78  79  80  81  82  83]
 [ 80  81  82  83  84  85  86  87]
 [ 84  85  86  87  88  89  90  91]
 [ 88  89  90  91  92  93  94  95]
 [ 92  93  94  95  96  97  98  99]
 [ 96  97  98  99 100 101 102 103]
 [100 101 102 103 104 105 106 107]
 [104 105 106 107 108 109 110 111]
 [108 109 110 111 112 113 114 115]
 [112 113 114 115 116 117 118 119]
 [116 117 118 119 120 121 122 123]
 [120 121 122 123 124 125 126 127]]
In [9]:
# What if we want our frames vertical instead?  Just swap the row and column strides

x_framed_strided_column = stride_tricks.as_strided(x, 
                                            shape=(num_frames, frame_length), 
                                            strides=(col_stride, row_stride))

print x_framed_strided_column
[[ 0  4  8 12 16 20 24 28]
 [ 1  5  9 13 17 21 25 29]
 [ 2  6 10 14 18 22 26 30]
 [ 3  7 11 15 19 23 27 31]
 [ 4  8 12 16 20 24 28 32]
 [ 5  9 13 17 21 25 29 33]
 [ 6 10 14 18 22 26 30 34]
 [ 7 11 15 19 23 27 31 35]
 [ 8 12 16 20 24 28 32 36]
 [ 9 13 17 21 25 29 33 37]
 [10 14 18 22 26 30 34 38]
 [11 15 19 23 27 31 35 39]
 [12 16 20 24 28 32 36 40]
 [13 17 21 25 29 33 37 41]
 [14 18 22 26 30 34 38 42]
 [15 19 23 27 31 35 39 43]
 [16 20 24 28 32 36 40 44]
 [17 21 25 29 33 37 41 45]
 [18 22 26 30 34 38 42 46]
 [19 23 27 31 35 39 43 47]
 [20 24 28 32 36 40 44 48]
 [21 25 29 33 37 41 45 49]
 [22 26 30 34 38 42 46 50]
 [23 27 31 35 39 43 47 51]
 [24 28 32 36 40 44 48 52]
 [25 29 33 37 41 45 49 53]
 [26 30 34 38 42 46 50 54]
 [27 31 35 39 43 47 51 55]
 [28 32 36 40 44 48 52 56]
 [29 33 37 41 45 49 53 57]
 [30 34 38 42 46 50 54 58]]