To work with DyND at a low level, particularly to be able to develop in the library, one needs a good understanding of how DyND views memory. We're going to work our way from Python objects to DyND arrays, motivating its design as we go.
At the end of these slides, you will hopefully grasp why DyND is structured as it is, and be prepared to dig deeper into its code.
We begin with a look at a Python object. In Python, objects are stored in memory blocks independently allocated on the heap. A single floating point value looks something like this:
This doesn’t look too bad, in a dynamically typed system we need to store some metadata about the type, and we need to store the data containing the value.
Often, however, we don’t want just one floating point value, we want a whole array of them. In Python, this is done using a list object, which looks like this:
We can start to see some problems from this picture.
We will call this the “Smalltalk road.” Program data is made of many little objects, connected by pointers. Another option is the “Fortran road,” where program data is made of arrays of data, stored contiguously.
There's a long but worthwhile set of lectures by Alexander Stepanov on Amazon A9's youtube site which goes into some depth about many programming topics.
from IPython.display import YouTubeVideo
YouTubeVideo("Y4qdNddSWAg", start=190)
Jake VanderPlas has an excellent blog post which dives into more details of Python's structure, that of the CPython interpreter in particular.
Another great place to look for performance inspiration is the game programming industry, as they work under constraints far more harsh than most developers. Bob Nystrom's book "Game Programming Patterns" has a chapter about data locality which goes into some depth on the topic and has some nice diagrams to illustrate things. Here's a quote:
"When I got some stuff working, I was surprised. I knew it was a big deal, but there’s nothing quite like seeing it with your own eyes. I wrote two programs that did the exact same computation. The only difference was how many cache misses they caused. The slow one was fifty times slower than the other."
Let's take the Python list we looked at, factor out the redundant repetition of the float type metadata, and split all the type information away from the value storage. This gives us an abstract idea of how a dynamic typed array should look:
The dynamic typed array we made is just a small step away from the NumPy array. In NumPy, the “Array of 7” part is represented as (ndim, shape, strides), and the “Float type” part is represented as the NumPy dtype.
The strided array approach NumPy takes turns out to be a very good data structure for many kinds of numeric data. Allowing the strides to be set arbitrarily means both C-order (row-major) and Fortran-order (column-major) data can be represented naturally. There is good material out there going into more depth showing the consequences and benefits of it.
One good resource is this section of open SciPy lecture notes.
NumPy, as fantastic as it is, has proved to be lacking in many areas. It provides something good enough and stable enough that a huge ecosystem has grown on top of it, but there are many directions people try to stretch it that require incredible contortions or performance sacrifices to achieve.
A few of these weaknesses include
DyND begins one step back from NumPy in the data representation narrative we have followed, generalizing the “Dynamic Typed Array” we saw differently.
Instead of focusing on the “Array of 7” part and turning it into a “general strided multi-dimensional array,” DyND treats it as one dimension of a hierarchically represented type. DyND takes the regularities in the types of stored data, bundles them together into a type, and takes the data as a separate bundle. This gets us a picture like this:
The types DyND is using are from datashape, a type system created for the Blaze project. What datashape does is combine the shape and dtype notions we saw in the NumPy array.
The shape is generalized slightly, allowing both NumPy-style strided dimensions, called “fixed” as seen in the DyND Array diagram, as well as variable-sized dimensions, called “var.” This variable-sized dimension allows for storage of ragged data.
The dtype is also slightly generalized over NumPy, but more importantly a language is defined for writing out types in the datashape system. This is spelled out in the datashape documentation.
Slicing is a pretty common operation to perform on numerical arrays, so we're going to use it to examine how the structure of the DyND array we defined fares.
In NumPy, slicing is enabled by the strided nature of the array object and the ability to point at the data of another array object. DyND works much the same way. Each “fixed” dimension has an associated stride, and the DyND array contains a data reference in addition to the raw data pointer.
Our diagram didn't include the stride, though, so we have to determine where it will go.
Based on what was described so far, there are two places: in the type or in the value. Let’s first imagine we put the stride in the type, and see what happens when we apply two different slices to an array.
To take a look at what will happen, let's consider an array A with three elements, [1, 2, 3], and type “fixed[size=3, stride=8] * float64”.
Our first slice grabs first two elements, in Python it is A[:2]. This produces an array with a new type “fixed[size=2, stride=8] * float64”, and points to the same data as A.
Our second slice grabs every second element, in Python it is A[::2]. This produces an array with a new type “fixed[size=2, stride=16] * float64”, and points to the same data as A.
By slicing different ways, we’ve produced two arrays which are “arrays with two floats,” but have different types. In NumPy, this distinction is hidden away in the strides value attached to the object, and the transparent handling of the this detail is considered a feature, not a problem. It would be nicer to just be able to say we want a “fixed dimension of size 2” or just a “fixed dimension.”
DyND's usual strided arrays do not include the stride in the type. The types are spelled "2 * float64" and "fixed * float64".
The other place we had available for the stride information is with the data. There are multiple ways we could do this, but let’s just consider the most obvious one, storing the stride immediately followed by the data. This means our original array looks like this:
Our first slice, A[:2], still looks good. The only things that change in the diagram are the “fixed[3]” dimension becomes “fixed[2]”, and the third value in the array is ignored.
Our second slice, however, leads to trouble. We want to view the same data, but now with a different stride. The stride must remain 8 for the original A, but needs to be 16 for the sliced A[::2], leading to the conclusion that the set of arrays viewed by this representation is not closed under slicing.
Another trivial yet problematic case is slicing the last two elements, where the place for the stride would overlap with a value.
Since neither the type nor the data of the array seem like appropriate places to put the stride, we need a new location for it. Recall that in NumPy, the array object contains a dtype, a data pointer, and a few additional values including ndim, shape, and strides. We’ve put ndim and shape into the type, but we’re going to leave strides in the array object, creating some new space called “array metadata” or “arrmeta” for short.
While the exercise we just went through may feel like a triviality, it’s partly an attempt to get you thinking about how various data types get laid out in memory. In this example, just placing the stride ahead of the data leads to something absurd, but there are other cases like the “var” dimension type, which are structured pretty close to this.
Every dynd type has the opportunity to store an arbitrary but fixed amount of arrmeta in the array object. This is a place to store information that doesn't really belong in the type, but is also necessary to fully describe the data the array object is pointing at.
In the case of the fixed dimension type, we have chosen to store both the size and the stride of the dimension. (The size is redundant, but included for simple low-level access.) For other types, other arrmeta is needed. We’re going to examine what arrmeta we should choose for a struct/record type, by thinking about some operations we want to perform on data of that type.
Let’s start with a simple 2D point type, with an X and a Y coordinate. In datashape, we can spell it as “{X: float64, Y: float64}”. If we treat it as a C/C++ struct, here’s how an array A of this type looks:
Just like with the slicing operation we had before, we have simple operations we want to support, such as field selection. If we access field A.Y, we get
A struct with only two fields isn't enough to illustrate our next operation, of selecting multiple fields. In addition to the coordinates, our data might include differentials as well, so let's add them.
Suppose we have a function which takes values of type “{X: float64, Y: float64}” as a parameter, and we want to pass the X and Y coordinates from our complicated struct to that function. A natural thing to try is to select these two fields to get a new struct.
Something went wrong here, just like when we tried to put the stride in the fixed dimension type. The values don't match a C/C++ struct anymore, the second offset can't be 8. Similarly if the offset is 16, the type is different from the type the function we have takes.
Just like when we first looked at where to put the stride for the fixed dimension, we have made a choice of where to put the offsets for the struct dtype. In this case, it was quite intuitive to simply do what C/C++ does, and put those offsets in the type. This is in fact what NumPy does.
This led to trouble, however, which was the exact same trouble that made us put the stride into the arrmeta. Therefore, we put these offsets in the arrmeta.
Here's how it looks after we move the struct’s field offsets into the arrmeta. Our data A of type “{X: float64, DX: float64, Y: float64, DY: float64}” looks like:
Now if we do a field selection, extracting the X and Y fields, we’ll get a DyND struct whose type is “{X: float64, Y: float64}”, exactly as we wanted. The layout is not the same as a C/C++ struct, but views of the data are closed under field selection, just as views of strided arrays are closed under slicing. Here’s the result of that field selection:
So far we’ve seen two forms of arrmeta, for fixed dimensions and for structs. A natural question arises, what happens when we have arrays of structs or structs of arrays? Let's start with a simple case, an array of XY coordinates with type "10 * {X: float64, Y: float64}". The dimension type, "10" (short for "fixed[10]"), needs two arrmeta values, size and stride. The struct dtype also needs two arrmeta values, offsetX and offsetY. These need to be systematically organized, so that the arrmeta can scale up to arbitrary complexity.
If we think of the type as a tree, we can use its hierarchy to organize the arrmeta structure. Here's how that looks:
So far, we've only made fields of type float64 in structs, so there has been no arrmeta to track. Struct fields are not limited this way, however, so let's do something a bit more involved. Let's make a struct of an array and a struct, with type "{X: 10 * float64, Y: {Z: float64, W: float64}}".
This nesting can get arbitrarily deep, and walking through the arrmeta always matches the tree structure of its corresponding type. DyND provides a way to print a representation of the arrmeta, which in Python is provided by the nd.debug_repr
function. Here's how that looks:
>>> a = nd.empty("10 * {A: 3 * float64, B: 3 * {X: float64, Y: 4 * float64}, C: 2 * float64}")
>>> nd.debug_repr(a)
------ array
address: 00000000003A6060
refcount: 1
type:
pointer: 00000000003974C0
type: 10 * {A : 3 * float64, B : 3 * {X : float64, Y : 4 * float64}, C : 2 * float64}
type refcount: 1
arrmeta:
flags: 3 (read_access write_access )
type-specific arrmeta:
fixed_dim arrmeta
size: 10
stride: 160
struct arrmeta
field offsets: 0, 24, 144
field 0 (name A) arrmeta:
fixed_dim arrmeta
size: 3
stride: 8
field 1 (name B) arrmeta:
fixed_dim arrmeta
size: 3
stride: 40
struct arrmeta
field offsets: 0, 8
field 1 (name Y) arrmeta:
fixed_dim arrmeta
size: 4
stride: 8
field 2 (name C) arrmeta:
fixed_dim arrmeta
size: 2
stride: 8
data:
pointer: 00000000003A6108
reference: 0000000000000000 (embedded in array memory)
------
Memory leaks can be a big problem, leading to the enormous popularity of systems using garbage collection as the main strategy to deal with it. Garbage collection requires a tightly controlled environment, however, and a library designed to interoperate with the memory of multiple systems cannot make strong assumptions about them.
In modern C++, the standard approach is to reason about memory ownership, using deterministic scoping to manage lifetimes. An important acronym to know is RAII (Resource Aquisition Is Initialization), which summarizes the basic idea of using memory ownership to systematically avoid memory leaks. See the Wikipedia article for a decent summary.
Being able to point at memory owned by other objects is necessary for the "view" model DyND is using. To manage this, DyND uses data references independent from the pointers, based on a notion of memory block objects.
To track ownership and lifetime of DyND arrays and the data they point to, DyND uses reference-counted objects called memory blocks. These have semantics similar to the standard C++ shared_ptr, utilizing an atomic reference count to keep track of how many references to the object are being held.
This is similar to Python objects in the CPython interpreter, which use reference counting as well. An important difference is that Python does not use atomic reference counting, and consequently is faster but requires the GIL (Global Interpreter Lock) be held any time objects are manipulated.
DyND uses the same ideas as C++ shared_ptr to make the reference count be mostly handled automatically, but it's useful to understand reference counting especially to work with DyND at a lower level. The CPython Reference Counting Documentation is a good read for this. Because the CPython API is C, there is no way to automatically handle reference counts, and one must manage them explicitly when using that API.
Putting the memory block data reference together with the other parts of the DyND array, this leads us to the following picture:
Memory allocations are fairly expensive operations, so creating a single DyND array can be costly. Additionally, when the memory block for the data is allocated separately from the main array object, it might be arbitrarily far away from that data in memory, reducing data locality.
DyND supports an optimization for this case, where the memory for the data and the array object are allocated in one chunk. This is signaled by a NULL data reference, because holding a reference to itself would cause a reference cycle.
An important use case for DyND is the ability to point into memory owned by another system. For example Python PEP 3118, simply known as the buffer protocol in Python 3, provides an interface through which to publish strided array memory. Another example might be the result of an algorithm in a library being wrapped, returned as an object containing a large heap-allocated array.
DyND has something called the external memory block for this case, a memory block which does not itself contain the memory, but owns a reference to the memory that it knows how to free. This is similar to the Capsule object in Python.
Variable-sized data is a commonly desired feature missing from NumPy, outside of its support for PyObject arrays. Proper support for this requires the ability to allocate and point to memory outside of the fixed-size data for a single array element. This is one of the purposes of DyND's support for pointers.
DyND data elements can contain pointers into separately allocated memory. We're going to introduce the topic via the "pointer" type, which represents a single pointer into some other memory. Some more types, like the "var" dimension, "string", and "bytes" also use pointers in their representation.
The pointers in an array's data will be pointing at some other data. It wouldn't be ok for this other data to disappear, so we need some way to hang on to a reference to it. Just like the stride and offsets before, this gets stored in the array's arrmeta.
One major difference with previously seen arrmeta is that the reference to that data will need to be freed when the array object is freed. DyND provides a mechanism for this via an arrmeta destructor function.
Let's start with something really simple, a single pointer to an array of floats. The type is spelled "pointer[7 * float64]", and the array object looks like:
Observe that the DataRef in the array object manages ownership for all data that the array object points to. Similarly, the DataRef in the pointer's arrmeta manages ownership for all data that pointers in the data corresponding to that pointer point to.
Recall the technique we used before to evaluate our type and data storage choices, by checking that the representation is closed over slicing. Let's try that here by slicing the last two elements of the array. If the array is A, we're producing A[-2:].
We can't modify the pointer, as the original A is still using it. Recall that before when we sliced, we got a new array object in which we could populate its data pointer. Now we're not getting a new data pointer in A's data, so we must do something different.
Because of the uniform way slicing applies, every pointer would be offset by the same amount. Thus, we can store a single offset value in the pointer's arrmeta, and then slicing works.
The "var" dimension allows for the storage of ragged arrays, where the size of the dimension marked "var" does not remain constant. A single element of this type includes a pointer and a dimension size. The arrmeta contains a stride, just like the fixed dimension, along with a data reference and an offset, just like the pointer type.
We'll make a simple ragged array with type "2 * var * float64" to illustrate this structure.
An array with the structure we just saw won't normally "poof" into existence fully formed. It would instead start in some initial empty state, and be filled in element by element. DyND handles this by initializing all the var dim elements to NULL, then having the memblock for the var dim expose a pooled memory allocator interface. When we call nd::empty("2 * var * float64"), we get back an array object looking like this:
After assigning to the first element, we get this:
After assigning to the second element, we get the fully populated array:
This pooled memory approach has some advantages and some disadvantages. A big advantage is that the data allocated in the pooled memblock can be viewed in the same way as other data in dynd memblocks. For example, if we access via a slice A[1, :], DyND will collapse the singular var dimension into a fixed dimension.
A big disadvantage of the pooled approach is that resizing an element of a var dimension after it has been created isn't allowed. If it were allowed, very simple code which repeatedly resized an element would be causing many allocations without any frees, a highly unexpected behavior.
DyND needs support for non-pooled allocations as well, for all of the string, bytes, and var dimension types that currently only support pooled allocations. The details of this hasn't been fully worked out, but it may be that this gets done by having the memblock of the dataref expose different kinds of memory allocators for the pooled and non-pooled cases.