NumPy Data Types¶

More than meets the eye

Austin Godber @godber

DesertPy - 8/26/201

What are NumPy Data Types?¶

We've seen them before as the simple data type of a NumPy array

In [1]:

import numpy as np
np.ones((3,4), dtype=np.int32)

Out[1]:

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int32)

So, what's there to talk about?

What simple data types are there and how can we specify them?

Well ... there are about half a dozen ways.

The one that clicks for me is the "array interface". Call np.dtype() with a string whose first character is the type and the next characters are the size in bytes, like np.dtype('i4'), is a 32bit signed integer. Here are the valid characters:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

In [2]:

np.ones((2,3), dtype=np.dtype('i2'))

Out[2]:

array([[1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [3]:

np.ones((2,3), dtype=np.dtype('i4'))

Out[3]:

array([[1, 1, 1],
       [1, 1, 1]], dtype=int32)

So, these look the same, but they are stored differently in memory, right ...

i2 uses two bytes to store the integers in the array below:

In [4]:

np.ones((2,3), dtype=np.dtype('i2')).tostring()

Out[4]:

b'\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00'

i4 uses four bytes to store the integers in the array below:

In [5]:

np.ones((2,3), dtype=np.dtype('i4')).tostring()

Out[5]:

b'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00'

One can even specify byte order by starting the string with > (big-endian) or < (little-endian).

In [6]:

np.ones((2,3), dtype=np.dtype('>i2'))

Out[6]:

array([[1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [7]:

np.ones((2,3), dtype=np.dtype('>i2')).tostring()

Out[7]:

b'\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01'

Great, now what?¶

Let Us Dig Deeper¶

... take a look at the NumPy docs on data types ...

A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted.

It describes the following aspects of the data:

Type of the data
Size of the data
Byte order of the data
If the data type is a record, an aggregate of other data types,
what are the names of the “fields” of the record, by which they can be accessed,
what is the data-type of each field, and
which part of the memory block each field takes.
If the data type is a sub-array, what is its shape and data type.

Whoa, did you see number 4 and 5!?!?¶

It describes the following aspects of the data:

Type of the data
Size of the data
Byte order of the data
If the data type is a record, an aggregate of other data types,
what are the names of the “fields” of the record, by which they can be accessed,
what is the data-type of each field, and
which part of the memory block each field takes.
If the data type is a sub-array, what is its shape and data type.

What, pray tell, is a RECORD?¶

A record is an array of C structures. These are arrays of composite data types where one can use python dictionary type notation to interact with array elements.

So, what's that mean?

Let's look.

In [8]:

simple_record_dt = np.dtype('a5,i2')
simple_record = np.zeros((2,3), dtype=simple_record_dt)
simple_record

Out[8]:

array([[(b'', 0), (b'', 0), (b'', 0)],
       [(b'', 0), (b'', 0), (b'', 0)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

What do we get? A 2x3 array of 5 character strings, followed by two byte integers. But what are the 'f0' and 'f1' values?

Implicitly assigned field names.

Using field names¶

In [9]:

simple_record['f0'] = (('a', 'b', 'c'), ('d', 'e', 'f'))
simple_record

Out[9]:

array([[(b'a', 0), (b'b', 0), (b'c', 0)],
       [(b'd', 0), (b'e', 0), (b'f', 0)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

In [10]:

simple_record.tostring()

Out[10]:

b'a\x00\x00\x00\x00\x00\x00b\x00\x00\x00\x00\x00\x00c\x00\x00\x00\x00\x00\x00d\x00\x00\x00\x00\x00\x00e\x00\x00\x00\x00\x00\x00f\x00\x00\x00\x00\x00\x00'

Broadcasting to a field¶

In [11]:

simple_record['f1'] = 21
simple_record

Out[11]:

array([[(b'a', 21), (b'b', 21), (b'c', 21)],
       [(b'd', 21), (b'e', 21), (b'f', 21)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

In [12]:

simple_record.tostring()

Out[12]:

b'a\x00\x00\x00\x00\x15\x00b\x00\x00\x00\x00\x15\x00c\x00\x00\x00\x00\x15\x00d\x00\x00\x00\x00\x15\x00e\x00\x00\x00\x00\x15\x00f\x00\x00\x00\x00\x15\x00'

Maniuplating Record Fields¶

In [13]:

simple_record['f1'] = ((1, 2, 3), (4, 5, 6))
simple_record['f1'] = simple_record['f1'] * 10
simple_record

Out[13]:

array([[(b'a', 10), (b'b', 20), (b'c', 30)],
       [(b'd', 40), (b'e', 50), (b'f', 60)]], 
      dtype=[('f0', 'S5'), ('f1', '<i2')])

Accessing records¶

Indexing with a single field returns a view

simple_record['f1']

Indexing with a list of fieldnames returns an array with values copied into it.

simple_record[['f1', 'f2']]

Naming Fields¶

Provide a list of tuples where the first element of the tuple is the name and the following value(s) define the type.

In [14]:

name_grade_dt = np.dtype([('name', 'S5'), ('grades', 'f2', (2,))])
name_grade_dt

Out[14]:

dtype([('name', 'S5'), ('grades', '<f2', (2,))])

In [15]:

a = np.zeros((3), dtype=name_grade_dt)
a

Out[15]:

array([(b'', [0.0, 0.0]), (b'', [0.0, 0.0]), (b'', [0.0, 0.0])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])

Filling out the array¶

In [16]:

a['name'] = ('Steve', 'Bob', 'Sally')
a['grades'] = [np.random.rand(2) for x in range(3)]
a

Out[16]:

array([(b'Steve', [0.91796875, 0.68359375]),
       (b'Bob', [0.310302734375, 0.33349609375]),
       (b'Sally', [0.71630859375, 0.041839599609375])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])

What does it look like in memory?

In [17]:

a.tostring()

Out[17]:

b'SteveX;x9Bob\x00\x00\xf74V5Sally\xbb9[)'

Good reminder of how different floating point is in memory from its text representation.

Compute the averages of grades...

In [18]:

np.vstack((a['name'], a['grades'].mean(axis=1)))

Out[18]:

array([[b'Steve', b'Bob', b'Sally'],
       [b'0.80078125', b'0.32177734375', b'0.379150390625']], 
      dtype='|S32')

Thats all well and good, why don't I just use Pandas¶

You can, but this is super useful for reading arbitrary packed binary data files.

Writing out a sample file:

In [19]:

a.tofile('grades.bin')
%cat grades.bin

SteveX;x9Bob�4V5Sally�9[)

Reading in the sample file, specifying a single record type using dtype:

In [20]:

np.fromfile('grades.bin', dtype=[('name', 'S5'), ('grades', 'f2', (2,))])

Out[20]:

array([(b'Steve', [0.91796875, 0.68359375]),
       (b'Bob', [0.310302734375, 0.33349609375]),
       (b'Sally', [0.71630859375, 0.041839599609375])], 
      dtype=[('name', 'S5'), ('grades', '<f2', (2,))])