We've seen them before as the simple data type of a NumPy array
import numpy as np
np.ones((3,4), dtype=np.int32)
array([[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]], dtype=int32)
So, what's there to talk about?
What simple data types are there and how can we specify them?
Well ... there are about half a dozen ways.
The one that clicks for me is the "array interface". Call np.dtype()
with a string whose first character is the type and the next characters are the size in bytes, like np.dtype('i4')
, is a 32bit signed integer. Here are the valid characters:
'b' boolean
'i' (signed) integer
'u' unsigned integer
'f' floating-point
'c' complex-floating point
'O' (Python) objects
'S', 'a' (byte-)string
'U' Unicode
'V' raw data (void)
np.ones((2,3), dtype=np.dtype('i2'))
array([[1, 1, 1], [1, 1, 1]], dtype=int16)
np.ones((2,3), dtype=np.dtype('i4'))
array([[1, 1, 1], [1, 1, 1]], dtype=int32)
So, these look the same, but they are stored differently in memory, right ...
i2
uses two bytes to store the integers in the array below:
np.ones((2,3), dtype=np.dtype('i2')).tostring()
b'\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00'
i4
uses four bytes to store the integers in the array below:
np.ones((2,3), dtype=np.dtype('i4')).tostring()
b'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00'
One can even specify byte order by starting the string with >
(big-endian) or <
(little-endian).
np.ones((2,3), dtype=np.dtype('>i2'))
array([[1, 1, 1], [1, 1, 1]], dtype=int16)
np.ones((2,3), dtype=np.dtype('>i2')).tostring()
b'\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01\x00\x01'
... take a look at the NumPy docs on data types ...
A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted.
It describes the following aspects of the data:
It describes the following aspects of the data:
A record is an array of C structures. These are arrays of composite data types where one can use python dictionary type notation to interact with array elements.
So, what's that mean?
Let's look.
simple_record_dt = np.dtype('a5,i2')
simple_record = np.zeros((2,3), dtype=simple_record_dt)
simple_record
array([[(b'', 0), (b'', 0), (b'', 0)], [(b'', 0), (b'', 0), (b'', 0)]], dtype=[('f0', 'S5'), ('f1', '<i2')])
What do we get? A 2x3 array of 5 character strings, followed by two byte integers. But what are the 'f0'
and 'f1'
values?
Implicitly assigned field names.
simple_record['f0'] = (('a', 'b', 'c'), ('d', 'e', 'f'))
simple_record
array([[(b'a', 0), (b'b', 0), (b'c', 0)], [(b'd', 0), (b'e', 0), (b'f', 0)]], dtype=[('f0', 'S5'), ('f1', '<i2')])
simple_record.tostring()
b'a\x00\x00\x00\x00\x00\x00b\x00\x00\x00\x00\x00\x00c\x00\x00\x00\x00\x00\x00d\x00\x00\x00\x00\x00\x00e\x00\x00\x00\x00\x00\x00f\x00\x00\x00\x00\x00\x00'
simple_record['f1'] = 21
simple_record
array([[(b'a', 21), (b'b', 21), (b'c', 21)], [(b'd', 21), (b'e', 21), (b'f', 21)]], dtype=[('f0', 'S5'), ('f1', '<i2')])
simple_record.tostring()
b'a\x00\x00\x00\x00\x15\x00b\x00\x00\x00\x00\x15\x00c\x00\x00\x00\x00\x15\x00d\x00\x00\x00\x00\x15\x00e\x00\x00\x00\x00\x15\x00f\x00\x00\x00\x00\x15\x00'
simple_record['f1'] = ((1, 2, 3), (4, 5, 6))
simple_record['f1'] = simple_record['f1'] * 10
simple_record
array([[(b'a', 10), (b'b', 20), (b'c', 30)], [(b'd', 40), (b'e', 50), (b'f', 60)]], dtype=[('f0', 'S5'), ('f1', '<i2')])
simple_record['f1']
simple_record[['f1', 'f2']]
Provide a list of tuples where the first element of the tuple is the name and the following value(s) define the type.
name_grade_dt = np.dtype([('name', 'S5'), ('grades', 'f2', (2,))])
name_grade_dt
dtype([('name', 'S5'), ('grades', '<f2', (2,))])
a = np.zeros((3), dtype=name_grade_dt)
a
array([(b'', [0.0, 0.0]), (b'', [0.0, 0.0]), (b'', [0.0, 0.0])], dtype=[('name', 'S5'), ('grades', '<f2', (2,))])
a['name'] = ('Steve', 'Bob', 'Sally')
a['grades'] = [np.random.rand(2) for x in range(3)]
a
array([(b'Steve', [0.91796875, 0.68359375]), (b'Bob', [0.310302734375, 0.33349609375]), (b'Sally', [0.71630859375, 0.041839599609375])], dtype=[('name', 'S5'), ('grades', '<f2', (2,))])
What does it look like in memory?
a.tostring()
b'SteveX;x9Bob\x00\x00\xf74V5Sally\xbb9[)'
Good reminder of how different floating point is in memory from its text representation.
Compute the averages of grades...
np.vstack((a['name'], a['grades'].mean(axis=1)))
array([[b'Steve', b'Bob', b'Sally'], [b'0.80078125', b'0.32177734375', b'0.379150390625']], dtype='|S32')
You can, but this is super useful for reading arbitrary packed binary data files.
Writing out a sample file:
a.tofile('grades.bin')
%cat grades.bin
SteveX;x9Bob �4V5Sally�9[)
Reading in the sample file, specifying a single record type using dtype
:
np.fromfile('grades.bin', dtype=[('name', 'S5'), ('grades', 'f2', (2,))])
array([(b'Steve', [0.91796875, 0.68359375]), (b'Bob', [0.310302734375, 0.33349609375]), (b'Sally', [0.71630859375, 0.041839599609375])], dtype=[('name', 'S5'), ('grades', '<f2', (2,))])