In [1]:

%autosave 10

Autosaving every 10 seconds

Module: https://github.com/SciTools/biggus/wiki
Talk: http://nbviewer.ipython.org/gist/pelson/9171602

Problem¶

Need
- Very large NumPy arrays that don't fit into memory.
- Lazy loading
- Allows indexing and combination with other arrays.
Don't need
- ragged arrays (fills regular numpy arrays)
- heterogenous dtype
- better-than-numpy performance

Prior art¶

BLAZE, https://code.google.com/p/blaze-lib/

Their work¶

Biggus. https://github.com/SciTools/biggus/wiki

In [5]:

import numpy as np
import biggus

np_array = np.empty((700, 200), dtype=int32)
arr = biggus.NumpyArrayAdapter(np_array)
print arr

<NumpyArrayAdapter shape=(700, 200) dtype=dtype('int32')>

In [6]:

# np.concatenate
bigger_arr = biggus.LinearMosaic([arr, arr], axis=0)
print bigger_arr

<LinearMosaic shape=(1400, 200) dtype=dtype('int32')>

But NumPy is creating a new space, then copying.
- NumPY arrays are hard to subclass
Biggus does no copying.

In [7]:

# no memory copying
print biggus.LinearMosaic([arr, arr] * 20, axis=0)

<LinearMosaic shape=(28000, 200) dtype=dtype('int32')>

In [8]:

# new dimension
biggus.ArrayStack(np.array([arr, arr]))

Out[8]:

<ArrayStack shape=(2, 700, 200) dtype=dtype('int32')>

In [ ]:

import h5py

hdf_dataset = h5py.File('data.hdf5')['arange']

# this is lazy; no data is loaded
arr_hdf = biggus.NumpyArrayAdapter(hdf_dataset)

print arr

HDF5 already has the ability to load arrays, but this is more generic.
Can combine (LinearMosaic) HDF5 and regular arrays.

In [ ]:

bigger_arr = biggus.LinearMosaic([bigger_arr, arr_hdf], axis=0)
print bigger_arr

The ndarray method realizes arrays, and brings it into memory.

In [11]:

type(bigger_arr.ndarray()), bigger_arr.ndarray().shape

Out[11]:

(numpy.ndarray, (1400, 200))

You can do basic processing on massive arrays in chunks.

In [12]:

# These operations don't run when you do this
mean = biggus.mean(bigger_arr, axis=0)
std = biggus.std(bigger_arr, axis=0)

In [13]:

print mean

<_Aggregation shape=(200,) dtype=dtype('float64')>

In [14]:

# _now_ we realize it, calculate mean in mean.ndarray()
# done is chunks, data never all in-memory
print np.all(mean.ndarray() == bigger_arr.ndarray().mean(axis=0))

True

Really though as you go chunk-by-chunk you want to do many operations at the same time.

In [15]:

# this realizes the result. it really is chunking the array
# into sub-arrays, aggregating results
mean_np, std_np = biggus.ndarrays([mean, std])
print type(mean_np)

<type 'numpy.ndarray'>

Current limitations¶

Limited to axis 0 - pull request in the works
Threading is coming, but tasks are I/O bound so not too helpful.

Array results¶

What if the result is in itself a big array?
Can never realize the full result.
Can chunk result directly into an HDF5 variable.

In [18]:

import h5py
with h5py.File('result.hdf5', mode='w') as f_out:
    df = f_out.create_dataset('my_result', mean.shape, mean.dtype)
    biggus.save([mean], [df])

Toy example of taking OpenStreetMap tiles¶

!!AI see video.

Eventually dealing with a 48GB uncompressed array, easily on a MacBook Air.

Summary¶

Simple (1200 LOC)
Array-like class.
Array-like indexing, aggregation, concatenation, with any index-like object (NumPy, HDF5).
Supports streaming.
Conceptually no size limit on arrays you can manipulate.
TODO
- Efficient chunking for complex chains of evaluations