Dask Arrays¶

Dask arrays are blocked numpy arrays

Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid. They support a large subset of the Numpy API.

Start Dask Client for Dashboard¶

Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.

The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [1]:

from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

Out[1]:

Client

Client-d7513320-0ddf-11ed-9808-000d3a8f7959

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://10.1.1.64:8787/status

Cluster Info

LocalCluster

eced48ba

Dashboard: http://10.1.1.64:8787/status	Workers: 1
Total threads: 4	Total memory: 1.86 GiB
Status: running	Using processes: False

Scheduler Info

Scheduler

Scheduler-245cbcab-5c52-43bc-bcad-524a2981a5bf

Comm: inproc://10.1.1.64/6152/1	Workers: 1
Dashboard: http://10.1.1.64:8787/status	Total threads: 4
Started: Just now	Total memory: 1.86 GiB

Workers

Worker: 0

Comm: inproc://10.1.1.64/6152/4	Total threads: 4
Dashboard: http://10.1.1.64:36121/status	Memory: 1.86 GiB
Nanny: None
Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-94bm6jfp

Create Random array¶

This creates a 10000x10000 array of random numbers, represented as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). In this case there are 100 (10x10) numpy arrays of size 1000x1000.

In [2]:

import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x

Out[2]:

	Array	Chunk
Bytes	762.94 MiB	7.63 MiB
Shape	(10000, 10000)	(1000, 1000)
Count	100 Tasks	100 Chunks
Type	float64	numpy.ndarray

Use NumPy syntax as usual

In [3]:

y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z

Out[3]:

	Array	Chunk
Bytes	39.06 kiB	3.91 kiB
Shape	(5000,)	(500,)
Count	430 Tasks	10 Chunks
Type	float64	numpy.ndarray

Call .compute() when you want your result as a NumPy array.

If you started Client() above then you may want to watch the status page during computation.

In [4]:

z.compute()

Out[4]:

array([1.00226063, 1.01066798, 1.00353892, ..., 1.00020978, 1.00972641,
       0.99609573])

Persist data in memory¶

If you have the available RAM for your dataset then you can persist data in memory.

This allows future computations to be much faster.

In [5]:

y = y.persist()

In [6]:

%time y[0, 0].compute()

CPU times: user 1.53 s, sys: 338 ms, total: 1.86 s
Wall time: 1.04 s

Out[6]:

0.6048766839597692

In [7]:

%time y.sum().compute()

CPU times: user 399 ms, sys: 53.2 ms, total: 452 ms
Wall time: 298 ms

Out[7]:

99992368.08411336