Welcome to the Dask Tutorial.
Dask is a parallel computing library that scales the existing Python ecosystem. This tutorial will introduce Dask and parallel data analysis more generally.
Dask can scale down to your laptop and up to a cluster. Here, we'll use an environment you setup on your laptop to analyze medium sized datasets in parallel locally.
Dask provides multi-core and distributed parallel execution on larger-than-memory datasets.
We can think of Dask at a high and a low level
threading
or
multiprocessing
libraries in complex cases or other task scheduling
systems like Luigi
or IPython parallel
.Different users operate at different levels but it is useful to understand both.
The Dask use cases provides a number of sample workflows where Dask should be a good fit.
You should clone this repository:
git clone http://github.com/dask/dask-tutorial
The included file environment.yml
in the binder
subdirectory contains a list of all of the packages needed to run this tutorial. To install them using conda
, you can do
conda env create -f binder/environment.yml
conda activate dask-tutorial
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
Do this before running this notebook.
dask
tag on Stack Overflow, for usage questionsEach section is a Jupyter notebook. There's a mixture of text, code, and exercises.
If you haven't used Jupyterlab, it's similar to the Jupyter Notebook. If you haven't used the Notebook, the quick intro is
Enter
to edit a cell (like this markdown cell)Esc
to change to command modeshift+enter
to execute a cell and move to the next cell.The toolbar has commands for executing, converting, and creating cells.
The layout of the tutorial will be as follows:
Whereas there is a wealth of information in the documentation, linked above, here we aim to give practical advice to aid your understanding and application of Dask in everyday situations. This means that you should not expect every feature of Dask to be covered, but the examples hopefully are similar to the kinds of work-flows that you have in mind.
Hello, world!
¶Each notebook will have exercises for you to solve. You'll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.
Print the text "Hello, world!".
# Your code here
The next cell has the solution. Click the ellipses to expand the solution, and always make sure to run the solution cell, in case later sections of the notebook depend on the output from the solution.
print("Hello, world!")