Docker for Improved Reproducibility of Scientific Python Analyses

Matt McCormick, Kitware, Inc

SciPy 2015

2015-07-10

The Problem

Even though computers are often considered deterministic, computational software is a rapidly evolving and changing landscape. Libraries are constantly adding new features and fixing issues.

Image source: http://www.michaelogawa.com/research/storylines/

Even libraries with the strictest backwards-compatibility policies can change in significant ways.

Image source: http://www.bonkersworld.net/backwards-compatibility/

A reproducible computational environment has a sufficiently consistent state for the computational task at hand.

For example, this can consist of

  • a CPython interpretor with a specific version built with
    • a specific version of a given compiler
    • a specific version of a libc implementation
  • packages dependencies with specific versions
  • C extension modules built against specific libraries
  • other libraries and executables available with a specific version and configuration options

Close But Not Good Enough

Source code

Does not include:

  • Python interpretor
  • How it is configured
  • Package dependencies
  • Run-time environment
  • How to run it

Package managers and distributions

  • There is not a consensus on the package manager
  • Packages become unsupported over time
  • What to do if a required library is not packaged?

Virtual machines (VMs)

  • Inefficient utilization of computational resources

Image source: http://time-az.com/images/2014/02/20140203carjam.jpg

Enter Linux Containers

Docker logo

Linux container systems , like Docker, are new type of tool to easily build, ship, and run reproducible, binary applications.

It is "good enough" for a reproducible computational environment.

In this talk, we will introduce Docker from the perspective a scientific research software engineer. We will

  • Generate an understanding of what Docker is by comparing it to existing technologies.
  • Give an introduction to basic Docker concepts.
  • Describe how Docker fits into the scientific Python analysis workflow.

Understanding Docker

Not just this cute whale thing

Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.

Docker is a combination of a:

  1. Sandboxed chroot
  2. Copy on write filesystem
  3. Distributed VCS for binaries

Sandboxed chroot

Docker works with images that consume minimal disk space, versioned, archiveable, and shareable. Executing applications in these images does not require dedicated resources and is high performance.

It works with containers as opposed to virtual machines (VM's).

In [2]:
%time !docker run --rm busybox sh -c 'echo "Hello Docker World!"'
Hello Docker World!
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 1.23 s

Copy on Write Filesystem

Union file systems, or UnionFS, are file systems that operate by creating layers, making them very lightweight and fast while saving disk space.

Docker can make use of several union file system variants including:

  • AUFS
  • btrfs
  • vfs
  • DeviceMapper

Distributed VCS for binaries

Docker is like Git for binaries

  • Docker images are identified with hex string or tags
  • Interface is docker <subcommand>
  • docker push, docker pull, docker tag
  • docker export will create a archiveable tarball of an image's filesystem.
  • DockerHub is like GitHub

Installing

Here's what you need:

  • Linux kernel with control groups and namespaces
  • Support for a layered filesystem (like AUFS)
  • Docker Daemon / Server (written in Go)

Docker Concepts

Image

A read-only file system layer

In [3]:
!docker images
REPOSITORY             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
busybox                buildroot-2014.02   8c2e06607696        11 weeks ago        2.433 MB
busybox                latest              8c2e06607696        11 weeks ago        2.433 MB
odise/busybox-python   latest              649988b8bf0e        4 months ago        20.26 MB

Container

An modifiable image with processes running in memory, or an exited container with a modified filesystem

In [4]:
!docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
In [5]:
!docker run -d busybox sh -c 'sleep 3'
3a6bf9d61548ae36bdc0bdb5a87aec17a8056517709c61e2df989aa0a37b7f32
In [7]:
!docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
In [8]:
!docker ps -a
CONTAINER ID        IMAGE                       COMMAND             CREATED             STATUS                      PORTS               NAMES
3a6bf9d61548        busybox:buildroot-2014.02   "sh -c 'sleep 3'"   14 seconds ago      Exited (0) 11 seconds ago                       goofy_almeida       
a3761241bf97        busybox:buildroot-2014.02   "sh -c 'sleep 3'"   53 minutes ago      Exited (0) 53 minutes ago                       reverent_stallman   

Volume

A mounted directory that is not tracked as a filesystem layer

  • Data volumes are initialized when a container is created
  • Volumes can be shared and reused between containers
  • Changes to a data volume are made directly
  • Changes to a data volume will not be included when you update an image
  • Volume persist until no containers use them
  • Host directories can also be mounted as data volumes
In [10]:
!ls $PWD/images/
BackwardsCompatibility.png    DockerVM.jpg
BuildInstructions1.png	      Eww.jpg
BuildInstructions2.png	      FilesystemsGeneric.png
BuildInstructions3.png	      itkka.png
BuildInstructions4.png	      Jenkins.png
CarJam.jpg		      Jupyter.png
Chroot.png		      LayerCake.jpg
ConfusedCat.jpg		      Liar.png
Debian.png		      MakerwareScreenshot.png
DockerDeploy.jpg	      MakerwareVTK.png
DockerFilesystemsBusybox.png  MakerwareWebsite.png
DockerFilesystems.svg	      MasonJar.jpg
DockerHub.png		      ModulesModulesModules.png
DockerLogo.png		      PythonStoryline.svg
In [11]:
!docker run --rm --volume $PWD/images:/images busybox \
    sh -c 'ls /images'
BackwardsCompatibility.png
BuildInstructions1.png
BuildInstructions2.png
BuildInstructions3.png
BuildInstructions4.png
CarJam.jpg
Chroot.png
ConfusedCat.jpg
Debian.png
DockerDeploy.jpg
DockerFilesystems.svg
DockerFilesystemsBusybox.png
DockerHub.png
DockerLogo.png
DockerVM.jpg
Eww.jpg
FilesystemsGeneric.png
Jenkins.png
Jupyter.png
LayerCake.jpg
Liar.png
MakerwareScreenshot.png
MakerwareVTK.png
MakerwareWebsite.png
MasonJar.jpg
ModulesModulesModules.png
PythonStoryline.svg
itkka.png

Scientific Python with Docker

Graphical Applications and Docker

A portable Docker image will only assume standard CPU/memory/disk/network resources are available. If local USB devices and video card devices are used the images will not be runnable anywhere.

Choosing a base image

Recap and Next Steps

Docker is

  • Sandboxed chroot +
  • Incremental, copy on write filesystem +
  • Distributed VCS for binaries +

Concepts

  • Image: A read-only file system layer
  • Container: A writable image with processes running in memory, or an exited container with a modified filesystem
  • Volume: A mounted directory that is not tracked as a filesystem layer

Scientific Python and Docker

  • Not for graphical applications, especially OpenGL
  • Reproducible computational environment for IPython notebook
  • Use with Linux-based packaging system of your choice

Learn more!

Docker vs. LXC

  • LXC is a set of tools and API to interact with Linux kernel namespaces, cgroups, etc.
  • LXC used to be the default execution enviroment for Docker
  • Docker provides LXC function, plus:
    • Portable deployment across machines
    • Application-centric
    • Automatic builds
    • Versioning
    • Component re-use
    • Sharing
    • Tool echosystem

Docker vs Rocket

  • Rocket is a container system like Docker developed by CoreOS
  • Rocket is not yet fully operational
  • Rocket does not use a daemon/client system