Notebook

Data Analysis and Machine Learning Applications for Physicists¶

Material for a University of Illinois course offered by the Physics Department. This content is maintained on GitHub and is distributed under a BSD3 license.

Table of contents

Prerequisites¶

Create a github account if you don't already have one.
Install the git command-line tools on your computer, if necessary.
Install the Python 3.7 version of anaconda on your computer, if necessary.

This course assumes a basic familiarity with the core python language. If you are rusty or still learning, I recommend the free ebook A Whirlwind Tour of Python, which is "a fast-paced introduction to essential components of the Python language for researchers and developers who are already familiar with programming in another language".

If you are currently using python 2.x and reluctant to move to python 3, read this and this.

No previous experience with git or github is necessary for this course (but they are useful research tools so worth learning - here is a good starting point). If you are finding the git learning curve to be steep, you are not alone.

Create the course python environment¶

Clone the course material from github with the following command, which will create a subdirectory called syllabus:

git clone --recurse-submodules https://github.com/illinois-mla/syllabus.git

This should ask you for your github username and password (but you can streamline future github access using ssh).

We will use the conda command to create a standard python environment for this course. These instructions assume that you have already satisfied the prerequisites.

Create a new environment by entering (or pasting) the following command at a shell prompt in the top level directory of the course syllabus repo.

conda env create -f DAMLA-env/environment.yml

Activate the new environment using (this should add "(DAMLA)" to your command prompt, as a reminder of your current environment):

source activate DAMLA

Add some additional packages from other sources (details here and here):

conda install -c conda-forge keras libiconv jupyter_contrib_nbextensions
conda install -c astropy emcee astroml
conda install pytorch-cpu -c pytorch

You might see something like

You are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

at which point you should dutifully follow and do

pip install --upgrade pip

Enable a jupyter notebook extension we will use for in-class exercises:

jupyter nbextension enable exercise2/main

Install course material¶

Activate the course environment, if necessary (check your command prompt, but it doesn't do any harm to reactivate the current environment):

source activate DAMLA

Install the course code and data using:

cd syllabus
pip install .

Launch notebook server¶

To launch the notebook server at any time, you can now use:

[[syllabus]]
source activate DAMLA
cd notebooks
jupyter notebook

Note that [[syllabus]] is a reminder that you must be in your syllabus directory before typing the following commands. If you are unsure about this, refer to the pwd and cd commands.

Windows users: Wherever you see source activate DAMLA, use activate DAMLA instead. Details here.

This should have opened a jupyter notebook tab or window in your browser. If this is your first time doing this, to validate that you can open and view a notebook, do File->Open and click on Contents.ipynb. Jupyter notebooks (formerly called IPython notebooks) have the file extension .ipynb.

(For git experts: you will normally be working on the master branch to simplify the workflow. This means that your local work must be discarded or saved to another branch each time you update, using the instructions below).

In case of emergency, break glass...¶

In case something goes wrong with your installation and you want to start again, use:

conda remove --name DAMLA --all

You will need to shutdown any jupyter sessions with the old environment first.

Update course material¶

You can skip this section if you are installing for the first time, but remember these instructions for later.

The first step is to "factory reset" your installation before getting the updates. The simplest method is to throw away any changes you have made using:

[[syllabus]]
git checkout master
git reset --hard

Alternatively, you can keep a permanent record of your changes in a git branch with a name of your choice, for example "08-Jan-2018":

[[syllabus]]
git checkout -b "08-Jan-2018"
git commit -a -m "Save work in progress"
git checkout master

The second step is to download the changes from github:

[[syllabus]]
git pull

If this commands reports Already up-to-date. then there are no updates to download.

The final step is to update your local python environment:

[[syllabus]]
source activate DAMLA
pip install . --upgrade

Using the environment Docker image¶

If there is a problem that keeps you from being able to easily install the Conda environment on your local machine you can use the environment on the provided Docker image. The Docker image is meant to provide the compute environment, but not to be used as an area to store your work, so you should still clone the repo down to your local machine.

Installation¶

To install Docker Community Edition on your Linux, Mac, or Windows machine follow the instructions in the Docker docs.

Running the environment¶

To use the Docker image first pull it down from Docker Hub

docker pull illinoismla/damla-env

If you want anything you do in the container to safely persist then you should bindmount your local machine's file system to the container as a volume. So run the image in a container while exposing the container's internal port 8888 with the -p flag (this is necessary for Jupyter to be able to talk to the localhost) and bindmount the directory of the course Git repo on your local machine

docker run --rm -it -v <path to the repo goes here>:/home/physicist/data -p 8888:8888 illinoismla/damla-env:latest

Once inside the container activate note that the DAMLA Conda environment is already activated and should be shown in the terminal prompt

(DAMLA) root@<hostname>:~/data#

though you can also verify this by listing the conda environments

conda env list
# conda environments:
#
base                     /root/miniconda
DAML

Verifying the setup¶

To verify that things are working as expected, launch a Jupyter notebook server and test basic imports (the "Hello World" of data analysis).

Inside the running Docker container with the DAMLA environment activated navigate to /home/physicist/data (this is an arbitrary path that we chose when running the Docker image — you can make this whatever you want). If you ls you will see that you are actually inside your Git repo on your local machine.

Then launch the Jupyter server

jupyter notebook

which will cause a login URL with a token to be printed to your terminal

http://localhost:8888/?token=<token>

Click on the URL, and copy and paste it into your web browser on your local machine. This should then display the Jupyter server in your web browser.

Create a new Jupyter notebook (select from the "New" drop down menu on the upper right) and then when the notebook opens import NumPy and run a simple test

import numpy as np
np.arange(0, 10, 0.5)

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])

If you now save the notebook and in a different terminal window on your local machine you navigate to the Git repo directory you will now see that on your local machine there is the Jupyter notebook. If you shutdown and exit the Jupter server, and exit the Docker container you will see that though the environment has exited and been cleaned up the notebook has persisted.

Container persistance¶

There should be no need to reuse the same container, as the thing you care about is the data and files you're writing which should exist on your local machine (nicely versioned with Git). However, if in the event you do want the container to persist between uses you can remove the --rm flag from the docker run command to keep the container from being removed. In this situation it is a good idea to also name the container with the --name flag

docker run --name DAMLA-env-container -it -v <path to the repo goes here>:/home/physicist/data -p 8888:8888 illinoismla/damla-env:latest

After you exit the container if you list the Docker containers on your local machine

docker ps -a

you will see your exited container. To resume using that specific container start it again

docker start DAMLA-env-container

and then attach it to your shell

docker attach DAMLA-env-container

A Tool to diff two notebooks¶

Sometimes seemingly small changes to a Juypter notebook can be hard to discern from large changes using git diff. You can add a tool called nbdime to your conda enviroment. First activate your DAMLA conda environment and then do:

pip install nbdime
nbdime config-git --enable --global

Then diff'ing notebooks can use a web tool like: nbdiff-web <notebook1.ipynb> <notebook2.ipynb>

For a complete set of nbdime commands, see them here