IPython Tutorial

ADSC, Singapore May 23rd by Jonas Arnfred

Abstract

For scientific computing, fast iterations and rapid prototyping is essential for designing experiments and working with data. When we can see the result of what we are doing right away, it is a lot simpler to adjust algorithm parameters and fine tune plots to suit our purpose. IPython is an interactive python console with a browser based notebook that comes with support for code, text, mathematical expressions and inline plots. It is made with scientific computing in mind and allows for a matlab-like iterative approach for working with data. In this presentation I will give an introduction to IPython and showcase how it can be used for working with data, plotting and fast prototyping.

Introduction

This tutorial will cover basic principles in using IPython for development and scientific research. I will assume a basic familiarity with Python coding in general, although I think the code examples should be easy enough to understand for those that aren't. If you have questions, you are of course welcome to interrupt me and ask me to clarify something.

I will start with a few notes on installing Ipython before I introduce the basic features and showcase inline plotting. Then I will go through a small example computing the Mandelbrot set to showcase my normal workflow in IPython. Finally we will have a look at plotting data from the Mandelbrot set using the Pandas and ggPlot libraries.

Installation

On Debian/Ubuntu you can install ipython by installing pip:

sudo apt-get install python-pip

Then using pip you can easily install ipython and dependencies:

sudo pip install ipython[notebook]

For pylab (plotting) you will need numpy and matplotlib:

sudo pip install numpy matplotlib

On other platforms you can find instructions on the IPython installation page.

Using IPython

IPython consists of a few different shells that make it easy to enter python and evaluate python expressions. In this tutorial we will focus on the IPython Notebook which is a browser based shell with inline graphics. To open IPython Notebook, open a terminal and navigate to your project directory, then type:

ipython notebook --pylab inline

The 'pylab inline' means that any command creating a pylab plot or showing an image will open up in the interactive python shell.

Evaluating Expressions

After entering the command, IPython will open in a browser window in any open browser showing the directory page from where you can load saved IPython notebooks or create a new one. When you create a new notebook you are met with a page featuring text-input (a cell) in which you can enter any python code and have it evaluated by pressing the run button or by pressing Ctrl + Enter:

In [1]:
n = 4+8
"I would like %i shrubberies" % n
Out[1]:
'I would like 12 shrubberies'

To insert a new cell below, go to insert and click 'Insert Cell Below' or press Ctrl + m and then b (as in below). You can change any cell and reevaluate it. Any variable assignment that you evaluate will be available in all other cells:

In [2]:
"'n' declared above has a value of %i" % n
Out[2]:
"'n' declared above has a value of 12"

Markdown Cells

Outside of python code you can also specify that a cell contains markdown formatted text (like this one). Markdown is a simple syntax for formatting text documents which makes it easy to include links, lists, code examples etc. You can find a brief overview on the markdown syntax here. In Markdown cells you can also include $\LaTeX$ expressions by wrapping the text in dollar signs: \$\LaTeX\$ -> $\LaTeX$. as well as html-expressions. The input <span style="color:red">Some text</span> is rendered like: Some text

IPython commands

While ipython evaluates python code, the shell comes with a set of commands built-in. A nice overview of the commands are featured here, but in day to day coding the one I find the most useful is %timeit:

In [3]:
import numpy
%timeit numpy.sqrt(numpy.ones((400, 400)))
100 loops, best of 3: 2.67 ms per loop

Plotting

Since we've loaded IPython Notebook with pylab inline, plotting a figure is as simple as typing 'plot' which uses MatPlotLib:

In [4]:
values = numpy.arange(30)**2
plot(values)
Out[4]:
[<matplotlib.lines.Line2D at 0x7ff867114390>]

An Example: The Mandelbrot Set

As an example I will iteratively go through the example of coding up the algorithm that computes the Mandelbrot fractal. This example probably makes more sense in the live tutorial than in the notes, but in case anyone is interested I've kept it here.

For the definitions of the Mandelbrot set I take the liberty of quoting wikipedia:

The Mandelbrot set is a mathematical set of points whose boundary is a distinctive and easily recognizable two-dimensional fractal shape. The set is closely related to Julia sets (which include similarly complex shapes) and is named after the mathematician Benoit Mandelbrot, who studied and popularized it.

Mandelbrot set images are made by sampling complex numbers and determining for each whether the result tends towards infinity when a particular mathematical operation is iterated on it. Treating the real and imaginary parts of each number as image coordinates, pixels are colored according to how rapidly the sequence diverges, if at all.

More precisely, the Mandelbrot set is the set of values of c in the complex plane for which the orbit of 0 under iteration of the complex quadratic polynomial $z_{n+1}=z_n^2+c$ remains bounded.

This means that for a $m \times n$ grid of imaginary numbers going from, say, $z = (-x - iy)$ to $z = (x + i)$ we apply function $z_{n+1}=z_n^2+c$ continously and observes when each value diverges, and then we create an $m \times n$ image where each pixel corresponds to how many iterations it took before the corresponding imaginary value in the grid converged.

Getting a grid

To do this, we need a grid of evenly spaced imaginary numbers to start with. The most straightforward way (I know of) in numpy is to create to arrays and add them as follows:

In [5]:
import numpy
t1 = numpy.linspace(0,1,5).reshape((1, 5))
t2 = numpy.linspace(1,3,3).reshape((3, 1))
print(t1)
print(t2)
t1 + t2*1j
[[ 0.    0.25  0.5   0.75  1.  ]]
[[ 1.]
 [ 2.]
 [ 3.]]
Out[5]:
array([[ 0.00+1.j,  0.25+1.j,  0.50+1.j,  0.75+1.j,  1.00+1.j],
       [ 0.00+2.j,  0.25+2.j,  0.50+2.j,  0.75+2.j,  1.00+2.j],
       [ 0.00+3.j,  0.25+3.j,  0.50+3.j,  0.75+3.j,  1.00+3.j]])

We can create a grid of size $m \times n = 600 \times 600$ with values ranging from $-2$ to $1$ for the real part and $-1$ to $1$ for the imaginary part:

In [6]:
m = 600 # Height of plot
n = 600 # Width of plot
values_real = numpy.linspace(-2.3, 1, n).reshape((1,n))
values_imag = numpy.linspace(-1.4, 1.4, m).reshape((m,1))
initial_values = values_real + values_imag*1j
initial_values
Out[6]:
array([[-2.30000000-1.4j       , -2.29449082-1.4j       ,
        -2.28898164-1.4j       , ...,  0.98898164-1.4j       ,
         0.99449082-1.4j       ,  1.00000000-1.4j       ],
       [-2.30000000-1.39532554j, -2.29449082-1.39532554j,
        -2.28898164-1.39532554j, ...,  0.98898164-1.39532554j,
         0.99449082-1.39532554j,  1.00000000-1.39532554j],
       [-2.30000000-1.39065109j, -2.29449082-1.39065109j,
        -2.28898164-1.39065109j, ...,  0.98898164-1.39065109j,
         0.99449082-1.39065109j,  1.00000000-1.39065109j],
       ..., 
       [-2.30000000+1.39065109j, -2.29449082+1.39065109j,
        -2.28898164+1.39065109j, ...,  0.98898164+1.39065109j,
         0.99449082+1.39065109j,  1.00000000+1.39065109j],
       [-2.30000000+1.39532554j, -2.29449082+1.39532554j,
        -2.28898164+1.39532554j, ...,  0.98898164+1.39532554j,
         0.99449082+1.39532554j,  1.00000000+1.39532554j],
       [-2.30000000+1.4j       , -2.29449082+1.4j       ,
        -2.28898164+1.4j       , ...,  0.98898164+1.4j       ,
         0.99449082+1.4j       ,  1.00000000+1.4j       ]])

Let's now apply the function $z_{n+1}=z_n^2+c$ to this grid. Then for each round we want to know which values are about to diverge. Because I don't have the patience to iterate an infinite amount of round, we just test all values if they are bigger than a certain threshold. Conveniently enough wikipedia argues that we only need to test if the norm of the value is above 2. For an imaginary number $z = (x + iy)$ the norm is $\sqrt{x^2 + y^2} = \sqrt{z * conj(z)}$. For each iteration we log which numbers are divergent in the iterations matrix. We can then print this as a heatmap using matplotlib's imshow function:

In [7]:
values = initial_values
max_iterations = 30
iterations = numpy.ones(initial_values.shape) * max_iterations
for i in range(max_iterations) :
    values = values**2 + initial_values
    divergent = values * conj(values) > 4
    divergent = divergent & (iterations == max_iterations) # Test that we haven't already found this number
    iterations[divergent] = i
imshow(iterations)
-c:6: RuntimeWarning: overflow encountered in multiply
-c:6: RuntimeWarning: invalid value encountered in multiply
-c:5: RuntimeWarning: overflow encountered in square
-c:5: RuntimeWarning: invalid value encountered in square
Out[7]:
<matplotlib.image.AxesImage at 0x7ff866ec1e90>

Let's set matplotlib defaults to a bigger plot size:

In [8]:
rcParams['figure.figsize'] = 10, 10
imshow(iterations)
Out[8]:
<matplotlib.image.AxesImage at 0x7ff866dfea10>

This is a bit cumbersome for experimentation though, so let's define a function so we can more easily play around with the values:

In [9]:
def mandelbrot(width, height, x_lim = (-2.3, 1), y_lim = (-1.4, 1.4), max_iterations = 30) :
    m = height # Height of plot
    n = width # Width of plot
    values_real = numpy.linspace(x_lim[0], x_lim[1], n).reshape((1,n))
    values_imag = numpy.linspace(y_lim[0], y_lim[1], m).reshape((m,1))
    initial_values = values_real + values_imag*1j
    initial_values
    values = initial_values
    iterations = numpy.ones(initial_values.shape) * max_iterations
    for i in range(max_iterations) :
        values = values**2 + initial_values
        divergent = values * conj(values) > 4
        divergent = divergent & (iterations == max_iterations) # Test that we haven't already found this number
        iterations[divergent] = i
    return iterations

Nice, so let's try to zoom in a bit:

In [10]:
mandelbrot_data = mandelbrot(600, 600, (-0.56, -0.55), (-0.56,-0.55), 90)
imshow(mandelbrot_data)
-c:12: RuntimeWarning: overflow encountered in multiply
-c:12: RuntimeWarning: invalid value encountered in multiply
-c:11: RuntimeWarning: overflow encountered in square
-c:11: RuntimeWarning: invalid value encountered in square
Out[10]:
<matplotlib.image.AxesImage at 0x7ff8655f0250>

Plotting with ggPlot and Pandas

Here in the end I would like to add a brief overview on data exploration using the library Pandas together with ggplot for python. Pandas is a library providing easy-to-use data structures and data analysis tools for python inspired by R data frames. Similarly ggplot is a plotting library originally developed for R and later partially ported to python. Both can be installed with pip:

pip install pandas ggplot

To keep with the theme of the Mandelbrot brot set, we'll take a look at the data produced by our Mandelbrot function and plot it in various ways. First though, we'll need to construct a Python dataframe with relevant information:

In [11]:
import pandas, ggplot
# Let's pick out a few lines from the image
mandelbrot_df = pandas.DataFrame({ i : f for i,f in enumerate(mandelbrot_data) })
mandelbrot_df
Out[11]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 90 75 70 69 68 68 68 70 67 66 66 66 64 63 63 62 62 62 62 63 ...
1 73 85 68 68 67 67 67 66 65 65 65 64 63 63 62 62 62 62 62 62 ...
2 70 69 68 67 67 66 66 65 65 64 64 63 63 62 62 62 62 62 61 61 ...
3 69 69 68 67 67 66 66 65 65 64 64 63 63 62 62 62 62 61 61 61 ...
4 69 69 68 67 66 66 65 65 64 64 64 63 63 62 62 62 61 61 61 60 ...
5 69 69 68 67 67 66 66 65 64 64 64 63 63 62 62 61 61 61 61 60 ...
6 69 69 69 68 67 66 66 65 64 64 64 63 63 62 62 61 61 61 60 60 ...
7 71 70 70 73 71 67 66 65 64 64 64 63 63 63 62 61 61 61 60 60 ...
8 74 71 71 75 90 69 67 66 65 64 64 64 63 63 62 61 61 61 60 60 ...
9 79 73 74 85 81 90 80 87 66 65 65 65 65 64 63 62 61 61 60 60 ...
10 82 76 76 90 90 90 77 69 67 66 66 66 70 69 90 62 61 61 60 60 ...
11 90 90 80 81 87 90 90 86 68 67 67 68 70 90 78 63 62 61 60 60 ...
12 90 90 86 90 90 90 74 71 69 68 69 90 82 90 67 64 63 62 61 60 ...
13 90 90 90 90 90 79 76 72 70 70 71 75 90 76 90 90 64 63 61 60 ...
14 90 90 90 90 90 83 90 87 86 71 72 87 90 84 84 90 72 90 64 61 ...
15 90 88 90 90 90 88 90 90 84 74 74 76 90 90 90 78 90 67 63 61 ...
16 90 83 90 90 90 90 90 90 81 77 77 78 81 90 89 90 70 64 62 62 ...
17 78 80 90 90 90 90 90 90 90 86 90 89 90 90 76 90 77 66 62 62 ...
18 76 80 84 90 90 90 90 90 90 89 90 90 90 89 74 70 69 66 63 62 ...
19 79 90 90 90 90 90 90 90 90 90 90 90 90 77 70 68 66 65 64 63 ...
20 74 77 90 90 90 90 86 90 90 90 90 84 79 74 72 69 66 65 64 63 ...
21 72 90 90 90 90 90 82 90 90 90 90 83 79 90 90 72 67 65 64 64 ...
22 70 72 72 75 90 84 79 90 90 90 90 90 81 90 88 86 72 70 66 65 ...
23 69 70 71 72 76 76 79 85 90 90 90 86 84 89 90 90 77 69 67 66 ...
24 69 69 70 71 73 74 84 90 90 90 90 90 90 90 90 86 90 90 68 67 ...
25 68 69 70 71 73 75 79 85 90 90 90 90 90 90 90 80 78 73 69 69 ...
26 68 68 70 85 86 81 79 90 90 90 90 90 90 90 90 90 75 73 71 70 ...
27 67 67 68 90 80 90 84 90 90 90 90 90 90 90 90 90 77 78 76 90 ...
28 66 67 67 69 73 86 90 90 90 88 88 87 90 90 90 90 80 87 90 90 ...
29 66 66 67 70 90 83 90 90 90 88 84 84 90 90 90 90 86 85 90 90 ...
30 65 66 66 83 80 90 83 90 90 90 81 82 90 90 90 90 90 90 90 90 ...
31 65 65 65 66 68 69 76 80 90 90 80 79 82 90 90 90 90 90 90 90 ...
32 64 64 65 65 66 68 72 90 90 78 76 78 82 90 90 90 90 90 90 90 ...
33 64 64 64 65 66 67 72 76 76 74 75 79 90 90 90 90 90 85 79 90 ...
34 64 64 64 65 66 66 68 70 71 72 74 78 90 90 90 90 90 90 76 74 ...
35 63 64 64 65 66 66 68 69 70 73 79 81 90 90 90 90 90 90 77 74 ...
36 63 63 64 65 66 67 68 69 71 90 90 90 90 90 90 90 87 90 77 76 ...
37 62 63 64 66 72 70 71 70 71 90 90 90 90 90 90 89 84 81 78 78 ...
38 62 62 64 90 87 86 76 72 73 76 81 89 90 90 90 90 90 81 80 81 ...
39 62 62 64 67 90 90 90 75 75 81 90 90 90 90 90 90 90 90 84 83 ...
40 62 62 64 65 69 90 90 78 79 90 90 90 90 90 90 90 90 90 90 87 ...
41 61 62 64 66 69 75 90 82 90 90 90 90 90 90 90 90 90 90 90 90 ...
42 61 63 90 76 87 75 90 88 90 90 90 88 90 90 90 90 90 90 90 90 ...
43 61 63 71 76 87 90 85 90 90 89 85 84 86 90 90 90 90 90 90 90 ...
44 60 61 63 66 68 90 90 90 90 90 85 81 81 82 88 90 90 90 90 90 ...
45 60 61 62 63 66 90 90 90 87 90 84 80 79 80 82 86 90 90 90 90 ...
46 60 61 61 63 64 68 73 79 90 90 90 78 78 80 90 88 90 90 90 87 ...
47 60 61 61 62 65 81 90 90 90 86 78 76 76 79 90 90 90 90 90 90 ...
48 60 61 61 62 90 90 90 90 77 90 75 74 75 90 90 90 90 90 90 90 ...
49 60 61 61 62 65 71 69 84 74 71 72 73 74 77 86 90 90 90 90 90 ...
50 61 61 62 63 63 65 66 68 69 70 71 73 75 77 82 86 90 90 90 90 ...
51 61 62 62 63 63 65 66 67 68 69 71 74 90 80 84 88 90 90 90 90 ...
52 62 62 63 63 64 64 66 67 69 70 73 87 90 86 90 90 90 90 90 90 ...
53 66 67 64 64 64 65 70 71 74 72 78 90 90 90 90 90 90 90 90 90 ...
54 76 90 66 65 66 67 90 90 77 75 90 90 90 90 90 90 90 88 84 83 ...
55 90 77 67 67 67 78 90 90 90 86 82 85 89 90 90 90 90 88 90 80 ...
56 90 90 70 68 69 73 88 83 89 90 90 90 90 90 90 90 90 90 90 90 ...
57 90 75 71 69 70 72 75 79 90 90 90 90 90 90 90 90 90 90 88 74 ...
58 90 86 72 73 90 78 78 90 90 90 90 90 90 90 90 88 85 90 90 72 ...
59 90 76 75 78 89 89 90 90 90 90 90 90 90 86 85 80 88 88 74 71 ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

600 rows × 600 columns

This gives us a data structure that is easy to work with using panda's functionality. For the moment though, we are mostly interested in creating a narrow table with x and y values to plot the data. For this purpose we pick a few rows and use panda's melt function to squeeze the data in to one array:

In [12]:
select_rows_df = pandas.melt(mandelbrot_df[[100,200,300,400]], var_name="line", value_name="iteration") # picking out row 100, 200, 300 and 400 for inspection
select_rows_df
Out[12]:
line iteration
0 100 90
1 100 62
2 100 61
3 100 61
4 100 60
5 100 60
6 100 59
7 100 59
8 100 59
9 100 59
10 100 59
11 100 59
12 100 59
13 100 59
14 100 59
15 100 60
16 100 60
17 100 60
18 100 61
19 100 62
20 100 65
21 100 78
22 100 85
23 100 90
24 100 77
25 100 90
26 100 90
27 100 90
28 100 90
29 100 78
30 100 79
31 100 74
32 100 86
33 100 61
34 100 60
35 100 59
36 100 58
37 100 57
38 100 56
39 100 55
40 100 55
41 100 55
42 100 55
43 100 54
44 100 54
45 100 54
46 100 54
47 100 54
48 100 54
49 100 55
50 100 55
51 100 56
52 100 90
53 100 90
54 100 90
55 100 74
56 100 60
57 100 90
58 100 90
59 100 58
... ...

2400 rows × 2 columns

Now we can take a look at the histogram of iteration values for these 4 rows by using ggplot and the geom_histogram method:

In [13]:
from ggplot import *
ggplot(select_rows_df, aes(x="iteration")) + geom_histogram(binwidth = 5)
Out[13]:
<ggplot: (8794026754077)>

How about looking at the densities for different lines? We can group the values by the 'line' variable and display the histogram as a density plot:

In [14]:
ggplot(select_rows_df, aes(x="iteration", color="line", fill="line")) + geom_density(alpha=0.3)
Out[14]:
<ggplot: (8794026732405)>

Finally how about we show a line plot illustrating the mean of the iteration values as we move down the image line by line?

In [15]:
accumulated_rows_df = mandelbrot_df.mean().reset_index()
accumulated_rows_df
Out[15]:
index 0
0 0 72.740000
1 1 72.440000
2 2 72.031667
3 3 72.081667
4 4 72.193333
5 5 72.008333
6 6 71.663333
7 7 71.410000
8 8 71.325000
9 9 70.953333
10 10 70.566667
11 11 70.483333
12 12 70.086667
13 13 70.195000
14 14 69.980000
15 15 69.848333
16 16 69.741667
17 17 69.766667
18 18 69.010000
19 19 68.685000
20 20 68.690000
21 21 69.113333
22 22 68.986667
23 23 68.983333
24 24 68.738333
25 25 69.246667
26 26 69.381667
27 27 69.458333
28 28 69.316667
29 29 69.453333
30 30 69.826667
31 31 70.451667
32 32 70.235000
33 33 70.575000
34 34 70.766667
35 35 71.000000
36 36 71.133333
37 37 71.493333
38 38 71.570000
39 39 71.090000
40 40 70.578333
41 41 70.438333
42 42 70.356667
43 43 70.035000
44 44 69.601667
45 45 69.190000
46 46 69.500000
47 47 69.375000
48 48 69.088333
49 49 68.985000
50 50 68.501667
51 51 68.260000
52 52 68.485000
53 53 68.431667
54 54 68.435000
55 55 68.381667
56 56 68.756667
57 57 68.773333
58 58 68.458333
59 59 67.991667
... ...

600 rows × 2 columns

In [16]:
ggplot(accumulated_rows_df, aes(x="index", y=0)) + geom_line(position="jitter")
Out[16]:
<ggplot: (8794053609001)>

For ggplot you can find more plotting options here. Pandas is a very powerful data analysis tool and we've barely touched the surface. To get a glimpse of what you can do with Pandas I suggest taking a look at the 10 minute tour of Pandas.