#!/usr/bin/env python # coding: utf-8 # # Jupyter Notebooks # ## for Collaborative and Reproducible Research # ## Reproducible Research # # > reproducing conclusions from a single experiment based on the measurements from that experiment # # The most basic form of reproducibility is a complete description of the data and associated analyses (including code!) so the results can be *exactly* reproduced by others. # # Reproducing calculations can be onerous, even with one's own work! # # Scientific data are becoming larger and more complex, making simple descriptions inadequate for reproducibility. As a result, most modern research is irreproducible without tremendous effort. # # *** Reproducible research is not yet part of the culture of science in general, or scientific computing in particular. *** # ## Scientific Computing Workflow # # There are a number of steps to scientific endeavors that involve computing: # # ![workflow](images/workflow.png) # # # Many of the standard tools impose barriers between one or more of these steps. This can make it difficult to iterate, reproduce work. # # The Jupyter notebook [eliminates or reduces these barriers to reproducibility](http://www.nature.com/news/interactive-notebooks-sharing-the-code-1.16261). # # Jupyter/IPython notebooks have already motivated the generation of [reproducible publications](https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks#reproducible-academic-publications) and an [open source statistics textbook](http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/) # ## Jupyter Notebook # # The Jupyter Notebook is an **interactive computing environment** that enables users to author notebook documents that include: # - Live code # - Interactive widgets # - Plots # - Narrative text # - Equations # - Images # - Video # # These documents provide a **complete and self-contained record of a computation** that can be converted to various formats and shared with others using email, [Dropbox](http://dropbox.com), version control systems (like git/[GitHub](http://github.com)) or [nbviewer.ipython.org](http://nbviewer.ipython.org). # ### Components # # The Jupyter Notebook combines three components: # # * **The notebook web application**: An interactive web application for writing and running code interactively and authoring notebook documents. # * **Kernels**: Separate processes started by the notebook web application that runs users' code in a given language and returns output back to the notebook web application. The kernel also handles things like computations for interactive widgets, tab completion and introspection. # * **Notebook documents**: Self-contained documents that contain a representation of all content visible in the notebook web application, including inputs and outputs of the computations, narrative # text, equations, images, and rich media representations of objects. Each notebook document has its own kernel. # ## Kernels # # Through IPython's kernel and messaging architecture, the Notebook allows code to be run in a range of different programming languages. For each notebook document that a user opens, the web application starts a kernel that runs the code for that notebook. Each kernel is capable of running code in a single programming language and there are kernels available in the following languages: # # * [Python](https://github.com/ipython/ipython) # * [Julia](https://github.com/JuliaLang/IJulia.jl) # * [R](https://github.com/takluyver/IRkernel) # * [Ruby](https://github.com/minrk/iruby) # * [Haskell](https://github.com/gibiansky/IHaskell) # * [Scala](https://github.com/Bridgewater/scala-notebook) # * [node.js](https://github.com/n-riesco/ijavascript) # * [Go](https://github.com/takluyver/igo) # # The default kernel runs Python code. IPython 3.0 provides a simple way for users to pick which of these kernels is used for a given notebook. # # Each of these kernels communicate with the notebook web application and web browser using a JSON over ZeroMQ/WebSockets message protocol that is described [here](http://ipython.org/ipython-doc/dev/development/messaging.html). Most users don't need to know about these details, but it helps to understand that "kernels run code." # ## Notebook Documents # # Notebook documents contain the **inputs and outputs** of an interactive session as well as **narrative text** that accompanies the code but is not meant for execution. **Rich output** generated by running code, including HTML, images, video, and plots, is embeddeed in the notebook, which makes it a complete and self-contained record of a computation. # # When you run the notebook web application on your computer, notebook documents are just **files on your local filesystem with a `.ipynb` extension**. This allows you to use familiar workflows for organizing your notebooks into folders and sharing them with others. # # Notebooks consist of a **linear sequence of cells**. There are three basic cell types: # # * **Code cells:** Input and output of live code that is run in the kernel # * **Markdown cells:** Narrative text with embedded LaTeX equations # * **Raw cells:** Unformatted text that is included, without modification, when notebooks are converted to different formats using nbconvert # # Internally, notebook documents are **[JSON](http://en.wikipedia.org/wiki/JSON) data** with **binary values [base64](http://en.wikipedia.org/wiki/Base64)** encoded. This allows them to be **read and manipulated programmatically** by any programming language. Because JSON is a text format, notebook documents are version control friendly. # # **Notebooks can be exported** to different static formats including HTML, reStructeredText, LaTeX, PDF, and slide shows ([reveal.js](http://lab.hakim.se/reveal-js/#/)) using IPython's `nbconvert` utility. # # Furthermore, any notebook document available from a **public URL** can be shared via [nbviewer](http://nbviewer.ipython.org). This service loads the notebook document from the URL and renders it as a static web page. The resulting web page may thus be shared with others **without their needing to install IPython**. # ## Installation and Configuration # # # While Jupyter runs code in many different programming languages, Python is a prerequisite for installing Jupyter notebook. # # Perhaps the easiest way to get a feature-complete version of Python on your system is to install the [Anaconda](http://continuum.io/downloads.html) distribution by Continuum Analytics. Anaconda is a completely free Python environment that includes includes almost 200 of the best Python packages for science and data analysis. Its simply a matter of downloading the installer (either graphical or command line), and running it on your system. # # Be sure to download the Python 3.5 installer, by following the **Python 3.5 link** for your computing platform (Mac OS X example shown below). # # ![get Python 3](http://fonnesbeck-dropshare.s3.amazonaws.com/687474703a2f2f666f6e6e65736265636b2d64726f7073686172652e73332e616d617a6f6e6177732e636f6d2f53637265656e2d53686f742d323031362d30332d31382d61742d332e32342e32362d504d2e706e67.png) # # Once Python is installed, installing Jupyter is a matter of running a single command: # # conda install jupyter # # If you prefer to install Jupyter from source, or you did not use Anaconda to install Python, you can also use `pip`: # # pip install jupyter # ## Installing Kernels # # Individual language kernels must be installed from each respective language. We will show the R kernel installation as an example. # # Setting up the R kernel involves two commands from within the R shell. The first installs the packages: # # ```r # install.packages(c('repr', 'IRkernel', 'IRdisplay'), # repos = c('http://irkernel.github.io/', getOption('repos'))) # ``` # # and the second links the kernel to Jupyter: # # ```r # IRkernel::installspec() # ``` # ## Running Jupyter Notebooks # # Once installed, a notebook session can be initiated from the command line via: # # jupyter notebook # # If you installed Jupyter via Anaconda, you will also have a graphical launcher available. # ## IPython # # **IPython** (Interactive Python) is an enhanced Python shell which provides a more robust and productive development environment for users. There are several key features that set it apart from the standard Python shell. # # * Interactive data analysis and visualization # * Python kernel for Jupyter notebooks # * Easy parallel computation # # Over time, the IPython project grew to include several components, including: # # * an interactive shell # * a REPL protocol # * a notebook document fromat # * a notebook document conversion tool # * a web-based notebook authoring tool # * tools for building interactive UI (widgets) # * interactive parallel Python # # As each component has evolved, several had grown to the point that they warrented projects of their own. For example, pieces like the notebook and protocol are not even specific to Python. As the result, the IPython team created Project Jupyter, which is the new home of language-agnostic projects that began as part of IPython, such as the notebook in which you are reading this text. # # The HTML notebook that is part of the Jupyter project supports **interactive data visualization** and easy high-performance **parallel computing**. # # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') def f(x): return (x-3)*(x-5)*(x-7)+85 import numpy as np x = np.linspace(0, 10, 200) y = f(x) plt.plot(x,y) # The Notebook gives you everything that a browser gives you. For example, you can embed images, videos, or entire websites. # In[2]: from IPython.display import IFrame IFrame('http://biostat.mc.vanderbilt.edu/wiki', width='100%', height=350) # In[3]: from IPython.display import YouTubeVideo YouTubeVideo("rl5DaFbLc60") # # Running Code # First and foremost, the IPython Notebook is an interactive environment for writing and running code. IPython is capable of running code in a wide range of languages. However, this notebook, and the default kernel in IPython 3, runs Python code. # ## Code cells allow you to enter and run Python code # Run a code cell using `Shift-Enter` or pressing the button in the toolbar above: # In[4]: a = 10 # In[5]: print(a) # There are three keyboard shortcuts for running code: # # * `Shift-Enter` runs the current cell, enters command mode, and select next cell. # * `Ctrl-Enter` runs the current cell and enters command mode. # * `Alt-Enter` runs the current cell and inserts a new one below, enters edit mode. # # These keyboard shortcuts works both in command and edit mode. # ## Managing the IPython Kernel # Code is run in a separate process called the IPython Kernel. The Kernel can be interrupted or restarted. Try running the following cell and then hit the button in the toolbar above. # In[6]: import time time.sleep(10) # If the Kernel dies it will be automatically restarted up to 3 times. # If it cannot be restarted automatically you will be prompted to try again, or abort. # Here we call the low-level system libc.time routine with the wrong argument via # ctypes to segfault the Python interpreter: # In[ ]: import sys from ctypes import CDLL # This will crash a Linux or Mac system # equivalent calls can be made on Windows dll = 'dylib' if sys.platform == 'darwin' else 'so.6' libc = CDLL("libc.%s" % dll) libc.time(-1) # BOOM!! # ## Cell menu # The "Cell" menu has a number of menu items for running code in different ways. These includes: # # * Run # * Run and Select Below # * Run and Insert Below # * Run All # * Run All Above # * Run All Below # ## Restarting the kernels # The kernel maintains the state of a notebook's computations. You can reset this state by restarting the kernel. This is done by clicking on the in the toolbar above, or by using the `00` (press 0 twice) shortcut in command mode. # ## Output is asynchronous # All output is displayed asynchronously as it is generated in the Kernel. If you execute the next cell, you will see the output one piece at a time, not all at the end. # In[7]: import time, sys for i in range(8): print(i) time.sleep(0.5) # ## Large outputs # To better handle large outputs, the output area can be collapsed. Run the following cell and then single- or double- click on the active area to the left of the output: # In[8]: for i in range(50): print(i) # Beyond a certain point, output will scroll automatically: # In[9]: for i in range(500): print(2**i - 1) # ## Markdown cells # # Markdown is a simple *markup* language that allows plain text to be converted into HTML. # # The advantages of using Markdown over HTML (and LaTeX): # # - its a **human-readable** format # - allows writers to focus on content rather than formatting and layout # - easier to learn and use # # For example, instead of writing: # # ```html #

In order to create valid # HTML, you # need properly coded syntax that can be cumbersome for # “non-programmers” to write. Sometimes, you # just want to easily make certain words bold # , and certain words italicized without # having to remember the syntax. Additionally, for example, # creating lists:

# # ``` # # we can write the following in Markdown: # # ```markdown # In order to create valid [HTML], you need properly # coded syntax that can be cumbersome for # "non-programmers" to write. Sometimes, you just want # to easily make certain words **bold**, and certain # words *italicized* without having to remember the # syntax. Additionally, for example, creating lists: # # * should be easy # * should not involve programming # ``` # # ### Emphasis # # Markdown uses `*` (asterisk) and `_` (underscore) characters as # indicators of emphasis. # # *italic*, _italic_ # **bold**, __bold__ # ***bold-italic***, ___bold-italic___ # # *italic*, _italic_ # **bold**, __bold__ # ***bold-italic***, ___bold-italic___ # # ### Lists # # Markdown supports both unordered and ordered lists. Unordered lists can use `*`, `-`, or # `+` to define a list. This is an unordered list: # # * Apples # * Bananas # * Oranges # # * Apples # * Bananas # * Oranges # # Ordered lists are numbered lists in plain text: # # 1. Bryan Ferry # 2. Brian Eno # 3. Andy Mackay # 4. Paul Thompson # 5. Phil Manzanera # # 1. Bryan Ferry # 2. Brian Eno # 3. Andy Mackay # 4. Paul Thompson # 5. Phil Manzanera # # ### Links # # Markdown inline links are equivalent to HTML `` # links, they just have a different syntax. # # [Biostatistics home page](http://biostat.mc.vanderbilt.edu "Visit Biostat!") # # [Biostatistics home page](http://biostat.mc.vanderbilt.edu "Visit Biostat!") # # ### Block quotes # # Block quotes are denoted by a `>` (greater than) character # before each line of the block quote. # # > Sometimes a simple model will outperform a more complex model . . . # > Nevertheless, I believe that deliberately limiting the complexity # > of the model is not fruitful when the problem is evidently complex. # # > Sometimes a simple model will outperform a more complex model . . . # > Nevertheless, I believe that deliberately limiting the complexity # > of the model is not fruitful when the problem is evidently complex. # # ### Images # # Images look an awful lot like Markdown links, they just have an extra # `!` (exclamation mark) in front of them. # # ![Python logo](images/python-logo-master-v3-TM.png) # # ![Python logo](images/python-logo-master-v3-TM.png) # ### Mathjax Support # # Mathjax ia a javascript implementation $\alpha$ of LaTeX that allows equations to be embedded into HTML. For example, this markup: # # """$$ \int_{a}^{b} f(x)\, dx \approx \frac{1}{2} \sum_{k=1}^{N} \left( x_{k} - x_{k-1} \right) \left( f(x_{k}) + f(x_{k-1}) \right). $$""" # # becomes this: # # $$ # \int_{a}^{b} f(x)\, dx \approx \frac{1}{2} \sum_{k=1}^{N} \left( x_{k} - x_{k-1} \right) \left( f(x_{k}) + f(x_{k-1}) \right). # $$ # ## Running other Kernels # # The kernel of a Jupyter session can be switched from the menu. [Here is an example of a notebook running R code](rtutorial.ipynb). # ## IPython in Jupyter Notebooks # # Running IPython within a Jupyter Notebook provides an enhanced interactive scientific computing environment. # # ### SymPy # # SymPy is a Python library for symbolic mathematics. It supports: # # * polynomials # * calculus # * solving equations # * discrete math # * matrices # In[10]: from sympy import * init_printing() x, y = symbols("x y") # In[11]: eq = ((x+y)**2 * (x+1)) eq # In[12]: expand(eq) # In[13]: (1/cos(x)).series(x, 0, 6) # In[14]: limit((sin(x)-x)/x**3, x, 0) # In[15]: diff(cos(x**2)**2 / (1+x), x) # ### Magic functions # # IPython has a set of predefined ‘magic functions’ that you can call with a command line style syntax. These include: # # * `%run` # * `%edit` # * `%debug` # * `%timeit` # * `%paste` # * `%load_ext` # # # In[16]: get_ipython().run_line_magic('lsmagic', '') # IPython also creates aliases for a few common interpreters, such as bash, ruby, perl, etc. # # These are all equivalent to `%%script ` # In[17]: get_ipython().run_cell_magic('ruby', '', 'puts "Hello from Ruby #{RUBY_VERSION}"\n') # In[18]: get_ipython().run_cell_magic('bash', '', 'echo "hello from $BASH"\n') # IPython has an `rmagic` extension that contains a some magic functions for working with R via rpy2. This extension can be loaded using the `%load_ext` magic as follows: # In[19]: get_ipython().run_line_magic('load_ext', 'rpy2.ipython') # If the above generates an error, it is likely that you do not have the `rpy2` module installed. You can install this now via: # In[20]: get_ipython().system('pip install rpy2') # or, if you are running Anaconda, via `conda`: # In[21]: get_ipython().system('conda install rpy2') # In[22]: get_ipython().run_line_magic('R', 'print(lm(rnorm(10)~rnorm(10)))') print('i am python') # In[23]: import numpy as np x,y = np.arange(10), np.random.normal(size=10) # In[24]: get_ipython().run_cell_magic('R', '-i x,y -o XYcoef', 'lm.fit <- lm(y~x)\npar(mfrow=c(2,2))\nprint(summary(lm.fit))\nplot(lm.fit)\nXYcoef <- coef(lm.fit)\n') # In[25]: XYcoef # ### Remote Code # # Use `%load` to add remote code # In[26]: # %load http://matplotlib.org/mpl_examples/shapes_and_collections/scatter_demo.py """ Simple demo of a scatter plot. """ import numpy as np import matplotlib.pyplot as plt N = 50 x = np.random.rand(N) y = np.random.rand(N) colors = np.random.rand(N) area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses plt.scatter(x, y, s=area, c=colors, alpha=0.5) plt.show() # ### Debugging and Profiling # # The `%debug` magic can be used to trigger the IPython debugger (`ipd`) for a cell that raises an exception. The debugger allows you to step through code line-by-line and inspect variables and execute code. # In[27]: import numpy def abc(y, N, epsilon=[0.2, 0.8]): trace = [] while len(trace) < N: # Simulate from priors mu = numpy.random.normal(0, 10) sigma = numpy.random.uniform(0, 20) x = numpy.random.normal(mu, sigma, 50) #if (np.linalg.norm(y - x) < epsilon): if ((abs(x.mean() - y.mean()) < epsilon) & (abs(x.std() - y.std()) < epsilon[1])): trace.append([mu, sigma]) return trace # In[28]: y = numpy.random.normal(4, 2, 50) abc(y, 10) # In[29]: get_ipython().run_line_magic('debug', '') # Timing the execution of code is easy with the `timeit` magic: # In[30]: get_ipython().run_line_magic('timeit', '[i**2 for i in range(1000)]') # In[31]: get_ipython().run_line_magic('timeit', 'numpy.arange(1000)**2') # ## Exporting and Converting Notebooks # # In Jupyter, one can convert an `.ipynb` notebook document file into various static formats via the `nbconvert` tool. Currently, nbconvert is a command line tool, run as a script using Jupyter. # In[33]: get_ipython().system('jupyter nbconvert --to html "Introduction to Jupyter Notebooks.ipynb"') # Currently, `nbconvert` supports HTML (default), LaTeX, Markdown, reStructuredText, Python and HTML5 slides for presentations. Some types can be post-processed, such as LaTeX to PDF (this requires [Pandoc](http://johnmacfarlane.net/pandoc/) to be installed, however). # In[35]: get_ipython().system('jupyter nbconvert --to pdf "Introduction to Jupyter Notebooks.ipynb"') # A very useful online service is the [IPython Notebook Viewer](http://nbviewer.ipython.org) which allows you to display your notebook as a static HTML page, which is useful for sharing with others: # In[36]: IFrame("http://nbviewer.ipython.org/2352771", width='100%', height=350) # GitHub supports the [rendering of Jupyter Notebooks](https://gist.github.com/fonnesbeck/670e777406a2f2bfb67e) stored on its repositories. # ## Parallel IPython # # The IPython architecture consists of four components, which reside in the `ipyparallel` package: # # 1. **Engine** The IPython engine is a Python instance that accepts Python commands over a network connection. When multiple engines are started, parallel and distributed computing becomes possible. An important property of an IPython engine is that it blocks while user code is being executed. # # 2. **Hub** The hub keeps track of engine connections, schedulers, clients, as well as persist all task requests and results in a database for later use. # # 3. **Schedulers** All actions that can be performed on the engine go through a Scheduler. While the engines themselves block when user code is run, the schedulers hide that from the user to provide a fully asynchronous interface to a set of engines. # # 4. **Client** The primary object for connecting to a cluster. # # ![IPython architecture](images/ipython_architecture.png) # (courtesy Min Ragan-Kelley) # # This architecture is implemented using the ØMQ messaging library and the associated Python bindings in `pyzmq`. # # ### Running parallel IPython # # To enable the IPython Clusters tab in Jupyter Notebook: # # ipcluster nbextension enable # # When you then start a Jupyter session, you should see the following in your **IPython Clusters** tab: # # ![parallel tab](images/parallel_tab.png) # Before running the next cell, make sure you have first started your cluster, you can use the [clusters tab in the dashboard](/#tab2) to do so. # # Select the number if IPython engines (nodes) that you want to use, then click **Start**. # In[37]: from ipyparallel import Client client = Client() dv = client.direct_view() # In[38]: len(dv) # In[39]: def where_am_i(): import os import socket return "In process with pid {0} on host: '{1}'".format( os.getpid(), socket.gethostname()) # In[40]: where_am_i_direct_results = dv.apply(where_am_i) where_am_i_direct_results.get() # Let's now consider a useful function that we might want to run in parallel. Here is a version of the approximate Bayesian computing (ABC) algorithm. # In[41]: import numpy def abc(y, N, epsilon=[0.2, 0.8]): trace = [] while len(trace) < N: # Simulate from priors mu = numpy.random.normal(0, 10) sigma = numpy.random.uniform(0, 20) x = numpy.random.normal(mu, sigma, 50) #if (np.linalg.norm(y - x) < epsilon): if ((abs(x.mean() - y.mean()) < epsilon[0]) & (abs(x.std() - y.std()) < epsilon[1])): trace.append([mu, sigma]) return trace # In[42]: y = numpy.random.normal(4, 2, 50) # Let's try running this on one of the cluster engines: # In[43]: dv0 = client[0] dv0.block = True dv0.apply(abc, y, 10) # This fails with a NameError because NumPy has not been imported on the engine to which we sent the task. Each engine has its own namespace, so we need to import whatever modules we will need prior to running our code: # In[44]: dv0.execute("import numpy") # In[45]: dv0.apply(abc, y, 10) # An easier approach is to use the parallel cell magic to import everywhere: # In[46]: get_ipython().run_cell_magic('px', '', 'import numpy\n') # This magic can be used to execute the same code on all nodes. # In[47]: get_ipython().run_cell_magic('px', '', 'import os\nprint(os.getpid())\n') # In[48]: get_ipython().run_cell_magic('px', '', "%matplotlib inline\nimport matplotlib.pyplot as plt\nimport os\ntsamples = numpy.random.randn(100)\nplt.hist(tsamples)\n_ = plt.title('PID %i' % os.getpid())\n") # ## JupyterHub # # [JupyterHub](https://github.com/jupyterhub/jupyterhub) is a server that gives multiple users access to Jupyter notebooks, running an independent Jupyter notebook server for each user. # # To use JupyterHub, you need a Unix server (typically Linux) running somewhere that is accessible to your team on the network. The JupyterHub server can be on an internal network at your organisation, or it can run on the public internet (in which case, take care with security). Users access JupyterHub in a web browser, by going to the IP address or domain name of the server. # # Three actors: # # - multi-user Hub (tornado process) # - configurable http proxy (node-http-proxy) # - multiple single-user IPython notebook servers (Python/IPython/tornado) # # Basic principles: # # - Hub spawns proxy # - Proxy forwards requests to hub by default # - Hub handles login, and spawns single-user servers on demand # - Hub configures proxy to forward url prefixes to single-user servers # # To start the server, run the command: # # jupyterhub # # and then visit http://localhost:8000, and sign in with your unix credentials. # # To allow multiple users to sign into the server, you will need to run the jupyterhub command as a privileged user, such as root. The wiki describes how to run the server as a less privileged user, which requires more configuration of the system. # # ![jupyterhub animation](https://657cea1304d5d92ee105-33ee89321dddef28209b83f19f06774f.ssl.cf1.rackcdn.com/jupyterhub-00f39fede1ec8780cdc3163052f632fb6980b72d43a72b3c3650257a6b9ed02d.gif) # # *(animation courtesy of Jessica Hamrick)* # ## Links and References # * [IPython Notebook Viewer](http://nbviewer.ipython.org) Displays static HTML versions of notebooks, and includes a gallery of notebook examples. # # * [A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data](http://ged.msu.edu/papers/2012-diginorm/) A landmark example of reproducible research in genomics: Git repo, iPython notebook, data and scripts. # # * Jacques Ravel and K Eric Wommack. 2014. [All Hail Reproducibility in Microbiome Research](http://www.microbiomejournal.com/content/pdf/2049-2618-2-8.pdf). Microbiome, 2:8. # # * Benjamin Ragan-Kelley et al.. 2013. [Collaborative cloud-enabled tools allow rapid, reproducible biological insights](http://www.nature.com/ismej/journal/v7/n3/full/ismej2012123a.html). The ISME Journal, 7, 461–464; doi:10.1038/ismej.2012.123;