Procedural Notebooks¶

Typically a notebook's author will begin an idea from a blank documents in an editable state. Through cycles of interactive computing an author will transform the notebook's data by adding narrative, code, and metadata. The notebook's cells are parts of a whole computable document described by the notebook format.

The interactive in-memory editing mode is a critical, but fleeting stage in the life of a computable document. Notebooks spend most of their existence as whole & static files on disk. The static state of notebooks are reusable; and for notebooks to be reusable they must be reused.

Procedural notebooks are readable and reusable literate documents that can be executed successfully in other contexts like documention, module development, or external jobs. This notebook explores the reusability of procedural notebooks that successfully Restart and Run All.

This literate document can be viewed as a notebook, presentation, or View on GitHub ¶

Motivation¶

Procedural notebooks are inspired by Paco Nathan's Oriole: a new learning medium based on Jupyter + Docker given at Jupyter Day Atlanta 2016. In Paco's unofficial styleguide for authoring Jupyter notebooks he suggests:

clear all output then "Run All" -- or it didn't happen

Procedural notebooks¶

... restart and run all, or they don't. Their reusability can be tested in different contexts.
... change over time
... encapsulate cycles of non-structured,

structured, and literate programming actions.

... can be executed in other contexts like testing, document conversion, or compute...
... can be tested as parts in interactive mode
... can be tested as a whole in a procedural mode
... may be used to create sophisticated software projects.

This notebook is procedural notebook¶

Its cells Restart and Run All to create a module and python package called particles:

Create, describe, and test source code for a project we call particles.
Copy the source code to a notebook called particles.ipynb
Convert particles.ipynb to particles.py and a Python package called particles.

particles is inspired by the New York Times R&D The Future of News is Not an Article. particles treat elements of computable documents as data and modular components.

Procedurally create the particles module¶

readme.ipynb generates the particles module either in interactive mode, or procedurally from a converted Python script.

attach is a callable used by readme to append the recent Input as cell source to particles.ipynb; it is erroneous to the particles module. _If the readme.ipynb cells are run out of order then particles.ipynb could be created incorrectly.

In [1]:

from nbformat import v4, NotebookNode
nb, particles = 'particles.ipynb', v4.new_notebook();
def attach(nb:NotebookNode=particles)->None:
    """attach an input to another notebook removing attach statements.
    >>> nb = v4.new_notebook();
    >>> assert attach(nb) or ('doctest' in nb.cells[-1].source)"""
    'In' in globals() and nb.cells.append(v4.new_code_cell('\n'.join(
        str for str in In[-1].splitlines() if not str.startswith('attach'))))

In [2]:

    %%file requirements.txt
    pandas
    matplotlib

Overwriting requirements.txt

build `particles.ipynb`¶

Many cells in readme.ipynb have lived and died before you read this line.

The code cell below will be appended to particles.ipynb. It import__s tools into __readme.ipynb's interactive mode. It now becomes quite easy to iteratively develop and test parts of the procedural document.

In [3]:

attach(particles)
"""particles treat notebooks as data"""

Out[3]:

'particles treat notebooks as data'

In [4]:

attach(particles)
from nbformat import reads, v4 
from pandas import concat, DataFrame, to_datetime 
from pathlib import Path 

callables in `particles`¶

Create two main functions for particles to export

In [5]:

attach()
def read_notebooks(dir:str='.')->DataFrame:
    """Read a directory of notebooks into a pandas.DataFrame
    >>> df = read_notebooks('.')
    >>> assert len(df) and isinstance(df, DataFrame)"""
    return concat({
        file: DataFrame(reads(file.read_text(), 4)['cells'])
        for file in Path(dir).glob('*.ipynb')
    }).reset_index(-1, drop=True)

The read_notebooks index is a pathlib object containing extra metadata. files_to_data extracts the stat properties for each file.

In [6]:

attach()
def files_to_data(df:DataFrame)->DataFrame:
    """Transform an index of Path's to a dataframe of os_stat.
    >>> df = files_to_data(read_notebooks())
    """
    stats, index = [], df.index.unique()
    for file in index:
        stat = file.stat() 
        stats.append({
            key: to_datetime(
                getattr(stat, key), unit=key.endswith('s') and key.rsplit('_')[-1] or 's'
            ) if 'time' in key else getattr(stat, key)
            for key in dir(stat) if not key.startswith('_') and not callable(getattr(stat, key))})
    # Append the change in time to the dataframe.
    return DataFrame(stats, index).pipe(lambda df: df.join((df.st_mtime - df.st_birthtime).rename('dt')))

Control Flow in Procedural Notebooks¶

A procedural notebooks will use clues from a namespace to decide what statements to execute in different contexts.

In [7]:

if __name__ != '__main__': assert __name__+'.py' == __file__

In Jupyter mode¶

__name__ == '__main__', but nothing is known about the python object __file__.

In Setup mode¶

__name__ == '__main__' and assert __file__ .

As a python package mode¶

__name__ + '.py' == __file__.

`get_ipython`¶

The get_ipython context must be manually imported to use magics in converted notebooks.

In [8]:

    from IPython import get_ipython

Controlling value assignment¶

Introspect the interactive Jupyter namespace to control expressions in procedural notebooks.

thing = get_ipython().user_ns.get('thing', 42):

`readme` procedures to make `particles`¶

Below are the procedures to test and create the particles package.

doctests were declared in each of our functions. doctest can be run in an interactive notebook session, unittest cannot.

doctest catches a lot of errors when it is in the Restart and Run All pipeline. It is a great place to stash repeatedly typed statements.
When the tests pass write the particles.ipynb notebook.

In [9]:

if __name__ == '__main__':
    print(__import__('doctest').testmod())
    Path(nb).write_text(__import__('nbformat').writes(particles))

TestResults(failed=0, attempted=5)

Transform both readme.ipynb and the newly minted particles.ipynb to python scripts.
Autopep it because we can.
Rerun the same tests on particles.py

In [10]:

if __name__ == '__main__' and '__file__' not in globals():
    !jupyter nbconvert --to python --TemplateExporter.exclude_input_prompt=True particles.ipynb readme.ipynb
    !autopep8 --in-place --aggressive readme.py particles.ipynb
    !python -m doctest particles.py & echo "success"
    !jupyter nbconvert --to markdown --TemplateExporter.exclude_input_prompt=True readme.ipynb

[NbConvertApp] Converting notebook particles.ipynb to python
[NbConvertApp] Writing 1234 bytes to particles.py
[NbConvertApp] Converting notebook readme.ipynb to python
[NbConvertApp] Writing 10488 bytes to readme.py
success
[NbConvertApp] Converting notebook readme.ipynb to markdown
[NbConvertApp] Writing 10407 bytes to readme.md

setuptools will install the particles package using the conditions for setup mode.

Install the particles package

python readme.py develop

In [11]:

    if __name__ == '__main__' and '__file__' in globals():
        __import__('setuptools').setup(
            name="particles", 
            py_modules=['particles'], 
            install_requires=['notebook', 'pandas'])

Reusing `particles`¶

A notebook that can be imported is reusable.

Particles can now be imported into the current scope. particles allow the user to explore notebooks and their cells as data.

In [12]:

import particles
assert particles.__file__.endswith('.py')
%matplotlib inline
from matplotlib import pyplot as plt
df = particles.read_notebooks()
df.sample(5)

Out[12]:

	cell_type	execution_count	metadata	outputs	source
readme.ipynb	code	NaN	{}	[]	if __name__ == '__main__':\n print(__import...
readme.ipynb	markdown	NaN	{'slideshow': {'slide_type': '-'}}	NaN	### In Jupyter mode\n\n> `__name__` == **`...
readme.ipynb	code	NaN	{'collapsed': True}	[]	from IPython import get_ipython
particles.ipynb	code	NaN	{}	[]	def files_to_data(df:DataFrame)->DataFrame:\n ...
readme.ipynb	markdown	NaN	{}	NaN	> __particles__ is inspired by the New York T...

Quantifying lines of code¶

In [13]:

df.source.str.split('\n').apply(len).groupby([df.index, df.cell_type]).sum().to_frame('lines of ...').unstack(-1)

Out[13]:

	lines of ...
cell_type	code	markdown
particles.ipynb	26.0	NaN
readme.ipynb	66.0	108.0

The distribution of markdown and code cells in the particles project.¶

In [14]:

    
    df.cell_type.groupby(df.index).value_counts().unstack('cell_type').apply(lambda df: df.plot.pie() and plt.show());

Summary¶

This document must Restart and Run All to acheive the goals of creating the particles module.

Procedural notebooks Restart and Run All or they don't; they can be tested.
Not all notebooks survive, the lucky ones become procedural notebooks.
Literate procedural notebooks reinforce readability and reusability to reproducible work.
Procedural tend to maintain a longer shelf life than an exploratory notebook.
Interactive programming is complex and an author will rely on multiple styles of programming to acheive a procedural document.

In [ ]: