Literate / reproducible exploratory data analysis using Jupyter Notebooks¶

Tony Hirst

Computing and Communications

https://blog.ouseful.info https://github.com/ouseful-demos/

Jupyter notebooks are a browser based, interactive environment widely used in computational research and commercial data science environments. Notebooks provide a blended research environment within which rich text (HTML), code and outputs generated from code, such as charts and tables, or even static or interactive maps, can be created and displayed.

Using simple analysis scripts, typically written using Python or R, researchers can create narrated, readable, reproducible, shareable research workflows as well as all-in-one “text+source code” versions of research papers, where the code used to produce analyses reported in the paper is self-contained within the notebook itself.

In this workshop you will have an opportunity to see how notebooks can be used and even try them out yourself. All you need is a desktop computer, laptop, tablet or even phone running a Chrome or Firefox browser.

Play Along¶

Reproducibility...¶

Can you:

find that data source again?
clean the data the same way as last time?
generate the same diagram?
recreate the analysis?
get the code to work the same way?

Literate Computing¶

Is there:

a clear linear narrative;
comments, qualifications and context to help you make sense, a year down, the line of:
- your data?
- your code?
- your analyses?
- your results?

Jupyter...¶

is an ecosystem of tools and protocols for helping you:
- access self-contained computational environments
- create rich, interactive documents that blend:
  - text;
  - code;
  - code outputs.

This is a presentation powered by a Jupyter Notebook...¶

...one possible interface you can use to access Jupyter computing environments.

¶

Using Code¶

Code is entered and executed via code cells. The execution environment is determined by the notebook kernel attached to the notebook.

This notebook has been associated with an Python kernel. Which means we can write — and execute — Python code in the cells:

In [1]:

print("hello world")

hello world

`Jupyter_py_and_R.ipynb` Demo ¶

`Food Standards Agency.ipynb` Demo ¶

Text Analysis¶

A wide range of packages support analysis of English, as well as classical, texts.

Spacy — Visualise Sentence Morphology¶

In [4]:

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")
displacy.render(doc, style="dep")

Spacy — Semantic Tagging¶

In [7]:

text = "The current Chancellor of the Open University is Martha Lane Fox."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")

The current Chancellor of the Open University ORG is Martha Lane Fox PERSON .

NLTK - Text Analysis Package¶

(See also: CLTK - Classical Languages Text Analysis)

In [15]:

import nltk
#nltk.download('gutenberg')
#nltk.download('punkt')

nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package punkt to /Users/tonyhirst/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

Out[15]:

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Load Text and View sentences¶

In [16]:

from nltk.corpus import gutenberg

macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sentences[:2]

Out[16]:

[['[',
  'The',
  'Tragedie',
  'of',
  'Macbeth',
  'by',
  'William',
  'Shakespeare',
  '1603',
  ']'],
 ['Actus', 'Primus', '.']]

Text Concordance¶

In [26]:

# Load full text
macbeth = nltk.Text(nltk.corpus.gutenberg.words('shakespeare-macbeth.txt'))

# Text concordances
macbeth.concordance("spot")

Displaying 2 of 2 matches:
er of an houre Lad . Yet heere ' s a spot Doct . Heark , she speaks , I will s
ce the more strongly La . Out damned spot : out I say . One : Two : Why then '

Dispersion Plots¶

In [24]:

macbeth.dispersion_plot(['Macbeth', 'Macduff', 'Banquo'])

Other examples¶

https://github.com/ouseful-demos/getting-started-with-notebooks

Reproducible Computing Environments¶

Computing environments where you guarantee that certain software packages:

are installed;
are available.

In [ ]:

# Reproducible Computing Environments Using Jupyter Tools


- MyBinder
- repo2docker