In [2]:
from IPython.display import Image

Full-stack science workflow += the IPython notebook

My life trajectory (physically) so far:

In [3]:
%run talktools

Plan B Career highlights

  • 2005 Graduated from Vanderbilt University
    • Majored in Physics, Mathematics, and Philosophy
  • 2005--2011 University of California San Diego
    • PhD in Physics
  • 2011--2014 Postdoc at Swinburne University (working with Michael Murphy)
  • 2014--? -> "Silicon Valley Tech Industry"

Where I started computer coding-wise

  • 2005 I started grad school (< 10 years)
    • I knew essentially nothing about computers besides how to send an email.
    • programmed about 100 lines in Mathematica
    • Didn't know cd or ls never used linux
  • 2006
    • Did 1 formal programming course "Intro for engineers, C"
    • wrote first IDL program
      • Couldn't figure out how to do a for-loop in IDL, so I wrote a compiled C program to write out the full IDL code

Where I ended up in less than 10 years

  • 2014
    • created several python modules (installable via pip on command line).
    • Submitted pull requests to open source astronomy library (astropy).
    • Proponent of open science.
    • Technical reviewer for programming books (Mastering Scipy).
    • Accepted in to the Insight Data Science Fellowship.

IPython notebook

The ipython notebook runs a webserver locally on your computer. Which you connect to via your browser. The browser then runs a python kernel so you execute actual python code in your browser.

This sounds like a gimmick. It is not.

This presentation includes examples from many places, including a few from a talk that I gave with David Lagattuta at Swinburne last year.

Glimpse of the future

ipython notebook + + nbviewer == easily shareable and reproducible work

External Example of the future of science (hopefully)

Email that link to anyone in the world who has a browser. No python, no ipython, nothing needs to be installed. The barrier to sharing the analysis here is about as close to zero as we can get.

Beginning notes

First of all -- don't try to take notes, just let the amazing wash over you.

Several Ideas and Themes

  1. Challenge the way we do things being 'the obviously best way to do things'.
    • There is a knee-jerk reaction, which is human as far as I can tell, that "different" feels bad and is resisted for emotional reasons.
    • Why do software companies do pair coding? Jeffrey Dean and Sanjay Ghemawat?
    • Why do they do Code Review?
    • Why isn't "how we do science" actually tested in a scientific way? Seriously?
  2. Don't reinvent the wheel (too much).
    • As Andrew Hopkins says, get a real sense of diminishing returns.
  3. Technical skills to learn that can help you massively.
    • Workflow matters more than you suspect.

Beginning notes (2)

  1. Think about what you think science should be -- it's ultimately a social field and not static.
    • Open code, open data, open notebooks.
    • You help choose where it goes, so think carefully about what you're contributing.
    • Not just data, or papers, or results, but the system that produces knowledge for the future.
    • Glad to hear this HWWS:
      • Brian Schmidt say that quality of papers and not total quantity
      • Steven Tingay ask whether are citations the way to go?
      • Amanda Bauer's quoting Einstein: "Not everything that can be counted counts and not everything that counts can be counted."
  2. Examples of code, techniques and otherwise that I feel you will benefit from being exposed to.
  3. Example of my interview with Insight (how I got a Data Science 'job').
    • Ask me back in a year how to be a successful Data Scientist!

Starting Suggestions

Know your instrument

As Steven said, it's important to know your instrument.

  • Learn a real editor (not just use)
    • Unless you are already amazing at vim/emacs, go with something like Sublime Text 2.
  • Use syntax highlighting.
  • Learn bash
  • Learn to watch yourself
    • Make efficient those tasks that you repeatedly do.

Example -- bash and latex

Plan B Jobs

  • get an online presence (github, stackoverflow, linkedin, etc.)
  • Learn open/free technology that is used in industry
    • python (or r) > idl
    • git, github
    • SQL
    • Data visualization; Mike Bostock's D3.js, public notebooks
  • Actively work on your skills
In [50]:
# Create a [list] 
days_of_the_week = ['Monday',
                    'Sunday', ]

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
In [51]:
# Simple for-loop
for day in days_of_the_week:
    print day
In [52]:
# Double for-loop
for day in days_of_the_week:
    for letter in day:
        print letter,
M o n d a y T u e s d a y W e d n e s d a y T h u r s d a y F r i d a y S a t u r d a y S u n d a y
In [9]:
for day in days_of_the_week:
    for letter in day:
        print letter.lower(),
m o n d a y t u e s d a y w e d n e s d a y t h u r s d a y f r i d a y s a t u r d a y s u n d a y
In [53]:
letters = [letter for day in days_of_the_week 
                       for letter in day]

letters = [letter for day in days_of_the_week for letter in day]
print letters
['M', 'o', 'n', 'd', 'a', 'y', 'T', 'u', 'e', 's', 'd', 'a', 'y', 'W', 'e', 'd', 'n', 'e', 's', 'd', 'a', 'y', 'T', 'h', 'u', 'r', 's', 'd', 'a', 'y', 'F', 'r', 'i', 'd', 'a', 'y', 'S', 'a', 't', 'u', 'r', 'd', 'a', 'y', 'S', 'u', 'n', 'd', 'a', 'y']
In [54]:
sorted_letters = sorted([x.lower() for x in letters])
print sorted_letters
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'h', 'i', 'm', 'n', 'n', 'n', 'o', 'r', 'r', 'r', 's', 's', 's', 's', 's', 't', 't', 't', 'u', 'u', 'u', 'u', 'w', 'y', 'y', 'y', 'y', 'y', 'y', 'y']
In [55]:
unique_sorted_letters = sorted(set(sorted_letters))
print "There are", len(unique_sorted_letters), "unique letters in the days of the week."
print "They are:", ''.join(unique_sorted_letters)
There are 15 unique letters in the days of the week.
They are: adefhimnorstuwy
In [56]:
def first_three(input_string):
    """Takes an input string and returns the first 3 characters."""
    return input_string[:3]
In [57]:
[first_three(day) for day in days_of_the_week]
['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
In [58]:
def last_N(input_string, number=2):
    """Takes an input string and returns the last N characters."""
    return input_string[-number:]
In [61]:
[last_N(day, 153) for day in days_of_the_week if len(day) > 6]
['Tuesday', 'Wednesday', 'Thursday', 'Saturday']

But seriously list comprehensions are awesome

In [2]:
from math import pi

print [str(round(pi, i)) for i in xrange(2, 9)]
['3.14', '3.142', '3.1416', '3.14159', '3.141593', '3.1415927', '3.14159265']
In [3]:
list_of_lists = [[i, round(pi, i)] for i in xrange(2, 9)]
print list_of_lists
[[2, 3.14], [3, 3.142], [4, 3.1416], [5, 3.14159], [6, 3.141593], [7, 3.1415927], [8, 3.14159265]]

Do use python like python

You know you're doing it wrong when it starts to look like C.

Best practices

There are generally good reasons that things become "best practices" -- and just because you don't see the reasons for it, doesn't mean you shouldn't adopt them.

Style Guide for Python Code:

In [4]:
# Let this be a warning to you!

# If you see python code like the following in your work:

for x in range(len(list_of_lists)):
    print "Decimals:", list_of_lists[x][0], 
    print "expression:", list_of_lists[x][1]
Decimals: 2 expression: 3.14
Decimals: 3 expression: 3.142
Decimals: 4 expression: 3.1416
Decimals: 5 expression: 3.14159
Decimals: 6 expression: 3.141593
Decimals: 7 expression: 3.1415927
Decimals: 8 expression: 3.14159265
In [6]:
# Change it to look more like this: 

for decimal, rounded_pi in list_of_lists:
    print "Decimals:", decimal, "expression:", rounded_pi
print list_of_lists
Decimals: 2 expression: 3.14
Decimals: 3 expression: 3.142
Decimals: 4 expression: 3.1416
Decimals: 5 expression: 3.14159
Decimals: 6 expression: 3.141593
Decimals: 7 expression: 3.1415927
Decimals: 8 expression: 3.14159265
[[2, 3.14], [3, 3.142], [4, 3.1416], [5, 3.14159], [6, 3.141593], [7, 3.1415927], [8, 3.14159265]]

Learn about dictionaries

You won't regret it.

Let's do something a bit more useful

As scientists, we're often going to need to plot results and use numbers to do analysis.

In [22]:
import numpy as np
import matplotlib.pyplot as plt

# The following line is an ipython notebook trick
%matplotlib inline 

plt.rcParams['figure.figsize'] = 12, 8  # plotsize
In [23]:
x = np.arange(10000)

print "x      -> ", x  # notice the smart printing
print "x[:]   -> ", x[:]
print "x[0]   -> ", x[0] # first element 
print "x[0:5] -> ", x[0:5] # first 5 elements
print "x[-1]  -> ", x[-1] # last element
x      ->  [   0    1    2 ..., 9997 9998 9999]
x[:]   ->  [   0    1    2 ..., 9997 9998 9999]
x[0]   ->  0
x[0:5] ->  [0 1 2 3 4]
x[-1]  ->  9999
In [24]:
# A bit more complicated slicing
print x[-5:] # last five elements
print x[-5:-2] # 
print x[-5:-1] # last 4 elements (not final value)
[9995 9996 9997 9998 9999]
[9995 9996 9997]
[9995 9996 9997 9998]
In [25]:
# Single physical cloud with following physical parameters
def GaussFunc(x, amplitude, centroid, sigma):
    """Takes an array, and calculates a Gaussian with the following parameters. """
    return amplitude * np.exp(-0.5 * ((x - centroid) / sigma)**2)

feature_centroid = 5315.3
feature_amplitude = 2.3
feature_sigma = 1.5

wavelength = np.linspace(5305., 5330., 120)
tau = GaussFunc(wavelength, feature_amplitude, feature_centroid, feature_sigma)

flux = np.exp(-tau)
sigma = 0.05
noise = np.random.randn(len(flux)) * sigma
observed = flux + noise
error = np.ones_like(wavelength) * sigma
In [26]:
plt.plot(wavelength, flux) 
plt.plot(wavelength, observed) 
[<matplotlib.lines.Line2D at 0x104701b90>]
In [27]:
np.savetxt("example.txt", np.transpose((wavelength, observed, flux, error, noise)))
In [28]:
# np.random.randn?
In [29]:
%less example.txt
In [30]:
wave = []
observed_flux = []
error = []
for line in open("example.txt", 'r'):
wave[:10] # Whoops! 
['5', '5', '5', '5', '5', '5', '5', '5', '5', '5']
In [31]:
wave = []
observed_flux = []
error = []
for line in open("example.txt", 'r'):
    line = line.split()
wave[:10] # Dang! still strings!
In [32]:
# If you see yourself doing this kind of thing... 

wave = []
observed_flux = []
error = []

for line in open("example.txt", 'r'):
    line = line.split()
wave = np.array(wave)
observed_flux = np.array(observed_flux)
error = np.array(error)
In [33]:
# Do this instead
wave, observed_flux, error = np.loadtxt("example.txt", usecols=(0, 1, 3), unpack=True)

Special Plotting

Look at the matplotlib gallery: and find plots that are similar to what you want to do.

In [7]:
# Run this cell
In [ ]:
# And it loads the code here, ready to run (obviously without this comment). 

#!/usr/bin/env python

import numpy as np
from matplotlib.pyplot import figure, show, rc

# radar green, solid grid lines
rc('grid', color='#316931', linewidth=1, linestyle='-')
rc('xtick', labelsize=15)
rc('ytick', labelsize=15)

# force square figure and square axes looks better for polar, IMO
fig = figure(figsize=(8,8))
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8], polar=True, axisbg='#d5de9c')

r = np.arange(0, 3.0, 0.01)
theta = 2*np.pi*r
ax.plot(theta, r, color='#ee8d18', lw=3, label='a line')
ax.plot(0.5*theta, r, color='blue', ls='--', lw=3, label='another line')



In [42]:
Image("../../../../Screenshots/Screenshot 2014-07-18 14.51.30.png")

Final Thoughts

  • Think about what you do (especially what you repeatedly do)
  • Reproducible science
    • ipython notebook or otherwise
In [39]: