In [10]:
%run talktools.py

A Portrait of One Scientist as a Graduate Student

Paul Ivanov

@ivanov

http://github.com/ivanov/

http://pirsquared.org/

TL;DR version: "What's easy, won't last. What lasts, won't be easy."

My background: (since this will be autobiographical)

  • Born in Moscow (Soviet Russia) (communism + socialism)
  • Started running GNU/Linux in 2000 (free software + open source)
  • BS in Computer Science from UC Davis
  • finishing up graduate program in Vision Science at UC Berkeley

My rotation in an primate electrophysiology lab:

Data is hard to get: 1-2 years training animal on task.

"Minor brain surgery", every day of data collection for 4-6 months, every day, 6-10 hours per day.

Data is very rich. It is hoarded. With a very tight lid.

My naive conclusion: Data is precious. Free the DATA!

If data was just more accessible....

But the reality is that, having accessible data is not enough...

http://crcns.org

You need the code, and I don't mean a tar ball. And even that's not enough...

Step 0 -- version control everything

(including this presentation)

Yes, all scientists.

Git specifically?

It's not rocket science! There are sane GUIs for novice users.

I explained the benefits of version control to my biologist friend Sara, and put SmartGit on her machine. No more _v1, _v3_works, etc. "I didn't think it would be this easy".

Back in my home lab, we do computational experiments.

Unsupervised learning of natural signals. "How should the brain encode images given their properties?"

van Hateren images

Very popular dataset (camera calibrated, linearized, uncompressed, etc) - the paper to cite it came out in 1998.

As of 2007, it had 336 citations according to Google Scholar, (then 99th most cited paper in the Vision literature).

Today that number is up to 802.

Then is 2010, there's an email sent to a vision community mailing list saying:

Does anyone have a copy of van Hateren database? I have been looking  
for the 4000 still image database. The links to images

http://hlab.phys.rug.nl/imlib/l1_200/index.html

are broken! And it looks like there is no mirror of the full database  
anywhere. I would appreciate your help and suggestions.

So I put up a mirror.

Shortly thereafter, another grad student in a lab in Germany (one of my academic nephews), did the same.

This happened again a year later with another dataset. Luckily, I had downloaded that one as well, and now host the canonical version.

Lesson learned: don't take today's data sources for granted.

multiple resources for the same data (http, ftp, bittorrent) in one container file.

Let's start using them!

data integrity

Use a Python decorator attached to your data loading routines to verifiy hashes on-load

Let's see an example of what that looks like.

data preservation + data conservation + data integrity

I have some data, and a problem: I don't want to lose my data.

So I make a copy -- and now I have two problems.

Git Annex

distributed digital hoarding.

Keep track of file signatures (hashes)

In [4]:
%%bash 
cd ~/cur/siam; chmod +rwx -R tmp fake_usb;
rm -fr tmp fake_usb; # start with a clean slate
mkdir -p tmp
cd tmp
git init .
git annex init "local laptop"
Initialized empty Git repository in /home/pi/code/workspace/everest/siam/tmp/.git/
init local laptop ok
(Recording state in git...)
chmod: cannot access ‘fake_usb’: No such file or directory
In [5]:
%cd ~/cur/siam/tmp
/home/pi/code/workspace/everest/siam/tmp
In [6]:
%%bash
# let' just make a file to see how annex works
echo "pretend this is a large file" > original.dus
for x in {1..10000}; do echo GATTACA >> original.dus; done
ls -lh
total 80K
-rw-r--r-- 1 pi pi 79K Feb 28 17:37 original.dus

This is a ~80K file, let's check it into git annex

In [7]:
!git annex add original.dus
add original.dus (checksum...) ok
(Recording state in git...)

Let's see what happened to it:

In [8]:
!ls -lh
total 4.0K
lrwxrwxrwx 1 pi pi 184 Feb 28 17:37 original.dus -> .git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a

So by annexing the file, we've hashed its contents, and renamed the file to that hash, making a symbolic link to the file. (content-based addressing)

It turns out git annex also staged this symbolic link for us in git.

In [9]:
!git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#
#	new file:   original.dus
#

Let's check that into git.

In [10]:
!git commit -m"original data of unusual size checked in"
[master (root-commit) fb0751e] original data of unusual size checked in
 1 file changed, 1 insertion(+)
 create mode 120000 original.dus
In [11]:
!git log
commit fb0751ef2b2044e623d855ba0b881d3744115cac
Author: Paul Ivanov <pi@berkeley.edu>
Date:   Thu Feb 28 17:39:29 2013 -0500

    original data of unusual size checked in

What did we actually check in? just one line - a symbolic link pointing to the contents of original.dus

In [12]:
!git log -p
commit fb0751ef2b2044e623d855ba0b881d3744115cac
Author: Paul Ivanov <pi@berkeley.edu>
Date:   Thu Feb 28 17:39:29 2013 -0500

    original data of unusual size checked in

diff --git a/original.dus b/original.dus
new file mode 120000
index 0000000..b30551c
--- /dev/null
+++ b/original.dus
@@ -0,0 +1 @@
+.git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a
\ No newline at end of file
In [13]:
!git annex whereis ./original.dus
whereis original.dus (1 copy) 
  	74eda928-81f7-11e2-bfa9-e3a22b8f91cd -- here (local laptop)
ok

Let's copy this repository to an external harddrive:

In [14]:
!git clone ./ ../fake_usb
Cloning into '../fake_usb'...
done.
In [15]:
cd ../fake_usb/
/home/pi/code/workspace/everest/siam/fake_usb
In [16]:
!git annex init "pi's external harddrive"
init pi's external harddrive ok
(Recording state in git...)
In [17]:
ls -al
total 16
drwxr--r-- 3 pi pi 4096 Feb 28 17:40 ./
drwxr--r-- 6 pi pi 4096 Feb 28 17:40 ../
drwxr--r-- 9 pi pi 4096 Feb 28 17:40 .git/
lrwxrwxrwx 1 pi pi  184 Feb 28 17:40 original.dus -> .git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a
In [18]:
!head original.dus
head: cannot open ‘original.dus’ for reading: No such file or directory

On the external harddrive, we only have a catalogue of the annexed files. We can grab them explicitly:

In [19]:
!git annex whereis original.dus
(merging origin/git-annex into git-annex...)
whereis original.dus (1 copy) 
  	74eda928-81f7-11e2-bfa9-e3a22b8f91cd -- origin (local laptop)
ok
In [20]:
!git annex get original.dus
get original.dus (from origin...) ok
(Recording state in git...)
In [21]:
ls -al
total 16
drwxr--r-- 3 pi pi 4096 Feb 28 17:40 ./
drwxr--r-- 6 pi pi 4096 Feb 28 17:40 ../
drwxr--r-- 9 pi pi 4096 Feb 28 17:41 .git/
lrwxrwxrwx 1 pi pi  184 Feb 28 17:40 original.dus -> .git/annex/objects/vP/0j/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a/SHA256-s80029--1a8fcd77de8926f8dd5d913d1f212e1c2e49509de1096e295952b556219f1e2a
In [22]:
!head original.dus
pretend this is a large file
GATTACA
GATTACA
GATTACA
GATTACA
GATTACA
GATTACA
GATTACA
GATTACA
GATTACA

Real world example

Here's an example of one of my annexes: total known annex size is 557 Gb, but this laptop only has 6 Gb of it (and it only has a 100Gb SSD).

The key point is that the catalogue is available in a very lightwheight manner. Everything in the catalogue is just a git annex get away.

In [23]:
%%bash
# this cell will only run on pi's computer
cd ~/annex
git annex status
supported backends: SHA256 SHA1 SHA512 SHA224 SHA384 SHA256E SHA1E SHA512E SHA224E SHA384E WORM URL
supported remote types: git S3 bup directory rsync web hook
trusted repositories: 0
semitrusted repositories: 13
	00000000-0000-0000-0000-000000000001 -- web
 	094dfddc-df61-11e1-9750-37e06b8a9271 -- passport
 	370e23eb-e8a6-4e4e-a541-126e86707d24 -- pirr (pirsquared.org: ~/data)
 	3a3f810a-bcba-11e1-8c79-abd1241ca1ec -- mybook-baregit (My Book bare git repo)
 	3ee45d74-bcbb-11e1-9248-134861891c80 -- mybook
 	82f51036-bb84-11e1-97d3-4b1ce8380653 -- apxrsync (ApxuMed rsync)
 	9a3d31ac-bb83-11e1-aa4a-07e10779860b -- here (HbIOTOH)
 	ApxuMed: -- ~/data
 	a113d11c-0cdc-45c4-95ec-896f051f58b9 -- apxumed
 	cb0ec9da-bcb9-11e1-ba58-aba4ddc57a0f -- mybook
 	d85bce23-e501-489c-aa3f-af86fac17b14 -- ApxuMed ~/data
 	e8a75f48-d852-11e1-b3b8-771474033e82 -- g2usb1 (16GB DT101 G2)
 	eb874b2a-bbf0-11e1-b174-afbb2a5f838f -- HbIOTOH /home/pi/data
untrusted repositories: 0
dead repositories: 1
	ce41e790-bcb9-11e1-bcf6-0b61b49dc5c3 -- mybook
transfers in progress: none
available local disk space: 4 gigabytes (+1 megabyte reserved)
temporary directory size: 89 megabytes (clean up with git-annex unused)
local annex keys: 699
local annex size: 6 gigabytes
known annex keys: 5621
known annex size: 557 gigabytes
bloom filter size: 16 mebibytes (0.1% full)
backend usage: 
	SHA256: 6300
	URL: 20

Ok, now that we have data under control, let's move on to doing something with it... (code)

Доверяй, но проверяй!

"Trust, but verify!"

How do you know that a tool is any good?

In [24]:
import numpy as np
np.test()
Running unit tests for numpy
NumPy version 1.6.2
.............................................................................................................................................................................................................................................................................................S............................................................................................................................................................................................................................................................................KK...................................................................................................................................SSS.........................................................................................................................................................................................................................................................................................K.....................................................................................................K......................K...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................  
NumPy is installed in /usr/lib/pymodules/python2.7/numpy
Python version 2.7.3 (default, Jan  2 2013, 13:56:14) [GCC 4.7.2]
nose version 1.1.2
----------------------------------------------------------------------
Ran 3196 tests in 19.538s

OK (KNOWNFAIL=5, SKIP=4)
Out[24]:
<nose.result.TextTestResult run=3196 errors=0 failures=0>

Running software without a test suite is like running experiments without calibrating your instruments

ttyrec

Lightweight capture tool (I use this daily, it helps me account for how I spend my time). Just writes everything you see in the shell to a file, with timing information, which you can later play back.

demo in the shell (ttyplay ~/2012-08-01_2.tty)

IPython notebook

This stuff is obvious here, but in another context, I would mention these features of the ipython notebook

  • clear all output, and re-run this notebook
  • inline plotting
  • tab-completion, documentation tooltips, etc
  • extensible - R magic, octave magic
  • Notebook format - converters to PDF, LaTeX, HTML, markdown, restructured text, python.
  • communication protocol : multiple clients can talk to the computational kernel (vim-ipython, for example)
  • collaboration: expose a relevant port publically, point collaborator to a website.

Summary

Data:

- simple availability is not enough
- let's start using metalinks, bittorrent, etc.

Data integrity:

- hashes (python decorators: verify on-load)
- git annex


Testing:

- "Running software without a test suite is like running experiments without calibrating your instruments"


Communication of process, results:

- ttyrec 
- IPython Notebook

Thank you

Paul Ivanov

@ivanov

http://github.com/ivanov/

http://pirsquared.org/

"The task must be made difficult, for only the difficult inspires the noble-hearted." -- Kierkegaard