Notebook

Data provenance¶

We've now successfully created a command line program - plot_precipitation_climatology.py - that calculates and plots the precipitation climatology for a given month. The last step is to capture the provenace of that plot. In other words, we need a record of all the data processing steps that were taken from the intial download of the data files to the end result (i.e. the .png image).

The simplest way to do this is to follow the lead of the NCO and CDO command line tools, which insert a record of what was executed at the command line into the history attribute of the output netCDF file.

In [3]:

import xarray as xr

csiro_pr_file = '../data/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc'
dset = xr.open_dataset(csiro_pr_file)

print(dset.attrs['history'])

Fri Dec  8 10:05:56 2017: ncatted -O -a history,pr,d,, pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc
Fri Dec 01 08:01:43 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/CSIRO-Mk3-6-0/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_185001-200512.nc pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc
2011-07-27T02:26:04Z CMOR rewrote data to comply with CF standards and CMIP5 requirements.

Fortunately, there is a Python package called cmdline-provenance that creates NCO/CDO-style records of what was executed at the command line. We can use it to generate a new command line record:

In [6]:

#Example: This is the command that was run to launch the jupyter notebook we’re using.
import cmdline_provenance as cmdprov
new_record = cmdprov.new_log()
print(new_record)

Tue Nov 03 07:23:59 2020: C:\Users\yorksea\anaconda3\python.exe C:\Users\yorksea\anaconda3\lib\site-packages\ipykernel_launcher.py -f C:\Users\yorksea\AppData\Roaming\jupyter\runtime\kernel-4a6f7305-ff2a-46f4-b58a-c2e3ed534d8f.json

If we want to create our own entry for the history attribute, we'll need to be able to create a:

Time stamp
Record of what was entered at the command line in order to execute plot_precipitation_climatology.py
Method of indicating which verion of the script was run (i.e. because the script is in our git repository)

Time stamp¶

A library called datetime can be used to find out the time and date right now:

In [25]:

import datetime
 
time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y")
print(time_stamp)

Fri Dec 08 14:05:17 2017

The strftime function can be used to customise the appearance of a datetime object; in this case we've made it look just like the other time stamps in our data file.

Command line record¶

The sys.argv function, which is what the argparse library is built on top of, contains all the arguments entered by the user at the command line:

In [26]:

import sys
print(sys.argv)

['/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py', '-f', '/Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json']

In launching this IPython notebook, you can see that a command line program called ipykernel_launcher.py was run. To join all these list elements up, we can use the join function that belongs to Python strings:

In [27]:

args = " ".join(sys.argv)
print(args)

/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json

While this list of arguments is very useful, it doesn't tell us which Python installation was used to execute those arguments. The sys library can help us out here too:

In [28]:

exe = sys.executable
print(exe) 

/Applications/anaconda/envs/pyaos-lesson/bin/python

Git hash¶

In the lesson on version control using git we learned that each commit is associated with a unique 40-character identifier known as a hash. We can use the git Python library to get the hash associated with the script:

In [29]:

import git
import os
     
repo_dir = '/Users/irv033/Documents/volunteer/teaching'        
#repo_dir = os.getcwd()
git_hash = git.Repo(repo_dir).heads[0].commit
print(git_hash)

588f96dcab5c78d10b4c994eb3ca67955c882697

Putting it all together¶

We can now put all this together into a function that generates our history record,

In [30]:

def get_history_record(repo_dir):
    """Create a new history record."""

    time_stamp = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
    exe = sys.executable
    args = " ".join(sys.argv)
    git_hash = git.Repo(repo_dir).heads[0].commit

    entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])
    
    return entry

In [31]:

new_history = get_history_record('/Users/irv033/Documents/volunteer/teaching')

print(new_history)

2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d)

which can be combined with the previous history to compile a record that goes all the way back to when we obtained the original data file:

In [32]:

complete_history = '%s \n %s' %(new_history, previous_history)

print(complete_history)

2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d) 
 Fri Dec  8 10:05:47 2017: ncatted -O -a history,pr,d,, pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc
Fri Dec 01 07:59:16 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/ACCESS1-3/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc
CMIP5 compliant file produced from raw ACCESS model output using the ACCESS Post-Processor and CMOR2. 2012-02-08T06:45:54Z CMOR rewrote data to comply with CF standards and CMIP5 requirements. Fri Apr 13 09:55:30 2012: forcing attribute modified to correct value Fri Apr 13 12:13:10 2012: updated version number to v20120413. Fri Apr 13 12:29:34 2012: corrected model_id from ACCESS1-3 to ACCESS1.3

(Noting that in real example of this process in action, the new history would refer to what was entered at the command line to run plot_precipitation_climatology.py, as opposed to running ipykernel_launcher.py to run a notebook.)

Writing your own modules¶

We could place this new get_history_record() function directly into the plot_precipitation_climatology.py script, but there's a good chance we'll want to use it in many scripts that we write into the future. In the functions lesson we discussed all the reasons why code duplication is a bad thing, and it's the same principle here. The solution is to place the get_history_record() function in a separate script full of functions (which is called a module) that we use regularly across many scripts.

(A slight modification has been added to get_history_record() so that the repo_dir isn't hard wired into the code. Instead, the script defines repo_dir as the current working directory, which is assumed to be the top of the directory tree in the git repository, as that's the input information required by git.Repo.)

In [33]:

!cat provenance.py

"""
A collection of commonly used functions for data provenance

"""

import sys
import datetime
import git
import os


def get_history_record():
    """Create a new history record."""

    time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y")
    exe = sys.executable
    args = " ".join(sys.argv)
    
    repo_dir = os.getcwd()
    try:
        git_hash = git.Repo(repo_dir).heads[0].commit
    except git.exc.InvalidGitRepositoryError:
        print('To record the git hash, must run script from top of directory tree in git repo')
        git_hash = 'unknown'
        
    entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])
    
    return entry

We can then import that module and use it in all of our scripts.

In [36]:

import provenance

The first line of a module file is similar to the first line of a function - if you enter a string, it will be picked up by the help generator.

In [34]:

help(provenance)

Help on module provenance:

NAME
    provenance - A collection of commonly used functions for data provenance

FUNCTIONS
    get_history_record(repo_dir)
        Create a new history record.

FILE
    /Users/irv033/Documents/volunteer/teaching/amos-icshmo/provenance.py

In [35]:

help(provenance.get_history_record)

Help on function get_history_record in module provenance:

get_history_record(repo_dir)
    Create a new history record.

Challenge¶

Import the new provenance module into your plot_precipitation_climatology.py script and use it to record the complete history of the output figure.

Things to consider:

For command line programs that output a netCDF file, the history record is typically added to the global history attribute. In this case the output is a .png file, so it will be necessary to have plot_precipitation_climatology.py output a .txt file that contains the history information (it's usually easiest for this metadata file to have exactly the same name as the figure file, just with a .txt instead of .png file extension)
Do you need to record the history of the land surface fraction file, or just the precipitation file?

Don't forget to commit your changes to git and push to GitHub.