We've now successfully created a command line program - plot_precipitation_climatology.py
- that calculates and plots the precipitation climatology for a given month. The last step is to capture the provenace of that plot. In other words, we need a record of all the data processing steps that were taken from the intial download of the data files to the end result (i.e. the .png image).
The simplest way to do this is to follow the lead of the NCO and CDO command line tools, which insert a record of what was executed at the command line into the history attribute of the output netCDF file.
import xarray as xr
csiro_pr_file = '../data/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc'
dset = xr.open_dataset(csiro_pr_file)
print(dset.attrs['history'])
Fri Dec 8 10:05:56 2017: ncatted -O -a history,pr,d,, pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc Fri Dec 01 08:01:43 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/CSIRO-Mk3-6-0/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_185001-200512.nc pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc 2011-07-27T02:26:04Z CMOR rewrote data to comply with CF standards and CMIP5 requirements.
Fortunately, there is a Python package called cmdline-provenance that creates NCO/CDO-style records of what was executed at the command line. We can use it to generate a new command line record:
#Example: This is the command that was run to launch the jupyter notebook we’re using.
import cmdline_provenance as cmdprov
new_record = cmdprov.new_log()
print(new_record)
Tue Nov 03 07:23:59 2020: C:\Users\yorksea\anaconda3\python.exe C:\Users\yorksea\anaconda3\lib\site-packages\ipykernel_launcher.py -f C:\Users\yorksea\AppData\Roaming\jupyter\runtime\kernel-4a6f7305-ff2a-46f4-b58a-c2e3ed534d8f.json
If we want to create our own entry for the history attribute, we'll need to be able to create a:
plot_precipitation_climatology.py
A library called datetime
can be used to find out the time and date right now:
import datetime
time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y")
print(time_stamp)
Fri Dec 08 14:05:17 2017
The strftime
function can be used to customise the appearance of a datetime object;
in this case we've made it look just like the other time stamps in our data file.
The sys.argv
function, which is what the argparse
library is built on top of, contains all the arguments entered by the user at the command line:
import sys
print(sys.argv)
['/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py', '-f', '/Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json']
In launching this IPython notebook,
you can see that a command line program called ipykernel_launcher.py
was run.
To join all these list elements up,
we can use the join
function that belongs to Python strings:
args = " ".join(sys.argv)
print(args)
/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json
While this list of arguments is very useful,
it doesn't tell us which Python installation was used to execute those arguments.
The sys
library can help us out here too:
exe = sys.executable
print(exe)
/Applications/anaconda/envs/pyaos-lesson/bin/python
In the lesson on version control using git we learned that each commit is associated with a unique 40-character identifier known as a hash. We can use the git Python library to get the hash associated with the script:
import git
import os
repo_dir = '/Users/irv033/Documents/volunteer/teaching'
#repo_dir = os.getcwd()
git_hash = git.Repo(repo_dir).heads[0].commit
print(git_hash)
588f96dcab5c78d10b4c994eb3ca67955c882697
We can now put all this together into a function that generates our history record,
def get_history_record(repo_dir):
"""Create a new history record."""
time_stamp = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
exe = sys.executable
args = " ".join(sys.argv)
git_hash = git.Repo(repo_dir).heads[0].commit
entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])
return entry
new_history = get_history_record('/Users/irv033/Documents/volunteer/teaching')
print(new_history)
2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d)
which can be combined with the previous history to compile a record that goes all the way back to when we obtained the original data file:
complete_history = '%s \n %s' %(new_history, previous_history)
print(complete_history)
2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d) Fri Dec 8 10:05:47 2017: ncatted -O -a history,pr,d,, pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc Fri Dec 01 07:59:16 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/ACCESS1-3/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc CMIP5 compliant file produced from raw ACCESS model output using the ACCESS Post-Processor and CMOR2. 2012-02-08T06:45:54Z CMOR rewrote data to comply with CF standards and CMIP5 requirements. Fri Apr 13 09:55:30 2012: forcing attribute modified to correct value Fri Apr 13 12:13:10 2012: updated version number to v20120413. Fri Apr 13 12:29:34 2012: corrected model_id from ACCESS1-3 to ACCESS1.3
(Noting that in real example of this process in action, the new history would refer to what was entered at the command line to run plot_precipitation_climatology.py
, as opposed to running ipykernel_launcher.py
to run a notebook.)
We could place this new get_history_record()
function directly into the plot_precipitation_climatology.py
script, but there's a good chance we'll want to use it in many scripts that we write into the future. In the functions lesson we discussed all the reasons why code duplication is a bad thing, and it's the same principle here. The solution is to place the get_history_record()
function in a separate script full of functions (which is called a module) that we use regularly across many scripts.
(A slight modification has been added to get_history_record()
so that the repo_dir
isn't hard wired into the code. Instead, the script defines repo_dir
as the current working directory, which is assumed to be the top of the directory tree in the git repository, as that's the input information required by git.Repo
.)
!cat provenance.py
""" A collection of commonly used functions for data provenance """ import sys import datetime import git import os def get_history_record(): """Create a new history record.""" time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y") exe = sys.executable args = " ".join(sys.argv) repo_dir = os.getcwd() try: git_hash = git.Repo(repo_dir).heads[0].commit except git.exc.InvalidGitRepositoryError: print('To record the git hash, must run script from top of directory tree in git repo') git_hash = 'unknown' entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7]) return entry
We can then import that module and use it in all of our scripts.
import provenance
The first line of a module file is similar to the first line of a function - if you enter a string, it will be picked up by the help generator.
help(provenance)
Help on module provenance: NAME provenance - A collection of commonly used functions for data provenance FUNCTIONS get_history_record(repo_dir) Create a new history record. FILE /Users/irv033/Documents/volunteer/teaching/amos-icshmo/provenance.py
help(provenance.get_history_record)
Help on function get_history_record in module provenance: get_history_record(repo_dir) Create a new history record.
Import the new provenance
module into your plot_precipitation_climatology.py
script and use it to record the complete history of the output figure.
Things to consider:
.png
file, so it will be necessary to have plot_precipitation_climatology.py
output a .txt
file that contains the history information (it's usually easiest for this metadata file to have exactly the same name as the figure file, just with a .txt
instead of .png
file extension)Don't forget to commit your changes to git and push to GitHub.