In this notebook, we will be going over how to create, add metadata and other contextual information, edit, establish relationships between Data Records
- the fundamental unit in DataFed
Just as in the previous notebook, we start be instantiating the datafed.CommandLib.API
class to communicate with DataFed:
from datafed.CommandLib import API
df_api = API()
Since we want to work within the context
of the Training Project:
df_api.setContext('p/trn001')
To begin with, you will be working within your own private collection whose alias
is the same as your DataFed username.
Enter your username into the parent_collection
variable
parent_collection = ?
Data Records can hold a whole lot of contextual information about the raw data in them. One key component is the scientific metadata. Ideally, we would get this metadata from the headers of the raw data file or some other log file that was generated along with the raw data.
DataFed expects scientific metadata to be specified like a python dictionary.
For now, let's set up some dummy metadata:
parameters = {
'a': 4,
'b': [1, 2, -4, 7.123],
'c': 'Something important',
'd': {'x': 14, # Can use nested dictionaries
'y': -19
}
}
DataFed currently encodes metadata in JSON strings. The next version will accept python dictonaries as is.
For now, we will need to convert the metadata to a JSON string using the dumps()
function in the json
package
import json
json.dumps(parameters)
We use the dataCreate()
function to make our new record, and the json.dumps()
function to format the python dictionary to JSON:
dc_resp = df_api.dataCreate('my important data',
metadata=json.dumps(parameters),
parent_id=parent_collection, # The parent collection, whose alias is your username
)
dc_resp
In the future, the
dataCreate()
function would by default return only theID
of the record instead of such a verbose response if it successfully created the Data Record. We expect to be able to continue to get this verbose response through an optional argument.
Extract the ID
of the data record from the message returned from dataCreate()
for future use:
record_id = ?
print(record_id)
Data Records and the information in them are not static and can always be modified at any time
Let's add some additional metadata and change the title of our record:
du_resp = df_api.dataUpdate(record_id,
title='Some new title for the data',
metadata=json.dumps({'appended_metadata': True})
)
print(du_resp)
In the future, the dataUpdate() command would return only an acknowledgement of the successful execution of the data update.
We can get full information about a data record including the complete metadata via the dataView()
function. Let us use this function to verify that the changes have been incorporated:
dv_resp = df_api.dataView(record_id)
print(dv_resp)
In the first update, we merged new metadata with the existing metadata within the record. However dataUpdate()
is also capable of replacing the metadata as well.
Now try to replace the metadata.
Hint: look at the metadata_set
keyword argument in the docstrings.
With the cursor just past the starting parenthesis of
dataUpdate(
, simultaneously press theShift
andTab
keys once, twice, or four times to view more of the documentation about the function.
new_metadata = ?
du_resp = df_api.dataUpdate(record_id,
...
)
dv_resp = df_api.dataView(record_id)
print(json.loads(dv_resp[0].data[0].metadata))
Along with in-depth, detailed scientific metadata describing each data record, DataFed also provides a very handy tool for tracking data provenance, i.e. recording the relationships between Data Records which can be used to track the history, lineage, and origins of a data object.
Create a new record meant to hold some processed version of the first data record.
Caution: Make sure to create it in the correct Collection.
new_params = {}
dc2_resp = df_api.dataCreate( ...
)
clean_rec_id = dc2_resp[0].data[0].id
print(clean_rec_id)
Now that we have two records, we can specify the second record's relationship to the first by adding a dependency via the deps_add
keyword argument of the dataUpdate()
function.
As the documentation for
dataUpdate()
will reveal, dependencies must be specified as alist
of relationships. Each relationship is expressed as alist
where the first item is a dependency type (a string) and the second is the data record (also a string).
DataFed currently supports three relationship types:
der
- Is derived fromcomp
- Is comprised ofver
- Is new version ofdep_resp = df_api.dataUpdate(clean_rec_id,
deps_add=[["der", record_id]])
print(dep_resp)
Take a look at the records on the DataFed Web Portal in order to see a graphical representation of the data provenance.
1. Create a new data record to hold a figure in your journal article.
2. Extract the record ID.
3. Now establish a provenance link between this figure record and the processed data record we just created. You may try out a different dependency type if you like.
4. Take a look at the DataFed web portal to see the update to the Provenance of the records