Notebook

2. Data Records¶

In this notebook, we will be going over how to create, add metadata and other contextual information, edit, establish relationships between Data Records - the fundamental unit in DataFed

Setting up:¶

Just as in the previous notebook, we start be instantiating the datafed.CommandLib.API class to communicate with DataFed:

In [ ]:

from datafed.CommandLib import API

In [ ]:

df_api = API()

Specifying context:¶

Since we want to work within the context of the Training Project:

In [ ]:

df_api.setContext('p/trn001')

To begin with, you will be working within your own private collection whose alias is the same as your DataFed username.

Exercise: ¶

Enter your username into the parent_collection variable

In [ ]:

parent_collection = ?

Creating Data Records:¶

Data Records can hold a whole lot of contextual information about the raw data in them. One key component is the scientific metadata. Ideally, we would get this metadata from the headers of the raw data file or some other log file that was generated along with the raw data.

Note ¶

DataFed expects scientific metadata to be specified like a python dictionary.

For now, let's set up some dummy metadata:

In [ ]:

parameters = {
              'a': 4,
              'b': [1, 2, -4, 7.123],
              'c': 'Something important',
              'd': {'x': 14, # Can use nested dictionaries
                    'y': -19
                   } 
              }

Note ¶

DataFed currently encodes metadata in JSON strings. The next version will accept python dictonaries as is.

For now, we will need to convert the metadata to a JSON string using the dumps() function in the json package

In [ ]:

import json
json.dumps(parameters)

We use the dataCreate() function to make our new record, and the json.dumps() function to format the python dictionary to JSON:

In [ ]:

dc_resp = df_api.dataCreate('my important data',
                            metadata=json.dumps(parameters),
                            parent_id=parent_collection, # The parent collection, whose alias is your username
                            )
dc_resp

Note: ¶

In the future, the dataCreate() function would by default return only the ID of the record instead of such a verbose response if it successfully created the Data Record. We expect to be able to continue to get this verbose response through an optional argument.

Exercise: ¶

Extract the ID of the data record from the message returned from dataCreate() for future use:

In [ ]:

record_id = ?
print(record_id)

Data Records and the information in them are not static and can always be modified at any time

Updating Data Records¶

Let's add some additional metadata and change the title of our record:

In [ ]:

du_resp = df_api.dataUpdate(record_id,
                            title='Some new title for the data',
                            metadata=json.dumps({'appended_metadata': True})
                            )
print(du_resp)

Note: ¶

In the future, the dataUpdate() command would return only an acknowledgement of the successful execution of the data update.

Viewing Data Records¶

We can get full information about a data record including the complete metadata via the dataView() function. Let us use this function to verify that the changes have been incorporated:

In [ ]:

dv_resp = df_api.dataView(record_id)
print(dv_resp)

Note: ¶

Record metadata is always stored as a JSON string.

Exercise: ¶

Try isolating the updated metadata and converting it to a python dictionary.

Hint - json.loads() is the opposite of json.dumps()

In [ ]:

In the first update, we merged new metadata with the existing metadata within the record. However dataUpdate() is also capable of replacing the metadata as well.

Exercise: ¶

Now try to replace the metadata.

Hint: look at the metadata_set keyword argument in the docstrings.

Tip: ¶

With the cursor just past the starting parenthesis of dataUpdate(, simultaneously press the Shift and Tab keys once, twice, or four times to view more of the documentation about the function.

In [ ]:

new_metadata = ?
du_resp = df_api.dataUpdate(record_id,
                            ...
                            )
dv_resp = df_api.dataView(record_id)
print(json.loads(dv_resp[0].data[0].metadata))

Provenance¶

Along with in-depth, detailed scientific metadata describing each data record, DataFed also provides a very handy tool for tracking data provenance, i.e. recording the relationships between Data Records which can be used to track the history, lineage, and origins of a data object.

Exercise: ¶

Create a new record meant to hold some processed version of the first data record.
Caution: Make sure to create it in the correct Collection.

In [ ]:

new_params = {}

dc2_resp = df_api.dataCreate( ...
                             )

clean_rec_id = dc2_resp[0].data[0].id
print(clean_rec_id)

Specifying Relationships¶

Now that we have two records, we can specify the second record's relationship to the first by adding a dependency via the deps_add keyword argument of the dataUpdate() function.

Note: ¶

As the documentation for dataUpdate() will reveal, dependencies must be specified as a list of relationships. Each relationship is expressed as a list where the first item is a dependency type (a string) and the second is the data record (also a string).

DataFed currently supports three relationship types:

der - Is derived from
comp - Is comprised of
ver - Is new version of

In [ ]:

dep_resp = df_api.dataUpdate(clean_rec_id, 
                             deps_add=[["der", record_id]])
print(dep_resp)

Exercise: ¶

Take a look at the records on the DataFed Web Portal in order to see a graphical representation of the data provenance.

Exercise: ¶

1. Create a new data record to hold a figure in your journal article.
2. Extract the record ID.
3. Now establish a provenance link between this figure record and the processed data record we just created. You may try out a different dependency type if you like.
4. Take a look at the DataFed web portal to see the update to the Provenance of the records

In [ ]: