The Modern Research Data Portal: A Design Pattern for Networked, Data-Intensive Science

In this notebook we demonstrate the core logic for developing a Modern Research Data Portal (MRDP). This code leverages the Globus platform to manage identities and data access. We first demonstrate how to use the Globus SDK before stepping through the MRDP logic.

The following notebook contains a brief introduction to the Globus SDK. More complete documentation and example notebooks are avaialble in the following locations:

Setup

To use the following notebook you must first install the Globus Python SDK. This can be done by downloading the SDK and installing it manually (https://github.com/globus/globus-sdk-python) or via Python pip as follows.

pip install globus-sdk

To access the SDK you must authenticate using your Globus identity. In this notebook we use the NativeAppAuthClient as a way of acquiring tokens. If the MRDP code is deployed in a service web-based authentication flows should be used.

In [ ]:
from globus_sdk import AuthClient, TransferClient, AccessTokenAuthorizer, NativeAppAuthClient, TransferData


CLIENT_ID = '2f9482c4-67b3-4783-bac7-12b37d6f8966'

client = NativeAppAuthClient(CLIENT_ID)
client.oauth2_start_flow()

authorize_url = client.oauth2_get_authorize_url()
print('Please go to this URL and login: {0}'.format(authorize_url))

# this is to work on Python2 and Python3 -- you can just use raw_input() or
# input() for your specific version
get_input = getattr(__builtins__, 'raw_input', input)
auth_code = get_input(
    'Please enter the code you get after login here: ').strip()
token_response = client.oauth2_exchange_code_for_tokens(auth_code)

AUTH_TOKEN = token_response.by_resource_server['auth.globus.org']['access_token']
TRANSFER_TOKEN = token_response.by_resource_server['transfer.api.globus.org']['access_token']

tc = TransferClient(AccessTokenAuthorizer(TRANSFER_TOKEN))
ac = AuthClient(authorizer=AccessTokenAuthorizer(AUTH_TOKEN))

Using the Globus SDK

We first show how the Globus SDK can be used to discover endpoint IDs.

In [ ]:
# discover an Endpoint ID
search_str = "Globus Tutorial Endpoint"
endpoints = tc.endpoint_search(search_str)
print("==== Displaying endpoint matches for search: '{}' ===".format(search_str))
for ep in endpoints:
    print("{} ({})".format(ep["display_name"] or ep["canonical_name"], ep["id"]))

The Research Data Portal function

The following code uses the Globus SDK to create, manage access to, and delete shared endpoints, as follows.

It first sets up variables for the host endpoint on which the shared enpoint will be created (in this case the "Globus Tutorial Endpoint"), the source path for the data to be copied and shared, and the email address of the user to be shared with.

import sys, random, uuid It then creates a TransferClient and an AuthClient object and uses the Globus SDK function endpoint_autoactivate to ensure that the portal admin has a credential that permits access to the endpoint identified by host_id. Activation of the endpoint assumes that the endpoint is configured to trust the Globus IdP (as is the case with Globus Connect Personal).

In [ ]:
import sys, random, uuid
from globus_sdk import AuthClient, TransferClient, AccessTokenAuthorizer, TransferData

host_id = 'ddb59aef-6d04-11e5-ba46-22000b92c6ec' # Endpoint for shared endpoint
source_path = '/share/godata/' # Directory to copy data from
email ='[email protected]' # Email address to share with

tc = TransferClient(AccessTokenAuthorizer(TRANSFER_TOKEN))
ac = AuthClient(authorizer=AccessTokenAuthorizer(AUTH_TOKEN))

We use the Globus SDK function operation_mkdir to create a directory (in our example call, a UUID) on the existing endpoint with identifier host_id.

In [ ]:
share_path = '/~/' + str(uuid.uuid4()) + '/'
r = tc.operation_mkdir(host_id, path=share_path)
print (r['message'])

Then we use the Globus SDK function create_shared_endpoint to create a shared endpoint for the new directory. At this point, the new shared endpoint exists and is associated with the new directory. However, only the creating user has access to this new shared endpoint at this point.

In [ ]:
shared_ep_data = {
    'DATA_TYPE': 'shared_endpoint',
    'host_endpoint': host_id,
    'host_path': share_path,
    'display_name': 'RDP shared endpoint',
    'description': 'RDP shared endpoint'
}

r = tc.create_shared_endpoint(shared_ep_data)
share_id = r['id']
print(share_id)

To provide access to the requested data we copy data to the shared endpoint. We use sample data contained on the Globus Tutorial Endpoint under path "/share/godata".

In [ ]:
tc.endpoint_autoactivate(share_id)
tdata = TransferData(tc, host_id, share_id, label='RDP copy data', sync_level='checksum')
tdata.add_item(source_path, '/', recursive=True)
r = tc.submit_transfer(tdata)
o = tc.task_wait(r['task_id'], timeout=1000, polling_interval=10)
print (r['task_id'])

To confirm all data is in place for sharing we check the contents of the shared endpoint.

In [ ]:
for f in tc.operation_ls(share_id):
    print (f['name'])

We now share the endpoint with the appropriate user. We first use the Globus SDK function get_identities to retrieve the user identifier associated with the supplied email address; this is the user for whom sharing is to be enabled. (If this user is not known to Globus, an identity is created.) We then use the function add_endpoint_acl_rule to add an access control rule to the new shared endpoint to grant the specified user readonly access to the endpoint. The various elements in the rule_data structure specify, among other things:

  • principal_type: the type of principal to which the rule applies: in this case, ’identity’ —other options are ’group’, ’all_authenticated_users’, or ’anonymous’;
  • principal: as the principal_type is ’identity’, this is the user id with whom sharing is to be enabled;
  • permissions: the type of access being granted: in this case read-only (’r’), but could also be read and write (’rw’);
  • notify_email: an email address to which an invitation to access the shared endpoint should be sent; and
  • notify_message: a message to include in the invitation email.

As our add_endpoint_acl_rule request specifies an email address, an invitation email is sent to the user. At this point, the user is authorized to download data from the new shared endpoint.

In [ ]:
r = ac.get_identities(usernames=email)
user_id = r['identities'][0]['id']
rule_data = {
    'DATA_TYPE': 'access',
    'principal_type': 'identity', # Grantee is
    'principal': user_id, # a user.
    'path': '/', # Path is /
    'permissions': 'r', # Read-only
    'notify_email': email, # Email invite
    'notify_message': # Invite msg
    'Requested data is available.'
}
r = tc.add_endpoint_acl_rule(share_id, rule_data)
print (r['message'])

The shared endpoint will typically be left operational for some period, after which it can be deleted. Note that deleting a shared endpoint does not delete the data that it contains. The portal admin may want to retain the data for other purposes. If not, we can use the Globus SDK function submit_delete to delete the folder.

In [ ]:
r = tc.delete_endpoint(share_id)
print (r['message'])

Putting it all together

The following code integrates the code above into a single callable function.

In [ ]:
from globus_sdk import TransferClient, TransferData, AccessTokenAuthorizer
from globus_sdk import AuthClient
import sys, random, uuid

def rdp(host_id, # Endpoint for shared endpoint
    source_path, # Directory to copy data from
    email):      # Email address to share with
    
    # Instantiate transfer and auth clients
    tc = TransferClient(AccessTokenAuthorizer(TRANSFER_TOKEN))
    ac = AuthClient(authorizer=AccessTokenAuthorizer(AUTH_TOKEN))
    tc.endpoint_autoactivate(host_id)

    # (1) Create shared endpoint:
    # (a) Create directory to be shared
    share_path = '/~/' + str(uuid.uuid4()) + '/'
    tc.operation_mkdir(host_id, path=share_path)
    
    # (b) Create shared endpoint on directory 
    shared_ep_data = {
        'DATA_TYPE': 'shared_endpoint',
        'host_endpoint': host_id,
        'host_path': share_path,
        'display_name': 'RDP shared endpoint',
        'description': 'RDP shared endpoint'
    }

    r = tc.create_shared_endpoint(shared_ep_data)
    share_id = r['id']

    # (2) Copy data into the shared endpoint
    tc.endpoint_autoactivate(share_id)
    tdata = TransferData(tc, host_id, share_id, label='RDP copy data', sync_level='checksum')
    tdata.add_item(source_path, '/', recursive=True)
    r = tc.submit_transfer(tdata)
    tc.task_wait(r['task_id'], timeout=1000, polling_interval=10)

    # (3) Enable access by user
    r = ac.get_identities(usernames=email)
    user_id = r['identities'][0]['id']
    rule_data = {
        'DATA_TYPE': 'access',
        'principal_type': 'identity', # Grantee is
        'principal': user_id, # a user.
        'path': '/', # Path is /
        'permissions': 'r', # Read-only
        'notify_email': email, # Email invite
        'notify_message': # Invite msg
        'Requested data is available.'
    }
    tc.add_endpoint_acl_rule(share_id, rule_data)

    # (4) Ultimately, delete the shared endpoint
    #tc.delete_endpoint(share_id)
    
rdp('ddb59aef-6d04-11e5-ba46-22000b92c6ec', '/share/godata/' , '[email protected]')