Notebook

If you run into any issues or have questions/concerns about the data catalog API, usage patterns, or anything else, please do not hesistate to email them to danf@usc.edu.¶

Thanks,¶

Dan Feldman¶

In [ ]:

# Prerequisites: python 3.6 or later
import requests
import json
import uuid
import pprint
import datetime
pp = pprint.PrettyPrinter(indent=2)

In [ ]:

# This is a convenience method to handle api responses. The main portion of the notebook starts in the the next cell
def handle_api_response(response, print_response=False):
    parsed_response = response.json()

    if print_response:
        pp.pprint({"API Response": parsed_response})
    
    if response.status_code == 200:
        return parsed_response
    elif response.status_code == 400:
        raise Exception("Bad request ^")
    elif response.status_code == 403:
        msg = "Please make sure your request headers include X-Api-Key and that you are using correct url"
        raise Exception(msg)
    else:
        now = datetime.datetime.utcnow().replace(microsecond=0).isoformat()
        msg = f"""\n\n
        ------------------------------------- BEGIN ERROR MESSAGE -----------------------------------------
        It seems our server encountered an error which it doesn't know how to handle yet. 
        This sometimes happens with unexpected input(s). In order to help us diagnose and resolve the issue, 
        could you please fill out the following information and email the entire message between ----- to
        danf@usc.edu:
        1) URL of notebook (of using the one from https://hub.mybinder.org/...): [*****PLEASE INSERT ONE HERE*****]
        2) Snapshot/picture of the cell that resulted in this error: [*****PLEASE INSERT ONE HERE*****]
        
        Thank you and we apologize for any inconvenience. We'll get back to you as soon as possible!
        
        Sincerely, 
        Dan Feldman
        
        Automatically generated summary:
        - Time of occurrence: {now}
        - Request method + url: {response.request.method} - {response.request.url}
        - Request headers: {response.request.headers}
        - Request body: {response.request.body}
        - Response: {parsed_response}

        --------------------------------------- END ERROR MESSAGE ------------------------------------------
        \n\n
        """

        raise Exception(msg)

Main Portion¶

In [ ]:

# For real interactions with the data catalog, use api.mint-data-catalog.org
url = "https://sandbox.mint-data-catalog.org"

In [ ]:

# When you register datasets or resources, we require you to pass a "provenance_id". This a unique id associated
# with your account so that we can keep track of who is adding things to the data catalog. For sandboxed interactions
# with the data catalog api, please use this provenance_id:
provenance_id = "e8287ea4-e6f2-47aa-8bfc-0c22852735c8"

In [ ]:

# Step 1: Get session token to use the API
resp = requests.get(f"{url}/get_session_token").json()
print(resp)
api_key = resp['X-Api-Key']

request_headers = {
    'Content-Type': "application/json",
    'X-Api-Key': api_key
}

Our setup¶

Recall from the data catalog primer that a dataset is logical grouping of data about specific variables contained in one or more resources that a dataset is logical grouping of data about specific variables contained in one or more resources

To make the above statement more concrete, we will interactively go through the process of registering a toy dataset in the data catalog in order to make it available for others.

Let's say I have a dataset called "Temperature recorded outside my house" in which every day I note the temperature outside my apartment in the morning, afternoon, and evening. I then record those data points in a csv file temp_records_YYYY_MM_DD.csv that looks like (prettified):

Time	Temperature
2018-01-01T07:34:40	23
2018-01-01T12:15:28	32
2018-01-01T20:56:15	26

Note that each file contains data for a single day only.

In this example, my dataset would be "Temperature recorded outside my house", variables would be "Time" and "Temperature", and each csv file would be a resource associated with the dataset. In addition, since each file contains both of the variables in our dataset, each resource will be associated with both variables.

A case for why we need standard variables¶

Now, I know that what I refer to as "Temperature" is actually the air temperature recorded in F, but my CSV files have no mention of the fact. If you just look at the file without any context, it's unclear what it is that is being recorded. Temperature of what? In what units? C, F, K?

In order to disambiguate those variable names, we require that each variable in your dataset to be associated with one or more standard variables. What makes a variable name "standard" is that it is a part of some ontology, so that anyone can examine that ontology and see for themselves semantic meaning of the variable. Most of our current datasets are mapped to standard names defined by the GSN ontology.

But you are not forced to map your variables to GSN names. Data catalog allows you to register your own set of standard_variable_names. The only requirement for now is that those standard names are associated with an ontology whose schema is publicly available.

FAQ: What if I don't know anything about ontologies? How do I know that my ontology/standard variable name is correct?¶

There is no such thing as a "correct" standard variable or ontology - you can think of it as a formalized naming convention. As such, there isn't such a thing as a right or wrong standard name, but useful or not. If you are the only person who is using a convention (like naming all variables with single letters), it might not be useful to anyone but you. Sure, it's possible to teach somebody else your convention, but the less semantic structure your convention uses, the harder it will be for another person to learn it. The easier it is for someone to pick it up, the faster other people will adopt it.

So, what makes a good standard variable name? A good starting point is to take a look at the variable from a perspective of someone who is seeing just that piece of information for the first time. How many clarifying questions do you expect the other person to ask you before they understand the meaning of data in front of them? Those answers can then become parts of standard variable name.

Now that we know why we need standard variables, this is how we can register new ones, if needed.

Step 0: Registering Standard Variables¶

In [ ]:

# @param[name] standard variable name (aka label)
# @param[ontology] name of the ontology where standard variables are defined
# @param[uri] uri of standard variable name (note that this is full uri, which includes the ontology)
standard_variable_defs = {
    "standard_variables": [
        {
            "name": "Time_Standard_Variable",
            "ontology": "MyOntology",
            "uri": "http://my_ontology_uri.org/standard_names/time_standard_variable"
        },
        {
            "name": "Temperature_Standard_Variable",
            "ontology": "MyOntology",
            "uri": "http://my_ontology_uri.org/standard_names/temperature_standard_variable"
        }
    ]
}

resp = requests.post(f"{url}/knowledge_graph/register_standard_variables", 
                    headers=request_headers, 
                    json=standard_variable_defs)


# If request is successful, it will return 'result': 'success' along with a list of registered standard variables
# and their record_ids. Those record_ids are unique identifiers (UUID) and you will need them down the road to 
# register variables
parsed_response = handle_api_response(resp, print_response=True)
records = parsed_response['standard_variables']

# iterate through the list of returned standard variable objects and save
# the ones whose names match the one that we want and store them in python variables
time_standard_variable = next(record for record in records if record["name"] == "Time_Standard_Variable")
temperature_standard_variable = next(record for record in records if record["name"] == "Temperature_Standard_Variable")

## Uncomment below to see the structure of a specific variable:
# pp.pprint({"Time Standard Variable": time_standard_variable})
# pp.pprint({"Temperature Standard Variable": temperature_standard_variable})

In [ ]:

# If you need to check if specific standard variables have already been registered in the data catalog, 
# you can search by name and data catalog will return existing records.
nonexistent_name = str(uuid.uuid4())
print(f"This name does not exist: {nonexistent_name}")

search_query = {
    "name__in": ["Time_Standard_Variable", "Temperature_Standard_Variable", nonexistent_name]
}

resp = requests.post(f"{url}/knowledge_graph/find_standard_variables", 
                                        headers=request_headers,
                                        json=search_query)
parsed_response = handle_api_response(resp, print_response=True)

# Below is how you'd extract standard_variables from the response if you need to reference them (their record_ids)
# later:
# 
# existing_standard_variables = parsed_response["standard_variables"]
# print(existing_standard_variables)

After you are satisfied that all relevant standard variables are in the data catalog (usually it's a one-time thing), you can proceed to register datasets, variables, and resources¶

Step 1: REGISTER DATASETS¶

FAQ: How do I know what is a dataset and what is a resource?¶

There isn't a cut-and-dry answer and will utimately depend on the way your organize and think about your data. We define a dataset as "logical grouping of variables in a collection of resources". As long as your way fits this extremely broad definition, you should be ok. To illustrate this, let's go back to our toy example of "Temperature recorded outside my house", for which I record data for my time and temperature variables in multiple files. Originally, each file was a resource under my dataset, which makes sense because all of these data files describe (semantically) the same concept - temperature recorded outside my house. On the other hand, I could've made a similarly strong argument that each file is actually a separate logical entity that provides temperature data recorded outside my house on a specific date and should therefore be semantically differentiated from the temperature recorded on another date. But in vast majority of cases, this distinction doesn't really matter because in the end, those that care about knowing the temperature outside my house on Jan 1st 2018 will be able to find the link to that file and download it. From the perspective of the end user, they care about the actual raw data, and not the (arguably somewhat arbitrary) distinction between a resource and a dataset. There will be an example later on how to "tag" your data with relevant temporal/spatial information so that it becomes searchable.

In [ ]:

dataset_id = "4e8ade31-7729-4891-a462-2dac66158512" # This is optional; if not given, it will be auto-generated

## An example of how to generate a random uuid yourself (will be different every time method is run)
# print(str(uuid.uuid4()))
# print(str(uuid.uuid4()))
#
## This will generate the same record_id as long as the input string remains the same
#
# input_string = "some string 34_"
# print(str(uuid.uuid5(uuid.NAMESPACE_URL, str(input_string))))
# print(str(uuid.uuid5(uuid.NAMESPACE_URL, str(input_string))))

A note about ids and record_ids.¶

Every entity in data catalog (variables, standard_variables, datasets, resources) will have a unique id/record_id associated with it. This is what disambiguates e.g., two datasets that are named "MyDataset". These record_ids are either generated automatically, on our end if no "record_id" is provided, or you can generate them yourself using Python's uuid library (or any other library that generates uuids according to the international standard).

What this means in practice is that if you remove "record_id" from the dictionary below and rerun this cell 3 times, you will end up registering 3 datasets with identical name, description, metadata, and provenance_id. This is why if you register a new dataset (or variable, or resource, etc), it's important to note returned object's record_id if/when you need to reference it later, rerun the same script in an indempotent manner, or update record's attributes.

Build datasets definition¶

In [ ]:

dataset_defs = {
    "datasets": [
        {
            "record_id": dataset_id, # Remove this line if you want to create a new dataset
            "provenance_id": provenance_id,
            "metadata": {
                "any_additional_metadata": "content"
            },
            "description": "Temperature recorded outside my house; collected over last month",
            "name": "Temperature recorded outside my house"
        }
    ]
}

resp = requests.post(f"{url}/datasets/register_datasets", 
                                        headers=request_headers,
                                        json=dataset_defs)


parsed_response = handle_api_response(resp, print_response=True)

datasets = parsed_response["datasets"]

# Iterate through the list of returned datasets objects and save the one whose name matches our name 
# to a Python variable
dataset_record = next(record for record in datasets if record["name"] == "Temperature recorded outside my house")
# Extract dataset record_id and store it in a variable
dataset_record_id = dataset_record["record_id"]
    

Step 2: Register variables¶

In [ ]:

# Again, these ids are optional and will be auto-generated if not given. They are included here in order
# to make requests indempotent (so that new records aren't beeing generated every time this code block is run)

time_variable_record_id = '9358af57-192f-4cc3-9bee-837e76819674'
temperature_variable_record_id = 'c22deb3b-ebda-48cb-950a-2f4f00498197'

variable_defs = {
    "variables": [
        {
            "record_id": time_variable_record_id, # If you remove this line, record_id will be auto-generated
            "dataset_id": dataset_record_id,
            "name": "Time",
            "metadata": {
                "units": "ISO8601_datetime"
                # Can include any other metadata that you want to associate with the variable
            },
            "standard_variable_ids": [
                # Recall that we created "time_standard_variable" python object after
                # registering our standard variables. We just need its unique identifier - 
                # record_id - in order to associate it with our "Time" variable. Also, note 
                # that "standard_variable_ids" is an array, so you can associate multiple
                # standard variables with our "local" variable (and it does not have
                # to be done all at once). That is how we can semantically link multiple
                # standard names and ontologies later on
                time_standard_variable["record_id"]
            ]
        },
        {
            "record_id": temperature_variable_record_id, # If you remove this line, record_id will be auto-generated
            "dataset_id": dataset_record_id, # from register_datasets() call
            "name": "Temperature",
            "metadata": {
                "units": "F"
            },
            "standard_variable_ids": [
                temperature_standard_variable["record_id"]
            ]
        }
    ]
}

resp = requests.post(f"{url}/datasets/register_variables", 
                                        headers=request_headers,
                                        json=variable_defs)

parsed_response = handle_api_response(resp, print_response=True)
variables = parsed_response["variables"]

time_variable = next(record for record in variables if record["name"] == "Time")
temperature_variable = next(record for record in variables if record["name"] == "Temperature")

## Uncomment below to print individual records
# print(f"Time Variable: {time_variable}")
# print(f"Temperature Variable: {temperature_variable}")

Step 3: Register resources¶

Assume that I host my datasets files on www.my_domain.com/storage

In [ ]:

data_storage_url = "www.my_domain.com/storage"

Also, assume that I've collected 2 days worth of data in temp_records_2018_01_01.csv and temp_records_2018_01_02.csv ...

In [ ]:

file_1_name = "temp_records_2018_01_01.csv"
file_2_name = "temp_records_2018_01_02.csv"

...and uploaded them to my remote storage location

In [ ]:

file_1_data_url = f"{data_storage_url}/{file_1_name}"
file_2_data_url = f"{data_storage_url}/{file_2_name}"

In [ ]:

# Similar to dataset and variable registrations, we are going to generate unique resource record_ids to 
# make these requests repeatable without creating new records. But remember, these will be auto-generated
# if not given

file_1_record_id = "dd52e66b-3149-4d46-8f8e-a18e46136e55"
file_2_record_id = "25916ccf-d108-4187-b243-2b257ce67fa5"

Making my files searchable by time¶

If I want my resources to be searchable by time range, I can "annotate" each resource with corresponding temporal coverage. That way, when someone searches for any datasets that contain "Temperature_Standard_Variable" for January 01 2018, my file_1_name will be returned, along with the data url, and the users will be able to download it easily. Note that temporal coverage must have "start_time" and "end_time" and must follow ISO 8601 datetime format YYYY-MM-DDTHH:mm:ss

In [ ]:

file_1_temporal_coverage = {
    "start_time": "2018-01-01T00:00:00",
    "end_time": "2018-01-01T23:59:59"
}
file_2_temporal_coverage = {
    "start_time": "2018-01-02T00:00:00",
    "end_time": "2018-01-02T23:59:59"
}

Making my files spatially searchable¶

Let's say that my house is somewhere in LA, defined by the following bounding box (where x refers to longitude and y refers to latitude)
x_min: 33.9605286
y_min: -118.4253354
x_max: 33.9895077
y_max: -118.4093589

We can annotate our resources with spatial coverage. Since all of our resources come from the same location, we can reuse the same values. If you have multiple resources with with different locations, you can follow temporal annotation example above.

Things to note here are the required "type" and "value" parameters.

In [ ]:

spatial_coverage = {
    "type": "BoundingBox",
    "value": {
        "xmin": 33.9605286,
        "ymin": -118.4253354,
        "xmax": 33.9895077,
        "ymax": -118.4093589
    }
}

Finally, we can build our resource definitions and register them (in bulk)¶

In [ ]:

resource_defs = {
    "resources": [
        {
            "record_id": file_1_record_id,
            "dataset_id": dataset_record_id,
            "provenance_id": provenance_id,
            "variable_ids": [
                time_variable["record_id"],
                temperature_variable["record_id"]
            ],
            "name": file_1_name,
            "resource_type": "csv",
            "data_url": file_1_data_url,
            "metadata": {
                "spatial_coverage": spatial_coverage,
                "temporal_coverage": file_1_temporal_coverage
            },
            "layout": {}
        },
        {
            "record_id": file_2_record_id,
            "dataset_id": dataset_record_id,
            "provenance_id": provenance_id,
            "variable_ids": [
                time_variable["record_id"],
                temperature_variable["record_id"]
            ],
            "name": file_2_name,
            "resource_type": "csv",
            "data_url": file_2_data_url,
            "metadata": {
                "spatial_coverage": spatial_coverage,
                "temporal_coverage": file_2_temporal_coverage
            },
            "layout": {}
        }
    ]
}

# ... and register them in bulk
resp = requests.post(f"{url}/datasets/register_resources", 
                                        headers=request_headers,
                                        json=resource_defs)


parsed_response = handle_api_response(resp, print_response=True)


resources = parsed_response["resources"]
    
resource_1 = next(record for record in resources if record["name"] == file_1_name)
resource_2 = next(record for record in resources if record["name"] == file_2_name)

## Uncomment below to print individual records    
# print(f"{file_1_name}: {resource_1}")
# print(f"{file_2_name}: {resource_2}")

Searching for datasets/resources¶

After registering datasets/variables/resources, we can now programmatically search of relevant information. Below, you'll see 3 examples of searching for data using standard variable names, temporal, and spatial coverages. Currently, these are the only search filters we support, but we'll be adding more as we get more feature requests. If you would like to search data catalog by other keywords, please let me know at danf@usc.edu

In [ ]:

# 1) Searching by standard_names

search_query_1 = {
    "standard_variable_names__in": [temperature_standard_variable["name"]]
}

resp = requests.post(f"{url}/datasets/find", 
                                        headers=request_headers,
                                        json=search_query_1).json()
if resp['result'] == 'success':
    found_resources = resp['resources']
    print(f"Found {len(found_resources)} resources")
    print(found_resources)

In [ ]:

# 2) Searching by spatial_coverage

# Bounding box search parameter is a 4-element numeric array (in WGS84 coordinate system) [xmin, ymin, xmax, ymax]
# As a reminder, x is longitude, y is latitude
bounding_box = [
    spatial_coverage["value"]["xmin"], 
    spatial_coverage["value"]["ymin"], 
    spatial_coverage["value"]["xmax"],
    spatial_coverage["value"]["ymax"]
]

search_query_2 = {
    "spatial_coverage__within": bounding_box
}

resp = requests.post(f"{url}/datasets/find", 
                                        headers=request_headers,
                                        json=search_query_2).json()
if resp['result'] == 'success':
    found_resources = resp['resources']
    print(f"Found {len(found_resources)} resources")
    print(found_resources)

In [ ]:

# 3) Searching by temporal_coverage and standard_names

# Bounding box search parameter is a 4-element numeric array (in WGS84 coordinate system) [xmin, ymin, xmax, ymax]
# As a reminder, x is longitude, y is latitude
start_time = "2018-01-01T00:00:00"
end_time = "2018-01-21T23:59:59"

search_query_3 = {
    "standard_variable_names__in": [temperature_standard_variable["name"]],
    "start_time__gte": start_time,
    "end_time__lte": end_time
}

resp = requests.post(f"{url}/datasets/find", 
                                        headers=request_headers,
                                        json=search_query_3).json()

if resp['result'] == 'success':
    found_resources = resp['resources']
    print(f"Found {len(found_resources)} resources")
    pp.pprint(found_resources)
    
    

In [ ]:

# 4) Searching by dataset_names

search_query_4 = {
    "dataset_names__in": ["Temperature recorded outside my house"]
}

resp = requests.post(f"{url}/datasets/find",
                     headers=request_headers,
                     json=search_query_4).json()

if resp['result'] == 'success':
    found_resources = resp['resources']
    print(f"Found {len(found_resources)} resources")
    pp.pprint(found_resources)

In [ ]:

# 5) Searching by dataset ids

search_query_5 = {
    "dataset_ids__in": [dataset_id]
}

resp = requests.post(f"{url}/datasets/find",
                     headers=request_headers,
                     json=search_query_5).json()

if resp['result'] == 'success':
    found_resources = resp['resources']
    print(f"Found {len(found_resources)} resources")
    pp.pprint(found_resources)