Notebook

THREDDS Catalogs: The Basics

Unidata AMS 2021 Student Conference

Focuses¶

Become familiar with THREDDS Catalogs and the THREDDS Data Server (TDS)
Browse THREDDS Catalogs using Siphon
Show metadata and available datasets contained within a THREDDS Catalog
List the data access methods associated with a dataset

Objectives¶

Read a THREDDS Catalog
Moving from one THREDDS Catalog to another
Working with a TDS Catalog Dataset

---

Imports¶

The main python package we will use to work with THREDDS Catalogs is called Siphon. Siphon can read THREDDS Catalogs, which are xml documents that do one or more of the following:

reference other THREDDS Catalogs
expose metadata about a dataset
describe how to access a dataset

The xml documents themselves can be written by hand, but often they are generated by a server, such as the THREDDS Data Server. They may be read locally from an xml file, or remotely over HTTP. Siphon greatly simplifies the process of reading and using the information contained in xml, allowing users to "siphon off data" from a variety of sources.

In [ ]:

from siphon.catalog import TDSCatalog

Read a THREDDS Catalog from a TDS ¶

For this notebook, we will use the Unidata demonstration TDS. If you visit the server <https://thredds.ucar.edu/thredds/catalog/catalog.html> in your browser, you will see something like the image at the top of this notebook. The page you see is actually a product of the TDS, and is generated by the server (we call this an HTML view of the catalog). If you change the last part of the URL from .html to .xml (that is, <https://thredds.ucar.edu/thredds/catalog/catalog.html>), you will see the actual THREDDS Catalog in your browser, which looks similar to this:

<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:xlink="http://www.w3.org/1999/xlink"   name="Unidata THREDDS Data Server" version="1.0.1">
  <dataset name="Realtime data from IDD">
    <catalogRef xlink:href="idd/forecastModels.xml" xlink:title="Forecast Model Data" name=""/>
    <catalogRef xlink:href="idd/forecastProdsAndAna.xml" xlink:title="Forecast Products and Analyses" name=""/>
    <catalogRef xlink:href="idd/obsData.xml" xlink:title="Observation Data" name=""/>
    <catalogRef xlink:href="idd/radars.xml" xlink:title="Radar Data" name=""/>
    <catalogRef xlink:href="idd/satellite.xml" xlink:title="Satellite Data" name=""/>
  </dataset>
  <dataset name="Other Unidata Data">
    <catalogRef xlink:href="casestudies/catalog.xml" xlink:title="Unidata case studies" name=""/>
  </dataset>
</catalog>

This catalog tells us that there are other catalogs containing data from forecast models, radar data, satellite data, etc. In this sense, a catalog can point to other catalogs, creating a tree-like structure in which the datasets are organized. This will vary from server to server, as needs vary across organizations and groups.

We can use Siphon to read in this remote catalog programmatically, without the need for a web browser:

In [ ]:

catalog = TDSCatalog('http://thredds.ucar.edu/thredds/catalog/catalog.xml')

Now that we have read in the THREDDS Catalog from the THREDDS Data Server, we can investigate what information it holds. A list of the names of other catalogs it points to is contained within the catalog_refs instance attribute, and can be access as follows:

In [ ]:

catalog.catalog_refs

There are more things you can do once you have read a THREDDS Catalog using Siphon, but for now we'll leave it at this.

Note: Not all TDS catalogs are intended to be browsed directly: Occasionally, the TDS is used purely as "middleware", and the catalogs are not setup for users to easily browse directly. An example of this would be the catalogs produced by the TDS serving data for the North America component of the Coordinated Regional Downscaling Experiment ([NA-CORDEX](https://na-cordex.org/index.html)). The intent of the data providers is for users to search for datasets using the NA-CORDEX search page on the NCAR Climate Data Gateway, which allows for one to search for datasets by variable type, experiment, driver, model, etc. Although the NA-CORDEX datasets are hosted on a TDS, they are all contained in one catalog, and their names are defined using a combination of the parameters used in the NA-CORDEX search page. For example, one of the over 62,000 NA-CORDEX datasets is named cordex.raw.NAM-44.seas.RCA4.EC-EARTH.rcp26.prec.v20180914. When you see a THREDDS Catalog in which the datasets have opaque names like this, that's your clue that the catalogs are probably not intended to be browsed directly by users, but rather accessed through another service (such as the NA-CORDEX search interface on the Climate Data Gateway).

Top

Reading a referenced catalog ¶

If we'd like to see what is available in the 'Satellite Data' catalog, we can use the .follow() method to read in the new catalog, and look at the .catalog_refs instance attribute of the new catalog:

In [ ]:

satellite_catalog = catalog.catalog_refs['Satellite Data'].follow()
satellite_catalog.catalog_refs

The URL of the new catalog is in the catalog_url instance attribute, and can be accessed as follows:

In [ ]:

satellite_catalog.catalog_url

Any datasets described by the catalog are contained in the datasets instance attribute:

In [ ]:

satellite_catalog.datasets

The [] indicates there are no datasets contained within the catalog. We can continue to work our way down through the catalog structure until we reach a catalog that contains a dataset.

In [ ]:

goes_east_grb_catalog = satellite_catalog.catalog_refs['GOES East GOES Rebroadcast (GRB)'].follow()
print(goes_east_grb_catalog.catalog_url)
print('  catalogs: {}'.format(goes_east_grb_catalog.catalog_refs))
print('  datasets: {}\n'.format(goes_east_grb_catalog.datasets))

abi_catalog = goes_east_grb_catalog.catalog_refs['ABI'].follow()
print(abi_catalog.catalog_url)
print('  catalogs: {}'.format(abi_catalog.catalog_refs))
print('  datasets: {}\n'.format(abi_catalog.datasets))

conus_catalog = abi_catalog.catalog_refs['CONUS'].follow()
print(conus_catalog.catalog_url)
print('  catalogs: {}'.format(conus_catalog.catalog_refs))
print('  datasets: {}\n'.format(conus_catalog.datasets))

channel01_catalog = conus_catalog.catalog_refs['Channel01'].follow()
print(channel01_catalog.catalog_url)
print('  catalogs: {}'.format(channel01_catalog.catalog_refs))
print('  datasets: {}\n'.format(channel01_catalog.datasets))

date_catalog = channel01_catalog.catalog_refs['20210110'].follow()
print(date_catalog.catalog_url)
print('  catalogs: {}'.format(date_catalog.catalog_refs))
print('  datasets: {}\n'.format(date_catalog.datasets))

We used the follow() method several times before finally reaching a catalog with datasets. Normally, it is easiest to browse the catalogs of a TDS using a web browser in order to find a dataset collection that you might be interested in using. Once you have found a dataset you are interested in, you can use the URL from your browser to begin working in python using Siphon. For this collection of data (CONUS domain of the GOES East satellite Advanced Baseline Imager instrument (channel 1)), the catalog https://thredds.ucar.edu/thredds/catalog/satellite/goes/east/grb/ABI/CONUS/Channel01/catalog.xml looks like a good place to start, as it points to catalogs named by date (yyyyMMdd).

Real-time data availability: In general, the datasets available on the demonstration TDS managed by Unidata are updated in real time. Data are removed from the server after a certain period of time, typically between three days and one month (depending on the size of the data files). This collection contains, roughly, the most recent 14 days of data.

As mentioned at the beginning of this notebook, catalogs can expose metadata about a dataset. The metadata instance variable holds any metadata defined by the catalog, such as dataFormat, documentation, etc. For example, the metadata associated with date_catalog looks like:

In [ ]:

date_catalog.metadata

The amount of metadata contained within a catalog depends on how much effort has been put into currating the collection.

Top

Working with a TDS Catalog Dataset ¶

Once we have found a catalog with datasets, we can access once of the datasets using its name:

In [ ]:

dataset = date_catalog.datasets['OR_ABI-L1b-RadC-M6C01_G16_s20210100156163_e20210100158536_c20210100158591.nc']

Now that we have a dataset, we can see in what ways we can access the dataset using the access_urls instance variable:

In [ ]:

dataset.access_urls

Each service provides a unique way of accessing the metadata or actual data contained within the dataset. Other Siphon notebooks explore ways in which the services can be used, but at this point, you are ready to begin your data analysis journey!

Top