Data Engineering with Intake

Intake philosophy contains a clear separation of concerns between the provider of data and the consumer of data. This tutorial concerns the former: someone who cares about where a particular dataset it stored and the right format and options for best retrieval. It is their task to make these choices, and then expose the data to end-users (such as data scientists), so that they have a clear path to finding and accessing their data. There is no need to train users in how to investigate or load a particular dataset, those details are encoded in the catalog.

Intake catalogs act as a single source of truth about the data in question. The principal job of a data scientist, while interacting with Intake, is to find the best representation of data-sets (as they would have to do in any case) and to author catalogs as a means of both codifying the data-sets in versionable files and exposing them to users with a clear contract.

In this tutorial we will show the work-flow for writing a catalog, and thereby providing data to your users.

In [ ]:
import intake
import hvplot.pandas
intake.output_notebook()

Intake datasets are loaded by data drivers, which are class definitions, some of which are included with the main Intake package, but many more of which are available as extra installs (see the plugins directory. The currently installed set of drivers are listed by name in the registry:

In [ ]:
list(intake.registry)

Each of these is associated with an intake.open_* function. The registry, and these functions, are built at import time by scanning all installed packaged with names starting "intake_", and containing DataSource subclasses at the top level. It is also possible to refer to drivers by the fully-qualified name (e.g., "package.submodule.DriverClass"), but in this example we will concentrate on the "csv" driver, which is included in the default Intake install.

Commonly, the first step to writing a catalog entry might be to use the relevant open_ function to create a DataSource object. The functions are documented with text from the driver class. Getting the right set of parameters can take some iterations, and most of the drivers can take many optional parameters - some domain knowledge of the format in question may be required at this point. The CSV format is fairly simple, and in this particular case, it loads without extra arguments:

In [ ]:
source = intake.open_csv('https://timeseries.weebly.com/uploads/'
                         '2/1/0/8/21086414/sea_ice.csv')

That line just created the DataSource object, which may have verified that the arguments were reasonable, but did not actually access the data (i.e., it is a lazy process). To test if the loading is successful, we need to interrogate the source, such as looking at the details, or some (or all) of the loaded data:

In [ ]:
# access the data basic details
source.discover()
In [ ]:
# load and use data
df = source.read()
df.head()

We find that the data loaded fine (although, we may have wished to parse those Time values into timestamps). Having got here, we might attempt to investigate some plots which best show the characteristics of the data. The purpose of this is to come up with plot specs that might be stored along with the arguments required to load the data, so that users can get a quick overview of the contents. This is normally an iterative process, but we just show a particular invocation that happens to show a figure that might be informative:

In [ ]:
# try some plotting
df.hvplot(kind='line', x='Time', y= ['Arctic', 'Antarctica'],
          width=700, height=500)

Notice that we plotted on the (in-memory) Pandas data-frame here, but the identical output can be had from the DataSource instance too. This loads the data every time, which may or may not be desired. It is possible, but not included in this tutorial, to specify a dataset as cache-on-first-access, so that the speed of such output is greatly improved.

In [ ]:
source.hvplot(kind='line', x='Time', y= ['Arctic', 'Antarctica'],
              width=700, height=500)

At this point, we have a source which correctly loads the data, and we have graphical output too. We can get the YAML-syntax prescription of the first by directly calling a method:

In [ ]:
# presciption for the source we made
print(source.yaml())

Finally, we can create a YAML-syntax catalog file containing that prescription, with some extra description, and the addition of the plot we trialled above.

In [ ]:
%%writefile sea.yaml
sources:
    sea_ice:
      args:
        urlpath: "https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv"
      description: "Polar sea ice cover"
      driver: csv
      metadata:
        plots:
          basic:
            kind: line
            x: Time
            y: [Arctic, Antarctica]
            width: 700
            height: 500

To test that this catalog does the right thing, we can load it in again, and try to work with it

In [ ]:
# load that prescription
cat = intake.open_catalog('sea.yaml')
In [ ]:
# plot is automatic
cat.sea_ice.plot.basic()

OK! The catalog is functional, and ready to be shared. The very easiest way to share an Intake catalog is simply to put it in a location where it can be read by your target audience. For this tutorial, which is stored in a github repo, that can be the URL of the file in the repo.

So all that you need to share with your users is the URL of the catalog. You can try it for yourself, but the following line is what you would expect the data users to execute. Of course, it is good practice to verify that this works too.

In [ ]:
# put catalog in public place for sharing
cat = intake.open_catalog('https://raw.githubusercontent.com/intake/'
                          'intake-examples/master/tutorial/sea.yaml')
In [ ]:
# load data as before
cat.sea_ice.read().head()

One interesting note, is that the catalog is also a DataSource instance. This means that you can refer to it from other catalogs, and thus build up a hierarchy of data sources. For instance, you might have a root or master catalog, which points to several other catalogs, which each contain entries of a particular type; and the whole thing is searchable with either code or the Intake GUI. This way, the data collection as a whole has a structure that will make navigating to the right data-set simpler. You can even have separate hierarchies pointing to the same data, for alternative ways to split up the set of data-sets.

In [ ]:
# the cat as a data source
print(cat.yaml())
In [ ]: