Intake for a Data Science workflow

A data scientist wishes to find data for a given task, and introspect the attributes of that data, to make sure it is right for a given task. Next, the data should be loaded, ready for analysis, with the less known about the specifics of the loading package, format and storage backend, the better.

Intake provides an easy way to find your data, locally, in a cloud service, or on an Intake server. In this tutorial, we show an overview of working with Intake from the point-of-view of the data scientist, or data end-user: someone who wants a package which can fetch the data and then "get out of the way".

The common starting point for finding and inspecting data-sets is with a catalog, which is a collection of entries, each of which corresponds to a specific data-set. The entries have names, descriptions and metadata, to allow for searching and filtering of the entries, to find the specific data which solves a particular problem.

There are two specific starting points that a data scientist might usually use:

  • built-in data, which may have been conda-installed, or otherwise appear on a predetermined search-path
  • specific URLs containing catalog specifications.

Both of these can be accessed programatically or via the GUI, and we'll demonstrate each.

In [ ]:
import intake
intake.output_notebook()  # enables source.plot API

In this environment, we have "conda installed data" which could look as follows:

conda install -c intake us_crime

This action puts a speacial catalog spec file in the YAML format in a location which Intake searches at import time. The effect is to make some datasets automatically available in the special catalog intake.cat. In the following cell, the entry "us_crime" corresponds to the dataset which is installed with the conda command above.

In [ ]:
intake.cat
In [ ]:
# find the entries in the catalog
list(intake.cat)

Any entry of eny catalog is easy to investigate. The standard repr already gives a lot of information about the data. Note that you can use attribute access to grab a specific entry, if it has a name which can be a python identifier, or with getitem syntax ["name"], which always works.

In [ ]:
# select an entry
# same as intake.cat["us_crime"]
intake.cat.us_crime

We found that this represents a dataframe, to be read by the driver "csv".

A catalog entry is just a definition of how Intake should go about grabbing the data, together with useful descriptions. To go deeper, we can instantiate the data source, which a concrete representation of the data, and can be used for loading metadata bound in the storage itself, and to actually read all of the data. The following cell touches the first part of the data, to infer the types of the columns of the dataframe.

In [ ]:
# detailed info - touches data
s = intake.cat.us_crime()
s.discover()

So what should we do with this data? Most obvious, and hinted in the output, above, would be to view the pre-specified plot that was included in the spec. There is one named plot called "example"

In [ ]:
s.plots

Which we can view with the following one-liner. Note that this plot is interactive.

In [ ]:
# the included quick-plot
intake.cat.us_crime.plot.example()

Even more commonly, we may just load the data into memory. In this case, the in-memory representation of the data is a Pandas data-frame, which will be familiar to most data scientists. For every data-set, there will be only one particular type of container, like the Pandas data-frame, which is the right way to represent that data. Often, though, there may be multiple ways to load this data, for example the .to_dask() method would have created a dask.dataframe.DataFrame instead, which can be important for data-sets that are very large.

At this point, Intake is essentially done with this dataset: it has been delivered to the scientist, so that they can get on with whatever analysis they needed this data for.

In [ ]:
# load all data into memory
df = intake.cat.us_crime.read()
In [ ]:
# now we have a pandas dataframe
df.head()
In [ ]:
# Analysis:
# fraction of all theft that was vehicle theft across all years
df['Motor vehicle theft'].sum() / df['Larceny-theft'].sum()

This workflow could have been achieved via the Intake GUI also. Initially, the GUI will show the built-in datasets:

In [ ]:
intake.gui

Notice how only one catalog is loaded, and that if you select the source "us_crime", the same information is displayed about it as at the top of this notebook. Also, the "plot" button opens a panel below the main interface, containing the one prescribed plot for this data-set.

The (+) button in the interface allows you to use the file-browser or any URL to add catalogs to the interface; but you can also do this in code:

In [ ]:
intake.gui.add('sea.yaml')

Now a new catalog "sea" has appeared, and its one item is selected - which happens to be another csv data-set, but this time loaded from a remote location. Note that the catalog file is local, but the data is remote. It is perfectly possible for the catalog file to also be remote.

In [ ]:
s = intake.gui.item()
s.discover()  # access information on the selected item.
In [ ]:
s.plot.basic()