Developing for Intake

Intake has been designed with a plugable structure, to allow adding new data sources and other features as easily as possible. Here we demonstrate making a new plugin to read data from the GitHub API, using a small and handy package that already exists, pygithub. We will access only one particular part of the API, as a simple example.

The directory intake-github contains a full package to do the job. Let's investigate its contents by changing directory - the exact line you need next will vary depending on how you are running this notebook.

In [ ]:
cd intake-github/

The package layout

The contents are typical python package files:

In [ ]:
import os
cwd = os.getcwd()
for path, dirs, files in os.walk(cwd):
    for f in files:
        print(os.path.join(path, f)[len(cwd):])

The setup.py and meta.yaml are standard distutils and conda package information boilerplate, specifying dependencies and a little metadata. The LICENSE file is empty, but in general, all open-source projects should have an appropriate license. With this structure, one can install the package:

pip install .

or add the modifier -e to link rather than copy. In the following section, we will execute this command. It is also possible to use pip to install directly for a remote github repository.

You can create a distribution (in the dist/ folder), which could be uploaded to PyPI

python setup.py sdist

And finally, you can create a conda package, which could be uploaded to a channel on anaconda.org .

conda build conda

(all of these commands to be run in the intake-github directory)

Note that, once the package is installed via pip or conda, the package github_issues will become importable in your environment, and the driver defined below, "github_issues" will automatically be loaded into the Intake, registry, so that the function intake.open_github_issues gets created at import time.

The code

This example contains just one DataSource class, although it could have had several related source.

The source code has very little code: a subclass of DataSource with the following class-level attributes:

container = 'dataframe'
    version = '0.0.1'
    partition_access = False
    name = 'github_issues'

These name the new plugin, give it a version and an output data type (Pandas dataframe), and specify that the data will always be loaded in a single shot, without partitioning. Whether or not a dataframe is the right representation for this data is left for another discussion.

The only real logic is in the read function:

def _get_partition(self, _):
        from github import Github
        import pandas as pd
        gh = Github()
        repo = gh.get_repo('%s/%s' % (self.organization, self.repo))

        raw_data = repo.get_issues(**self.ghkwargs)
        data = {}
        for column in ['number', 'title', 'user', 'state', 'comments',
                            'created_at', 'updated_at', 'body']:
            # user is special case
            if column == 'user':
                data['user'] = [issue.user.login for issue in raw_data]
            else:
                data[column] = [getattr(issue, column) for issue in raw_data]

        return pd.DataFrame(data)

This code is typical of the kind of coercion of data that might be required for different sources. Since we do not attempt partitioning here, the read() method is guaranteed to do the right thing. The external module pygithub takes care of actually formatting the HTTP REST calls to the github API. We simply pass on keyword arguments, so if a catalog author would like more fine control over what is downloaded for any specific instance of this data source, they would have to read the pygithub and/or GitHub API documentation.

We can install the package with pip, or we could build the conda package and install that instead. To distribute the package, it could be uploaded to PyPI, to a user conda channel on anaconda.org, or added as a feedstock to conda-forge.

In [ ]:
# install with dependencies. This produces a lot of text output and can take a little while.
!pip install -e .

Since the class is in the top-level of the package, and the package name starts with intake_, it will be scanned when Intake is imported. Now the plugin automatically appears in the set of known plugins in the Intake registry, and an associated intake.open_github_issues function is created at import time.

In [ ]:
import intake
In [ ]:
'github_issues' in intake.registry

If this automatic behaviour is not desired, then either catalogs would need to reference the plugin module/location explicitly in the plugins: section, or the package would need to insert its class into the Intake registry when it is imported.

Now we can use our new plugin to load data.

In [ ]:
source = intake.open_github_issues('intake', 'intake')
In [ ]:
source.discover()
In [ ]:
out = source.read()
In [ ]:
# currently open issues and PRs in the Intake repo
# would it make sense to here to regard the issue number as the index?
out[['number', 'title', 'state']]

The point of doing this, is to enable data engineers to specify the driver in their catalogs or when distributing data packages with conda. Notice that in YAML form, our data source references the driver class in the source. If the package were not found on the system when such a catalog entry is loaded, an error would be raised. The second form also includes the optional plugins section. The plugins directory is a convenient place in which to maintain links to the various drivers.

In [ ]:
print(source.yaml())
In [ ]:
print(source.yaml(True))

Now there is a clear path for this particular type of data to be accessed by end-users, and authors of catalogs can use a spec like the one above to list particular data-sets in catalogs. If included in a conda data package, then the dependency (this intake-github package) can be automatically installed along with the catalog. If not using conda, users will find from the catalog spec that they need the github_issues python package and will have to install it themselves.

Summary

In a small amount of code, we have shown how to wrap a third-party data-loading package, to make it available to Intake as a plugin. The creation of the plugin enables inclusion of data-sets made from this kind of data possible in data catalogs. In this way, we build a uniform interface to access a wide variety of arbirtary data sources.