Intake Caching

This notebook shows a simple demonstration of how you would use and manage caching with Intake of avoid repeated downloads to large data files.

Let's start with a simple example. First, import intake as normal.

In [ ]:
from intake.config import conf
conf['cache_download_progress'] = False # <-- turn off download progress to display download times

import intake
cat = intake.open_catalog('catalog.yml')
list(cat)
In [ ]:
sales = cat.sales()
cache = sales.cache[0]
cache.clear_all() # <-- clearing cache to make sure we start from scratch.
sales._urlpath

Here the urlpath is a remote HTTP server. When the data source is read for the first time a download will be triggered.

In [ ]:
%time df = sales.read()

Now let's read the data again. Notice, the read is fast this time thanks to local caching.

In [ ]:
%time df = sales.read()

See that we do indeed have the data.

In [ ]:
df.head()

Looking under the hood at the default cache directory, notice the files now exist locally in a hashed subdirectory.

In [ ]:
%ls -la ~/.intake/cache/975358c19433bc3c5eae68abbde7f2ca

These subdirectories are named by hashing the data source driver, urlpath, and cache regex to avoid collision among data sources and cache specifications. We can call the _hash method directly to find out the subdirectory name for a given urlpath.

In [ ]:
cache._hash(sales._urlpath)

Inspecting the metadata shows the created timestamp, original path, and cached path.

In [ ]:
cache.get_metadata(sales._urlpath)

The data source will provide the cache directory if you are not sure where it is located.

In [ ]:
sales.cache_dirs

The cache can be cleared for an individual source.

In [ ]:
cache.clear_cache(sales._urlpath)
cache.get_metadata(sales._urlpath)

After clearing the cache, the files are removed from the cache directory.

In [ ]:
%ls -la ~/.intake/cache

If the data source is read again, the file is downloaded again.

In [ ]:
%time df = sales.read()
In [ ]:
%ls -la ~/.intake/cache/975358c19433bc3c5eae68abbde7f2ca

Cache object?

Let's take a quick look at the cache object. This object provides utilities for managing cached data files. When a request for data is made, this object checks to see if data for the urlpath specified in the source exists on local disk in the cache directory. If so, it returns a reference to the local file path rather than the remote path. If the file(s) do not exist, it will download them, update the metadata, and return a local reference.

Below are a few methods that Intake users should be familar with.

In [ ]:
cache.get_metadata?
In [ ]:
cache.clear_cache?
In [ ]:
cache.clear_all?

Cache directory is configurable

The config and cache metadata are stored in ~/.intake. By default, the cache directory is located at ~/.intake/cache, however it can be set to a separate location specified in the config file, an environment variable, or at runtime. Here it is set at runtime.

In [ ]:
from intake.config import conf
conf['cache_download_progress'] = True # <-- turn progress bars back on (default)

cache.clear_all()

import os.path

cat = intake.open_catalog('catalog.yml')
sales = cat.sales()
sales.set_cache_dir(os.path.join(os.getcwd(), 'test_cache_dir'))
sales.cache_dirs
cache = sales.cache[0]
In [ ]:
df = sales.read()
In [ ]:
cache.get_metadata(sales._urlpath)
In [ ]:
cache.clear_all()

The cache directory can also be set in the Intake config. This is equivalent to setting it in the INTAKE_CACHE_DIR environment variable.

In [ ]:
from intake.config import conf, defaults
import os.path

conf['cache_dir'] = defaults['cache_dir']
cat = intake.open_catalog('catalog.yml')
sales = cat.sales()
sales.cache_dirs

Disable Caching

Caching can be disabled globally in the intake.config.

In [ ]:
from intake.config import conf
conf['cache_disabled'] = True

cat = intake.open_catalog('catalog.yml')
sales = cat.sales()
cache = sales.cache[0]

Notice, the read times are consistently longer.

In [ ]:
%time df = sales.read()
In [ ]:
%time df = sales.read()

Also, the cache directory and metadata are empty.

In [ ]:
sales.cache_dirs
In [ ]:
%ls -la ~/.intake/cache
In [ ]:
cache.get_metadata(sales._urlpath)
In [ ]: