# Intake Caching¶

This notebook shows a simple demonstration of how you would use and manage caching with Intake of avoid repeated downloads to large data files.

In [ ]:
from intake.config import conf

import intake
cat = intake.open_catalog('catalog.yml')
list(cat)

In [ ]:
sales = cat.sales()
cache = sales.cache[0]
cache.clear_all() # <-- clearing cache to make sure we start from scratch.
sales._urlpath


Here the urlpath is a remote HTTP server. When the data source is read for the first time a download will be triggered.

In [ ]:
%time df = sales.read()


Now let's read the data again. Notice, the read is fast this time thanks to local caching.

In [ ]:
%time df = sales.read()


See that we do indeed have the data.

In [ ]:
df.head()


Looking under the hood at the default cache directory, notice the files now exist locally in a hashed subdirectory.

In [ ]:
%ls -la ~/.intake/cache/975358c19433bc3c5eae68abbde7f2ca


These subdirectories are named by hashing the data source driver, urlpath, and cache regex to avoid collision among data sources and cache specifications. We can call the _hash method directly to find out the subdirectory name for a given urlpath.

In [ ]:
cache._hash(sales._urlpath)


Inspecting the metadata shows the created timestamp, original path, and cached path.

In [ ]:
cache.get_metadata(sales._urlpath)


The data source will provide the cache directory if you are not sure where it is located.

In [ ]:
sales.cache_dirs


The cache can be cleared for an individual source.

In [ ]:
cache.clear_cache(sales._urlpath)


After clearing the cache, the files are removed from the cache directory.

In [ ]:
%ls -la ~/.intake/cache


In [ ]:
%time df = sales.read()

In [ ]:
%ls -la ~/.intake/cache/975358c19433bc3c5eae68abbde7f2ca


## Cache object?¶

Let's take a quick look at the cache object. This object provides utilities for managing cached data files. When a request for data is made, this object checks to see if data for the urlpath specified in the source exists on local disk in the cache directory. If so, it returns a reference to the local file path rather than the remote path. If the file(s) do not exist, it will download them, update the metadata, and return a local reference.

Below are a few methods that Intake users should be familar with.

In [ ]:
cache.get_metadata?

In [ ]:
cache.clear_cache?

In [ ]:
cache.clear_all?


## Cache directory is configurable¶

The config and cache metadata are stored in ~/.intake. By default, the cache directory is located at ~/.intake/cache, however it can be set to a separate location specified in the config file, an environment variable, or at runtime. Here it is set at runtime.

In [ ]:
from intake.config import conf

cache.clear_all()

import os.path

cat = intake.open_catalog('catalog.yml')
sales = cat.sales()
sales.set_cache_dir(os.path.join(os.getcwd(), 'test_cache_dir'))
sales.cache_dirs
cache = sales.cache[0]

In [ ]:
df = sales.read()

In [ ]:
cache.get_metadata(sales._urlpath)

In [ ]:
cache.clear_all()


The cache directory can also be set in the Intake config. This is equivalent to setting it in the INTAKE_CACHE_DIR environment variable.

In [ ]:
from intake.config import conf, defaults
import os.path

conf['cache_dir'] = defaults['cache_dir']
cat = intake.open_catalog('catalog.yml')
sales = cat.sales()
sales.cache_dirs


## Disable Caching¶

Caching can be disabled globally in the intake.config.

In [ ]:
from intake.config import conf
conf['cache_disabled'] = True

cat = intake.open_catalog('catalog.yml')
sales = cat.sales()
cache = sales.cache[0]


Notice, the read times are consistently longer.

In [ ]:
%time df = sales.read()

In [ ]:
%time df = sales.read()


Also, the cache directory and metadata are empty.

In [ ]:
sales.cache_dirs

In [ ]:
%ls -la ~/.intake/cache

In [ ]:
cache.get_metadata(sales._urlpath)

In [ ]: