Caching Demo

To run this demo make sure you have installed tqdm conda install -c conda-forge tqdm or pip install tqdm so that you will see the progress bar in the notebook.

In [ ]:
import intake
cat = intake.open_catalog('cache_demo.yml')
list(cat)

Each entry in the catalog has a cache associated with it. When accessing the catalog metadata, the file does not get downloaded.

In [ ]:
stats = cat.demographic_stats()
stats.cache[0].clear_all()
stats._urlpath

The download occurs when the data source is read for the first time.

In [ ]:
df = stats.read()

Second read doesn't download

When the source is read again, the new local version will be used. So the read will be much faster.

In [ ]:
df = stats.read()
df.head()

We can inspect the cache from the command line or using the python API.

In [ ]:
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv
In [ ]:
stats.cache[0].get_metadata(stats._urlpath)
In [ ]:
stats.cache_dirs

We can also use the os module to inspect the cache dir more directly

In [ ]:
os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))

Now let's clear the cache and then redownload the file.

In [ ]:
stats.cache[0].clear_cache(stats._urlpath)
stats.cache[0].get_metadata(stats._urlpath)

Or equivilently:

In [ ]:
!intake cache clear
In [ ]:
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv
In [ ]:
df = stats.read()
In [ ]:
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv

Cache directory is configurable

In [ ]:
stats.cache[0].clear_cache(stats._urlpath)

import os.path

cat = intake.open_catalog('cache_demo.yml')
stats = cat.demographic_stats()
stats.set_cache_dir(os.path.join(os.getcwd(), 'test_cache_dir'))
stats.cache_dirs
In [ ]:
df = stats.read()

List the files in the default intake cahce dir to see that nothing is in there. Then inspect the dir defined above to see that there is a dir with a unique id. Alternately - use the CLI to access the cache info.

In [ ]:
os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))
In [ ]:
os.listdir('./test_cache_dir')
In [ ]:
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv
In [ ]:
stats.cache[0].get_metadata(stats._urlpath)
In [ ]:
stats.cache[0].clear_all()

Disable Caching

Caching can be globally disabled from the config using the python API or by editing the config file directly.

In [ ]:
from intake.config import conf
conf['cache_disabled'] = True

cat = intake.open_catalog('cache_demo.yml')
stats = cat.demographic_stats()
df = stats.read()
df.head()
In [ ]:
!intake config info
In [ ]:
stats.cache_dirs
In [ ]:
os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))
In [ ]:
stats.cache[0].get_metadata(stats._urlpath)