Caching Demo¶

To run this demo make sure you have installed tqdm conda install -c conda-forge tqdm or pip install tqdm so that you will see the progress bar in the notebook.

In [ ]:

import intake
cat = intake.open_catalog('cache_demo.yml')
list(cat)

Each entry in the catalog has a cache associated with it. When accessing the catalog metadata, the file does not get downloaded.

In [ ]:

stats = cat.demographic_stats()
stats.cache[0].clear_all()
stats._urlpath

The download occurs when the data source is read for the first time.

In [ ]:

df = stats.read()

Second read doesn't download¶

When the source is read again, the new local version will be used. So the read will be much faster.

In [ ]:

df = stats.read()
df.head()

We can inspect the cache from the command line or using the python API.

In [ ]:

!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv

In [ ]:

stats.cache[0].get_metadata(stats._urlpath)

In [ ]:

stats.cache_dirs

We can also use the os module to inspect the cache dir more directly

In [ ]:

import os
os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))

Now let's clear the cache and then redownload the file.

In [ ]:

stats.cache[0].clear_cache(stats._urlpath)
stats.cache[0].get_metadata(stats._urlpath)

Or equivilently:

In [ ]:

!intake cache clear

In [ ]:

!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv

In [ ]:

df = stats.read()

In [ ]:

!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv

Cache directory is configurable¶

In [ ]:

stats.cache[0].clear_cache(stats._urlpath)  # clear default cache

import os.path

cat = intake.open_catalog('cache_demo.yml')
stats = cat.demographic_stats()
stats.set_cache_dir(os.path.join(os.getcwd(), 'test_cache_dir'))
stats.cache_dirs

In [ ]:

df = stats.read()

List the files in the default intake cahce dir to see that nothing is in there. Then inspect the dir defined above to see that there is a dir with a unique id. Alternately - use the CLI to access the cache info.

In [ ]:

os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))

In [ ]:

os.listdir('./test_cache_dir')

In [ ]:

!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv

In [ ]:

stats.cache[0].get_metadata(stats._urlpath)

In [ ]:

stats.cache[0].clear_all()

Disable Caching¶

Caching can be globally disabled from the config using the python API or by editing the config file directly.

In [ ]:

from intake.config import conf
conf['cache_disabled'] = True

cat = intake.open_catalog('cache_demo.yml')
stats = cat.demographic_stats()
df = stats.read()
df.head()

In [ ]:

!intake config info

In [ ]:

stats.cache_dirs

In [ ]:

os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))

In [ ]:

stats.cache[0].get_metadata(stats._urlpath)

In [ ]: