To run this demo make sure you have installed tqdm conda install -c conda-forge tqdm
or pip install tqdm
so that you will see the progress bar in the notebook.
import intake
cat = intake.open_catalog('cache_demo.yml')
list(cat)
Each entry in the catalog has a cache associated with it. When accessing the catalog metadata, the file does not get downloaded.
stats = cat.demographic_stats()
stats.cache[0].clear_all()
stats._urlpath
The download occurs when the data source is read
for the first time.
df = stats.read()
When the source is read again, the new local version will be used. So the read will be much faster.
df = stats.read()
df.head()
We can inspect the cache from the command line or using the python API.
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv
stats.cache[0].get_metadata(stats._urlpath)
stats.cache_dirs
We can also use the os
module to inspect the cache dir more directly
import os
os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))
Now let's clear the cache and then redownload the file.
stats.cache[0].clear_cache(stats._urlpath)
stats.cache[0].get_metadata(stats._urlpath)
Or equivilently:
!intake cache clear
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv
df = stats.read()
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv
stats.cache[0].clear_cache(stats._urlpath) # clear default cache
import os.path
cat = intake.open_catalog('cache_demo.yml')
stats = cat.demographic_stats()
stats.set_cache_dir(os.path.join(os.getcwd(), 'test_cache_dir'))
stats.cache_dirs
df = stats.read()
List the files in the default intake cahce dir to see that nothing is in there. Then inspect the dir defined above to see that there is a dir with a unique id. Alternately - use the CLI to access the cache info.
os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))
os.listdir('./test_cache_dir')
!intake cache list-files https://s3.amazonaws.com/earth-data/Demographic_Statistics_By_Zip_Code.csv
stats.cache[0].get_metadata(stats._urlpath)
stats.cache[0].clear_all()
Caching can be globally disabled from the config using the python API or by editing the config file directly.
from intake.config import conf
conf['cache_disabled'] = True
cat = intake.open_catalog('cache_demo.yml')
stats = cat.demographic_stats()
df = stats.read()
df.head()
!intake config info
stats.cache_dirs
os.listdir(os.path.join(os.path.expanduser('~'), '.intake', 'cache'))
stats.cache[0].get_metadata(stats._urlpath)