Here we are demonstrating a new functionality within Intake, which can parse and make use of information stored in the filenames of a given dataset. This notebook demonstrates the functionality from the point of view of the catalog-author/data-engineer: you can create a catalog entry for a data source that allows users to get all the data they need in one or two lines. Link to the blog post.
import intake
First we'll open the catalog and take a look at the data sources defined within:
cat = intake.Catalog('catalog.yml')
list(cat)
For this example we will be using southern_rockies
. We can learn more about the data source by reading the description
.
southern_rockies = cat.southern_rockies()
print(southern_rockies.description)
We can also inspect the pattern
property of the data source:
southern_rockies.pattern
We will use the read
method to load the data into memory, as a pandas.dataframe
, in one shot.
df = southern_rockies.read()
df.sample(5)
The values for emissions
and model
are parsed from the filenames and added to the data. By inspecting more closely we can see that these new columns have categorical datatypes:
df.dtypes
This is a highly efficient representation of the data and takes up minimal memory. It is also more performant for select and groupby operations.
Intake provides a plotting API which uses hvplot
to declare plots in the catalog itself. Hvplot allows you to easily create interactive plots that are actually holoviews
objects making it an incredibly powerful tool for rapid data visualization. This API can be used to set default values for a particular data source and to declare specific plots.
import hvplot.intake
intake.output_notebook()
In this case specify some defaults to make it easy to produce plots quickly. You'll find these lines in the catalog file:
metadata:
plot:
x: time
y: precip
southern_rockies.plot(groupby='emissions', by='model')
In addition to declaring defaults, the catalog author can specify complete plots:
metadata:
plots:
model_emissions_grid:
col: model
row: emissions
width: 300
height: 200
southern_rockies.plot.model_emissions_grid()
Note that these plots are one-liners. Providing default plots for users is a great way to give them a quick sense of the data, and the interactivity makes it straightforward to zoom into the area of interest and derive real meaning. We can use the pandas.dataframe
directly to do computations and make more visualizations.
import hvplot.pandas
unit = southern_rockies.metadata['fields']['precip']['unit']
thresh = 3
years = 20
label = f'Months per {years} years with precip ({unit}) greater than {thresh}'
(df[df['precip'] > thresh]
.groupby('emissions').resample(f'{years}y', on='time').sum()
.rename(columns={'precip': 'count'})
.hvplot.bar(by='emissions', x='time')
.relabel(label))
When you are starting to build a catalog it is sometimes helpful to use intake.open_csv
to figure out the best way to load your data.
paths = ['./data/SRLCC_b1_Precip_ECHAM5-MPI.csv', './data/SRLCC_b1_Precip_MIROC3.2(medres).csv']
southern_rockies_list = intake.open_csv(urlpath=paths,
path_as_pattern='SRLCC_{emissions}_Precip_{model}.csv',
csv_kwargs=dict(
skiprows=3,
names=['time', 'precip'],
parse_dates=['time']))
df = southern_rockies_list.read()
df.sample(5)
In this case since we are using inline loading rather than the catalog, we need declare ever feature of out plot inline.
southern_rockies_list.hvplot(x='time', y='precip', col='model', row='emissions', width=300, height=200)
Once you are ready to save your catalog, use .yaml
to generate an approximate version.
print(southern_rockies_list.yaml())
Once you have a catalog, you have a single reference of truth for the data source, no need for copy/paste, and the end-user can get on with their work. You can find an example from the data user's perspective in the landsat example.