Pattern as Path - CSV

Here we are demonstrating a new functionality within Intake, which can parse and make use of information stored in the filenames of a given dataset. This notebook demonstrates the functionality from the point of view of the catalog-author/data-engineer: you can create a catalog entry for a data source that allows users to get all the data they need in one or two lines. Link to the blog post.

In [ ]:
import intake

First we'll open the catalog and take a look at the data sources defined within:

In [ ]:
cat = intake.Catalog('catalog.yml')
list(cat)

For this example we will be using southern_rockies. We can learn more about the data source by reading the description.

In [ ]:
southern_rockies = cat.southern_rockies()
print(southern_rockies.description)

We can also inspect the pattern property of the data source:

In [ ]:
southern_rockies.pattern

We will use the read method to load the data into memory, as a pandas.dataframe, in one shot.

In [ ]:
df = southern_rockies.read()
df.sample(5)

The values for emissions and model are parsed from the filenames and added to the data. By inspecting more closely we can see that these new columns have categorical datatypes:

In [ ]:
df.dtypes

This is a highly efficient representation of the data and takes up minimal memory. It is also more performant for select and groupby operations.

Visualization

Intake provides a plotting API which uses hvplot to declare plots in the catalog itself. Hvplot allows you to easily create interactive plots that are actually holoviews objects making it an incredibly powerful tool for rapid data visualization. This API can be used to set default values for a particular data source and to declare specific plots.

In [ ]:
import hvplot.intake

intake.output_notebook()

In this case specify some defaults to make it easy to produce plots quickly. You'll find these lines in the catalog file:

metadata:
  plot:
    x: time
    y: precip
In [ ]:
southern_rockies.plot(groupby='emissions', by='model')

In addition to declaring defaults, the catalog author can specify complete plots:

metadata:
  plots:
    model_emissions_grid:
      col: model
      row: emissions
      width: 300
      height: 200
In [ ]:
southern_rockies.plot.model_emissions_grid()

Note that these plots are one-liners. Providing default plots for users is a great way to give them a quick sense of the data, and the interactivity makes it straightforward to zoom into the area of interest and derive real meaning. We can use the pandas.dataframe directly to do computations and make more visualizations.

In [ ]:
import hvplot.pandas
In [ ]:
unit = southern_rockies.metadata['fields']['precip']['unit']
In [ ]:
thresh = 3
years = 20
label = f'Months per {years} years with precip ({unit}) greater than {thresh}'

(df[df['precip'] > thresh]
    .groupby('emissions').resample(f'{years}y', on='time').sum()
    .rename(columns={'precip': 'count'}) 
    .hvplot.bar(by='emissions', x='time') 
    .relabel(label))

Using a list of paths

When you are starting to build a catalog it is sometimes helpful to use intake.open_csv to figure out the best way to load your data.

In [ ]:
paths = ['./data/SRLCC_b1_Precip_ECHAM5-MPI.csv', './data/SRLCC_b1_Precip_MIROC3.2(medres).csv']

southern_rockies_list = intake.open_csv(urlpath=paths,
                path_as_pattern='SRLCC_{emissions}_Precip_{model}.csv',
                csv_kwargs=dict(
                    skiprows=3,
                    names=['time', 'precip'],
                    parse_dates=['time']))
In [ ]:
df = southern_rockies_list.read()
df.sample(5)

In this case since we are using inline loading rather than the catalog, we need declare ever feature of out plot inline.

In [ ]:
southern_rockies_list.hvplot(x='time', y='precip', col='model', row='emissions', width=300, height=200)

Once you are ready to save your catalog, use .yaml to generate an approximate version.

In [ ]:
print(southern_rockies_list.yaml())

Once you have a catalog, you have a single reference of truth for the data source, no need for copy/paste, and the end-user can get on with their work. You can find an example from the data user's perspective in the landsat example.