Histogrammar exercises¶

Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes.

(There is also a scala backend for Histogrammar, that is used by spark.)

You can do the exercises below after the basic tutorial.

Enjoy!

In [ ]:

%%capture
# install histogrammar (if not installed yet)
import sys

!"{sys.executable}" -m pip install histogrammar

In [ ]:

import histogrammar as hg

In [ ]:

import pandas as pd
import numpy as np
import matplotlib

Dataset¶

Let's first load some data!

In [ ]:

# open a pandas dataframe for use below
from histogrammar import resources
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])

In [ ]:

df.head(2)

Comparing histogram types¶

Histogrammar treats histograms as objects. You will see this has various advantages.

Let's fill a simple histogram with a numpy array.

In [ ]:

# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]
hist1 = hg.Bin(num=10, low=0, high=100)

In [ ]:

hist1.fill.numpy(df['age'].values)

In [ ]:

hist1.plot.matplotlib();

In [ ]:

hist2 = hg.SparselyBin(binWidth=10, origin=0)

In [ ]:

hist2.fill.numpy(df['age'].values)

In [ ]:

hist2.plot.matplotlib();

Q: Have a look at the .values and .bins attributes of hist1 and hist2. What types are these? (hist1.values is a ...?) Does that make sense?

In [ ]:

hist1

In [ ]:

hist2

Q: In each bin, what type of object is keeping track of the bin count?

Try filling hist1 with small values (negative) or very large (> 100) or with NaNs. Find out if and how hist1 keeps track of these?

Now fill hist2 with small values (negative) or very large (> 100) or with NaNs. How does hist2 keeps track of these?

Categorical variables¶

For categorical variables use the Categorize histogram

Categorize histograms: accepting categorical variables such as strings and booleans.

In [ ]:

histx = hg.Categorize('eyeColor')

In [ ]:

histx.fill.numpy(df)

Q: A categorize histogram, what is it fundementally, a dictionary or a list?

Q: What else can it keep track of, e.g. numbers, booleans, nans? Give it a try, fill it with more entries!

Fill a histograms with a boolean array (isActive), directly from the dataframe

Q: what type of histogram do you get?

In [ ]:

hists = df.hg_make_histograms(features=['isActive'])

In [ ]:

Multi-dimensional histograms¶

Let's make a 3-dimensional histogram, with axes: x=favoriteFruit, y=gender, z=isActive. (In Histogrammar, a multi-dimensional histogram is composed as recursive histograms, starting with the last one.) Then fill it with the dataframe.

In [ ]:

# hist1 = hg.Categorize(quantity='isActive')
# hist2 = hg.Categorize(quantity='gender', value=hist1)
# hist3 = hg.Categorize(quantity='favoriteFruit')

Q: How many data points end up in the bin: banana, male, True ?

Q: Store this histogram as a json file. What is the size of the json file?

Q: Read back the histogram and then plot it.

Q: Make a histogram of the feature 'fruit', which measures the average value of 'latitude' per bin of fruit.

In [ ]:

hist1 = hg.Average(quantity='latitude')

Q: what is the mean value of latitude for the bin 'strawberry'?