Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes.
(There is also a scala backend for Histogrammar, that is used by spark.)
You can do the exercises below after the basic tutorial.
Enjoy!
%%capture
# install histogrammar (if not installed yet)
import sys
!"{sys.executable}" -m pip install histogrammar
import histogrammar as hg
import pandas as pd
import numpy as np
import matplotlib
Let's first load some data!
# open a pandas dataframe for use below
from histogrammar import resources
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head(2)
Histogrammar treats histograms as objects. You will see this has various advantages.
Let's fill a simple histogram with a numpy array.
# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]
hist1 = hg.Bin(num=10, low=0, high=100)
hist1.fill.numpy(df['age'].values)
hist1.plot.matplotlib();
hist2 = hg.SparselyBin(binWidth=10, origin=0)
hist2.fill.numpy(df['age'].values)
hist2.plot.matplotlib();
Q: Have a look at the .values and .bins attributes of hist1 and hist2. What types are these? (hist1.values is a ...?) Does that make sense?
hist1
hist2
Q: In each bin, what type of object is keeping track of the bin count?
Try filling hist1 with small values (negative) or very large (> 100) or with NaNs. Find out if and how hist1 keeps track of these?
Now fill hist2 with small values (negative) or very large (> 100) or with NaNs. How does hist2 keeps track of these?
For categorical variables use the Categorize histogram
histx = hg.Categorize('eyeColor')
histx.fill.numpy(df)
Q: A categorize histogram, what is it fundementally, a dictionary or a list?
Q: What else can it keep track of, e.g. numbers, booleans, nans? Give it a try, fill it with more entries!
Fill a histograms with a boolean array (isActive), directly from the dataframe
Q: what type of histogram do you get?
hists = df.hg_make_histograms(features=['isActive'])
Let's make a 3-dimensional histogram, with axes: x=favoriteFruit, y=gender, z=isActive. (In Histogrammar, a multi-dimensional histogram is composed as recursive histograms, starting with the last one.) Then fill it with the dataframe.
# hist1 = hg.Categorize(quantity='isActive')
# hist2 = hg.Categorize(quantity='gender', value=hist1)
# hist3 = hg.Categorize(quantity='favoriteFruit')
Q: How many data points end up in the bin: banana, male, True ?
Q: Store this histogram as a json file. What is the size of the json file?
Q: Read back the histogram and then plot it.
Q: Make a histogram of the feature 'fruit', which measures the average value of 'latitude' per bin of fruit.
hist1 = hg.Average(quantity='latitude')
Q: what is the mean value of latitude for the bin 'strawberry'?