Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. (There is also a scala backend for Histogrammar.)
This basic tutorial shows how to:
Enjoy!
%%capture
# install histogrammar (if not installed yet)
import sys
!"{sys.executable}" -m pip install histogrammar
import histogrammar as hg
import pandas as pd
import numpy as np
import matplotlib
Let's first load some data!
# open a pandas dataframe for use below
from histogrammar import resources
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head()
Histogrammar treats histograms as objects. You will see this has various advantages.
Let's fill a simple histogram with a numpy array.
# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]
hist1 = hg.Bin(num=100, low=-5, high=5)
# filling it with one data point:
hist1.fill(0.5)
hist1.entries
# filling the histogram with an array:
hist1.fill.numpy(np.random.normal(size=10000))
hist1.entries
# let's plot it
hist1.plot.matplotlib();
# Alternatively, you can call this to make the same histogram:
# hist1 = hg.Histogram(num=100, low=-5, high=5)
Histogrammar also supports open-ended histograms, which are sparsely represented. Open-ended histograms are used when you have a distribution of known scale (bin width) but unknown domain (lowest and highest bin index). Bins in a sparse histogram only get created and filled if the corresponding data points are encountered.
A sparse histogram has a binWidth
, and optionally an origin
parameter. The origin
is the left edge of the bin whose index is 0 and is set to 0.0 by default. Sparse histograms are nice if you don't want to restrict the range, for example for tracking data distributions over time, which may have large, sudden outliers.
hist2 = hg.SparselyBin(binWidth=10, origin=0)
hist2.fill.numpy(df['age'].values)
hist2.plot.matplotlib();
# Alternatively, you can call this to make the same histogram:
# hist2 = hg.SparselyHistogram(binWidth=10)
Let's make the same 1d (sparse) histogram directly from a (pandas) dataframe.
hist3 = hg.SparselyBin(binWidth=10, origin=0, quantity='age')
hist3.fill.numpy(df)
hist3.plot.matplotlib();
When importing histogrammar, pandas (and spark) dataframes get extra functions to create histograms that all start with "hg_". For example: hg_Bin or hg_SparselyBin. Note that the column "age" is picked by setting quantity="age", and also that the filling step is done automatically.
# Alternatively, do:
hist3 = df.hg_SparselyBin(binWidth=10, origin=0, quantity='age')
# ... where hist3 automatically picks up column age from the dataframe,
# ... and does not need to be filled by calling fill.numpy() explicitly.
For any 1-dimensional histogram extract the bin entries, edges and centers as follows:
# full range of bin entries, and those in a specified range:
(hist3.bin_entries(), hist3.bin_entries(low=30, high=80))
# full range of bin edges, and those in a specified range:
(hist3.bin_edges(), hist3.bin_edges(low=31, high=71))
# full range of bin centers, and those in a specified range:
(hist3.bin_centers(), hist3.bin_centers(low=31, high=80))
hsum = hist2 + hist3
hsum.entries
hsum *= 4
hsum.entries
There are two other open-ended histogram variants in addition to the SparselyBin we have seen before. Whereas SparselyBin is used when bins have equal width, the others offer similar alternatives to a single fixed bin width.
There are two ways:
They both partition a space into irregular subdomains with no gaps and no overlaps.
hist4 = hg.CentrallyBin(centers=[15, 25, 35, 45, 55, 65, 75, 85, 95], quantity='age')
hist4.fill.numpy(df)
hist4.plot.matplotlib();
hist4.bin_edges()
Note the slightly different plotting style for CentrallyBin histograms (e.g. x-axis labels are central values instead of edges).
Let's make a multi-dimensional histogram. In Histogrammar, a multi-dimensional histogram is composed as two recursive histograms.
We will use histograms with irregular binning in this example.
edges1 = [-100, -75, -50, -25, 0, 25, 50, 75, 100]
edges2 = [-200, -150, -100, -50, 0, 50, 100, 150, 200]
hist1 = hg.IrregularlyBin(edges=edges1, quantity='latitude')
hist2 = hg.IrregularlyBin(edges=edges2, quantity='longitude', value=hist1)
# for 3 dimensions or higher simply add the 2-dim histogram to the value argument
hist3 = hg.SparselyBin(binWidth=10, quantity='age', value=hist2)
hist1.bin_centers()
hist2.bin_centers()
hist2.fill.numpy(df)
hist2.plot.matplotlib();
# number of dimensions per histogram
(hist1.n_dim, hist2.n_dim, hist3.n_dim)
For most 2+ dimensional histograms, one can get the bin entries and centers as follows:
from histogrammar.plot.hist_numpy import get_2dgrid
x_labels, y_labels, grid = get_2dgrid(hist2)
y_labels, grid
Depending on the histogram type of the first axis, hg.Bin or other, one can access the sub-histograms directly from: hist.values or hist.bins
# Acces sub-histograms from IrregularlyBin from hist.bins
# The first item of the tuple is the lower bin-edge of the bin.
hist2.bins[1]
h = hist2.bins[1][1]
h.plot.matplotlib()
h.bin_entries()
So far we have covered the histogram types:
All of these process numeric variables only.
For categorical variables use the Categorize histogram
histy = hg.Categorize('eyeColor')
histx = hg.Categorize('favoriteFruit', value=histy)
histx.fill.numpy(df)
histx.plot.matplotlib();
# show the datatype(s) of the histogram
histx.datatype
Categorize histograms also accept booleans:
histy = hg.Categorize('isActive')
histy.fill.numpy(df)
histy.plot.matplotlib();
histy.bin_entries()
histy.bin_labels()
# histy.bin_centers() will work as well for Categorize histograms
There are several more histogram types:
hmin = df.hg_Minimize('latitude')
hmax = df.hg_Maximize('longitude')
(hmin.min, hmax.max)
havg = df.hg_Average('latitude')
hdev = df.hg_Deviate('longitude')
(havg.mean, hdev.mean, hdev.variance)
hsum = df.hg_Sum('age')
hsum.sum
# let's illustrate the Stack histogram with longitude distribution
# first we plot the regular distribution
hl = df.hg_SparselyBin(25, 'longitude')
hl.plot.matplotlib();
# Stack counts how often data points are greater or equal to the provided thresholds
thresholds = [-200, -150, -100, -50, 0, 50, 100, 150, 200]
hs = df.hg_Stack(thresholds=thresholds, quantity='longitude')
hs.thresholds
hs.bin_entries()
Stack histograms are useful to make efficiency curves.
With all these histograms you can make multi-dimensional histograms. For example, you can evaluate the mean and standard deviation of one feature as a function of bins of another feature. (A "profile" plot, similar to a box plot.)
hav = hg.Deviate('age')
hlo = hg.SparselyBin(25, 'longitude', value=hav)
hlo.fill.numpy(df)
hlo.bins
hlo.plot.matplotlib();
There are several convenience functions to make such composed histograms. These are:
# For example, call this convenience function to make the same histogram as above:
hlo = df.hg_SparselyProfileErr(25, 'longitude', 'age')
hlo.plot.matplotlib();
Here you can find the list of all available histograms and aggregators and how to use each one:
https://histogrammar.github.io/histogrammar-docs/specification/1.0/
The most useful aggregators are the following. Tinker with them to get familiar; building up an analysis is easier when you know "there's an app for that."
Simple counters:
Count
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#count-sum-of-weights): just counts. Every aggregator has an entries
field, but Count
only has this field.
Average
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#average-mean-of-a-quantity) and Deviate
: add mean and variance, cumulatively.
Minimize
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#minimize-minimum-value) and Maximize
: lowest and highest value seen.
Histogram-like objects:
Bin
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#bin-regular-binning-for-histograms) and SparselyBin
: split a numerical domain into uniform bins and redirect aggregation into those bins.
Categorize
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#categorize-string-valued-bins-bar-charts): split a string-valued domain by unique values; good for making bar charts (which are histograms with a string-valued axis).
CentrallyBin
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#centrallybin-fully-partitioning-with-centers) and IrregularlyBin
: split a numerical domain into arbitrary subintervals, usually for separate plots like particle pseudorapidity or collision centrality.
Collections:
Label
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#label-directory-with-string-based-keys), UntypedLabel
, and Index
: bundle objects with string-based keys (Label
and UntypedLabel
) or simply an ordered array (effectively, integer-based keys) consisting of a single type (Label
and Index
) or any types (UntypedLabel
).
Branch
](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#branch-tuple-of-different-types): for the fourth case, an ordered array of any types. A Branch
is useful as a "cable splitter". For instance, to make a histogram that tracks minimum and maximum value, do this:
There a nice method to make many histograms in one go. See here.
By default automagical binning is applied to make the histograms.
More details one how to use this function are found in in the advanced tutorial.
hists = df.hg_make_histograms()
hists.keys()
h = hists['transaction']
h.plot.matplotlib();
h = hists['date']
h.plot.matplotlib();
# you can also select which and make multi-dimensional histograms
hists = df.hg_make_histograms(features = ['longitude:age'])
hist = hists['longitude:age']
hist.plot.matplotlib();
Histograms can be easily stored and retrieved in/from the json format.
# storage
hist.toJsonFile('long_age.json')
# retrieval
factory = hg.Factory()
hist2 = factory.fromJsonFile('long_age.json')
hist2.plot.matplotlib();
# we can store the histograms if we want to
import json
from histogrammar.util import dumper
# store
with open('histograms.json', 'w') as outfile:
json.dump(hists, outfile, default=dumper)
# and load again
with open('histograms.json') as handle:
hists2 = json.load(handle)
hists.keys()
The advanced tutorial shows: