Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. (There is also a scala backend for Histogrammar.)
This advanced tutorial shows how to:
Enjoy!
%%capture
# install histogrammar (if not installed yet)
import sys
!"{sys.executable}" -m pip install histogrammar
import histogrammar as hg
import pandas as pd
import numpy as np
import matplotlib
Let's first load some data!
# open a pandas dataframe for use below
from histogrammar import resources
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head()
No problem! We can easily perform the same steps on a Spark DataFrame. One important thing to note there is that we need to include a jar file when we create our Spark session. This is used by spark to create the histograms using Histogrammar. The jar file will be automatically downloaded the first time you run this command.
# download histogrammar jar files if not already installed, used for histogramming of spark dataframe
try:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark import __version__ as pyspark_version
pyspark_installed = True
except ImportError:
print("pyspark needs to be installed for this example")
pyspark_installed = False
# this is the jar file for spark 3.0
# for spark 2.X, in the jars string, for both jar files change "_2.12" into "_2.11".
if pyspark_installed:
scala = '2.12' if int(pyspark_version[0]) >= 3 else '2.11'
hist_jar = f'io.github.histogrammar:histogrammar_{scala}:1.0.20'
hist_spark_jar = f'io.github.histogrammar:histogrammar-sparksql_{scala}:1.0.20'
spark = SparkSession.builder.config(
"spark.jars.packages", f'{hist_spark_jar},{hist_jar}'
).getOrCreate()
sdf = spark.createDataFrame(df)
Filling histograms with spark dataframes is just as simple as it is with pandas dataframes.
# example: filling from a pandas dataframe
hist = hg.SparselyHistogram(binWidth=100, quantity='transaction')
hist.fill.numpy(df)
hist.plot.matplotlib();
# for spark you will need this spark column function:
if pyspark_installed:
from pyspark.sql.functions import col
Let's make the same histogram but from a spark dataframe. There are just two differences:
col('columns_name')
instead of 'columns_name'
fill.sparksql()
method instead of fill.numpy()
.# example: filling from a pandas dataframe
if pyspark_installed:
hist = hg.SparselyHistogram(binWidth=100, quantity=col('transaction'))
hist.fill.sparksql(sdf)
hist.plot.matplotlib();
Apart from these two differences, all functionality is the same between pandas and spark histograms!
Like pandas, we can also do directly from the dataframe:
if pyspark_installed:
h2 = sdf.hg_SparselyProfileErr(25, col('longitude'), col('age'))
h2.plot.matplotlib();
if pyspark_installed:
h3 = sdf.hg_TwoDimensionallySparselyHistogram(25, col('longitude'), 10, col('latitude'))
h3.plot.matplotlib();
All examples below also work with spark dataframes.
Histogrammar has a nice method to make many histograms in one go. See here.
By default automagical binning is applied to make the histograms.
hists = df.hg_make_histograms()
# histogrammar has made histograms of all features, using an automated binning.
hists.keys()
h = hists['transaction']
h.plot.matplotlib();
# you can select which features you want to histogram with features=:
hists = df.hg_make_histograms(features = ['longitude', 'age', 'eyeColor'])
# you can also make multi-dimensional histograms
# here longitude is the first axis of each histogram.
hists = df.hg_make_histograms(features = ['longitude:age', 'longitude:age:eyeColor'])
# Working with a dedicated time axis, make histograms of each feature over time.
hists = df.hg_make_histograms(time_axis="date")
hists.keys()
h2 = hists['date:age']
h2.plot.matplotlib();
Histogrammar does not support pandas' timestamps natively, but converts timestamps into nanoseconds since 1970-1-1.
h2.bin_edges()
The datatype shows the datetime though:
h2.datatype
# convert these back to timestamps with:
pd.Timestamp(h2.bin_edges()[0])
# For the time axis, you can set the binning specifications with time_width and time_offset:
hists = df.hg_make_histograms(time_axis="date", time_width='28d', time_offset='2014-1-4', features=['date:isActive', 'date:age'])
hists['date:isActive'].plot.matplotlib();
# histogram selections. Here 'date' is the first axis of each histogram.
features=[
'date', 'latitude', 'longitude', 'age', 'eyeColor', 'favoriteFruit', 'transaction'
]
# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs={
'longitude': {'binWidth': 10.0, 'origin': 0.0},
'latitude': {'edges': [-100, -75, -25, 0, 25, 75, 100]},
'age': {'num': 100, 'low': 0, 'high': 100},
'transaction': {'centers': [-1000, -500, 0, 500, 1000, 1500]},
'date': {'binWidth': pd.Timedelta('4w').value, 'origin': pd.Timestamp('2015-1-1').value}
}
# this binning specification is making:
# - a sparse histogram for: longitude
# - an irregular binned histogram for: latitude
# - a closed-range evenly spaced histogram for: age
# - a histogram centered around bin centers for: transaction
hists = df.hg_make_histograms(features=features, bin_specs=bin_specs)
hists.keys()
hists['transaction'].plot.matplotlib();
# all available bin specifications are (just examples):
bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0}, # SparselyBin histogram
'y': {'num': 10, 'low': 0.0, 'high': 2.0}, # Bin histogram
'x:y': [{}, {'num': 5, 'low': 0.0, 'high': 1.0}], # SparselyBin vs Bin histograms
'a': {'edges': [0, 2, 10, 11, 21, 101]}, # IrregularlyBin histogram
'b': {'centers': [1, 6, 10.5, 16, 20, 100]}, # CentrallyBin histogram
'c': {'max': True}, # Maximize histogram
'd': {'min': True}, # Minimize histogram
'e': {'sum': True}, # Sum histogram
'z': {'deviate': True}, # Deviate histogram
'f': {'average': True}, # Average histogram
'a:f': [{'edges': [0, 10, 101]}, {'average': True}], # IrregularlyBin vs Average histograms
'g': {'thresholds': [0, 2, 10, 11, 21, 101]}, # Stack histogram
'h': {'bag': True}, # Bag histogram
}
# to set binning specs for a specific 2d histogram, you can do this:
# if these are not provide, the 1d binning specifications are picked up for 'a:f'
bin_specs = {'a:f': [{'edges': [0, 10, 101]}, {'average': True}]}
# For example
features = ['latitude:age', 'longitude:age', 'age', 'longitude']
bin_specs = {
'latitude': {'binWidth': 25},
'longitude:': {'edges': [-100, -75, -25, 0, 25, 75, 100]},
'age': {'deviate': True},
'longitude:age': [{'binWidth': 25}, {'average': True}],
}
hists = df.hg_make_histograms(features=features, bin_specs=bin_specs)
h = hists['latitude:age']
h.bins
hists['longitude:age'].plot.matplotlib();