The main purpose of Stagg is to move statistical aggreggations, such as histograms, from one framework to the next. This requires a conversion of high-level domain concepts.
Consider the following example: in Numpy, a histogram is simply a 2-tuple of arrays with special meaning—bin contents, then bin edges.
import numpy
numpy_hist = numpy.histogram(numpy.random.normal(0, 1, int(10e6)), bins=80, range=(-5, 5))
numpy_hist
(array([ 3, 6, 9, 17, 34, 53, 88, 145, 219, 362, 583, 890, 1414, 2082, 3073, 4567, 6650, 9493, 13497, 18696, 25706, 34175, 45639, 59338, 76917, 96418, 120250, 147785, 177579, 210677, 246305, 283236, 321129, 357500, 392646, 424978, 452731, 475951, 490446, 497232, 497458, 490322, 475074, 453326, 425909, 393028, 358993, 321558, 284107, 246317, 210293, 177366, 147453, 119625, 97069, 75632, 59476, 45713, 34588, 25589, 18934, 13608, 9658, 6656, 4692, 3177, 2137, 1388, 866, 570, 365, 207, 147, 71, 40, 29, 21, 9, 4, 4]), array([-5. , -4.875, -4.75 , -4.625, -4.5 , -4.375, -4.25 , -4.125, -4. , -3.875, -3.75 , -3.625, -3.5 , -3.375, -3.25 , -3.125, -3. , -2.875, -2.75 , -2.625, -2.5 , -2.375, -2.25 , -2.125, -2. , -1.875, -1.75 , -1.625, -1.5 , -1.375, -1.25 , -1.125, -1. , -0.875, -0.75 , -0.625, -0.5 , -0.375, -0.25 , -0.125, 0. , 0.125, 0.25 , 0.375, 0.5 , 0.625, 0.75 , 0.875, 1. , 1.125, 1.25 , 1.375, 1.5 , 1.625, 1.75 , 1.875, 2. , 2.125, 2.25 , 2.375, 2.5 , 2.625, 2.75 , 2.875, 3. , 3.125, 3.25 , 3.375, 3.5 , 3.625, 3.75 , 3.875, 4. , 4.125, 4.25 , 4.375, 4.5 , 4.625, 4.75 , 4.875, 5. ]))
We convert that into its Stagg equivalent with a connector (two-function module: tostagg
and tonumpy
).
import stagg.connect.numpy
stagg_hist = stagg.connect.numpy.tostagg(numpy_hist)
stagg_hist
<Histogram at 0x711474a41588>
This object is instantiated from a class structure built from simple pieces.
stagg_hist.dump()
Histogram( axis=[ Axis(binning=RegularBinning(num=80, interval=RealInterval(low=-5.0, high=5.0))) ], counts= UnweightedCounts( counts= InterpretedInlineInt64Buffer( buffer= [ 3 6 9 17 34 53 88 145 219 362 583 890 1414 2082 3073 4567 6650 9493 13497 18696 25706 34175 45639 59338 76917 96418 120250 147785 177579 210677 246305 283236 321129 357500 392646 424978 452731 475951 490446 497232 497458 490322 475074 453326 425909 393028 358993 321558 284107 246317 210293 177366 147453 119625 97069 75632 59476 45713 34588 25589 18934 13608 9658 6656 4692 3177 2137 1388 866 570 365 207 147 71 40 29 21 9 4 4])))
Now it can be converted to a ROOT histogram with another connector.
import stagg.connect.root
root_hist = stagg.connect.root.toroot(stagg_hist, "root_hist")
root_hist
Welcome to JupyROOT 6.14/04
<ROOT.TH1D object ("root_hist") at 0x6510de522e70>
import ROOT
canvas = ROOT.TCanvas()
root_hist.Draw()
canvas.Draw()
And Pandas with yet another connector.
import stagg.connect.pandas
pandas_hist = stagg.connect.pandas.topandas(stagg_hist)
pandas_hist
unweighted | |
---|---|
[-5.0, -4.875) | 3 |
[-4.875, -4.75) | 6 |
[-4.75, -4.625) | 9 |
[-4.625, -4.5) | 17 |
[-4.5, -4.375) | 34 |
[-4.375, -4.25) | 53 |
[-4.25, -4.125) | 88 |
[-4.125, -4.0) | 145 |
[-4.0, -3.875) | 219 |
[-3.875, -3.75) | 362 |
[-3.75, -3.625) | 583 |
[-3.625, -3.5) | 890 |
[-3.5, -3.375) | 1414 |
[-3.375, -3.25) | 2082 |
[-3.25, -3.125) | 3073 |
[-3.125, -3.0) | 4567 |
[-3.0, -2.875) | 6650 |
[-2.875, -2.75) | 9493 |
[-2.75, -2.625) | 13497 |
[-2.625, -2.5) | 18696 |
[-2.5, -2.375) | 25706 |
[-2.375, -2.25) | 34175 |
[-2.25, -2.125) | 45639 |
[-2.125, -2.0) | 59338 |
[-2.0, -1.875) | 76917 |
[-1.875, -1.75) | 96418 |
[-1.75, -1.625) | 120250 |
[-1.625, -1.5) | 147785 |
[-1.5, -1.375) | 177579 |
[-1.375, -1.25) | 210677 |
... | ... |
[1.25, 1.375) | 210293 |
[1.375, 1.5) | 177366 |
[1.5, 1.625) | 147453 |
[1.625, 1.75) | 119625 |
[1.75, 1.875) | 97069 |
[1.875, 2.0) | 75632 |
[2.0, 2.125) | 59476 |
[2.125, 2.25) | 45713 |
[2.25, 2.375) | 34588 |
[2.375, 2.5) | 25589 |
[2.5, 2.625) | 18934 |
[2.625, 2.75) | 13608 |
[2.75, 2.875) | 9658 |
[2.875, 3.0) | 6656 |
[3.0, 3.125) | 4692 |
[3.125, 3.25) | 3177 |
[3.25, 3.375) | 2137 |
[3.375, 3.5) | 1388 |
[3.5, 3.625) | 866 |
[3.625, 3.75) | 570 |
[3.75, 3.875) | 365 |
[3.875, 4.0) | 207 |
[4.0, 4.125) | 147 |
[4.125, 4.25) | 71 |
[4.25, 4.375) | 40 |
[4.375, 4.5) | 29 |
[4.5, 4.625) | 21 |
[4.625, 4.75) | 9 |
[4.75, 4.875) | 4 |
[4.875, 5.0) | 4 |
80 rows × 1 columns
The stagg_hist
object is also a Flatbuffers object, which has a multi-lingual, random-access, small-footprint serialization:
stagg_hist.tobuffer()
bytearray(b'\x04\x00\x00\x00\x90\xff\xff\xff\x10\x00\x00\x00\x00\x01\n\x00\x10\x00\x0c\x00\x0b\x00\x04\x00\n\x00\x00\x00`\x00\x00\x00\x00\x00\x00\x01\x04\x00\x00\x00\x01\x00\x00\x00\x0c\x00\x00\x00\x08\x00\x0c\x00\x0b\x00\x04\x00\x08\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x02\x08\x00(\x00\x1c\x00\x04\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\xc0\x00\x00\x00\x00\x00\x00\x14@\x01\x00\x00\x00\x00\x00\x00\x00P\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\n\x00\t\x00\x04\x00\x08\x00\x00\x00\x0c\x00\x00\x00\x00\x02\x06\x00\x08\x00\x04\x00\x06\x00\x00\x00\x04\x00\x00\x00\x80\x02\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x11\x00\x00\x00\x00\x00\x00\x00"\x00\x00\x00\x00\x00\x00\x005\x00\x00\x00\x00\x00\x00\x00X\x00\x00\x00\x00\x00\x00\x00\x91\x00\x00\x00\x00\x00\x00\x00\xdb\x00\x00\x00\x00\x00\x00\x00j\x01\x00\x00\x00\x00\x00\x00G\x02\x00\x00\x00\x00\x00\x00z\x03\x00\x00\x00\x00\x00\x00\x86\x05\x00\x00\x00\x00\x00\x00"\x08\x00\x00\x00\x00\x00\x00\x01\x0c\x00\x00\x00\x00\x00\x00\xd7\x11\x00\x00\x00\x00\x00\x00\xfa\x19\x00\x00\x00\x00\x00\x00\x15%\x00\x00\x00\x00\x00\x00\xb94\x00\x00\x00\x00\x00\x00\x08I\x00\x00\x00\x00\x00\x00jd\x00\x00\x00\x00\x00\x00\x7f\x85\x00\x00\x00\x00\x00\x00G\xb2\x00\x00\x00\x00\x00\x00\xca\xe7\x00\x00\x00\x00\x00\x00u,\x01\x00\x00\x00\x00\x00\xa2x\x01\x00\x00\x00\x00\x00\xba\xd5\x01\x00\x00\x00\x00\x00IA\x02\x00\x00\x00\x00\x00\xab\xb5\x02\x00\x00\x00\x00\x00\xf56\x03\x00\x00\x00\x00\x00!\xc2\x03\x00\x00\x00\x00\x00dR\x04\x00\x00\x00\x00\x00i\xe6\x04\x00\x00\x00\x00\x00|t\x05\x00\x00\x00\x00\x00\xc6\xfd\x05\x00\x00\x00\x00\x00\x12|\x06\x00\x00\x00\x00\x00{\xe8\x06\x00\x00\x00\x00\x00/C\x07\x00\x00\x00\x00\x00\xce{\x07\x00\x00\x00\x00\x00P\x96\x07\x00\x00\x00\x00\x002\x97\x07\x00\x00\x00\x00\x00R{\x07\x00\x00\x00\x00\x00\xc2?\x07\x00\x00\x00\x00\x00\xce\xea\x06\x00\x00\x00\x00\x00\xb5\x7f\x06\x00\x00\x00\x00\x00D\xff\x05\x00\x00\x00\x00\x00Qz\x05\x00\x00\x00\x00\x00\x16\xe8\x04\x00\x00\x00\x00\x00\xcbU\x04\x00\x00\x00\x00\x00-\xc2\x03\x00\x00\x00\x00\x00u5\x03\x00\x00\x00\x00\x00\xd6\xb4\x02\x00\x00\x00\x00\x00\xfd?\x02\x00\x00\x00\x00\x00I\xd3\x01\x00\x00\x00\x00\x00-{\x01\x00\x00\x00\x00\x00p\'\x01\x00\x00\x00\x00\x00T\xe8\x00\x00\x00\x00\x00\x00\x91\xb2\x00\x00\x00\x00\x00\x00\x1c\x87\x00\x00\x00\x00\x00\x00\xf5c\x00\x00\x00\x00\x00\x00\xf6I\x00\x00\x00\x00\x00\x00(5\x00\x00\x00\x00\x00\x00\xba%\x00\x00\x00\x00\x00\x00\x00\x1a\x00\x00\x00\x00\x00\x00T\x12\x00\x00\x00\x00\x00\x00i\x0c\x00\x00\x00\x00\x00\x00Y\x08\x00\x00\x00\x00\x00\x00l\x05\x00\x00\x00\x00\x00\x00b\x03\x00\x00\x00\x00\x00\x00:\x02\x00\x00\x00\x00\x00\x00m\x01\x00\x00\x00\x00\x00\x00\xcf\x00\x00\x00\x00\x00\x00\x00\x93\x00\x00\x00\x00\x00\x00\x00G\x00\x00\x00\x00\x00\x00\x00(\x00\x00\x00\x00\x00\x00\x00\x1d\x00\x00\x00\x00\x00\x00\x00\x15\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')
print("Numpy size: ", numpy_hist[0].nbytes + numpy_hist[1].nbytes)
tmessage = ROOT.TMessage()
tmessage.WriteObject(root_hist)
print("ROOT size: ", tmessage.Length())
import pickle
print("Pandas size:", len(pickle.dumps(pandas_hist)))
print("Stagg size: ", len(stagg_hist.tobuffer()))
Numpy size: 1288 ROOT size: 1962 Pandas size: 2975 Stagg size: 792
Stagg is generally forseen as a memory format, like Apache Arrow, but for statistical aggregations. Like Arrow, it reduces the need to implement $N(N - 1)/2$ conversion functions among $N$ statistical libraries to just $N$ conversion functions. (See the figure on Arrow's website.)
Stagg also intends to be as close to zero-copy as possible. This means that it must make graceful translations among conventions. Different histogramming libraries handle overflow bins in different ways:
fromroot = stagg.connect.root.tostagg(root_hist)
fromroot.axis[0].binning.dump()
print("Bin contents length:", len(fromroot.counts.array))
RegularBinning( num=80, interval=RealInterval(low=-5.0, high=5.0), overflow=RealOverflow(loc_underflow=BinLocation.below1, loc_overflow=BinLocation.above1)) Bin contents length: 82
stagg_hist.axis[0].binning.dump()
print("Bin contents length:", len(stagg_hist.counts.array))
RegularBinning(num=80, interval=RealInterval(low=-5.0, high=5.0)) Bin contents length: 80
And yet we want to be able to manipulate them as though these differences did not exist.
sum_hist = fromroot + stagg_hist
sum_hist.axis[0].binning.dump()
print("Bin contents length:", len(sum_hist.counts.array))
RegularBinning( num=80, interval=RealInterval(low=-5.0, high=5.0), overflow=RealOverflow(loc_underflow=BinLocation.above1, loc_overflow=BinLocation.above2)) Bin contents length: 82
The binning structure keeps track of the existence of underflow/overflow bins and where they are located.
below1
) and overflow after (above1
), so that the normal bins are effectively 1-indexed.above1
) and underflow after that (above2
), so that underflow is accessed via myhist[-1]
in Numpy.Intervals
that extend to infinity.Stagg accepts all of these, so that it doesn't have to manipulate the bin contents buffer it receives, but knows how to deal with them if it has to combine histograms that follow different conventions.
All the different axis types have an equivalent in Stagg (and not all are single-dimensional).
import stagg
stagg.IntegerBinning(5, 10).dump()
stagg.RegularBinning(100, stagg.RealInterval(-5, 5)).dump()
stagg.HexagonalBinning(0, 100, 0, 100, stagg.HexagonalBinning.cube_xy).dump()
stagg.EdgesBinning([0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]).dump()
stagg.IrregularBinning([stagg.RealInterval(0, 5),
stagg.RealInterval(10, 100),
stagg.RealInterval(-10, 10)],
overlapping_fill=stagg.IrregularBinning.all).dump()
stagg.CategoryBinning(["one", "two", "three"]).dump()
stagg.SparseRegularBinning([5, 3, -2, 8, -100], 10).dump()
stagg.FractionBinning(error_method=stagg.FractionBinning.clopper_pearson).dump()
stagg.PredicateBinning(["signal region", "control region"]).dump()
stagg.VariationBinning([stagg.Variation([stagg.Assignment("x", "nominal")]),
stagg.Variation([stagg.Assignment("x", "nominal + sigma")]),
stagg.Variation([stagg.Assignment("x", "nominal - sigma")])]).dump()
IntegerBinning(min=5, max=10) RegularBinning(num=100, interval=RealInterval(low=-5.0, high=5.0)) HexagonalBinning(qmin=0, qmax=100, rmin=0, rmax=100, coordinates=HexagonalBinning.cube_xy) EdgesBinning(edges=[0.01 0.05 0.1 0.5 1 5 10 50 100]) IrregularBinning( intervals=[ RealInterval(low=0.0, high=5.0), RealInterval(low=10.0, high=100.0), RealInterval(low=-10.0, high=10.0) ], overlapping_fill=IrregularBinning.all) CategoryBinning(categories=['one', 'two', 'three']) SparseRegularBinning(bins=[5 3 -2 8 -100], bin_width=10.0) FractionBinning(error_method=FractionBinning.clopper_pearson) PredicateBinning(predicates=['signal region', 'control region']) VariationBinning( variations=[ Variation(assignments=[ Assignment(identifier='x', expression='nominal') ]), Variation( assignments=[ Assignment(identifier='x', expression='nominal + sigma') ]), Variation( assignments=[ Assignment(identifier='x', expression='nominal - sigma') ]) ])
The meanings of these binning classes are given in the specification, but many of them can be converted into one another, and converting to CategoryBinning
(strings) often makes the intent clear.
stagg.IntegerBinning(5, 10).toCategoryBinning().dump()
stagg.RegularBinning(10, stagg.RealInterval(-5, 5)).toCategoryBinning().dump()
stagg.EdgesBinning([0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]).toCategoryBinning().dump()
stagg.IrregularBinning([stagg.RealInterval(0, 5),
stagg.RealInterval(10, 100),
stagg.RealInterval(-10, 10)],
overlapping_fill=stagg.IrregularBinning.all).toCategoryBinning().dump()
stagg.SparseRegularBinning([5, 3, -2, 8, -100], 10).toCategoryBinning().dump()
stagg.FractionBinning(error_method=stagg.FractionBinning.clopper_pearson).toCategoryBinning().dump()
stagg.PredicateBinning(["signal region", "control region"]).toCategoryBinning().dump()
stagg.VariationBinning([stagg.Variation([stagg.Assignment("x", "nominal")]),
stagg.Variation([stagg.Assignment("x", "nominal + sigma")]),
stagg.Variation([stagg.Assignment("x", "nominal - sigma")])]).toCategoryBinning().dump()
CategoryBinning(categories=['5', '6', '7', '8', '9', '10']) CategoryBinning( categories=['[-5, -4)', '[-4, -3)', '[-3, -2)', '[-2, -1)', '[-1, 0)', '[0, 1)', '[1, 2)', '[2, 3)', '[3, 4)', '[4, 5)']) CategoryBinning( categories=['[0.01, 0.05)', '[0.05, 0.1)', '[0.1, 0.5)', '[0.5, 1)', '[1, 5)', '[5, 10)', '[10, 50)', '[50, 100)']) CategoryBinning(categories=['[0, 5)', '[10, 100)', '[-10, 10)']) CategoryBinning(categories=['[50, 60)', '[30, 40)', '[-20, -10)', '[80, 90)', '[-1000, -990)']) CategoryBinning(categories=['pass', 'all']) CategoryBinning(categories=['signal region', 'control region']) CategoryBinning(categories=['x := nominal', 'x := nominal + sigma', 'x := nominal - sigma'])
This technique can also clear up confusion about overflow bins.
stagg.RegularBinning(5, stagg.RealInterval(-5, 5), stagg.RealOverflow(
loc_underflow=stagg.BinLocation.above2,
loc_overflow=stagg.BinLocation.above1,
loc_nanflow=stagg.BinLocation.below1
)).toCategoryBinning().dump()
CategoryBinning( categories=['{nan}', '[-5, -3)', '[-3, -1)', '[-1, 1)', '[1, 3)', '[3, 5)', '[5, +inf]', '[-inf, -5)'])
You might also be wondering about FractionBinning
, PredicateBinning
, and VariationBinning
.
FractionBinning
is an axis of two bins: #passing and #total, #failing and #total, or #passing and #failing. Adding it to another axis effectively makes an "efficiency plot."
h = stagg.Histogram([stagg.Axis(stagg.FractionBinning()),
stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5)))],
stagg.UnweightedCounts(
stagg.InterpretedInlineBuffer.fromarray(
numpy.array([[ 9, 25, 29, 35, 54, 67, 60, 84, 80, 94],
[ 99, 119, 109, 109, 95, 104, 102, 106, 112, 122]]))))
df = stagg.connect.pandas.topandas(h)
df
unweighted | ||
---|---|---|
pass | [-5.0, -4.0) | 9 |
[-4.0, -3.0) | 25 | |
[-3.0, -2.0) | 29 | |
[-2.0, -1.0) | 35 | |
[-1.0, 0.0) | 54 | |
[0.0, 1.0) | 67 | |
[1.0, 2.0) | 60 | |
[2.0, 3.0) | 84 | |
[3.0, 4.0) | 80 | |
[4.0, 5.0) | 94 | |
all | [-5.0, -4.0) | 99 |
[-4.0, -3.0) | 119 | |
[-3.0, -2.0) | 109 | |
[-2.0, -1.0) | 109 | |
[-1.0, 0.0) | 95 | |
[0.0, 1.0) | 104 | |
[1.0, 2.0) | 102 | |
[2.0, 3.0) | 106 | |
[3.0, 4.0) | 112 | |
[4.0, 5.0) | 122 |
df = df.unstack(level=0)
df
unweighted | ||
---|---|---|
all | pass | |
[-5.0, -4.0) | 99 | 9 |
[-4.0, -3.0) | 119 | 25 |
[-3.0, -2.0) | 109 | 29 |
[-2.0, -1.0) | 109 | 35 |
[-1.0, 0.0) | 95 | 54 |
[0.0, 1.0) | 104 | 67 |
[1.0, 2.0) | 102 | 60 |
[2.0, 3.0) | 106 | 84 |
[3.0, 4.0) | 112 | 80 |
[4.0, 5.0) | 122 | 94 |
df["unweighted", "pass"] / df["unweighted", "all"]
[-5.0, -4.0) 0.090909 [-4.0, -3.0) 0.210084 [-3.0, -2.0) 0.266055 [-2.0, -1.0) 0.321101 [-1.0, 0.0) 0.568421 [0.0, 1.0) 0.644231 [1.0, 2.0) 0.588235 [2.0, 3.0) 0.792453 [3.0, 4.0) 0.714286 [4.0, 5.0) 0.770492 dtype: float64
PredicateBinning
means that each bin represents a predicate (if-then rule) in the filling procedure. Stagg doesn't have a filling procedure, but filling-libraries can use this to encode relationships among histograms that a fitting-library can take advantage of, for combined signal-control region fits, for instance. It's possible for those regions to overlap: an input datum might satisfy more than one predicate, and overlapping_fill
determines which bin(s) were chosen: first
, last
, or all
.
VariationBinning
means that each bin represents a variation of one of the paramters used to calculate the fill-variables. This is used to determine sensitivity to systematic effects, by varying them and re-filling. In this kind of binning, the same input datum enters every bin.
xdata = numpy.random.normal(0, 1, int(1e6))
sigma = numpy.random.uniform(-0.1, 0.8, int(1e6))
h = stagg.Histogram([stagg.Axis(stagg.VariationBinning([
stagg.Variation([stagg.Assignment("x", "nominal")]),
stagg.Variation([stagg.Assignment("x", "nominal + sigma")])])),
stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5)))],
stagg.UnweightedCounts(
stagg.InterpretedInlineBuffer.fromarray(
numpy.concatenate([
numpy.histogram(xdata, bins=10, range=(-5, 5))[0],
numpy.histogram(xdata + sigma, bins=10, range=(-5, 5))[0]]))))
df = stagg.connect.pandas.topandas(h)
df
unweighted | ||
---|---|---|
x := nominal | [-5.0, -4.0) | 35 |
[-4.0, -3.0) | 1348 | |
[-3.0, -2.0) | 21465 | |
[-2.0, -1.0) | 135923 | |
[-1.0, 0.0) | 341627 | |
[0.0, 1.0) | 340649 | |
[1.0, 2.0) | 135983 | |
[2.0, 3.0) | 21584 | |
[3.0, 4.0) | 1355 | |
[4.0, 5.0) | 30 | |
x := nominal + sigma | [-5.0, -4.0) | 14 |
[-4.0, -3.0) | 597 | |
[-3.0, -2.0) | 10968 | |
[-2.0, -1.0) | 84154 | |
[-1.0, 0.0) | 272295 | |
[0.0, 1.0) | 367137 | |
[1.0, 2.0) | 209741 | |
[2.0, 3.0) | 50026 | |
[3.0, 4.0) | 4854 | |
[4.0, 5.0) | 213 |
df.unstack(level=0)
unweighted | ||
---|---|---|
x := nominal | x := nominal + sigma | |
[-5.0, -4.0) | 35 | 14 |
[-4.0, -3.0) | 1348 | 597 |
[-3.0, -2.0) | 21465 | 10968 |
[-2.0, -1.0) | 135923 | 84154 |
[-1.0, 0.0) | 341627 | 272295 |
[0.0, 1.0) | 340649 | 367137 |
[1.0, 2.0) | 135983 | 209741 |
[2.0, 3.0) | 21584 | 50026 |
[3.0, 4.0) | 1355 | 4854 |
[4.0, 5.0) | 30 | 213 |
You can gather many objects (histograms, functions, ntuples) into a Collection
, partly for convenience of encapsulating all of them in one object.
stagg.Collection({"one": fromroot, "two": stagg_hist}).dump()
Collection( objects={ 'one': Histogram( axis=[ Axis( binning= RegularBinning( num=80, interval=RealInterval(low=-5.0, high=5.0), overflow=RealOverflow(loc_underflow=BinLocation.below1, loc_overflow=BinLocation.above1)), statistics=[ Statistics( moments=[ Moments(sumwxn=InterpretedInlineInt64Buffer(buffer=[1e+07]), n=0), Moments(sumwxn=InterpretedInlineFloat64Buffer(buffer=[1e+07]), n=0, weightpower=1), Moments(sumwxn=InterpretedInlineFloat64Buffer(buffer=[1e+07]), n=0, weightpower=2), Moments(sumwxn=InterpretedInlineFloat64Buffer(buffer=[2641.38]), n=1, weightpower=1), Moments( sumwxn=InterpretedInlineFloat64Buffer(buffer=[1.00103e+07]), n=2, weightpower=1) ]) ]) ], counts= UnweightedCounts( counts= InterpretedInlineFloat64Buffer( buffer= [0.00000e+00 3.00000e+00 6.00000e+00 9.00000e+00 1.70000e+01 3.40000e+01 5.30000e+01 8.80000e+01 1.45000e+02 2.19000e+02 3.62000e+02 5.83000e+02 8.90000e+02 1.41400e+03 2.08200e+03 3.07300e+03 4.56700e+03 6.65000e+03 9.49300e+03 1.34970e+04 1.86960e+04 2.57060e+04 3.41750e+04 4.56390e+04 5.93380e+04 7.69170e+04 9.64180e+04 1.20250e+05 1.47785e+05 1.77579e+05 2.10677e+05 2.46305e+05 2.83236e+05 3.21129e+05 3.57500e+05 3.92646e+05 4.24978e+05 4.52731e+05 4.75951e+05 4.90446e+05 4.97232e+05 4.97458e+05 4.90322e+05 4.75074e+05 4.53326e+05 4.25909e+05 3.93028e+05 3.58993e+05 3.21558e+05 2.84107e+05 2.46317e+05 2.10293e+05 1.77366e+05 1.47453e+05 1.19625e+05 9.70690e+04 7.56320e+04 5.94760e+04 4.57130e+04 3.45880e+04 2.55890e+04 1.89340e+04 1.36080e+04 9.65800e+03 6.65600e+03 4.69200e+03 3.17700e+03 2.13700e+03 1.38800e+03 8.66000e+02 5.70000e+02 3.65000e+02 2.07000e+02 1.47000e+02 7.10000e+01 4.00000e+01 2.90000e+01 2.10000e+01 9.00000e+00 4.00000e+00 4.00000e+00 0.00000e+00]))), 'two': Histogram( axis=[ Axis(binning=RegularBinning(num=80, interval=RealInterval(low=-5.0, high=5.0))) ], counts= UnweightedCounts( counts= InterpretedInlineInt64Buffer( buffer= [ 3 6 9 17 34 53 88 145 219 362 583 890 1414 2082 3073 4567 6650 9493 13497 18696 25706 34175 45639 59338 76917 96418 120250 147785 177579 210677 246305 283236 321129 357500 392646 424978 452731 475951 490446 497232 497458 490322 475074 453326 425909 393028 358993 321558 284107 246317 210293 177366 147453 119625 97069 75632 59476 45713 34588 25589 18934 13608 9658 6656 4692 3177 2137 1388 866 570 365 207 147 71 40 29 21 9 4 4]))) })
Not only for convenience: you can also define an Axis
in the Collection
to subdivide all contents by that Axis
. For instance, you can make a collection of qualitatively different histograms all have a signal and control region with PredicateBinning
, or all have systematic variations with VariationBinning
.
It is not necessary to rely on naming conventions to communicate this information from filler to fitter.
I said in the introduction that Stagg does not fill histograms and does not plot histograms—the two things data analysts are expecting to do. These would be done by user-facing libraries.
Stagg does, however, transform histograms into other histograms, and not just among formats. You can combine histograms with +
. In addition to adding histogram counts, it combines auxiliary statistics appropriately (if possible).
h1 = stagg.Histogram([
stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5)),
statistics=[stagg.Statistics(
moments=[
stagg.Moments(stagg.InterpretedInlineBuffer.fromarray(numpy.array([10])), n=1),
stagg.Moments(stagg.InterpretedInlineBuffer.fromarray(numpy.array([20])), n=2)],
quantiles=[
stagg.Quantiles(stagg.InterpretedInlineBuffer.fromarray(numpy.array([30])), p=0.5)],
mode=stagg.Modes(stagg.InterpretedInlineBuffer.fromarray(numpy.array([40]))),
min=stagg.Extremes(stagg.InterpretedInlineBuffer.fromarray(numpy.array([50]))),
max=stagg.Extremes(stagg.InterpretedInlineBuffer.fromarray(numpy.array([60]))))])],
stagg.UnweightedCounts(stagg.InterpretedInlineBuffer.fromarray(numpy.arange(10))))
h2 = stagg.Histogram([
stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5)),
statistics=[stagg.Statistics(
moments=[
stagg.Moments(stagg.InterpretedInlineBuffer.fromarray(numpy.array([100])), n=1),
stagg.Moments(stagg.InterpretedInlineBuffer.fromarray(numpy.array([200])), n=2)],
quantiles=[
stagg.Quantiles(stagg.InterpretedInlineBuffer.fromarray(numpy.array([300])), p=0.5)],
mode=stagg.Modes(stagg.InterpretedInlineBuffer.fromarray(numpy.array([400]))),
min=stagg.Extremes(stagg.InterpretedInlineBuffer.fromarray(numpy.array([500]))),
max=stagg.Extremes(stagg.InterpretedInlineBuffer.fromarray(numpy.array([600]))))])],
stagg.UnweightedCounts(stagg.InterpretedInlineBuffer.fromarray(numpy.arange(100, 200, 10))))
(h1 + h2).dump()
Histogram( axis=[ Axis( binning=RegularBinning(num=10, interval=RealInterval(low=-5.0, high=5.0)), statistics=[ Statistics( moments=[ Moments(sumwxn=InterpretedInlineInt64Buffer(buffer=[110]), n=1), Moments(sumwxn=InterpretedInlineInt64Buffer(buffer=[220]), n=2) ], min=Extremes(values=InterpretedInlineInt64Buffer(buffer=[50])), max=Extremes(values=InterpretedInlineInt64Buffer(buffer=[600]))) ]) ], counts= UnweightedCounts( counts=InterpretedInlineInt64Buffer(buffer=[100 111 122 133 144 155 166 177 188 199])))
The corresponding moments of h1
and h2
were matched and added, quantiles and modes were dropped (no way to combine them), and the correct minimum and maximum were picked; the histogram contents were added as well.
Another important histogram → histogram conversion is axis-reduction, which can take three forms:
All of these operations use a Pandas-inspired loc
/iloc
syntax.
h = stagg.Histogram(
[stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5)))],
stagg.UnweightedCounts(
stagg.InterpretedInlineBuffer.fromarray(numpy.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90]))))
loc
slices in the data's coordinate system. 1.5
rounds up to bin index 6
. The first five bins get combined into an overflow bin: 150 = 10 + 20 + 30 + 40 + 50
.
h.loc[1.5:].dump()
Histogram( axis=[ Axis( binning= RegularBinning( num=4, interval=RealInterval(low=1.0, high=5.0), overflow= RealOverflow( loc_underflow=BinLocation.above1, minf_mapping=RealOverflow.missing, pinf_mapping=RealOverflow.missing, nan_mapping=RealOverflow.missing))) ], counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[60 70 80 90 150])))
iloc
slices by bin index number.
h.iloc[6:].dump()
Histogram( axis=[ Axis( binning= RegularBinning( num=4, interval=RealInterval(low=1.0, high=5.0), overflow= RealOverflow( loc_underflow=BinLocation.above1, minf_mapping=RealOverflow.missing, pinf_mapping=RealOverflow.missing, nan_mapping=RealOverflow.missing))) ], counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[60 70 80 90 150])))
Slices have a start
, stop
, and step
(start:stop:step
). The step
parameter rebins:
h.iloc[::2].dump()
Histogram( axis=[ Axis(binning=RegularBinning(num=5, interval=RealInterval(low=-5.0, high=5.0))) ], counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[10 50 90 130 170])))
Thus, you can slice and rebin as part of the same operation.
Projecting uses the same mechanism, except that None
passed as an axis's slice projects it.
h2 = stagg.Histogram(
[stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5))),
stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5)))],
stagg.UnweightedCounts(
stagg.InterpretedInlineBuffer.fromarray(numpy.arange(100))))
h2.iloc[:, None].dump()
Histogram( axis=[ Axis(binning=RegularBinning(num=10, interval=RealInterval(low=-5.0, high=5.0))) ], counts= UnweightedCounts( counts=InterpretedInlineInt64Buffer(buffer=[45 145 245 345 445 545 645 745 845 945])))
Thus, all three axis reduction operations can be performed in a single syntax.
In general, an n-dimensional Stagg histogram can be sliced like an n-dimensional Numpy array. This includes integer and boolean indexing (though that necessarily changes the binning to IrregularBinning
).
h.iloc[[4, 3, 6, 7, 1]].dump()
Histogram( axis=[ Axis( binning= IrregularBinning( intervals=[ RealInterval(low=-1.0, high=0.0), RealInterval(low=-2.0, high=-1.0), RealInterval(low=1.0, high=2.0), RealInterval(low=2.0, high=3.0), RealInterval(low=-4.0, high=-3.0) ])) ], counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[40 30 60 70 10])))
h.iloc[[True, False, True, False, True, False, True, False, True, False]].dump()
Histogram( axis=[ Axis( binning= IrregularBinning( intervals=[ RealInterval(low=-5.0, high=-4.0), RealInterval(low=-3.0, high=-2.0), RealInterval(low=-1.0, high=0.0), RealInterval(low=1.0, high=2.0), RealInterval(low=3.0, high=4.0) ])) ], counts=UnweightedCounts(counts=InterpretedInlineInt64Buffer(buffer=[0 20 40 60 80])))
loc
for numerical binnings accepts
None
for projection...
)loc
for categorical binnings accepts
None
for projection...
)iloc
accepts
None
for projection...
)Frequently, one wants to extract bin counts from a histogram. The loc
/iloc
syntax above creates histograms from histograms, not bin counts.
A histogram's counts
property has a slice syntax.
allcounts = numpy.arange(12) * numpy.arange(12)[:, None] # multiplication table
allcounts[10, :] = -999 # underflows
allcounts[11, :] = 999 # overflows
allcounts[:, 0] = -999 # underflows
allcounts[:, 1] = 999 # overflows
print(allcounts)
[[-999 999 0 0 0 0 0 0 0 0 0 0] [-999 999 2 3 4 5 6 7 8 9 10 11] [-999 999 4 6 8 10 12 14 16 18 20 22] [-999 999 6 9 12 15 18 21 24 27 30 33] [-999 999 8 12 16 20 24 28 32 36 40 44] [-999 999 10 15 20 25 30 35 40 45 50 55] [-999 999 12 18 24 30 36 42 48 54 60 66] [-999 999 14 21 28 35 42 49 56 63 70 77] [-999 999 16 24 32 40 48 56 64 72 80 88] [-999 999 18 27 36 45 54 63 72 81 90 99] [-999 999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999] [-999 999 999 999 999 999 999 999 999 999 999 999]]
h2 = stagg.Histogram(
[stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5),
stagg.RealOverflow(loc_underflow=stagg.RealOverflow.above1,
loc_overflow=stagg.RealOverflow.above2))),
stagg.Axis(stagg.RegularBinning(10, stagg.RealInterval(-5, 5),
stagg.RealOverflow(loc_underflow=stagg.RealOverflow.below2,
loc_overflow=stagg.RealOverflow.below1)))],
stagg.UnweightedCounts(
stagg.InterpretedInlineBuffer.fromarray(allcounts)))
print(h2.counts[:, :])
[[ 0 0 0 0 0 0 0 0 0 0] [ 2 3 4 5 6 7 8 9 10 11] [ 4 6 8 10 12 14 16 18 20 22] [ 6 9 12 15 18 21 24 27 30 33] [ 8 12 16 20 24 28 32 36 40 44] [10 15 20 25 30 35 40 45 50 55] [12 18 24 30 36 42 48 54 60 66] [14 21 28 35 42 49 56 63 70 77] [16 24 32 40 48 56 64 72 80 88] [18 27 36 45 54 63 72 81 90 99]]
To get the underflows and overflows, set the slice extremes to -inf
and +inf
.
print(h2.counts[-numpy.inf:numpy.inf, :])
[[-999 -999 -999 -999 -999 -999 -999 -999 -999 -999] [ 0 0 0 0 0 0 0 0 0 0] [ 2 3 4 5 6 7 8 9 10 11] [ 4 6 8 10 12 14 16 18 20 22] [ 6 9 12 15 18 21 24 27 30 33] [ 8 12 16 20 24 28 32 36 40 44] [ 10 15 20 25 30 35 40 45 50 55] [ 12 18 24 30 36 42 48 54 60 66] [ 14 21 28 35 42 49 56 63 70 77] [ 16 24 32 40 48 56 64 72 80 88] [ 18 27 36 45 54 63 72 81 90 99] [ 999 999 999 999 999 999 999 999 999 999]]
print(h2.counts[:, -numpy.inf:numpy.inf])
[[-999 0 0 0 0 0 0 0 0 0 0 999] [-999 2 3 4 5 6 7 8 9 10 11 999] [-999 4 6 8 10 12 14 16 18 20 22 999] [-999 6 9 12 15 18 21 24 27 30 33 999] [-999 8 12 16 20 24 28 32 36 40 44 999] [-999 10 15 20 25 30 35 40 45 50 55 999] [-999 12 18 24 30 36 42 48 54 60 66 999] [-999 14 21 28 35 42 49 56 63 70 77 999] [-999 16 24 32 40 48 56 64 72 80 88 999] [-999 18 27 36 45 54 63 72 81 90 99 999]]
Also note that the underflows are now all below the normal bins and overflows are now all above the normal bins, regardless of how they were arranged in the Stagg object. This allows analysis code to be independent of histogram source.
Stagg can attach fit functions to histograms, can store standalone functions, such as lookup tables, and can store ntuples for unweighted fits or machine learning.