Set some useful LaTeX commands for later in the document: $ \newcommand{\Int}[2] {\displaystyle \int\limits_{#1}^{#2}} $ $ \newcommand{\Prod}[2] {\displaystyle \prod\limits_{#1}^{#2}} $ $ \newcommand{\Sum}[2] {\displaystyle \sum\limits_{#1}^{#2}} $

\Int{lower}{upper} : $ \Int{lower}{upper},\qquad $ \Prod{lower}{upper} : $ \Prod{lower}{upper},\qquad $ \Sum{lower}{upper} : $ \Sum{lower}{upper} $

$ \newcommand{subNsubR}[3] {{}_{#1} #2_{#3}} $ $ \newcommand{rust}[1] {{\require{color}\color{Bittersweet}#1}} $ $ \newcommand{sky}[1] {{\require{color}\color{SkyBlue}#1}} $

\subNsubR{n}{K}{r} : $ \subNsubR{n}{K}{r},\qquad $ \rust{q} : $ \rust{q},\qquad $ \sky{q} : $ \sky{q} $

In [1]:
'''
debinningWithNewStatistics.ipynb
This file is an ipython notebook designed to be converted to reveal.js
slides via this command:
    ipython nbconvert debinningWithNewStatistics.ipynb --to=slides --post=serve [--config slides_config.py]
This file also contains the main body of debinning work in the code blocks below.

Copyright (C) 2015 Abram Krislock

Talks given from ~= this version of the slides:
    "Debinning: Data Smoothing with a New Probability Calculus"
      - Fysisk institutt, Universitetet i Oslo, February 25, 2015
      - Mitchell Institute for Fundamental Physics and Astronomy, Texas A&M University, March 13, 2015

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see http://www.gnu.org/licenses/gpl-3.0.txt.
'''


import sys
sys.path[0] = '..'
from __future__ import division
import numpy as np
from time import time
import bisect
import operator as op

from utilities import settings
import utilities.sampleFunctions as funcs
import utilities.probCalc as pc
import utilities.plotting as debinPlot
import utilities.minimizers as miniz
reload(funcs)
reload(pc)
reload(debinPlot)
reload(miniz)

%matplotlib inline
from matplotlib import pyplot as plt
from IPython.display import HTML

debinning Logo

Data Smoothing with a New Probability Calculus

Abram Krislock

Fysisk institutt, Universitetet i Oslo.

[[email protected]](mailto:[email protected] "Email Abram")

Once Upon A Time...

We knew all the stuff!

"At the end of the 19th century... it was generally accepted that all the important laws of physics had been discovered..." (Source)

Then this stuff happened:

$${\huge{\rm He}^{++},~{\rm e}^-,~\gamma \qquad\qquad \hbar \qquad\qquad g^{\mu\nu}}$$$${\huge E^2 = (mc^2)^2 + (pc)^2}$$
$${\Large \cdots {\rm SU}(3)_C \times {\rm SU}(2)_L \times {\rm U}(1)_Y}$$

Okay... But after that... Surely, we knew all the stuff!

Well, actually....

Then this stuff happened:

We know almost nothing.

Let's face it.

Large Hadron Collider: A "New Stuff" Experiment

New stuff experiment!

Dark Matter @ Large Hadron Collider

\begin{align} {\Huge \widetilde{q_L}} & {\Huge\rightarrow} & {\Huge \rust{q}} \\ & {\Huge\searrow} \\ & & {\Huge\widetilde{\chi}^0_2} & {\Huge\rightarrow} & {\Huge h^0} & {\Huge\rightarrow} & {\Huge \rust{b}}\\ & & & {\Huge\searrow} & & {\Huge \hookrightarrow} & {\Huge \rust{b}} \\ & & & & {\Huge \color{Gray}{\widetilde{\chi}^0_1}} \end{align}

Calculate as one particle, the mass:

$${\huge m_{\rust{q h^0}} \qquad {\rm using} \qquad E^2 = (mc^2)^2 + (pc)^2}$$

The $m_{\rust{qh^0}}$ Maximum Measurement

You want me to try.... What?!

Change the... bins? Seriously?!

Change the bins! How bad could it be?

It could be very bad...

Okay, that's bad, but how often...?

In [2]:
funcs.randomGenerator.setSeed('seedOne')
settings['simpleSortedOutput'] = True

nPulls = 400
basePars = (0.00006, 10., 200., 110., 1.5, 50., 140.)
baseGuess = (140., -2., -0.1, 10.)

def fitTwoHistograms(percentile):
    data = funcs.functionToData(funcs.easyEndpoint, nPulls, 0.5, (0,200), basePars)
    percentieth = data[data > np.percentile(data, percentile)]

    # The histograms are filled with the same data, only different binning.
    binContentsOne, binEdgesOne = np.histogram(percentieth, np.arange(3., 204., 10.))
    binContentsTwo, binEdgesTwo = np.histogram(percentieth, np.arange(0., 201., 5.))
    
    # The fit ranges are defined to be from the value of the 70th percentile data point
    #   to "the last non-zero bin", blindly.
    fitBinsOne = np.flatnonzero(binContentsOne[binContentsOne.argmax():]) + binContentsOne.argmax()
    fitBinsTwo = np.flatnonzero(binContentsTwo[binContentsTwo.argmax():]) + binContentsTwo.argmax()
    
    ### FITTING METHOD: Binned Maximum Likelihood ###
    fitGuessOne = np.array([value * funcs.randomGenerator.normal(1., 0.1) for value in baseGuess])
    fitGuessTwo = fitGuessOne.copy()

    def chiSqOne(pars):
#        return funcs.chiSqSimple(binContentsOne, binEdgesOne, fitBinsOne, funcs.theKink, pars)
        return -2. * funcs.theKink_BinnedLikelihood(binContentsOne, binEdgesOne, fitBinsOne, pars)
    def chiSqTwo(pars):
#        return funcs.chiSqSimple(binContentsTwo, binEdgesTwo, fitBinsTwo, funcs.theKink, pars)
        return -2. * funcs.theKink_BinnedLikelihood(binContentsTwo, binEdgesTwo, fitBinsTwo, pars)

    fitOne = miniz.nelderMead(chiSqOne, fitGuessOne, xtol=0.01, ftol=0.05, disp=0)
    fitTwo = miniz.nelderMead(chiSqTwo, fitGuessTwo, xtol=0.01, ftol=0.05, disp=0)
    return (data, fitOne, fitTwo)

xKinkOne = []
xKinkTwo = []
xKinkDiff = []
percentile = 60
nTrials = 1000

for i in xrange(nTrials):
    data, fitOne, fitTwo = fitTwoHistograms(percentile)
    xKinkOne.append(fitOne[0])
    xKinkTwo.append(fitTwo[0])
    xKinkDiff.append(baseGuess[0] + fitTwo[0] - fitOne[0])
In [45]:
def plotHistoFitCompare():
    fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
#    plt.hist(xKinkOne, range=(80,180), color='#66a61e', bins=np.arange(80, 181, 2), alpha=0.6, zorder=2)
#    plt.hist(xKinkTwo, range=(80,180), color='#0088ff', bins=np.arange(80, 181, 2), alpha=0.6, zorder=2)
    plt.hist(xKinkDiff, range=(80,180), color='#0088ff', bins=np.arange(80, 181, 2), alpha=0.8, zorder=1)

    axes.set_xticks(np.arange(80, 181, 10))
    axes.set_xticks(np.arange(80, 181, 2), minor=True)
    axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(80., 181., 10.)], fontsize=20)

    axes.set_yticks(np.arange(0, 250, 50))
    axes.set_yticks(np.arange(0, 250, 10), minor=True)
    axes.set_yticklabels([r'$%i$' % num for num in np.arange(0, 250, 50)], fontsize=20)

    axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
    axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
    axes.minorticks_on()    
    axes.set_xlabel('True Kink (140) + Kink Fit Difference', labelpad=5, fontsize=22)
    axes.set_ylabel('Number of Counts (400 total)', labelpad=5, fontsize=22)
    
    axes.annotate(r'larger $x_{\rm kink2}$', xy=(140.5,200), xycoords='data', xytext=(50,-5),
                  textcoords='offset points', size=22,
                  arrowprops=dict(arrowstyle='<-', color='#e7298a', linewidth=3))
    axes.annotate(r'larger $x_{\rm kink1}$', xy=(139.5,200), xycoords='data', xytext=(-180,-5),
                  textcoords='offset points', size=22,
                  arrowprops=dict(arrowstyle='<-', color='#66a61e', linewidth=3))
    axes.annotate('', xy=(140.15,250), xycoords='data', xytext=(0,-100),
                  textcoords='offset points', arrowprops=dict(arrowstyle='-'))

    print
In [46]:
plotHistoFitCompare()

The debinning Project: Data Smoothing without Bins

Histograms "smooth" the data in this way:

Let's try to find other shapes!

Is there a more appropriate smoothing shape?

Gaussian Smoothing: binfull Histogram

aka "Kernel Density Estimation":

Gaussian Smoothing

Select a point from our data...

Add a random deviate from a Gaussian distribution...

And repeat ad nauseam.

Put all results into a binfull Histogram ("full" of arbitrarily small bins).

binfull Histogram Demo

Linearly Increasing Gaussian Smoothing

Algorithm issues: Speed, choosing $\sigma$, over/underflow, "LinearExpanding" - Kernel depends on "time"...

1) New Probability Calculus:

$\qquad$ "Can't I just calculate that thing?"

2) The debinning Calculation:

$\qquad$ "Whoa, I can calculate something else!"

3) The Future, and Shameless Advertising:

$\qquad$ "Yikes.... There is so much more!"

New Probability Calculus:

"Can't I just calculate that thing?"

Datum Probability

What's the probability to get some datum, $x$?

$$ {\large P(\widetilde{x}\in[x,x+dx]) = f(x) dx \quad {\rm and} \quad \Int{-\infty}{\infty} f(x) dx = 1~.} $$

$f(x)$ is the Probability Distribution Function (PDF).

Its anti-derivative is the Cumulative Distribution Function (CDF):

$$ F(x) = \Int{-\infty}{x} f(x') dx' = P(\tilde{x} \le x)$$$$ 1 - F(x) = \Int{x}{\infty} f(x') dx' = P(\tilde{x} > x) $$

Sample Probability

Probability to get a sample of data points, $\{x_i\}_{i=1}^n$?

\begin{align} dP(\{x_1\}) &= \phantom{3!} \phantom{\times!!} f(x_1) dx_1 \\ dP(\{x_1, x_2\}) &= \rust{2}\phantom{!} \rust{\times} f(x_1) dx_1 f(x_2) dx_2 \\ dP(\{x_1, x_2, x_3\}) &= \rust{3!} \rust{\times} f(x_1) dx_1 f(x_2) dx_2 f(x_3) dx_3 \end{align}

Order doesn't matter... WLOG, sort:

\begin{align} \{x_1, x_2\} &\equiv \{x_2, x_1\} \\ \{x_1, x_2, x_3\} &\equiv \{x_2, x_3, x_1\} \equiv \cdots \\ \phantom{dP(\{x_1, x_2, x_3\})} &\phantom{= 3! \times f(x_1) dx_1 f(x_2) dx_2 f(x_3) dx_3} \end{align}$$ \Rightarrow\ {\rm WLOG}\ x_j < x_k\ {\rm for}\ j < k\ {\rm for\ all}\ x_j, x_k \in \{x_i\}_{i=1}^n$$

\begin{align} \therefore dP(\{x_i\}_{i=1}^n) &= \rust{n!} \Prod{i = 1}{n} f(x_i) dx_i \\ \phantom{dP(\{x_1, x_2, x_3\})} &\phantom{= 3! \times f(x_1) dx_1 f(x_2) dx_2 f(x_3) dx_3} \end{align}

Probability Calculus: Integration

The data is all sorted... Careful with those integration limits!

$$ \Int{\{x_1, x_2\}}{} dP(\{x_1, x_2\}) = 2 \Int{-\infty}{\infty} f(x_2) \left[ \Int{-\infty}{x_2} f(x_1) dx_1 \right] dx_2, $$
$$\Int{\{x_1, x_2, x_3\}}{} dP(\{x_1, x_2, x_3\}) = 3!\Int{-\infty}{\infty} f(x_3) \left[ \Int{-\infty}{x_3} f(x_2) \left( \Int{-\infty}{x_2} f(x_1) dx_1 \right) dx_2 \right] dx_3$$
$$\vdots$$

Limits of inner integrals are also integration variables...

Thus, integrals are "chained" together... Chain Integrals.

Probability Calculus: Being $r$th of $n$

\begin{align} \subNsubR{n}{f}{r} (x_r)\ dx_r &\equiv\ \Int{\{x_{i \neq r}\}_n}{} dP(\{x_i\}_{i=1}^n) \\ \quad = n!\ f(x_r)\ dx_r &\phantom\times\ \Int{-\infty}{\infty} f(x_n) \Int{-\infty}{x_n} f(x_{n-1}) \cdots \Int{-\infty}{x_{r+2}} f(x_{r+1}) \Prod{i=r+1}{n} dx_i \\ &\times\ \Int{-\infty}{\infty} f(x_1) \Int{x_1}{\infty} f(x_2) \cdots \Int{x_{r-2}}{\infty} f(x_{r-1}) \Prod{i'=1}{r-1} dx_{i'} \end{align}

$\subNsubR{n}{f}{r}$ Theorem:

$$ {\Large \subNsubR{n}{f}{r}(x_r) = \subNsubR{n}{K}{r}\ f(x_r) F^{r-1}(x_r) \left(1 - F(x_r)\right)^{n-r} }. $$

Probability Calculus: Being $r$th of $n$

$n$ identical siblings: Race! $\quad$ ... Outcomes? $n!$

Only care about $r$th place. $\quad$ ... $n!$ is overcounting.

$\quad$ ... $(r-1)$ faster... who cares what order...

$\quad$ ... $(n-r)$ slower... who cares what order...

$$ \Rightarrow \subNsubR{n}{K}{r} \equiv \frac{n!}{(n-r)!(r-1)!} $$

$$ {\Large \therefore \quad \subNsubR{n}{f}{r}(x_r) = \subNsubR{n}{K}{r}\ f(x_r)\ F^{r-1}(x_r)\ \left(1 - F(x_r)\right)^{n-r} } $$

$ {\Large ({\rm HW}-1)}$ $${\large {\rm Prove\ the}\ \subNsubR{n}{f}{r}\ {\rm Theorem\ by\ computing\ the\ chain\ integral.} }$$

Proof:

$$ \subNsubR{n}{f}{r}(x_r) dx_r = \Int{\{x_{i \neq r}\}_n}{} dP(\{x_i\}_n) = n! \Int{-\infty}{x_r} f(x_{r-1}) \Int{-\infty}{x_{r-1}} f(x_{r-2}) \cdots \Int{-\infty}{x_2} f(x_1) \Prod{i=1}{r-1} dx_i $$$$ \times\ f(x_r) dx_r \Int{x_r}{\infty} f(x_{r+1}) \Int{x_{r+1}}{\infty} f(x_{r+2}) \cdots \Int{x_{n-1}}{\infty} f(x_n) \Prod{j=r+1}{n} dx_j $$
$$ = n! \Int{-\infty}{x_r} f(x_{r-1}) \Int{-\infty}{x_{r-1}} f(x_{r-2}) \cdots \Int{-\infty}{x_3} f(x_2) F(x_2) \Prod{i=2}{r-1} dx_i $$$$ \times\ f(x_r) dx_r \Int{x_r}{\infty} f(x_{r+1}) \Int{x_{r+1}}{\infty} f(x_{r+2}) \cdots \Int{x_{n-2}}{\infty} f(x_{n-1}) \left(1 - F(x_{n-1})\right) \Prod{j=r+1}{n-1} dx_j $$

Integration by parts:

$$ \Int{-\infty}{x} f(\widetilde{x}) F^{a-1}(\widetilde{x}) d\widetilde{x} = \frac{1}{a} F^a(x) \qquad {\rm and} \qquad \Int{x}{\infty} f(\widetilde{x}) \left(1 - F(\widetilde{x})\right)^{a-1} d\widetilde{x} = \frac{1}{a} \left(1 - F(x)\right)^a $$
$$ \subNsubR{n}{f}{r}(x_r) dx_r = n! \Int{-\infty}{x_r} f(x_{r-1}) \Int{-\infty}{x_{r-1}} f(x_{r-2}) \cdots \frac{1}{2} \Int{-\infty}{x_4} f(x_3) F^2 (x_3) \Prod{i=3}{r-1} dx_i $$$$ \times\ f(x_r) dx_r \Int{x_r}{\infty} f(x_{r+1}) \Int{x_{r+1}}{\infty} f(x_{r+2}) \cdots \frac{1}{2} \Int{x_{n-2}}{\infty} f(x_{n-2}) \left(1 - F(x_{n-2})\right)^2 \Prod{j=r+1}{n-2} dx_j $$
$$ \cdots\ = n! \left(\frac{1}{(r-1)!} F^{r-1}(x_r)\right) f(x_r) dx_r \left(\frac{1}{(n-r)!} \left(1 - F(x_r)\right)^{n-r}\right) $$
$$ {\Large \therefore \quad \subNsubR{n}{f}{r}(x_r) = \subNsubR{n}{K}{r} f(x_r) F^{r-1}(x_r) \left(1 - F(x_r)\right)^{n-r} }. \quad {}_\blacksquare$$
In [5]:
reload(funcs)
def quickData():
    ''' Shorthand for getting easyEndpoint data (384 x values in [8.5,200])
          - set nPulls before calling
          - see utilities.sampleFunctions.functionToData for more details
    '''
    return funcs.functionToData(funcs.easyEndpoint, nPulls, 0.5, np.arange(8.5, 200.2, 0.5))


def quickBG():
    ''' Shorthand for getting endpointBG data (384 x values in [8.5,200])
          - set nPulls before calling
          - see utilities.sampleFunctions.functionToData for more details
    '''
    return funcs.functionToData(funcs.endpointBG, nPulls, 0.5, np.arange(8.5, 200.2, 0.5))
In [6]:
nPulls = 10
pullData = {index:[] for index in xrange(nPulls)}
settings['simpleSortedOutput'] = True
for iteration in xrange(2000):
    for i,v in enumerate(quickData()):
        pullData[i].append(v)

settings['simpleSortedOutput'] = False
x, f, F, data = quickData()
bgx, bgf, bgF, bgdata = quickBG()
f /= F[-1]
bgf /= F[-1]
bgF /= bgF[-1]
F /= F[-1]

ithPDF = {}
for i in xrange(1, nPulls+1):
    ithPDF[i-1] = np.array([f[j] * F[j]**(i-1) * (1 - F[j])**(nPulls-i) * pc.nKr(nPulls, i) for j in xrange(len(x))])
In [7]:
def makeHist(i=0):
    fig, axes = plt.subplots(figsize=(9,5), facecolor='#ffffff')
    plt.plot(x, f, '--', color='#0088ff', alpha=0.6, linewidth=3)
    plt.plot(x, bgf, '-.', color='#0088ff', alpha=0.6, linewidth=3)
    plt.hist(pullData[i], bins=np.arange(0, 201, 5), alpha=0.4, color='#66a61e', normed=True)
    plt.plot(x, ithPDF[i], '#964a01', linewidth=3)
    
    axes.set_xticks(np.arange(0, 201, 50))
    axes.set_xticks(np.arange(0, 201, 10), minor=True)
    axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(0, 201, 50)], fontsize=20)
    
    axes.set_ylim(bottom=-0.0001, top=0.0401)
    axes.set_yticks(np.arange(0, 0.04, 0.01))
    axes.set_yticklabels([r'$%0.3f$' % num for num in np.arange(0, 0.04, 0.01)], fontsize=20)
    axes.set_yticks(np.arange(0, 0.04, 0.01/5), minor=True)

    axes.set_xlabel('Physical Observable', labelpad=5, fontsize=22)
    axes.set_ylabel('Probability', labelpad=5, fontsize=22)

    axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
    axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
    axes.minorticks_on()

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{1}(x)$ compared with the histogram of the 1st smallest point of 2000 samples:

In [8]:
makeHist(0)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{2}(x)$ compared with the histogram of the 2nd smallest point of 2000 samples:

In [9]:
makeHist(1)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{3}(x)$ compared with the histogram of the 3rd smallest point of 2000 samples:

In [10]:
makeHist(2)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{4}(x)$ compared with the histogram of the 4th smallest point of 2000 samples:

In [11]:
makeHist(3)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{5}(x)$ compared with the histogram of the 5th smallest point of 2000 samples:

In [12]:
makeHist(4)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{6}(x)$ compared with the histogram of the 6th smallest point of 2000 samples:

In [13]:
makeHist(5)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{7}(x)$ compared with the histogram of the 7th smallest point of 2000 samples:

In [14]:
makeHist(6)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{8}(x)$ compared with the histogram of the 8th smallest point of 2000 samples:

In [15]:
makeHist(7)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{9}(x)$ compared with the histogram of the 9th smallest point of 2000 samples:

In [16]:
makeHist(8)

Let's "measure" 10 $x$ values for some $f(x)$, and repeat that 2000 times.

Here's $\subNsubR{10}{f}{10}(x)$ compared with the histogram of the 10th smallest point of 2000 samples:

In [17]:
makeHist(9)

NOTE: From now on, the data does not need to be sorted.

Probability Calculus: Are You $r$th of $n$?

$\subNsubR{n}{f}{r}(x)$ is the PDF of the $r$th of $n$ position:

$$ {\Large P(\tilde{x} \in [x, x + dx]\ |\ \tilde{x}\ {\rm is\ } r{\rm th\ of\ } n, f) \equiv \subNsubR{n}{f}{r}(x) dx } $$

Use Bayes' Theorem:

$$ {\large P(B\ |\ A) = \frac{P(A\ |\ B)\ P(B)}{P(A)} } $$$$ {\large \Rightarrow P(\tilde{x}\ {\rm is\ } r{\rm th\ of\ } n\ |\ \tilde{x} \in [x, x + dx], f) = \frac{\subNsubR{n}{f}{r}(x) dx\ \frac{1}{n}}{f(x) dx} } $$
$$ {\Large = \frac{\subNsubR{n}{K}{r}}{n} F^{r-1}(x) \left(1 - F(x)\right)^{n-r}} $$

${\Large ({\rm HW}-2) }$ $${\large {\rm Prove}: \qquad \Sum{r=1}{n} \frac{\subNsubR{n}{K}{r}}{n} F^{r-1}(x) \left(1 - F(x)\right)^{n-r} = 1, \quad \forall x. }$$

Proof (special thanks to Ola Liabøtrø, who solved this in his head during my talk):

$$\frac{\subNsubR{n}{K}{r}}{n} = \frac{(n-1)!}{(n-r)!(r-1)!} = \frac{(n-1)!}{(n-1-[r-1])!(r-1)!} = \subNsubR{n-1}{C}{r-1}$$

Thus,

$$ \Sum{r=1}{n} \frac{\subNsubR{n}{K}{r}}{n} F^{r-1}(x) \left(1 - F(x)\right)^{n-r} = \Sum{r=1}{n} \subNsubR{n-1}{C}{r-1} F^{r-1}(x) \left(1 - F(x)\right)^{n-r} $$$$ = \Sum{r'=0}{n-1} \subNsubR{n-1}{C}{r'} F^{r'}(x) \left(1 - F(x)\right)^{n-1-r'} = (F(x) + 1 - F(x))^{n-1} = 1. \quad {}_\blacksquare $$

Note: I proved this also. The hard way. Term-by-term. I could not see the relation to $\subNsubR{n-1}{C}{r-1}$.

The debinning Calculation:

"Whoa, I can calculate something else!"

The debinning Algorithm

nfr Smoothing

Thought experiment: The Flat PDF

$$ f_{\rm \color{gray}{flat}}(x) = \left\{ \begin{array}{rcl} 1 & {\rm if} & x \in [0, 1] \\ 0 & & {\rm otherwise} \end{array} \right. $$

Select a point, $x_j$, from the sample $\{x_1, x_2, x_3\}$

Randomly choose an appropriate $r$ value using

$${\large P(x_j\ {\rm is\ } r{\rm th\ of\ } 3\ |\ x_j, f_{\rm \color{Gray}{flat}}) = \frac{\subNsubR{3}{K}{r}}{3} F_{\rm \color{Gray}{flat}}^{r-1}(x_j) \left(1 - F_{\rm \color{Gray}{flat}}(x_j)\right)^{3-r} }$$

Generate a random point from

$${\large \subNsubR{3}{f}{{\rm(\color{Gray}{flat})}r}(x) = \subNsubR{3}{K}{r} f_{\rm \color{Gray}{flat}}(x) F_{\rm \color{Gray}{flat}}^{r-1}(x) \left(1 - F_{\rm \color{Gray}{flat}}(x)\right)^{3-r} = \mathcal{O}(x^2) }$$

The debinning Algorithm... Expectation

debinning Thought Experiment: 3 points from a flat distribution.

The Experiment Inspired debinning Algorithm

Suppose our data comes from

$${\huge f(x) = w\ f_{\rm \sky{sig}}(x) + (1 - w)\ f_{\rm \rust{bg}}(x)}$$

BG nfr Smoothing

Select a point from our data...

Randomly choose an appropriate $r$ value using

$${\large P(\tilde{x}\ {\rm is\ } r{\rm th\ of\ } n\ |\ \tilde{x} \in [x, x + dx], f_{\rm \rust{bg}}) \cdots}$$

Generate a random point from $\subNsubR{n}{f}{{\rm(\rust{bg})}r}(x)$... need 3 random numbers just to get one back!!

The debinning Algorithm... Calculation

"Det er dumt!" Let's just calculate it!

$$ f_{\rm debin}(x) dx = P_{\rm debin}\left(\tilde{x} \in [x, x + dx] | \{x_i\}_{i=1}^{n}, f_{\rm\rust{bg}} \right) $$
$$ = \sum\limits_{j=1}^{n} P( x_j | \{x_i\}_{i=1}^{n} ) \sum\limits_{r=1}^{N} P(r{\rm th}\ {\rm of}\ N | x_j, f_{\rm\rust{bg}}) P(\tilde{x} \in [x, x + dx] | r{\rm th}\ {\rm of}\ N, f_{\rm\rust{bg}}) $$
$$ = \sum\limits_{j=1}^{n} \frac{1}{n} \sum\limits_{r=1}^{N} \left(\frac{{}_NK_r}{N} F_{\rm\rust{bg}}(x_j)^{(r-1)}(1 - F_{\rm\rust{bg}}(x_j))^{(N-r)} \right) $$$$ \quad\times \left( {}_NK_r f_{\rm\rust{bg}}(x) F_{\rm\rust{bg}}(x)^{(r-1)}(1 - F_{\rm\rust{bg}}(x))^{(N-r)} dx \right) $$
$$ {\Large \Rightarrow f_{\rm debinnning}(x) = \frac{f_{\rm\rust{bg}}(x)}{n N} \displaystyle \vec{G} \cdot \sum\limits_{j=1}^{n} \vec{G}_j } $$

${\large \left[\vec{G}_{(j)} \equiv {}_N K_r F^{r-1}_{\rm\rust{bg}}(x_{(j)}) \left( 1 - F_{\rm\rust{bg}}(x_{(j)}) \right)^{N-r} \right] }$.

In [19]:
settings['simpleSortedOutput'] = False
nPulls = 96
nFictitious=96

x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
F /= F[-1]
bgf /= bgF[-1]
bgF /= bgF[-1]

    
def mapDataToX(dindex):
    # maps elements of data to elements of x
    # bisect_left(x, datum) locates the left-most entry in x which is >= datum
    xindex = bisect.bisect_left(x, data[dindex])
    if xindex > 0 and x[xindex] - data[dindex] >= data[dindex] - x[xindex-1]:
        return xindex - 1
    return xindex
dataToX = np.array([mapDataToX(j) for j in xrange(len(data))])


def vecG_rth(rindex, xindex):
    r = rindex+1
    return pc.nKr(nFictitious, r) * bgF[xindex]**(r-1) * (1 - bgF[xindex])**(nFictitious - r)


vecG_rx = np.array([[vecG_rth(rindex, xindex) for xindex in xrange(len(x))]
                    for rindex in xrange(nFictitious)])
sumG_j = vecG_rx.take(dataToX, axis=1).sum(axis=1)
debinningData = bgf * np.dot(sumG_j, vecG_rx) / (nFictitious*nPulls)

# One more pull for plotting purposes
x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
bgf /= F[-1]
bgF /= F[-1]
F /= F[-1]

def makeDebinningPlot():
    fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
    plt.plot(x, f, '--', color='#0088ff', alpha=0.6, linewidth=3)
    plt.plot(x, bgf, '-.', color='#0088ff', alpha=0.6, linewidth=3)
    plt.hist(data, bins=np.arange(0, 201, 5), alpha=0.4, color='#66a61e', normed=True)
    plt.plot(x, debinningData, '#964a01', linewidth=3)
    
    axes.set_xticks(np.arange(0, 201, 50))
    axes.set_xticks(np.arange(0, 201, 10), minor=True)
    axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(0, 201, 50)], fontsize=20)
    
    axes.set_ylim(bottom=-0.0001, top=0.0201)
    axes.set_yticks(np.arange(0, 0.02, 0.005))
    axes.set_yticklabels([r'$%0.3f$' % num for num in np.arange(0, 0.02, 0.005)], fontsize=20)
    axes.set_yticks(np.arange(0, 0.02, 0.005/5), minor=True)

    axes.set_xlabel('Physical Observable', labelpad=5, fontsize=22)
    axes.set_ylabel('Probability', labelpad=5, fontsize=22)

    axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
    axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
    axes.minorticks_on()

debinning Demonstration

In [20]:
makeDebinningPlot()
In [21]:
fwhmValues=None
In [23]:
# Let's try some GPU programming:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
from string import Template

settings['simpleSortedOutput'] = False
nPulls = 96
nFictitious = 96  # Multiple of 32 for GPU programming ease.
lenXdomain = len(np.arange(8.5, 200.2, 0.5))

mod_template = Template("""
    __global__ void vecG(double *vecG_rx, double *bgF, double *lognKr) {
        const int idr = threadIdx.y + blockDim.y * blockIdx.y;
        const int idx = threadIdx.x + blockDim.x * blockIdx.x;
        const int idrx = idx + idr*${lenX};
        const int r = (idr + 1) < (${nFic} - idr) ? (idr + 1) : ${nFic} - idr;
        if (bgF[idx] == 0 || (1 - bgF[idx]) == 0)
          vecG_rx[idrx] = 0;
        else
          vecG_rx[idrx] = exp(lognKr[r-1] + idr * log(bgF[idx]) 
                              + (${nFic} - idr - 1) * log(1 - bgF[idx]));
    }
    """)

mod = SourceModule(mod_template.substitute(nFic=nFictitious, lenX=lenXdomain))

timer = time()

# initialize the loop
lognKr = pc.get_lognKrArray(nFictitious)
chisq = []
bgchisq = []
ks = []
bgks = []

for trial in xrange(100000):

    x, f, F, data = quickData()
    x, bgf, bgF, bgData = quickBG()
    f /= F[-1]
    F /= F[-1]
    bgf /= bgF[-1]
    bgF /= bgF[-1]

    def mapDataToX(dindex):
        # maps elements of data to elements of x
        # bisect_left(x, datum) locates the left-most entry in x which is >= datum
        xindex = bisect.bisect_left(x, data[dindex])
        if xindex > 0 and x[xindex] - data[dindex] >= data[dindex] - x[xindex-1]:
            return xindex - 1
        return xindex
    dataToX = np.array([mapDataToX(j) for j in xrange(len(data))])

    # Use the GPU for the most expensive / paralellizable part of the calculation:
    # =====
    
    # Empty numpy array to hold result of vecG (row, column = nFictitious, len(x))
    vecG_rx = np.empty([nFictitious, len(x)])
    
    # Allocate memory on the GPU for vecG calculation
    vecG_rx_gpu = cuda.mem_alloc(vecG_rx.nbytes)
    bgF_gpu = cuda.mem_alloc(bgF.nbytes)
    lognKr_gpu = cuda.mem_alloc(lognKr.nbytes)
    
    # Transfer data to the GPU
    cuda.memcpy_htod(bgF_gpu, bgF)
    cuda.memcpy_htod(lognKr_gpu, lognKr)
    
    # Get a reference to the GPU module function and do the calculation
    # NOTE: 16*24 = 384 = len(x), and 4*24 = 96 = nFictitious
    vecG_func = mod.get_function('vecG')
    vecG_func(vecG_rx_gpu, bgF_gpu, lognKr_gpu, block=(16, 4, 1), grid=(24, 24))

    # Get the result back from the GPU and use it!
    cuda.memcpy_dtoh(vecG_rx, vecG_rx_gpu)
    sumG_j = vecG_rx.take(dataToX, axis=1).sum(axis=1)
    debinningData = bgf * np.dot(sumG_j, vecG_rx) / (nFictitious*nPulls)
    
    # Chisq test....
    if type(fwhmValues) == np.ndarray:
        chisq.append(sum(4.*(debinningData - f)**2 / fwhmValues**2) / len(fwhmValues))
        bgchisq.append(sum(4.*(debinningData - bgf)**2 / fwhmValues**2) / len(fwhmValues))
    
    # KS test...
    debinningCDF = pc.A(debinningData, 0.5)
    ks.append(max(abs(debinningCDF - F)))
    bgks.append(max(abs(debinningCDF - bgF)))
    
    if trial == 0:
        # Infinity sigma upper and lower bands...
        debinningUpper = debinningData.copy()
        debinningLower = debinningData.copy()
        # Uncertainty PDFs for each x value...
        uncertaintyHistos = [debinPlot.HistoContainer(np.arange(0., 0.024, 0.0002), debinningData[i])
                             for i in xrange(len(x))]
    else:
        debinningUpper = np.clip(debinningData, debinningUpper, 20.)
        debinningLower = np.clip(debinningData, -1., debinningLower)
        for i in xrange(len(x)):
            try:
                uncertaintyHistos[i].addValue(debinningData[i])
            except ValueError as e:
                print e.message
                print 'xindex = ' + str(i) + ',  debinningData[xindex] = ' + str(debinningData[i])

if not type(fwhmValues) == np.ndarray:
    fwhmValues = np.array([np.diff(uncertaintyHistos[i].fwhm()) for i in xrange(len(x))]).flatten()
    print "Run this again for the Chisq test data..."
print time() - timer
290.936532021
In [24]:
# Try one more sample, but with a HUGE data set:
nPulls = 10000

timer = time()
x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
F /= F[-1]
bgf /= bgF[-1]
bgF /= bgF[-1]

def mapDataToX(dindex):
    # maps elements of data to elements of x
    # bisect_left(x, datum) locates the left-most entry in x which is >= datum
    xindex = bisect.bisect_left(x, data[dindex])
    if xindex > 0 and x[xindex] - data[dindex] >= data[dindex] - x[xindex-1]:
        return xindex - 1
    return xindex
dataToX = np.array([mapDataToX(j) for j in xrange(len(data))])

# Use the GPU for the most expensive / paralellizable part of the calculation:
# =====

# Empty numpy array to hold result of vecG (row, column = nFictitious, len(x))
vecG_rx = np.empty([nFictitious, len(x)])

# Allocate memory on the GPU for vecG calculation
vecG_rx_gpu = cuda.mem_alloc(vecG_rx.nbytes)
bgF_gpu = cuda.mem_alloc(bgF.nbytes)
lognKr_gpu = cuda.mem_alloc(lognKr.nbytes)

# Transfer data to the GPU
cuda.memcpy_htod(bgF_gpu, bgF)
cuda.memcpy_htod(lognKr_gpu, lognKr)

# Get a reference to the GPU module function and do the calculation
# NOTE: 16*24 = 384 = len(x), and 4*24 = 96 = nFictitious
vecG_func = mod.get_function('vecG')
vecG_func(vecG_rx_gpu, bgF_gpu, lognKr_gpu, block=(16, 4, 1), grid=(24, 24))

# Get the result back from the GPU and use it!
cuda.memcpy_dtoh(vecG_rx, vecG_rx_gpu)
sumG_j = vecG_rx.take(dataToX, axis=1).sum(axis=1)
BIGdebinningData = bgf * np.dot(sumG_j, vecG_rx) / (nFictitious*nPulls)
print time() - timer
0.0585300922394

debinning Performace: Uncertainty

In [25]:
# One more pull for plotting purposes
x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
bgf /= F[-1]
bgF /= F[-1]
F /= F[-1]

# Pretty plot
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
plt.fill_between(x, debinningUpper, debinningLower, color='#0088ff', alpha=0.3, zorder=1)
plt.plot(x, f, color='#0088ff', linewidth=3, zorder=3)
plt.plot(x, bgf, color='#0088ff', linestyle='-.', linewidth=3, zorder=3)
plt.plot(x, BIGdebinningData, color='black', linestyle='--', linewidth=3, zorder=4)
colors = np.array([(0.8,1.,0.4), (0.8,0.3,0.), (1.,1.,1.), (0.,0.,1.)])
for i, v in enumerate(x):
    debinPlot.verticalColorBand(uncertaintyHistos[i], v, 0.5, colors)

axes.set_xticks(np.arange(0, 201, 50))
axes.set_xticks(np.arange(0, 201, 10), minor=True)
axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(0, 201, 50)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=0.0201)
axes.set_yticks(np.arange(0, 0.02, 0.005))
axes.set_yticklabels([r'$%0.3f$' % num for num in np.arange(0, 0.02, 0.005)], fontsize=20)
axes.set_yticks(np.arange(0, 0.02, 0.005/5), minor=True)

axes.set_xlabel('Physical Observable', labelpad=5, fontsize=22)
axes.set_ylabel('Probability', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()

print

debinning Performace: Discrimination

In [26]:
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
plt.hist(chisq, bins=np.arange(0, 7.01, .05), color='#0088ff', edgecolor='#555555')
plt.hist(bgchisq, bins=np.arange(0, 7.01, .05), color='#66a61e', edgecolor='#555555', alpha=0.5)

axes.set_xticks(np.arange(0, 7.01, 1))
axes.set_xticks(np.arange(0, 7.01, .25), minor=True)
axes.set_xticklabels([r'$%1.1f$' % num for num in np.arange(0, 7.01, 1)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=6000)
axes.set_yticks(np.arange(0, 6001, 1000))
axes.set_yticks(np.arange(0, 6001, 250), minor=True)
axes.set_yticklabels([r'$%i$' % int(num) for num in np.arange(0, 6001, 1000)], fontsize=20)

axes.set_xlabel(r'$\chi^2$ per d.o.f.', labelpad=5, fontsize=22)
axes.set_ylabel('Number of Counts', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()
print

debinning Performace: Discrimination

In [27]:
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
plt.hist(ks, bins=np.arange(0, 0.3, .002), color='#0088ff', edgecolor='#555555')
plt.hist(bgks, bins=np.arange(0, 0.3, .002), color='#66a61e', edgecolor='#555555', alpha=0.5)

axes.set_xticks(np.arange(0, 0.26, .05))
axes.set_xticks(np.arange(0, 0.26, .01), minor=True)
axes.set_xticklabels([r'$%0.2f$' % num for num in np.arange(0, 0.26, .05)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=5000)
axes.set_yticks(np.arange(0, 5001, 1000))
axes.set_yticks(np.arange(0, 5001, 250), minor=True)
axes.set_yticklabels([r'$%i$' % int(num) for num in np.arange(0, 5001, 1000)], fontsize=20)

axes.set_xlabel('KS test value', labelpad=5, fontsize=22)
axes.set_ylabel('Number of Counts', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()
print

The Future, and Shameless Advertising:

"Yikes.... There is so much more!"

Probability Calculus: Future

Two-point correlation distribution, $\subNsubR{n}{f}{r,s}(x_r, x_s)$...

$\qquad$ Form is known already, but use...?

Re-derive or Re-interpret current statistical methods?

$\qquad$ Poisson and Multinomial changes?

$\qquad$ Likelihoods... Unbinned likelihoods... Re-derive?

debinning: Statistical impact

Full comparison: debinning vs histograms

$\qquad$ Discovery signifigance?

$\qquad$ Parameter estimation performance?

Full comparison: debinning vs unbinned maximum likelihood

$\qquad$ Discovery signifigance?

$\qquad$ Parameter estimation performance?

debinning: Generalize / Specialize

$\qquad$ data-driven backgrounds?

$\qquad$ detector resolution "unfolding"?

$\qquad$ "Max Gap Method"?

$\qquad$ 2-debinning? N-debinning?

Searching for "New Stuff" with Statistics

New global-fitting tool for any "New Stuff": Gambit

The Global And Modular BSM Inference Tool

In [28]:
from IPython.display import HTML
HTML('<iframe src=http://gambit.hepforge.org/ width=1000 height=350></iframe>')
Out[28]:

Shameless Gambit Advertising:

Our flagship scan: Supersymmetry!

$\qquad$ Our PMSSM-25 scan: Scanning 25 parameters!

About us:

GAMBIT is a global fitting code for generic Beyond the Standard Model theories, designed to allow fast and easy definition of new models, observables, likelihoods, scanners and backend physics codes.

GAMBIT is developed by a group of nearly 30 theorists and experimentalists. The Collaboration includes members of the AMS-02, ATLAS, CDMS, CTA, DARWIN, DM-ICE, Fermi-LAT, HESS, IceCube, LHCb and XENON experiments, as well as developers of the public theory codes DarkSUSY, FlexibleSUSY, IsaJet, SoftSUSY and SuperIso.

We are currently finalising the code in preparation for first public release in Summer 2015 — stay tuned!

Constructive Comments and Acknowledgements

Greg says that I should check out rank statistics

Christoph mentioned "max gap method" for when you have no idea whatsoever what your background is.

Anders says that I should try Likelihood ratios: Compute Lsig/Lbg, debinning vs binned vs unbinned likelihood.

Marco De Mattia says to consider the time scaling of the problem. There may be a region in data sample size where unbinned likelihood is too slow, but there is still not enough signal statistics for regular histograms.

Thank yous to various people who have discussed with me over the years:

  • Teruki Kamon, Jan Conrad, Greg Martinez, Christoph Weniger, Anders Kvellestad, Ola Liabøtrø, Are Raklev, Marco De Mattia

Extra special thank yous for incredibly helpful constructive criticisms:

  • Bob Cousins, Alex Read

... and, of course, none of this would have happened without the help I needed for debinning-1.0.0. Ultra special thank you to my bro, Nathan Krislock.