Set some useful LaTeX commands for later in the document: $\newcommand{\Int}[2] {\displaystyle \int\limits_{#1}^{#2}}$ $\newcommand{\Prod}[2] {\displaystyle \prod\limits_{#1}^{#2}}$ $\newcommand{\Sum}[2] {\displaystyle \sum\limits_{#1}^{#2}}$

\Int{lower}{upper} : $\Int{lower}{upper},\qquad$ \Prod{lower}{upper} : $\Prod{lower}{upper},\qquad$ \Sum{lower}{upper} : $\Sum{lower}{upper}$¶

$\newcommand{subNsubR}[3] {{}_{#1} #2_{#3}}$ $\newcommand{rust}[1] {{\require{color}\color{Bittersweet}#1}}$ $\newcommand{sky}[1] {{\require{color}\color{SkyBlue}#1}}$

\subNsubR{n}{K}{r} : $\subNsubR{n}{K}{r},\qquad$ \rust{q} : $\rust{q},\qquad$ \sky{q} : $\sky{q}$¶

In [1]:
'''
debinningWithNewStatistics.ipynb
This file is an ipython notebook designed to be converted to reveal.js
slides via this command:
ipython nbconvert debinningWithNewStatistics.ipynb --to=slides --post=serve [--config slides_config.py]
This file also contains the main body of debinning work in the code blocks below.

Copyright (C) 2015 Abram Krislock

Talks given from ~= this version of the slides:
"Debinning: Data Smoothing with a New Probability Calculus"
- Fysisk institutt, Universitetet i Oslo, February 25, 2015
- Mitchell Institute for Fundamental Physics and Astronomy, Texas A&M University, March 13, 2015

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see http://www.gnu.org/licenses/gpl-3.0.txt.
'''

import sys
sys.path[0] = '..'
from __future__ import division
import numpy as np
from time import time
import bisect
import operator as op

from utilities import settings
import utilities.sampleFunctions as funcs
import utilities.probCalc as pc
import utilities.plotting as debinPlot
import utilities.minimizers as miniz

%matplotlib inline
from matplotlib import pyplot as plt
from IPython.display import HTML


Data Smoothing with a New Probability Calculus ¶

We knew all the stuff!¶

Then this stuff happened:¶

$${\huge{\rm He}^{++},~{\rm e}^-,~\gamma \qquad\qquad \hbar \qquad\qquad g^{\mu\nu}}$$$${\huge E^2 = (mc^2)^2 + (pc)^2}$$
$${\Large \cdots {\rm SU}(3)_C \times {\rm SU}(2)_L \times {\rm U}(1)_Y}$$

Dark Matter @ Large Hadron Collider¶

\begin{align} {\Huge \widetilde{q_L}} & {\Huge\rightarrow} & {\Huge \rust{q}} \\ & {\Huge\searrow} \\ & & {\Huge\widetilde{\chi}^0_2} & {\Huge\rightarrow} & {\Huge h^0} & {\Huge\rightarrow} & {\Huge \rust{b}}\\ & & & {\Huge\searrow} & & {\Huge \hookrightarrow} & {\Huge \rust{b}} \\ & & & & {\Huge \color{Gray}{\widetilde{\chi}^0_1}} \end{align}

Calculate as one particle, the mass:¶

$${\huge m_{\rust{q h^0}} \qquad {\rm using} \qquad E^2 = (mc^2)^2 + (pc)^2}$$

Okay, that's bad, but how often...?¶

In [2]:
funcs.randomGenerator.setSeed('seedOne')
settings['simpleSortedOutput'] = True

nPulls = 400
basePars = (0.00006, 10., 200., 110., 1.5, 50., 140.)
baseGuess = (140., -2., -0.1, 10.)

def fitTwoHistograms(percentile):
data = funcs.functionToData(funcs.easyEndpoint, nPulls, 0.5, (0,200), basePars)
percentieth = data[data > np.percentile(data, percentile)]

# The histograms are filled with the same data, only different binning.
binContentsOne, binEdgesOne = np.histogram(percentieth, np.arange(3., 204., 10.))
binContentsTwo, binEdgesTwo = np.histogram(percentieth, np.arange(0., 201., 5.))

# The fit ranges are defined to be from the value of the 70th percentile data point
#   to "the last non-zero bin", blindly.
fitBinsOne = np.flatnonzero(binContentsOne[binContentsOne.argmax():]) + binContentsOne.argmax()
fitBinsTwo = np.flatnonzero(binContentsTwo[binContentsTwo.argmax():]) + binContentsTwo.argmax()

### FITTING METHOD: Binned Maximum Likelihood ###
fitGuessOne = np.array([value * funcs.randomGenerator.normal(1., 0.1) for value in baseGuess])
fitGuessTwo = fitGuessOne.copy()

def chiSqOne(pars):
#        return funcs.chiSqSimple(binContentsOne, binEdgesOne, fitBinsOne, funcs.theKink, pars)
return -2. * funcs.theKink_BinnedLikelihood(binContentsOne, binEdgesOne, fitBinsOne, pars)
def chiSqTwo(pars):
#        return funcs.chiSqSimple(binContentsTwo, binEdgesTwo, fitBinsTwo, funcs.theKink, pars)
return -2. * funcs.theKink_BinnedLikelihood(binContentsTwo, binEdgesTwo, fitBinsTwo, pars)

fitOne = miniz.nelderMead(chiSqOne, fitGuessOne, xtol=0.01, ftol=0.05, disp=0)
fitTwo = miniz.nelderMead(chiSqTwo, fitGuessTwo, xtol=0.01, ftol=0.05, disp=0)
return (data, fitOne, fitTwo)

xKinkOne = []
xKinkTwo = []
xKinkDiff = []
percentile = 60
nTrials = 1000

for i in xrange(nTrials):
data, fitOne, fitTwo = fitTwoHistograms(percentile)
xKinkOne.append(fitOne[0])
xKinkTwo.append(fitTwo[0])
xKinkDiff.append(baseGuess[0] + fitTwo[0] - fitOne[0])

In [45]:
def plotHistoFitCompare():
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
#    plt.hist(xKinkOne, range=(80,180), color='#66a61e', bins=np.arange(80, 181, 2), alpha=0.6, zorder=2)
#    plt.hist(xKinkTwo, range=(80,180), color='#0088ff', bins=np.arange(80, 181, 2), alpha=0.6, zorder=2)
plt.hist(xKinkDiff, range=(80,180), color='#0088ff', bins=np.arange(80, 181, 2), alpha=0.8, zorder=1)

axes.set_xticks(np.arange(80, 181, 10))
axes.set_xticks(np.arange(80, 181, 2), minor=True)
axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(80., 181., 10.)], fontsize=20)

axes.set_yticks(np.arange(0, 250, 50))
axes.set_yticks(np.arange(0, 250, 10), minor=True)
axes.set_yticklabels([r'$%i$' % num for num in np.arange(0, 250, 50)], fontsize=20)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()
axes.set_xlabel('True Kink (140) + Kink Fit Difference', labelpad=5, fontsize=22)
axes.set_ylabel('Number of Counts (400 total)', labelpad=5, fontsize=22)

axes.annotate(r'larger $x_{\rm kink2}$', xy=(140.5,200), xycoords='data', xytext=(50,-5),
textcoords='offset points', size=22,
arrowprops=dict(arrowstyle='<-', color='#e7298a', linewidth=3))
axes.annotate(r'larger $x_{\rm kink1}$', xy=(139.5,200), xycoords='data', xytext=(-180,-5),
textcoords='offset points', size=22,
arrowprops=dict(arrowstyle='<-', color='#66a61e', linewidth=3))
axes.annotate('', xy=(140.15,250), xycoords='data', xytext=(0,-100),
textcoords='offset points', arrowprops=dict(arrowstyle='-'))

print

In [46]:
plotHistoFitCompare()


New Probability Calculus:¶

Datum Probability¶

What's the probability to get some datum, $x$?¶

$${\large P(\widetilde{x}\in[x,x+dx]) = f(x) dx \quad {\rm and} \quad \Int{-\infty}{\infty} f(x) dx = 1~.}$$

Its anti-derivative is the Cumulative Distribution Function (CDF):¶

$$F(x) = \Int{-\infty}{x} f(x') dx' = P(\tilde{x} \le x)$$$$1 - F(x) = \Int{x}{\infty} f(x') dx' = P(\tilde{x} > x)$$

Sample Probability¶

Probability to get a sample of data points, $\{x_i\}_{i=1}^n$?¶

\begin{align} dP(\{x_1\}) &= \phantom{3!} \phantom{\times!!} f(x_1) dx_1 \\ dP(\{x_1, x_2\}) &= \rust{2}\phantom{!} \rust{\times} f(x_1) dx_1 f(x_2) dx_2 \\ dP(\{x_1, x_2, x_3\}) &= \rust{3!} \rust{\times} f(x_1) dx_1 f(x_2) dx_2 f(x_3) dx_3 \end{align}

Order doesn't matter... WLOG, sort:¶

\begin{align} \{x_1, x_2\} &\equiv \{x_2, x_1\} \\ \{x_1, x_2, x_3\} &\equiv \{x_2, x_3, x_1\} \equiv \cdots \\ \phantom{dP(\{x_1, x_2, x_3\})} &\phantom{= 3! \times f(x_1) dx_1 f(x_2) dx_2 f(x_3) dx_3} \end{align}$$\Rightarrow\ {\rm WLOG}\ x_j < x_k\ {\rm for}\ j < k\ {\rm for\ all}\ x_j, x_k \in \{x_i\}_{i=1}^n$$

\begin{align} \therefore dP(\{x_i\}_{i=1}^n) &= \rust{n!} \Prod{i = 1}{n} f(x_i) dx_i \\ \phantom{dP(\{x_1, x_2, x_3\})} &\phantom{= 3! \times f(x_1) dx_1 f(x_2) dx_2 f(x_3) dx_3} \end{align}

Probability Calculus: Integration¶

The data is all sorted... Careful with those integration limits!¶

$$\Int{\{x_1, x_2\}}{} dP(\{x_1, x_2\}) = 2 \Int{-\infty}{\infty} f(x_2) \left[ \Int{-\infty}{x_2} f(x_1) dx_1 \right] dx_2,$$
$$\Int{\{x_1, x_2, x_3\}}{} dP(\{x_1, x_2, x_3\}) = 3!\Int{-\infty}{\infty} f(x_3) \left[ \Int{-\infty}{x_3} f(x_2) \left( \Int{-\infty}{x_2} f(x_1) dx_1 \right) dx_2 \right] dx_3$$
$$\vdots$$

Probability Calculus: Being $r$th of $n$¶

\begin{align} \subNsubR{n}{f}{r} (x_r)\ dx_r &\equiv\ \Int{\{x_{i \neq r}\}_n}{} dP(\{x_i\}_{i=1}^n) \\ \quad = n!\ f(x_r)\ dx_r &\phantom\times\ \Int{-\infty}{\infty} f(x_n) \Int{-\infty}{x_n} f(x_{n-1}) \cdots \Int{-\infty}{x_{r+2}} f(x_{r+1}) \Prod{i=r+1}{n} dx_i \\ &\times\ \Int{-\infty}{\infty} f(x_1) \Int{x_1}{\infty} f(x_2) \cdots \Int{x_{r-2}}{\infty} f(x_{r-1}) \Prod{i'=1}{r-1} dx_{i'} \end{align}

$\subNsubR{n}{f}{r}$ Theorem:¶

$${\Large \subNsubR{n}{f}{r}(x_r) = \subNsubR{n}{K}{r}\ f(x_r) F^{r-1}(x_r) \left(1 - F(x_r)\right)^{n-r} }.$$

Probability Calculus: Being $r$th of $n$¶

$$\Rightarrow \subNsubR{n}{K}{r} \equiv \frac{n!}{(n-r)!(r-1)!}$$¶

$${\Large \therefore \quad \subNsubR{n}{f}{r}(x_r) = \subNsubR{n}{K}{r}\ f(x_r)\ F^{r-1}(x_r)\ \left(1 - F(x_r)\right)^{n-r} }$$

${\Large ({\rm HW}-1)}$ $${\large {\rm Prove\ the}\ \subNsubR{n}{f}{r}\ {\rm Theorem\ by\ computing\ the\ chain\ integral.} }$$

Proof:

$$\subNsubR{n}{f}{r}(x_r) dx_r = \Int{\{x_{i \neq r}\}_n}{} dP(\{x_i\}_n) = n! \Int{-\infty}{x_r} f(x_{r-1}) \Int{-\infty}{x_{r-1}} f(x_{r-2}) \cdots \Int{-\infty}{x_2} f(x_1) \Prod{i=1}{r-1} dx_i$$$$\times\ f(x_r) dx_r \Int{x_r}{\infty} f(x_{r+1}) \Int{x_{r+1}}{\infty} f(x_{r+2}) \cdots \Int{x_{n-1}}{\infty} f(x_n) \Prod{j=r+1}{n} dx_j$$
$$= n! \Int{-\infty}{x_r} f(x_{r-1}) \Int{-\infty}{x_{r-1}} f(x_{r-2}) \cdots \Int{-\infty}{x_3} f(x_2) F(x_2) \Prod{i=2}{r-1} dx_i$$$$\times\ f(x_r) dx_r \Int{x_r}{\infty} f(x_{r+1}) \Int{x_{r+1}}{\infty} f(x_{r+2}) \cdots \Int{x_{n-2}}{\infty} f(x_{n-1}) \left(1 - F(x_{n-1})\right) \Prod{j=r+1}{n-1} dx_j$$

Integration by parts:

$$\Int{-\infty}{x} f(\widetilde{x}) F^{a-1}(\widetilde{x}) d\widetilde{x} = \frac{1}{a} F^a(x) \qquad {\rm and} \qquad \Int{x}{\infty} f(\widetilde{x}) \left(1 - F(\widetilde{x})\right)^{a-1} d\widetilde{x} = \frac{1}{a} \left(1 - F(x)\right)^a$$
$$\subNsubR{n}{f}{r}(x_r) dx_r = n! \Int{-\infty}{x_r} f(x_{r-1}) \Int{-\infty}{x_{r-1}} f(x_{r-2}) \cdots \frac{1}{2} \Int{-\infty}{x_4} f(x_3) F^2 (x_3) \Prod{i=3}{r-1} dx_i$$$$\times\ f(x_r) dx_r \Int{x_r}{\infty} f(x_{r+1}) \Int{x_{r+1}}{\infty} f(x_{r+2}) \cdots \frac{1}{2} \Int{x_{n-2}}{\infty} f(x_{n-2}) \left(1 - F(x_{n-2})\right)^2 \Prod{j=r+1}{n-2} dx_j$$
$$\cdots\ = n! \left(\frac{1}{(r-1)!} F^{r-1}(x_r)\right) f(x_r) dx_r \left(\frac{1}{(n-r)!} \left(1 - F(x_r)\right)^{n-r}\right)$$
$${\Large \therefore \quad \subNsubR{n}{f}{r}(x_r) = \subNsubR{n}{K}{r} f(x_r) F^{r-1}(x_r) \left(1 - F(x_r)\right)^{n-r} }. \quad {}_\blacksquare$$
In [5]:
reload(funcs)
def quickData():
''' Shorthand for getting easyEndpoint data (384 x values in [8.5,200])
- set nPulls before calling
- see utilities.sampleFunctions.functionToData for more details
'''
return funcs.functionToData(funcs.easyEndpoint, nPulls, 0.5, np.arange(8.5, 200.2, 0.5))

def quickBG():
''' Shorthand for getting endpointBG data (384 x values in [8.5,200])
- set nPulls before calling
- see utilities.sampleFunctions.functionToData for more details
'''
return funcs.functionToData(funcs.endpointBG, nPulls, 0.5, np.arange(8.5, 200.2, 0.5))

In [6]:
nPulls = 10
pullData = {index:[] for index in xrange(nPulls)}
settings['simpleSortedOutput'] = True
for iteration in xrange(2000):
for i,v in enumerate(quickData()):
pullData[i].append(v)

settings['simpleSortedOutput'] = False
x, f, F, data = quickData()
bgx, bgf, bgF, bgdata = quickBG()
f /= F[-1]
bgf /= F[-1]
bgF /= bgF[-1]
F /= F[-1]

ithPDF = {}
for i in xrange(1, nPulls+1):
ithPDF[i-1] = np.array([f[j] * F[j]**(i-1) * (1 - F[j])**(nPulls-i) * pc.nKr(nPulls, i) for j in xrange(len(x))])

In [7]:
def makeHist(i=0):
fig, axes = plt.subplots(figsize=(9,5), facecolor='#ffffff')
plt.plot(x, f, '--', color='#0088ff', alpha=0.6, linewidth=3)
plt.plot(x, bgf, '-.', color='#0088ff', alpha=0.6, linewidth=3)
plt.hist(pullData[i], bins=np.arange(0, 201, 5), alpha=0.4, color='#66a61e', normed=True)
plt.plot(x, ithPDF[i], '#964a01', linewidth=3)

axes.set_xticks(np.arange(0, 201, 50))
axes.set_xticks(np.arange(0, 201, 10), minor=True)
axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(0, 201, 50)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=0.0401)
axes.set_yticks(np.arange(0, 0.04, 0.01))
axes.set_yticklabels([r'$%0.3f$' % num for num in np.arange(0, 0.04, 0.01)], fontsize=20)
axes.set_yticks(np.arange(0, 0.04, 0.01/5), minor=True)

axes.set_xlabel('Physical Observable', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()


Here's $\subNsubR{10}{f}{1}(x)$ compared with the histogram of the 1st smallest point of 2000 samples:¶

In [8]:
makeHist(0)


Here's $\subNsubR{10}{f}{2}(x)$ compared with the histogram of the 2nd smallest point of 2000 samples:¶

In [9]:
makeHist(1)


Here's $\subNsubR{10}{f}{3}(x)$ compared with the histogram of the 3rd smallest point of 2000 samples:¶

In [10]:
makeHist(2)


Here's $\subNsubR{10}{f}{4}(x)$ compared with the histogram of the 4th smallest point of 2000 samples:¶

In [11]:
makeHist(3)


Here's $\subNsubR{10}{f}{5}(x)$ compared with the histogram of the 5th smallest point of 2000 samples:¶

In [12]:
makeHist(4)


Here's $\subNsubR{10}{f}{6}(x)$ compared with the histogram of the 6th smallest point of 2000 samples:¶

In [13]:
makeHist(5)


Here's $\subNsubR{10}{f}{7}(x)$ compared with the histogram of the 7th smallest point of 2000 samples:¶

In [14]:
makeHist(6)


Here's $\subNsubR{10}{f}{8}(x)$ compared with the histogram of the 8th smallest point of 2000 samples:¶

In [15]:
makeHist(7)


Here's $\subNsubR{10}{f}{9}(x)$ compared with the histogram of the 9th smallest point of 2000 samples:¶

In [16]:
makeHist(8)


Here's $\subNsubR{10}{f}{10}(x)$ compared with the histogram of the 10th smallest point of 2000 samples:¶

In [17]:
makeHist(9)


Probability Calculus: Are You $r$th of $n$?¶

$\subNsubR{n}{f}{r}(x)$ is the PDF of the $r$th of $n$ position:¶

$${\Large P(\tilde{x} \in [x, x + dx]\ |\ \tilde{x}\ {\rm is\ } r{\rm th\ of\ } n, f) \equiv \subNsubR{n}{f}{r}(x) dx }$$

Use Bayes' Theorem:¶

$${\large P(B\ |\ A) = \frac{P(A\ |\ B)\ P(B)}{P(A)} }$$$${\large \Rightarrow P(\tilde{x}\ {\rm is\ } r{\rm th\ of\ } n\ |\ \tilde{x} \in [x, x + dx], f) = \frac{\subNsubR{n}{f}{r}(x) dx\ \frac{1}{n}}{f(x) dx} }$$
$${\Large = \frac{\subNsubR{n}{K}{r}}{n} F^{r-1}(x) \left(1 - F(x)\right)^{n-r}}$$

${\Large ({\rm HW}-2) }$ $${\large {\rm Prove}: \qquad \Sum{r=1}{n} \frac{\subNsubR{n}{K}{r}}{n} F^{r-1}(x) \left(1 - F(x)\right)^{n-r} = 1, \quad \forall x. }$$

Proof (special thanks to Ola Liabøtrø, who solved this in his head during my talk):

$$\frac{\subNsubR{n}{K}{r}}{n} = \frac{(n-1)!}{(n-r)!(r-1)!} = \frac{(n-1)!}{(n-1-[r-1])!(r-1)!} = \subNsubR{n-1}{C}{r-1}$$

Thus,

$$\Sum{r=1}{n} \frac{\subNsubR{n}{K}{r}}{n} F^{r-1}(x) \left(1 - F(x)\right)^{n-r} = \Sum{r=1}{n} \subNsubR{n-1}{C}{r-1} F^{r-1}(x) \left(1 - F(x)\right)^{n-r}$$$$= \Sum{r'=0}{n-1} \subNsubR{n-1}{C}{r'} F^{r'}(x) \left(1 - F(x)\right)^{n-1-r'} = (F(x) + 1 - F(x))^{n-1} = 1. \quad {}_\blacksquare$$

Note: I proved this also. The hard way. Term-by-term. I could not see the relation to $\subNsubR{n-1}{C}{r-1}$.

The debinning Calculation:¶

The debinning Algorithm¶

Thought experiment: The Flat PDF¶

$$f_{\rm \color{gray}{flat}}(x) = \left\{ \begin{array}{rcl} 1 & {\rm if} & x \in [0, 1] \\ 0 & & {\rm otherwise} \end{array} \right.$$

Randomly choose an appropriate $r$ value using¶

$${\large P(x_j\ {\rm is\ } r{\rm th\ of\ } 3\ |\ x_j, f_{\rm \color{Gray}{flat}}) = \frac{\subNsubR{3}{K}{r}}{3} F_{\rm \color{Gray}{flat}}^{r-1}(x_j) \left(1 - F_{\rm \color{Gray}{flat}}(x_j)\right)^{3-r} }$$

Generate a random point from¶

$${\large \subNsubR{3}{f}{{\rm(\color{Gray}{flat})}r}(x) = \subNsubR{3}{K}{r} f_{\rm \color{Gray}{flat}}(x) F_{\rm \color{Gray}{flat}}^{r-1}(x) \left(1 - F_{\rm \color{Gray}{flat}}(x)\right)^{3-r} = \mathcal{O}(x^2) }$$

The Experiment Inspired debinning Algorithm¶

Suppose our data comes from¶

$${\huge f(x) = w\ f_{\rm \sky{sig}}(x) + (1 - w)\ f_{\rm \rust{bg}}(x)}$$

Randomly choose an appropriate $r$ value using¶

$${\large P(\tilde{x}\ {\rm is\ } r{\rm th\ of\ } n\ |\ \tilde{x} \in [x, x + dx], f_{\rm \rust{bg}}) \cdots}$$

The debinning Algorithm... Calculation¶

"Det er dumt!" Let's just calculate it!¶

$$f_{\rm debin}(x) dx = P_{\rm debin}\left(\tilde{x} \in [x, x + dx] | \{x_i\}_{i=1}^{n}, f_{\rm\rust{bg}} \right)$$
$$= \sum\limits_{j=1}^{n} P( x_j | \{x_i\}_{i=1}^{n} ) \sum\limits_{r=1}^{N} P(r{\rm th}\ {\rm of}\ N | x_j, f_{\rm\rust{bg}}) P(\tilde{x} \in [x, x + dx] | r{\rm th}\ {\rm of}\ N, f_{\rm\rust{bg}})$$
$$= \sum\limits_{j=1}^{n} \frac{1}{n} \sum\limits_{r=1}^{N} \left(\frac{{}_NK_r}{N} F_{\rm\rust{bg}}(x_j)^{(r-1)}(1 - F_{\rm\rust{bg}}(x_j))^{(N-r)} \right)$$$$\quad\times \left( {}_NK_r f_{\rm\rust{bg}}(x) F_{\rm\rust{bg}}(x)^{(r-1)}(1 - F_{\rm\rust{bg}}(x))^{(N-r)} dx \right)$$
$${\Large \Rightarrow f_{\rm debinnning}(x) = \frac{f_{\rm\rust{bg}}(x)}{n N} \displaystyle \vec{G} \cdot \sum\limits_{j=1}^{n} \vec{G}_j }$$

${\large \left[\vec{G}_{(j)} \equiv {}_N K_r F^{r-1}_{\rm\rust{bg}}(x_{(j)}) \left( 1 - F_{\rm\rust{bg}}(x_{(j)}) \right)^{N-r} \right] }$.

In [19]:
settings['simpleSortedOutput'] = False
nPulls = 96
nFictitious=96

x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
F /= F[-1]
bgf /= bgF[-1]
bgF /= bgF[-1]

def mapDataToX(dindex):
# maps elements of data to elements of x
# bisect_left(x, datum) locates the left-most entry in x which is >= datum
xindex = bisect.bisect_left(x, data[dindex])
if xindex > 0 and x[xindex] - data[dindex] >= data[dindex] - x[xindex-1]:
return xindex - 1
return xindex
dataToX = np.array([mapDataToX(j) for j in xrange(len(data))])

def vecG_rth(rindex, xindex):
r = rindex+1
return pc.nKr(nFictitious, r) * bgF[xindex]**(r-1) * (1 - bgF[xindex])**(nFictitious - r)

vecG_rx = np.array([[vecG_rth(rindex, xindex) for xindex in xrange(len(x))]
for rindex in xrange(nFictitious)])
sumG_j = vecG_rx.take(dataToX, axis=1).sum(axis=1)
debinningData = bgf * np.dot(sumG_j, vecG_rx) / (nFictitious*nPulls)

# One more pull for plotting purposes
x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
bgf /= F[-1]
bgF /= F[-1]
F /= F[-1]

def makeDebinningPlot():
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
plt.plot(x, f, '--', color='#0088ff', alpha=0.6, linewidth=3)
plt.plot(x, bgf, '-.', color='#0088ff', alpha=0.6, linewidth=3)
plt.hist(data, bins=np.arange(0, 201, 5), alpha=0.4, color='#66a61e', normed=True)
plt.plot(x, debinningData, '#964a01', linewidth=3)

axes.set_xticks(np.arange(0, 201, 50))
axes.set_xticks(np.arange(0, 201, 10), minor=True)
axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(0, 201, 50)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=0.0201)
axes.set_yticks(np.arange(0, 0.02, 0.005))
axes.set_yticklabels([r'$%0.3f$' % num for num in np.arange(0, 0.02, 0.005)], fontsize=20)
axes.set_yticks(np.arange(0, 0.02, 0.005/5), minor=True)

axes.set_xlabel('Physical Observable', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()


debinning Demonstration¶

In [20]:
makeDebinningPlot()

In [21]:
fwhmValues=None

In [23]:
# Let's try some GPU programming:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
from string import Template

settings['simpleSortedOutput'] = False
nPulls = 96
nFictitious = 96  # Multiple of 32 for GPU programming ease.
lenXdomain = len(np.arange(8.5, 200.2, 0.5))

mod_template = Template("""
__global__ void vecG(double *vecG_rx, double *bgF, double *lognKr) {
const int idr = threadIdx.y + blockDim.y * blockIdx.y;
const int idx = threadIdx.x + blockDim.x * blockIdx.x;
const int idrx = idx + idr*${lenX}; const int r = (idr + 1) < (${nFic} - idr) ? (idr + 1) : ${nFic} - idr; if (bgF[idx] == 0 || (1 - bgF[idx]) == 0) vecG_rx[idrx] = 0; else vecG_rx[idrx] = exp(lognKr[r-1] + idr * log(bgF[idx]) + (${nFic} - idr - 1) * log(1 - bgF[idx]));
}
""")

mod = SourceModule(mod_template.substitute(nFic=nFictitious, lenX=lenXdomain))

timer = time()

# initialize the loop
lognKr = pc.get_lognKrArray(nFictitious)
chisq = []
bgchisq = []
ks = []
bgks = []

for trial in xrange(100000):

x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
F /= F[-1]
bgf /= bgF[-1]
bgF /= bgF[-1]

def mapDataToX(dindex):
# maps elements of data to elements of x
# bisect_left(x, datum) locates the left-most entry in x which is >= datum
xindex = bisect.bisect_left(x, data[dindex])
if xindex > 0 and x[xindex] - data[dindex] >= data[dindex] - x[xindex-1]:
return xindex - 1
return xindex
dataToX = np.array([mapDataToX(j) for j in xrange(len(data))])

# Use the GPU for the most expensive / paralellizable part of the calculation:
# =====

# Empty numpy array to hold result of vecG (row, column = nFictitious, len(x))
vecG_rx = np.empty([nFictitious, len(x)])

# Allocate memory on the GPU for vecG calculation
vecG_rx_gpu = cuda.mem_alloc(vecG_rx.nbytes)
bgF_gpu = cuda.mem_alloc(bgF.nbytes)
lognKr_gpu = cuda.mem_alloc(lognKr.nbytes)

# Transfer data to the GPU
cuda.memcpy_htod(bgF_gpu, bgF)
cuda.memcpy_htod(lognKr_gpu, lognKr)

# Get a reference to the GPU module function and do the calculation
# NOTE: 16*24 = 384 = len(x), and 4*24 = 96 = nFictitious
vecG_func = mod.get_function('vecG')
vecG_func(vecG_rx_gpu, bgF_gpu, lognKr_gpu, block=(16, 4, 1), grid=(24, 24))

# Get the result back from the GPU and use it!
cuda.memcpy_dtoh(vecG_rx, vecG_rx_gpu)
sumG_j = vecG_rx.take(dataToX, axis=1).sum(axis=1)
debinningData = bgf * np.dot(sumG_j, vecG_rx) / (nFictitious*nPulls)

# Chisq test....
if type(fwhmValues) == np.ndarray:
chisq.append(sum(4.*(debinningData - f)**2 / fwhmValues**2) / len(fwhmValues))
bgchisq.append(sum(4.*(debinningData - bgf)**2 / fwhmValues**2) / len(fwhmValues))

# KS test...
debinningCDF = pc.A(debinningData, 0.5)
ks.append(max(abs(debinningCDF - F)))
bgks.append(max(abs(debinningCDF - bgF)))

if trial == 0:
# Infinity sigma upper and lower bands...
debinningUpper = debinningData.copy()
debinningLower = debinningData.copy()
# Uncertainty PDFs for each x value...
uncertaintyHistos = [debinPlot.HistoContainer(np.arange(0., 0.024, 0.0002), debinningData[i])
for i in xrange(len(x))]
else:
debinningUpper = np.clip(debinningData, debinningUpper, 20.)
debinningLower = np.clip(debinningData, -1., debinningLower)
for i in xrange(len(x)):
try:
except ValueError as e:
print e.message
print 'xindex = ' + str(i) + ',  debinningData[xindex] = ' + str(debinningData[i])

if not type(fwhmValues) == np.ndarray:
fwhmValues = np.array([np.diff(uncertaintyHistos[i].fwhm()) for i in xrange(len(x))]).flatten()
print "Run this again for the Chisq test data..."
print time() - timer

290.936532021

In [24]:
# Try one more sample, but with a HUGE data set:
nPulls = 10000

timer = time()
x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
F /= F[-1]
bgf /= bgF[-1]
bgF /= bgF[-1]

def mapDataToX(dindex):
# maps elements of data to elements of x
# bisect_left(x, datum) locates the left-most entry in x which is >= datum
xindex = bisect.bisect_left(x, data[dindex])
if xindex > 0 and x[xindex] - data[dindex] >= data[dindex] - x[xindex-1]:
return xindex - 1
return xindex
dataToX = np.array([mapDataToX(j) for j in xrange(len(data))])

# Use the GPU for the most expensive / paralellizable part of the calculation:
# =====

# Empty numpy array to hold result of vecG (row, column = nFictitious, len(x))
vecG_rx = np.empty([nFictitious, len(x)])

# Allocate memory on the GPU for vecG calculation
vecG_rx_gpu = cuda.mem_alloc(vecG_rx.nbytes)
bgF_gpu = cuda.mem_alloc(bgF.nbytes)
lognKr_gpu = cuda.mem_alloc(lognKr.nbytes)

# Transfer data to the GPU
cuda.memcpy_htod(bgF_gpu, bgF)
cuda.memcpy_htod(lognKr_gpu, lognKr)

# Get a reference to the GPU module function and do the calculation
# NOTE: 16*24 = 384 = len(x), and 4*24 = 96 = nFictitious
vecG_func = mod.get_function('vecG')
vecG_func(vecG_rx_gpu, bgF_gpu, lognKr_gpu, block=(16, 4, 1), grid=(24, 24))

# Get the result back from the GPU and use it!
cuda.memcpy_dtoh(vecG_rx, vecG_rx_gpu)
sumG_j = vecG_rx.take(dataToX, axis=1).sum(axis=1)
BIGdebinningData = bgf * np.dot(sumG_j, vecG_rx) / (nFictitious*nPulls)
print time() - timer

0.0585300922394


debinning Performace: Uncertainty¶

In [25]:
# One more pull for plotting purposes
x, f, F, data = quickData()
x, bgf, bgF, bgData = quickBG()
f /= F[-1]
bgf /= F[-1]
bgF /= F[-1]
F /= F[-1]

# Pretty plot
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
plt.fill_between(x, debinningUpper, debinningLower, color='#0088ff', alpha=0.3, zorder=1)
plt.plot(x, f, color='#0088ff', linewidth=3, zorder=3)
plt.plot(x, bgf, color='#0088ff', linestyle='-.', linewidth=3, zorder=3)
plt.plot(x, BIGdebinningData, color='black', linestyle='--', linewidth=3, zorder=4)
colors = np.array([(0.8,1.,0.4), (0.8,0.3,0.), (1.,1.,1.), (0.,0.,1.)])
for i, v in enumerate(x):
debinPlot.verticalColorBand(uncertaintyHistos[i], v, 0.5, colors)

axes.set_xticks(np.arange(0, 201, 50))
axes.set_xticks(np.arange(0, 201, 10), minor=True)
axes.set_xticklabels([r'$%i$' % int(num) for num in np.arange(0, 201, 50)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=0.0201)
axes.set_yticks(np.arange(0, 0.02, 0.005))
axes.set_yticklabels([r'$%0.3f$' % num for num in np.arange(0, 0.02, 0.005)], fontsize=20)
axes.set_yticks(np.arange(0, 0.02, 0.005/5), minor=True)

axes.set_xlabel('Physical Observable', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()

print


debinning Performace: Discrimination¶

In [26]:
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
plt.hist(chisq, bins=np.arange(0, 7.01, .05), color='#0088ff', edgecolor='#555555')
plt.hist(bgchisq, bins=np.arange(0, 7.01, .05), color='#66a61e', edgecolor='#555555', alpha=0.5)

axes.set_xticks(np.arange(0, 7.01, 1))
axes.set_xticks(np.arange(0, 7.01, .25), minor=True)
axes.set_xticklabels([r'$%1.1f$' % num for num in np.arange(0, 7.01, 1)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=6000)
axes.set_yticks(np.arange(0, 6001, 1000))
axes.set_yticks(np.arange(0, 6001, 250), minor=True)
axes.set_yticklabels([r'$%i$' % int(num) for num in np.arange(0, 6001, 1000)], fontsize=20)

axes.set_xlabel(r'$\chi^2$ per d.o.f.', labelpad=5, fontsize=22)
axes.set_ylabel('Number of Counts', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()
print


debinning Performace: Discrimination¶

In [27]:
fig, axes = plt.subplots(figsize=(11,7), facecolor='#ffffff')
plt.hist(ks, bins=np.arange(0, 0.3, .002), color='#0088ff', edgecolor='#555555')
plt.hist(bgks, bins=np.arange(0, 0.3, .002), color='#66a61e', edgecolor='#555555', alpha=0.5)

axes.set_xticks(np.arange(0, 0.26, .05))
axes.set_xticks(np.arange(0, 0.26, .01), minor=True)
axes.set_xticklabels([r'$%0.2f$' % num for num in np.arange(0, 0.26, .05)], fontsize=20)

axes.set_ylim(bottom=-0.0001, top=5000)
axes.set_yticks(np.arange(0, 5001, 1000))
axes.set_yticks(np.arange(0, 5001, 250), minor=True)
axes.set_yticklabels([r'$%i$' % int(num) for num in np.arange(0, 5001, 1000)], fontsize=20)

axes.set_xlabel('KS test value', labelpad=5, fontsize=22)
axes.set_ylabel('Number of Counts', labelpad=5, fontsize=22)

axes.tick_params(length=10, width=1.2, labelsize=22, zorder=10, pad=10)
axes.tick_params(which='minor', length=5, width=1.2, zorder=11)
axes.minorticks_on()
print


The Future, and Shameless Advertising:¶

Searching for "New Stuff" with Statistics¶

The Global And Modular BSM Inference Tool¶

In [28]:
from IPython.display import HTML
HTML('<iframe src=http://gambit.hepforge.org/ width=1000 height=350></iframe>')

Out[28]:

$\qquad$ Our PMSSM-25 scan: Scanning 25 parameters!¶

GAMBIT is a global fitting code for generic Beyond the Standard Model theories, designed to allow fast and easy definition of new models, observables, likelihoods, scanners and backend physics codes.

GAMBIT is developed by a group of nearly 30 theorists and experimentalists. The Collaboration includes members of the AMS-02, ATLAS, CDMS, CTA, DARWIN, DM-ICE, Fermi-LAT, HESS, IceCube, LHCb and XENON experiments, as well as developers of the public theory codes DarkSUSY, FlexibleSUSY, IsaJet, SoftSUSY and SuperIso.

We are currently finalising the code in preparation for first public release in Summer 2015 — stay tuned!

Constructive Comments and Acknowledgements¶

Greg says that I should check out rank statistics

Christoph mentioned "max gap method" for when you have no idea whatsoever what your background is.

Anders says that I should try Likelihood ratios: Compute Lsig/Lbg, debinning vs binned vs unbinned likelihood.

Marco De Mattia says to consider the time scaling of the problem. There may be a region in data sample size where unbinned likelihood is too slow, but there is still not enough signal statistics for regular histograms.

Thank yous to various people who have discussed with me over the years:

• Teruki Kamon, Jan Conrad, Greg Martinez, Christoph Weniger, Anders Kvellestad, Ola Liabøtrø, Are Raklev, Marco De Mattia

Extra special thank yous for incredibly helpful constructive criticisms:

• Bob Cousins, Alex Read

... and, of course, none of this would have happened without the help I needed for debinning-1.0.0. Ultra special thank you to my bro, Nathan Krislock.