Testing look-elsewhere effect from combining two searches for the same particle using Gaussian Processes

by Kyle Cranmer, Dec 17, 2015

The correction for 1d look-elsewhere effect (LEE) presented in Trial factors or the look elsewhere effect in high energy physics by Ofer Vitells and Eilam Gross http://arxiv.org/abs/arXiv:1005.1891

It is often claimed that when one has two statistically independent searches for the same particle, then a peak in one tells you where to look for the other, thus eliminating the look-elsewhere effect. These searches might be from two different experiments (eg. ATLAS and CMS), two different decay modes, or two different time perieods (eg. run1 and run2). There are various flaws in this logic, as stressed by Bob Cousins in these slides. This issue quickly becomes subtle as the intuitive procedure of using the location of the excess in one search to inform where to look in the other is not fully specified. A few things can be said about the intuitive procedure

  • there is a symmetry of switching search1 and search2, so any approach that breaks this symmetry is going to have some weird properties. Our inference should not depend on the order!
  • you can decide to break that symmetry by picking one of the searches (without seeing the results) in order to have the correct Type-I error rate (coverage), but that will lead to sub-optimal inference. (And Cousins has pointed out that there may be something insightful to say by connecting to "Buehler's betting game")
  • if you use search 1 to approximately specify the location of the bump for search 2, then there is still a residual LEE for search 2 (though it will be considerably smaller)
  • The combined result doesn't depend on the order, but clearly has a look-elsewhere effect that needs to be corrected

In what follows I will explore the behavior of the LEE correction for the combination of two searches (which can be trivially extended to more than two searches).

Formalism

The starting point is to consider a search for a new particle with signal strength $\mu$ and unknown mass $\nu$ on top of a background distribution described by some nuisance parameters $\theta$. We perform the search by scanning over the mass $\nu$ and calculating the test statistic

\begin{equation} q(\nu) = -2 \log \frac{ \max_{\theta} L(\mu=0, \nu, \theta)}{ \max_{\mu, \theta} L(\mu, \nu, \theta)} \end{equation}

Assuming the background-only is true, $q(\nu)$ is a chi-square random field (with 1 degree of freedom). That means that, for any point $\nu$, the quantity $q(\nu)$ would have a chi-square distribution if you repeated the experiment many times.

The maximum local significance is based on $Z_{local} = \sqrt{q_{max}}$, where $q_{max} = \max_\nu q(\nu)$. The correction from local to global significance is given by:

\begin{equation} p_{global} = p_{local} + N \exp(-(q_{max}-u_0)/2) \end{equation}

where $N$ is the average number of upcrossings above level $u_0$ (i.e. times that $q(\nu) > u_0$). This $N$ characterizes the search -- searches with good mass resolution over a large mass range will have large values of $N$, while searches with poor mass resolution will have small values of $N$.

Shortcut: Gaussian Processes

Creating many likelihood scans from pseudo-experiments (toy Monte Carlo) is somewhat time consuming, so here we make realizations of a chi-square random field by using a Gaussian Process. The main trick we will use is that a chi-square distribution for one degree of freedom is the same as the distribution of $x^2$ if $x$ is normally distributed. As you might have guessed, a Gaussian Process (GP) is like a chi-square random field, but it is Gaussian-distributed at each point.

Note, the distributions are not independent at each point, there is some covariance. So if the $q(\nu)$ is high at one point, you can expect it to be high near by. We can control this behavior via the GP's kernel. In particular, $K(\nu, \nu') = Cov[q(\nu)^2, q(\nu')^2]$. We can essentially specify what the mass resolution of our virtual search is via the length scale used in ther kernel.

For more on the theory of Gaussian Processes, the best resource is available for free online: Rasmussen & Williams (2006). We will george -- a nice python package for Gaussian Processes (GP).

Connection to the asymptotic approximation

The next major conceptual pilar for this work is the asymptotic approximations for the likelihood ratio. Wilks's theorem states that assuming background-only ($\mu=0$) the distribution of the best fit signal strength $\hat{\mu}$ follows a Gaussian distribution $G(\hat{\mu} | \mu=0, \sigma)$, where $\sigma^2 = \textrm{Var}[\mu]$. With that assumption $q(\mu) = (\hat{\mu}/\sigma)^2$, hence $q(\mu)$ is chi-square distributed. In this way of thinking, the GP is generating results for $(\hat{\mu}/\sigma)$ as a function of the mass parameter $\nu$.

This allows us to quickly do combinations on these toy results. Fundamentally, we wish to perform a likelihood combination at every mass point. This is additive in the log-likelihood ratio: \begin{equation} q_{12}(\nu) = q_1(\nu) + q_2(\nu) + const \end{equation}

This is also a chi-square random field. In the asymptotic limit, the likelihood combination is equivalent to:

\begin{equation} \hat{\mu}_{12}(\nu) = \frac{\hat{\mu}_1(\nu) \sigma_1^{-2}(\nu) +\hat{\mu}_2(\nu)\sigma_2^{-2}(\nu)}{\sigma_1^{-2}(\nu)+\sigma_2^{-2}(\nu)} \end{equation}

together with variance

\begin{equation} Var[\hat{\mu}_{12}(\nu)] \equiv\sigma_{12}^2(\nu) = \frac{1}{\sigma_1^{-2}(\nu)+\sigma_2^{-2}(\nu)} \end{equation}

The important part here is that we can also work out the kernel for the Gaussian process that describes $(\hat{\mu}_{12}/\sigma_{12})^2(\nu)$. In particular, the covariance between $\nu_1$ and $\nu_2$ of the GP for combination can be derived from the covariance of the GP for searches 1 and 2.

Consider the simple case where the two searches have the same constant sensitivity: $\sigma_1(\nu) = \sigma_2(\nu) = \sigma$. Then $\hat{\mu}_{12}(\nu) = [\hat{\mu}_1(\nu)+\hat{\mu}_2(\nu)]/2$ and $\sigma_{12}^2 = \sigma^2/2$. So the kernel for the GP that describes the combination is given by:

\begin{equation} K_{12}(\nu, \nu') = Cov[(\hat{\mu}_{12}(\nu)/\sigma_{12})^2, (\hat{\mu}_{12}(\nu')/\sigma_{12})^2 = \frac{1}{2} K_{1}(\nu, \nu') + \frac{1}{2} K_{2}(\nu, \nu') \end{equation}

Correlarry If two searches have the same mass resolution and statistical power, the effective $N$ needed to calculate the LEE for the combination is the same as the individual searches. (This can be demonstraited with the code below by setting ratio_of_correlations=1.

Note In what follows, I'll demonstrate this for the simple case, but this can be extended to have separate $\sigma(\nu)$ curves for searches 1 and 2.

Basic outline of what will be shown below

We will create three different Gaussian Processes:

  • gp1 will generate results from experiment 1
  • gp2 will generate results from experiment 2
  • we will explicitly combine the results from experiments 1 and 2
  • gp12 will be shown to have the same behavior as the combination of gp1 and gp2
In [4]:
%pylab inline --no-import-all
#plt.rc('text', usetex=True)
plt.rcParams['figure.figsize'] = (6.0, 6.0)
#plt.rcParams['savefig.dpi'] = 60
Populating the interactive namespace from numpy and matplotlib
In [5]:
import george
from george.kernels import ExpSquaredKernel, My2ExpLEEKernel, MySignificanceKernel
from scipy.stats import chi2, norm
In [6]:
length_scale_of_correaltion=3.
ratio_of_length_scales=10.
kernel1 = ExpSquaredKernel(length_scale_of_correaltion, ndim=1)
kernel2 = ExpSquaredKernel(ratio_of_length_scales**2*length_scale_of_correaltion, ndim=1)
kernel12 = 0.5*kernel1+0.5*kernel2
In [7]:
# Create the Gaussian process
# gp = george.GP(kernel)
gp1 = george.GP(kernel1, solver=george.HODLRSolver) #faster
gp2 = george.GP(kernel2, solver=george.HODLRSolver) #faster
gp12 = george.GP(kernel12, solver=george.HODLRSolver) #faster
In [8]:
a1,b1,a2,b2, dummy = 1.e-2,0.1,-1.e-2,1.1, 1e3
newkernel1 = My2ExpLEEKernel(a1=a1,b1=b1,a2=0.,b2=dummy,\
                             l1=length_scale_of_correaltion,\
                             l2=ratio_of_length_scales*length_scale_of_correaltion)

newkernel2 = My2ExpLEEKernel(a1=0.,b1=dummy,a2=a2,b2=b2,\
                             l1=length_scale_of_correaltion,\
                             l2=ratio_of_length_scales*length_scale_of_correaltion)

newkernel12 = My2ExpLEEKernel(a1=a1,b1=b1,a2=a2,b2=b2,\
                             l1=length_scale_of_correaltion,\
                             l2=ratio_of_length_scales*length_scale_of_correaltion)
In [30]:
#build equivalent combined kernel from basic parts and operations
# this is almost it, but still need to divide by ((1./sig11/sig11 + 1./sig21/sig21))
# and current implementation of MySignificanceKernel should be 1/sig not sig
#sig1kernel = MySignificanceKernel(a=a1,b=b1)
#sig2kernel = MySignificanceKernel(a=a2,b=b2)
#newkernel12 = sig1kernel*kernel1+sig2kernel*kernel2
In [10]:
# made up sensitivity curves
def sigma1(x):
    return a1*x+b1
def sigma2(x):
    return a2*x+b2
In [11]:
gp1 = george.GP(newkernel1) #,solver=george.HODLRSolver)
gp2 = george.GP(newkernel2) #,solver=george.HODLRSolver)
gp12 = george.GP(newkernel12) #,solver=george.HODLRSolver)
In [12]:
n_scan_points=250
x = np.linspace(0,100,n_scan_points)
In [13]:
# slow part: pre-compute internal stuff for the GP
gp1.compute(x)
gp2.compute(x)
gp12.compute(x)
In [14]:
# evaluate one realization of the GP
z1 = gp1.sample(x)
z2 = gp2.sample(x)
z12 = gp12.sample(x)
# plot the chi-square random field
plt.plot(x,z1, color='red')
plt.plot(x,z2, color='red')
plt.plot(x,z12)
plt.ylabel(r'$z(\nu)$')
plt.xlabel(r'$\nu$')
Out[14]:
<matplotlib.text.Text at 0x1127d3450>

Now lets histogram the values of the random field. Don't get confused here... if you pick a single point and histogram the value of over many instances, you expect a Gaussian. However, for a single instance, you don't expect the histogram for the value of the field to be Gaussian (because of the correlations). Thought experiments: if you make length_scale_of_correaltion very small, then each point is essentially independent and you do expect to see a Gaussian; however, if length_scale_of_correaltion is very large then you expect the field to be nearly constant and the histogram below would be a delta function.

Define some quick helper functions

In [15]:
def q_to_pvalue(q):
    return (1.-chi2.cdf(q, 1))/2 #divide by 2 for 1-sided test

def pvalue_to_significance(p):
    return -norm.ppf(p)

def significance_to_pvalue(Z):
    return 1.-norm.cdf(Z)
In [16]:
def num_upcrossings(z):
    """count number of times adjacent bins change between 0,1"""
    return np.sum((z-np.roll(z,1))**2)/2
In [17]:
def global_pvalue(u,u0, n):
    #return (1.-chi2.cdf(u, 1))/2. + np.exp(-(u-u0)/2)*n #1-sided p-value
    return (1.-chi2.cdf(u, 1)) + np.exp(-(u-u0)/2)*n # 2-sided p-value

Define the threshold for counting upcrossings

In [18]:
u1 = 0.1

Check the code to count upcrossings and the LEE correction is working

In [19]:
n_samples = 1000
n_plots = 3
plt.figure(figsize=(9,n_plots*3))

z_array = gp1.sample(x,n_samples)
n_up = np.zeros(n_samples)
sig1 = sigma1(x)
sig2 = sigma2(x)

for scan_no, z in enumerate(z_array):
    scan = (z/sig1)**2    
    exc1 = (scan>u1) + 0. #add 0. to convert from bool to double    
    n_up[scan_no] = num_upcrossings(exc1)
    if scan_no < n_plots: 
        plt.subplot(n_plots,2,2*scan_no+1)
        plt.plot(x,scan)
        plt.plot([0,100],[u1,u1], c='r')
        plt.subplot(n_plots,2,2*scan_no+2)
        plt.plot(x,exc1)
        plt.ylim(-.1,1.1)
        print('experiment %d has %d upcrossings' %(scan_no, n_up[scan_no]))
n_av = np.mean(n_up)
print("average number of upcrossings in %d experiments is %f" %(n_samples, n_av))
experiment 0 has 13 upcrossings
experiment 1 has 15 upcrossings
experiment 2 has 14 upcrossings
average number of upcrossings in 1000 experiments is 14.448000

Make prediction for global p-value for q_max distribution

In [20]:
u = np.linspace(5,25,100)
global_p = global_pvalue(u,u1,n_av)

Generate many toy experiments (via the Gaussian Process), find maximum local significance for each, and check the prediction for the LEE-corrected global p-value

In [21]:
n_samples = 10000
z_array = gp1.sample(x,n_samples)

q_max = np.zeros(n_samples)

for scan_no, z in enumerate(z_array):
    scan = (z/sig1)**2
    q_max[scan_no] = np.max(scan)
In [22]:
bins, edges, patches = plt.hist(q_max, bins=30)
icdf = 1.-np.cumsum(bins/n_samples)
icdf = np.hstack((1.,icdf))
icdf_error = np.sqrt(np.cumsum(bins))/n_samples
icdf_error = np.hstack((0.,icdf_error))
plt.xlabel('$q_{max}$')
plt.ylabel('counts / bin')
Out[22]:
<matplotlib.text.Text at 0x1136e8fd0>
In [23]:
# plot the p-value 
plt.plot(edges,icdf, c='r', label='toys')
plt.errorbar(edges,icdf,yerr=icdf_error)
plt.plot(u, global_p, label='prediction')
plt.xlabel('$u$')
plt.ylabel('$P(q_{max} >u)$')
plt.legend(('prediction','toys'))
#plt.ylabel('P(q>u)')
plt.ylim(1E-3,10)
plt.xlim(0,25)
plt.semilogy()
Out[23]:
[]

Wow! that was awesome! Go math!

Part 2

In [24]:
plt.plot(x,sig1)
plt.plot(x,sig2)
plt.ylim(0,4)
Out[24]:
(0, 4)

Now let's do some experiments combining two searches

In [25]:
n_samples = 10000
z_array1 = gp1.sample(x,n_samples)
z_array2 = gp2.sample(x,n_samples)

n_av1, n_av2, n_av12 = 0., 0., 0.
q_max = np.zeros((n_samples,3))

q_10 = np.zeros((n_samples,3))

n_plots = 3
plt.figure(figsize=(9,n_plots*3))
scan_no=0
for z1, z2 in zip(z_array1,z_array2):
    
    scan1  = (z1/sig1)**2
    scan2  = (z2/sig2)**2
    scan12 = ((z1/sig1**2 + z2/sig2**2 )/(sig1**-2+sig2**-2))**2  # This is where the combination happens
    scan12 = scan12*(sig1**-2 + sig2**-2)
    
    if scan_no==0:
        print sig1[5], sig2[5], scan1[5], scan2[5], scan12[5]
              

    exc1 = (scan1>u1) + 0. #add 0. to convert from bool to double
    exc2 = (scan2>u1) + 0. #add 0. to convert from bool to double
    exc12 = (scan12>u1) + 0. #add 0. to convert from bool to double
    
    if scan_no < n_plots:
        aspect = 1.
        #plt.subplot(n_plots,3,3*scan_no+1)
        plt.subplot(n_plots,1,1*scan_no+1)
        plt.plot(x,scan1, c='r', label='search 1')
        #plt.subplot(n_plots,3,3*scan_no+2)
        plt.subplot(n_plots,1,1*scan_no+1)
        plt.plot(x,scan2, c='g', label='search 2')
        #plt.subplot(n_plots,3,3*scan_no+3)
        plt.subplot(n_plots,1,1*scan_no+1)        
        plt.plot(x,scan12, c='b', label='combined')
        plt.legend(('search 1', 'search 2', 'combined'))

    q_max[scan_no,:] = [np.max(scan1), np.max(scan2), np.max(scan12)]
    q_10[scan_no,:] = [scan1[10],scan2[10], scan12[10]]

        
    #print num_upcrossings(exc1)
    n_av1 += 1.*num_upcrossings(exc1)/n_samples
    n_av2 += 1.*num_upcrossings(exc2)/n_samples
    n_av12 += 1.*num_upcrossings(exc12)/n_samples
    scan_no +=1

print "n_av search 1, search 2, combined = ", n_av1, n_av2, n_av12
0.120080321285 1.07991967871 0.0611510142466 2.96467034774 0.00307908455561
n_av search 1, search 2, combined =  14.3972 1.6124 9.4427
In [26]:
#Simple scaling: 
print "check simple scailing rule: prediction=%f, observed=%f" %(np.sqrt((n_av1**2+n_av2**2)/2), n_av12)
check simple scailing rule: prediction=10.244003, observed=9.442700

Now let's test the prediction that gp12 has the same behavior as the explicit combination of search 1 and search 2.

In [27]:
z_array12 = gp12.sample(x,n_samples)

q12_max = np.zeros((n_samples))
n_up = np.zeros(n_samples)

for scan_no, z12 in enumerate(z_array12):
    scan12 = (z12)**2 * (sig1**-2 + sig2**-2)  #divide out the variance of the combined
    q12_max[scan_no] = np.max(scan12)
    n_up[scan_no] = num_upcrossings((scan12 > u1)+0.)
print("average number of upcrossings for combined GP = %f" %(np.mean(n_up)))
average number of upcrossings for combined GP = 9.568000

Compare $q_{max}$ distribution from direct combination with the prediction from gp12

In [28]:
bins, edges, patches = plt.hist(q_max[:,2], bins=50, alpha=0.1, color='r', label='explicit combination')
bins, edges, patches = plt.hist(q12_max, bins=edges, alpha=0.1, color='b', label='predicted')
plt.ylabel('counts/bin')
plt.xlabel('$q_{max}$')
plt.legend(('explicit combination', 'predicted'))
Out[28]:
<matplotlib.legend.Legend at 0x113610a10>
In [29]:
u = np.linspace(5,25,100)
global_p = global_pvalue(u,u1,np.mean(n_up))

icdf = 1.-np.cumsum(bins/n_samples)
icdf = np.hstack((1.,icdf))
icdf_error = np.sqrt(np.cumsum(bins))/n_samples
icdf_error = np.hstack((0.,icdf_error))

plt.plot(edges,icdf, c='r', label='toys')
plt.errorbar(edges,icdf,yerr=icdf_error)
plt.plot(u, global_p, label='prediction')
plt.xlabel('$u$')
plt.ylabel('$P(q_{max} >u)$')
plt.legend(('prediction','toys'))
#plt.ylabel('P(q>u)')
plt.ylim(1E-3,10)
plt.xlim(0,25)
plt.semilogy()
Out[29]:
[]

Bingo!