This notebook uses ATLAS Open Data http://opendata.atlas.cern to show you the steps to rediscover the Higgs boson yourself!
ATLAS Open Data provides open access to proton-proton collision data at the LHC for educational purposes. ATLAS Open Data resources are ideal for high-school, undergraduate and postgraduate students.
Notebooks are web applications that allow you to create and share documents that can contain for example:
The idea is that cuts increase the ratio of signal ($H \rightarrow ZZ \rightarrow \ell\ell\ell\ell$) to background ($Z, t\bar{t}, ZZ \rightarrow \ell\ell\ell\ell$)
First, the amount of $Z$ and $t\bar{t}$ background is reduced, since these are quite different to the signal.
Then, the amount of $ZZ \rightarrow \ell\ell\ell\ell$ is reduced, whilst keeping as much $H \rightarrow ZZ \rightarrow \ell\ell\ell\ell$ signal as possible.
The datasets used in this notebook have already been filtered to include at least 4 leptons per event, so that processing is quicker.
This analysis loosely follows the discovery of the Higgs boson by ATLAS (mostly Section 4 and 4.1)
By the end of this notebook you will be able to:
Feynman diagram pictures are borrowed from our friends at https://www.particlezoo.net
To run the whole Jupyter notebook, in the top menu click Cell -> Run All.
To propagate a change you've made to a piece of code, click Cell -> Run All Below.
You can also run a single code cell, by clicking Cell -> Run Cells, or using the keyboard shortcut Shift+Enter.
This first cell only needs to be run the first time you open this notebook on your computer.
If you close Jupyter and re-open on the same computer, you won't need to run this first cell again.
If you open on mybinder, you don't need to run this cell.
import sys
# update the pip package installer
!{sys.executable} -m pip install --upgrade --user pip
# install required packages
!{sys.executable} -m pip install --upgrade --user uproot awkward vector numpy matplotlib
Cell -> Run All Below
to be done every time you re-open this notebook
We're going to be using a number of tools to help us:
import uproot # for reading .root files
import awkward as ak # to represent nested data in columnar format
import vector # for 4-momentum calculations
import time # to measure time to analyse
import math # for mathematical functions such as square root
import numpy as np # for numerical calculations such as histogramming
import matplotlib.pyplot as plt # for plotting
from matplotlib.ticker import AutoMinorLocator # for minor ticks
import infofile # local file containing cross-sections, sums of weights, dataset IDs
General definitions of fraction of data used, where to access the input files
#lumi = 0.5 # fb-1 # data_A only
#lumi = 1.9 # fb-1 # data_B only
#lumi = 2.9 # fb-1 # data_C only
#lumi = 4.7 # fb-1 # data_D only
lumi = 10 # fb-1 # data_A,data_B,data_C,data_D
fraction = 1.0 # reduce this is if you want the code to run quicker
#tuple_path = "Input/4lep/" # local
tuple_path = "https://atlas-opendata.web.cern.ch/atlas-opendata/samples/2020/4lep/" # web address
samples to process
samples = {
'data': {
'list' : ['data_A','data_B','data_C','data_D'],
},
r'Background $Z,t\bar{t}$' : { # Z + ttbar
'list' : ['Zee','Zmumu','ttbar_lep'],
'color' : "#6b59d3" # purple
},
r'Background $ZZ^*$' : { # ZZ
'list' : ['llll'],
'color' : "#ff0000" # red
},
r'Signal ($m_H$ = 125 GeV)' : { # H -> ZZ -> llll
'list' : ['ggH125_ZZ4lep','VBFH125_ZZ4lep','WH125_ZZ4lep','ZH125_ZZ4lep'],
'color' : "#00cdff" # light blue
},
}
Units, as stored in the data files
MeV = 0.001
GeV = 1.0
Define function to get data from files.
The datasets used in this notebook have already been filtered to include at least 4 leptons per event, so that processing is quicker.
def get_data_from_files():
data = {} # define empty dictionary to hold awkward arrays
for s in samples: # loop over samples
print('Processing '+s+' samples') # print which sample
frames = [] # define empty list to hold data
for val in samples[s]['list']: # loop over each file
if s == 'data': prefix = "Data/" # Data prefix
else: # MC prefix
prefix = "MC/mc_"+str(infofile.infos[val]["DSID"])+"."
fileString = tuple_path+prefix+val+".4lep.root" # file name to open
temp = read_file(fileString,val) # call the function read_file defined below
frames.append(temp) # append array returned from read_file to list of awkward arrays
data[s] = ak.concatenate(frames) # dictionary entry is concatenated awkward arrays
return data # return dictionary of awkward arrays
define function to calculate weight of MC event
def calc_weight(xsec_weight, events):
return (
xsec_weight
* events.mcWeight
* events.scaleFactor_PILEUP
* events.scaleFactor_ELE
* events.scaleFactor_MUON
* events.scaleFactor_LepTRIGGER
)
define function to get cross-section weight
def get_xsec_weight(sample):
info = infofile.infos[sample] # open infofile
xsec_weight = (lumi*1000*info["xsec"])/(info["sumw"]*info["red_eff"]) #*1000 to go from fb-1 to pb-1
return xsec_weight # return cross-section weight
define function to calculate 4-lepton invariant mass.
Note: lep_(pt|eta|phi|E)
are variable length lists of lepton momentum components for each event, represented by awkward arrays.
def calc_mllll(lep_pt, lep_eta, lep_phi, lep_E):
# construct awkward 4-vector array
p4 = vector.zip({"pt": lep_pt, "eta": lep_eta, "phi": lep_phi, "E": lep_E})
# calculate invariant mass of first 4 leptons
# [:, i] selects the i-th lepton in each event
# .M calculates the invariant mass
return (p4[:, 0] + p4[:, 1] + p4[:, 2] + p4[:, 3]).M * MeV
If you change a cut: Cell -> Run All Below
If you change a cut here, you also need to make sure the cut is applied in the "Applying a cut" cell.
# cut on lepton charge
# paper: "selecting two pairs of isolated leptons, each of which is comprised of two leptons with the same flavour and opposite charge"
def cut_lep_charge(lep_charge):
# throw away when sum of lepton charges is not equal to 0
# first lepton in each event is [:, 0], 2nd lepton is [:, 1] etc
return lep_charge[:, 0] + lep_charge[:, 1] + lep_charge[:, 2] + lep_charge[:, 3] != 0
# cut on lepton type
# paper: "selecting two pairs of isolated leptons, each of which is comprised of two leptons with the same flavour and opposite charge"
def cut_lep_type(lep_type):
# for an electron lep_type is 11
# for a muon lep_type is 13
# throw away when none of eeee, mumumumu, eemumu
sum_lep_type = lep_type[:, 0] + lep_type[:, 1] + lep_type[:, 2] + lep_type[:, 3]
return (sum_lep_type != 44) & (sum_lep_type != 48) & (sum_lep_type != 52)
If you add a cut: Cell -> Run All Below
def read_file(path,sample):
start = time.time() # start the clock
print("\tProcessing: "+sample) # print which sample is being processed
data_all = [] # define empty list to hold all data for this sample
# open the tree called mini using a context manager (will automatically close files/resources)
with uproot.open(path + ":mini") as tree:
numevents = tree.num_entries # number of events
if 'data' not in sample: xsec_weight = get_xsec_weight(sample) # get cross-section weight
for data in tree.iterate(['lep_pt','lep_eta','lep_phi',
'lep_E','lep_charge','lep_type',
# add more variables here if you make cuts on them
'mcWeight','scaleFactor_PILEUP',
'scaleFactor_ELE','scaleFactor_MUON',
'scaleFactor_LepTRIGGER'], # variables to calculate Monte Carlo weight
library="ak", # choose output type as awkward array
entry_stop=numevents*fraction): # process up to numevents*fraction
nIn = len(data) # number of events in this batch
if 'data' not in sample: # only do this for Monte Carlo simulation files
# multiply all Monte Carlo weights and scale factors together to give total weight
data['totalWeight'] = calc_weight(xsec_weight, data)
# cut on lepton charge using the function cut_lep_charge defined above
data = data[~cut_lep_charge(data.lep_charge)]
# cut on lepton type using the function cut_lep_type defined above
data = data[~cut_lep_type(data.lep_type)]
# calculation of 4-lepton invariant mass using the function calc_mllll defined above
data['mllll'] = calc_mllll(data.lep_pt, data.lep_eta, data.lep_phi, data.lep_E)
# array contents can be printed at any stage like this
#print(data)
# array column can be printed at any stage like this
#print(data['lep_pt'])
# multiple array columns can be printed at any stage like this
#print(data[['lep_pt','lep_eta']])
nOut = len(data) # number of events passing cuts in this batch
data_all.append(data) # append array from this batch
elapsed = time.time() - start # time taken to process
print("\t\t nIn: "+str(nIn)+",\t nOut: \t"+str(nOut)+"\t in "+str(round(elapsed,1))+"s") # events before and after
return ak.concatenate(data_all) # return array containing events passing all cuts
This is where the processing happens
start = time.time() # time at start of whole processing
data = get_data_from_files() # process all files
elapsed = time.time() - start # time after whole processing
print("Time taken: "+str(round(elapsed,1))+"s") # print total time taken to process every file
Processing data samples Processing: data_A nIn: 39, nOut: 27 in 1.7s Processing: data_B nIn: 156, nOut: 86 in 1.3s Processing: data_C nIn: 237, nOut: 146 in 1.3s Processing: data_D nIn: 400, nOut: 248 in 1.6s Processing Background $Z,t\bar{t}$ samples Processing: Zee nIn: 898, nOut: 243 in 1.2s Processing: Zmumu nIn: 684, nOut: 257 in 1.0s Processing: ttbar_lep nIn: 1031, nOut: 334 in 1.1s Processing Background $ZZ^*$ samples Processing: llll nIn: 554279, nOut: 523957 in 6.9s Processing Signal ($m_H$ = 125 GeV) samples Processing: ggH125_ZZ4lep nIn: 164716, nOut: 161451 in 4.9s Processing: VBFH125_ZZ4lep nIn: 191126, nOut: 186870 in 6.1s Processing: WH125_ZZ4lep nIn: 15379, nOut: 9772 in 2.7s Processing: ZH125_ZZ4lep nIn: 14485, nOut: 11947 in 1.4s Time taken: 31.3s
If you only want to make a change in plotting: Cell -> Run All Below
Define function to plot the data
def plot_data(data):
xmin = 80 * GeV
xmax = 250 * GeV
step_size = 5 * GeV
bin_edges = np.arange(start=xmin, # The interval includes this value
stop=xmax+step_size, # The interval doesn't include this value
step=step_size ) # Spacing between values
bin_centres = np.arange(start=xmin+step_size/2, # The interval includes this value
stop=xmax+step_size/2, # The interval doesn't include this value
step=step_size ) # Spacing between values
data_x,_ = np.histogram(ak.to_numpy(data['data']['mllll']),
bins=bin_edges ) # histogram the data
data_x_errors = np.sqrt( data_x ) # statistical error on the data
signal_x = ak.to_numpy(data[r'Signal ($m_H$ = 125 GeV)']['mllll']) # histogram the signal
signal_weights = ak.to_numpy(data[r'Signal ($m_H$ = 125 GeV)'].totalWeight) # get the weights of the signal events
signal_color = samples[r'Signal ($m_H$ = 125 GeV)']['color'] # get the colour for the signal bar
mc_x = [] # define list to hold the Monte Carlo histogram entries
mc_weights = [] # define list to hold the Monte Carlo weights
mc_colors = [] # define list to hold the colors of the Monte Carlo bars
mc_labels = [] # define list to hold the legend labels of the Monte Carlo bars
for s in samples: # loop over samples
if s not in ['data', r'Signal ($m_H$ = 125 GeV)']: # if not data nor signal
mc_x.append( ak.to_numpy(data[s]['mllll']) ) # append to the list of Monte Carlo histogram entries
mc_weights.append( ak.to_numpy(data[s].totalWeight) ) # append to the list of Monte Carlo weights
mc_colors.append( samples[s]['color'] ) # append to the list of Monte Carlo bar colors
mc_labels.append( s ) # append to the list of Monte Carlo legend labels
# *************
# Main plot
# *************
main_axes = plt.gca() # get current axes
# plot the data points
main_axes.errorbar(x=bin_centres, y=data_x, yerr=data_x_errors,
fmt='ko', # 'k' means black and 'o' is for circles
label='Data')
# plot the Monte Carlo bars
mc_heights = main_axes.hist(mc_x, bins=bin_edges,
weights=mc_weights, stacked=True,
color=mc_colors, label=mc_labels )
mc_x_tot = mc_heights[0][-1] # stacked background MC y-axis value
# calculate MC statistical uncertainty: sqrt(sum w^2)
mc_x_err = np.sqrt(np.histogram(np.hstack(mc_x), bins=bin_edges, weights=np.hstack(mc_weights)**2)[0])
# plot the signal bar
main_axes.hist(signal_x, bins=bin_edges, bottom=mc_x_tot,
weights=signal_weights, color=signal_color,
label=r'Signal ($m_H$ = 125 GeV)')
# plot the statistical uncertainty
main_axes.bar(bin_centres, # x
2*mc_x_err, # heights
alpha=0.5, # half transparency
bottom=mc_x_tot-mc_x_err, color='none',
hatch="////", width=step_size, label='Stat. Unc.' )
# set the x-limit of the main axes
main_axes.set_xlim( left=xmin, right=xmax )
# separation of x axis minor ticks
main_axes.xaxis.set_minor_locator( AutoMinorLocator() )
# set the axis tick parameters for the main axes
main_axes.tick_params(which='both', # ticks on both x and y axes
direction='in', # Put ticks inside and outside the axes
top=True, # draw ticks on the top axis
right=True ) # draw ticks on right axis
# x-axis label
main_axes.set_xlabel(r'4-lepton invariant mass $\mathrm{m_{4l}}$ [GeV]',
fontsize=13, x=1, horizontalalignment='right' )
# write y-axis label for main axes
main_axes.set_ylabel('Events / '+str(step_size)+' GeV',
y=1, horizontalalignment='right')
# set y-axis limits for main axes
main_axes.set_ylim( bottom=0, top=np.amax(data_x)*1.6 )
# add minor ticks on y-axis for main axes
main_axes.yaxis.set_minor_locator( AutoMinorLocator() )
# Add text 'ATLAS Open Data' on plot
plt.text(0.05, # x
0.93, # y
'ATLAS Open Data', # text
transform=main_axes.transAxes, # coordinate system used is that of main_axes
fontsize=13 )
# Add text 'for education' on plot
plt.text(0.05, # x
0.88, # y
'for education', # text
transform=main_axes.transAxes, # coordinate system used is that of main_axes
style='italic',
fontsize=8 )
# Add energy and luminosity
lumi_used = str(lumi*fraction) # luminosity to write on the plot
plt.text(0.05, # x
0.82, # y
'$\sqrt{s}$=13 TeV,$\int$L dt = '+lumi_used+' fb$^{-1}$', # text
transform=main_axes.transAxes ) # coordinate system used is that of main_axes
# Add a label for the analysis carried out
plt.text(0.05, # x
0.76, # y
r'$H \rightarrow ZZ^* \rightarrow 4\ell$', # text
transform=main_axes.transAxes ) # coordinate system used is that of main_axes
# draw the legend
main_axes.legend( frameon=False ) # no box around the legend
return
Call the function to plot the data
plot_data(data)