Notebook

Machine Learning for $t\bar{t}Z$ Opposite-sign dilepton analysis¶

This notebook uses ATLAS Open Data http://opendata.atlas.cern to show you the steps to implement Machine Learning in the $t\bar{t}Z$ Opposite-sign dilepton analysis, following the ATLAS published paper Measurement of the $t\bar{t}Z$ and $t\bar{t}W$ cross sections in proton-proton collisions at $\sqrt{s}$ = 13 TeV with the ATLAS detector.

The whole notebook takes a few hours to follow through.

Notebooks are web applications that allow you to create and share documents that can contain for example:

live code
visualisations
narrative text

Notebooks are a perfect platform to develop Machine Learning for your work, since you'll need exactly those 3 things: code, visualisations and narrative text!

We're interested in Machine Learning because we can design an algorithm to figure out for itself how to do various analyses, potentially saving us countless human-hours of design and analysis work.

Machine Learning use within ATLAS includes:

particle tracking
particle identification
signal/background classification
and more!

This notebook will focus on signal/background classification.

By the end of this notebook you will be able to:

run Boosted Decision Trees to classify signal and background
know some things you can change to improve your Boosted Decision Tree

Feynman diagram pictures are borrowed from our friends at https://www.particlezoo.net

Introduction (from Section 1)¶

Properties of the top quark have been explored by the Large Hadron Collider (LHC) and previous collider experiments in great detail.

Other properties of the top quark are now becoming accessible, owing to the large centerof-mass energy and luminosity at the LHC.

Measurements of top-quark pairs in association with a Z boson ($t\bar{t}Z$) provide a direct probe of the weak couplings of the top quark. These couplings may be modified in the presence of physics beyond the Standard Model (BSM). Measurements of the $t\bar{t}Z$ production cross sections, $\sigma_{t\bar{t}Z}$, can be used to set constraints on the weak couplings of the top quark.

The production of $t\bar{t}Z$ is often an important background in searches involving final states with multiple leptons and b-quarks. These processes also constitute an important background in measurements of the associated production of the Higgs boson with top quarks.

This paper presents measurements of the $t\bar{t}Z$ cross section using proton–proton (pp) collision data at a center-of mass energy $\sqrt{s} = 13 TeV.

The final states of top-quark pairs produced in association with a Z boson contain up to four isolated, prompt leptons. In this analysis, events with two opposite-sign (OS) leptons are considered. The dominant backgrounds in this channel are Z+jets and $t\bar{t}$,

(In this paper, lepton is used to denote electron or muon, and prompt lepton is used to denote a lepton produced in a Z or W boson decay, or in the decay of a τ-lepton which arises from a Z or W boson decay.)

Data and simulated samples (from Section 3)¶

The data were collected with the ATLAS detector at a proton–proton (pp) collision energy of 13 TeV.

Monte Carlo (MC) simulation samples are used to model the expected signal and background distributions in the different control, validation and signal regions described below. All samples were processed through the same reconstruction software as used for the data.

Opposite-sign dilepton analysis (from Section 5A)¶

The OS dilepton analysis targets the $t\bar{t}Z$ process, where both top quarks decay hadronically and the Z boson decays to a pair of leptons (electrons or muons). Events are required to have exactly two OSSF leptons. Events with additional isolated leptons are rejected. The invariant mass of the lepton pair is required to be in the Z boson mass window, |$m_{ll} − m_Z$| < 10 GeV. The leading (subleading) lepton is required to have a transverse momentum of at least 30 (15) GeV.

The OS dilepton analysis is affected by large backgrounds from Z+jets or $t\bar{t}$ production, both characterized by the presence of two leptons. In order to improve the signal-to-background ratio and constrain these backgrounds from data, three separate analysis regions are considered, depending on the number of jets ($n_{jets}$) and number of b-tagged jets ($n_{b-tags}$): 2l-Z-5j2b, 2l-Z-6j1b and 2l-Z-6j2b. The signal region requirements are summarized in Table 1 below.

Variable	2l-Z-6j1b	2l-Z-5j2b	2l-Z-6j2b
Leptons	= 2, same flavour and opposite sign	= 2, same flavour and opposite sign	= 2, same flavour and opposite sign
$m_{ll}$	$	m_{ll} − m_Z	$ < 10 GeV	$	m_{ll} − m_Z	$ < 10 GeV	$	m_{ll} − m_Z	$ < 10 GeV
$p_T$ (leading lepton)	> 30 GeV	> 30 GeV	> 30 GeV
$p_T$ (subleading lepton)	> 15 GeV	> 15 GeV	> 15 GeV
$n_{b-tags}$	1	$\geq$2	$\geq$2
$n_{jets}$	$\geq$6	5	$\geq$6

Table 1: Summary of the event selection requirements in the OS dilepton signal regions.

This is Table 2 of the ATLAS published paper Measurement of the $t\bar{t}Z$ and $t\bar{t}W$ cross sections in proton-proton collisions at $\sqrt{s}$ = 13 TeV with the ATLAS detector.

In signal region 2l-Z-5j2b, exactly five jets are required, of which at least two must be b-tagged. In 2l-Z-6j1b (2l-Z-6j2b), at least six jets are required with exactly one (at least two) being b-tagged jets.

Contents:

Running a Jupyter notebook
To setup first time
To setup everytime
Lumi, fraction, file path
Samples to process
Get data from files
Find the good jets!
Find the good leptons!
Let's calculate some variables
Load data
Samples to plot
Function to plot Data/MC

Boosted Decision Tree (BDT) in 6j2b Region
Training and Testing split
Training
Signal Region plot

Boosted Decision Tree (BDT) in 5j2b Region
Training and Testing split
Training
Signal Region plot

Boosted Decision Tree (BDT) in 6j1b Region
Training and Testing split
Training
Signal Region plot

Control Region plots
6j2b Control Region plot
5j2b Control Region plot
6j1b Control Region plot

Data-driven ttbar estimate
Function to plot data from histograms
6j2b Signal Region plot
5j2b Signal Region plot
6j1b Signal Region plot

BDT feature importances

Going further

Running a Jupyter notebook¶

To run the whole Jupyter notebook, in the top menu click Cell -> Run All.

To propagate a change you've made to a piece of code, click Cell -> Run All Below.

You can also run a single code cell, by using the keyboard shortcut Shift+Enter.

Definition	6j1b	5j2b	6j2b
$p_T$ of the lepton pair	15	14	15
$p_T$ of the 4th jet	5	1	8
$p_T$ of the 5th jet	-	8	-
$p_T$ of the 6th jet	2	-	2
$\Delta R_{\eta}$ between the two leptons	6	4	7
Number of jet pairs with mass within a window of 30 GeV around 85 GeV	1	2	3
Number of top-quark candidates	-	-	1
Invariant mass of the two jets with the smallest $\Delta R_{\eta}$	13	10	17
Invariant mass of the two untagged jets with the highest $p_T$	9	11	-
Invariant mass of the two jets with the highest value of the b-tagging discriminant	-	5	4
Scalar sum of $p_T$ divided by the sum of energy of all jets	14	13	16
Average $\Delta R_{\eta}$ of all jet pairs	11	3	10
Maximum invariant mass of a lepton and the b-tagged jet with the smallest $\Delta R_{\eta}$	10	-	13
First Fox–Wolfram moment built from jets and leptons	12	12	14
Sum of jet $p_T$, using up to six jets	4	6	5
$\eta$ of dilepton system	3	9	9
Sum of the two closest two-jet invariant masses from jjj1 and jjj2 divided by two	7	-	11
$\Delta R_{\eta}$ between two jets with the highest value of the b-tagging discriminant in the event	-	7	6
$p_T$ of the b-tagged jet with the highest $p_T$	8	-	12

Machine Learning for $t\bar{t}Z$ Opposite-sign dilepton analysis¶

Introduction (from Section 1)¶

Data and simulated samples (from Section 3)¶

Opposite-sign dilepton analysis (from Section 5A)¶

Running a Jupyter notebook¶

First time setup on your computer (no need on mybinder)¶

To setup everytime¶

Lumi, fraction, file path¶

Samples to process¶

Get data from files¶

Find the good jets!¶

Find the good leptons!¶

Let's calculate some variables!¶

Can we process the data yet?!¶

Samples to plot¶

Function to plot Data and MC¶

BDT in 6j2b region¶

The Training and Testing split (6j2b)¶

Training Decision Trees (6j2b)¶

6j2b Signal Region plot¶

BDT in 5j2b region¶

The Training and Testing split (5j2b)¶

Training Decision Trees (5j2b)¶

5j2b Signal Region plot¶

BDT in 6j1b region¶

The Training and Testing split (6j1b)¶

Training Decision Trees (6j1b)¶

6j1b Signal Region plot¶

Control Region plots¶

6j2b Control Region plot¶

5j2b Control Region plot¶

6j1b Control Region plot¶

Data-driven ttbar estimation¶

Function to plot Data and MC from histograms¶

6j2b Signal Region plot with data-driven $t\bar{t}$¶

5j2b Signal Region plot with data-driven $t\bar{t}$¶

6j1b Signal Region plot with data-driven $t\bar{t}$¶

BDT feature importances¶

Going further¶