This macro provides an example of how to use TMVA for k-folds cross evaluation.
As input data is used a toy-MC sample consisting of two gaussian distributions.
The output file "TMVACV.root" can be analysed with the use of dedicated macros (simply say: root -l <macro.C>), which can be conveniently invoked through a GUI that will appear at the end of the run of this macro. Launch the GUI via the command:
root -l -e 'TMVA::TMVAGui("TMVACV.root")'
Cross evaluation is a special case of k-folds cross validation where the splitting into k folds is computed deterministically. This ensures that the a given event will always end up in the same fold.
In addition all resulting classifiers are saved and can be applied to new
data using MethodCrossValidation
. One requirement for this to work is a
splitting function that is evaluated for each event to determine into what
fold it goes (for training/evaluation) or to what classifier (for
application).
Cross evaluation uses a deterministic split to partition the data into
folds called the split expression. The expression can be any valid
TFormula
as long as all parts used are defined.
For each event the split expression is evaluated to a number and the event is put in the fold corresponding to that number.
It is recommended to always use %int([NumFolds])
at the end of the
expression.
The split expression has access to all spectators and variables defined in
the dataloader. Additionally, the number of folds in the split can be
accessed with NumFolds
(or numFolds
).
"int(fabs([eventID]))%int([NumFolds])"
Author: Kim Albertsson (adapted from code originally by Andreas Hoecker)
This notebook tutorial was automatically generated with ROOTBOOK-izer from the macro found in the ROOT repository on Wednesday, April 17, 2024 at 11:22 AM.
%%cpp -d
#include <cstdlib>
#include <iostream>
#include <map>
#include <string>
#include "TChain.h"
#include "TFile.h"
#include "TTree.h"
#include "TString.h"
#include "TObjString.h"
#include "TSystem.h"
#include "TROOT.h"
#include "TMVA/CrossValidation.h"
#include "TMVA/DataLoader.h"
#include "TMVA/Factory.h"
#include "TMVA/Tools.h"
#include "TMVA/TMVAGui.h"
Helper function to load data into TTrees.
%%cpp -d
TTree *genTree(Int_t nPoints, Double_t offset, Double_t scale, UInt_t seed = 100)
{
TRandom3 rng(seed);
Float_t x = 0;
Float_t y = 0;
UInt_t eventID = 0;
TTree *data = new TTree();
data->Branch("x", &x, "x/F");
data->Branch("y", &y, "y/F");
data->Branch("eventID", &eventID, "eventID/I");
for (Int_t n = 0; n < nPoints; ++n) {
x = rng.Gaus(offset, scale);
y = rng.Gaus(offset, scale);
// For our simple example it is enough that the id's are uniformly
// distributed and independent of the data.
++eventID;
data->Fill();
}
// Important: Disconnects the tree from the memory locations of x and y.
data->ResetBranchAddresses();
return data;
}
Arguments are defined.
bool useRandomSplitting = false;
This loads the library
TMVA::Tools::Instance();
Load the data into TTrees. If you load data from file you can use a variant of
TString filename = "/path/to/file";
TFile * input = TFile::Open( filename );
TTree * signalTree = (TTree*)input->Get("TreeName");
TTree *sigTree = genTree(1000, 1.0, 1.0, 100);
TTree *bkgTree = genTree(1000, -1.0, 1.0, 101);
Create a ROOT output file where TMVA will store ntuples, histograms, etc.
TString outfileName("TMVACV.root");
TFile *outputFile = TFile::Open(outfileName, "RECREATE");
DataLoader definitions; We declare variables in the tree so that TMVA can find them. For more information see TMVAClassification tutorial.
TMVA::DataLoader *dataloader = new TMVA::DataLoader("datasetcv");
Data variables
dataloader->AddVariable("x", 'F');
dataloader->AddVariable("y", 'F');
Spectator used for split
dataloader->AddSpectator("eventID", 'I');
NOTE: Currently TMVA treats all input variables, spectators etc as
floats. Thus, if the absolute value of the input is too large
there can be precision loss. This can especially be a problem for
cross validation with large event numbers.
A workaround is to define your splitting variable as:
dataloader->AddSpectator("eventID := eventID % 4096", 'I');
where 4096 should be a number much larger than the number of folds
you intend to run with.
Attaches the trees so they can be read from
dataloader->AddSignalTree(sigTree, 1.0);
dataloader->AddBackgroundTree(bkgTree, 1.0);
DataSetInfo : [datasetcv] : Added class "Signal" : Add Tree of type Signal with 1000 events DataSetInfo : [datasetcv] : Added class "Background" : Add Tree of type Background with 1000 events
The CV mechanism of TMVA splits up the training set into several folds.
The test set is currently left unused. The nTest_ClassName=1
assigns
one event to the test set for each class and puts the rest in the
training set. A value of 0 is a special value and would split the
datasets 50 / 50.
dataloader->PrepareTrainingAndTestTree("", "",
"nTest_Signal=1"
":nTest_Background=1"
":SplitMode=Random"
":NormMode=NumEvents"
":!V");
This sets up a CrossValidation class (which wraps a TMVA::Factory internally) for 2-fold cross validation.
The split type can be "Random", "RandomStratified" or "Deterministic". For the last option, check the comment below. Random splitting randomises the order of events and distributes events as evenly as possible. RandomStratified applies the same logic but distributes events within a class as evenly as possible over the folds.
UInt_t numFolds = 2;
TString analysisType = "Classification";
TString splitType = (useRandomSplitting) ? "Random" : "Deterministic";
One can also use a custom splitting function for producing the folds.
The example uses a dataset spectator eventID
.
The idea here is that eventID should be an event number that is integral, random and independent of the data, generated only once. This last property ensures that if a calibration is changed the same event will still be assigned the same fold.
This can be used to use the cross validated classifiers in application, a technique that can simplify statistical analysis.
If you want to run TMVACrossValidationApplication, make sure you have run this tutorial with Deterministic splitting type, i.e. with the option useRandomSPlitting = false
TString splitExpr = (!useRandomSplitting) ? "int(fabs([eventID]))%int([NumFolds])" : "";
TString cvOptions = Form("!V"
":!Silent"
":ModelPersistence"
":AnalysisType=%s"
":SplitType=%s"
":NumFolds=%i"
":SplitExpr=%s",
analysisType.Data(), splitType.Data(), numFolds,
splitExpr.Data());
TMVA::CrossValidation cv{"TMVACrossValidation", dataloader, outputFile, cvOptions};
<HEADER> Factory : You are running ROOT Version: 6.33/01, Oct 10, 2023 : : _/_/_/_/_/ _| _| _| _| _|_| : _/ _|_| _|_| _| _| _| _| : _/ _| _| _| _| _| _|_|_|_| : _/ _| _| _| _| _| _| : _/ _| _| _| _| _| : : ___________TMVA Version 4.2.1, Feb 5, 2015 :
Books a method to use for evaluation
cv.BookMethod(TMVA::Types::kBDT, "BDTG",
"!H:!V:NTrees=100:MinNodeSize=2.5%:BoostType=Grad"
":NegWeightTreatment=Pray:Shrinkage=0.10:nCuts=20"
":MaxDepth=2");
cv.BookMethod(TMVA::Types::kFisher, "Fisher",
"!H:!V:Fisher:VarTransform=None");
Train, test and evaluate the booked methods. Evaluates the booked methods once for each fold and aggregates the result in the specified output file.
cv.Evaluate();
: Rebuilding Dataset datasetcv : Building event vectors for type 2 Signal : Dataset[datasetcv] : create input formulas for tree : Building event vectors for type 2 Background : Dataset[datasetcv] : create input formulas for tree <HEADER> DataSetFactory : [datasetcv] : Number of events in input trees : : : Number of training and testing events : --------------------------------------------------------------------------- : Signal -- training events : 999 : Signal -- testing events : 1 : Signal -- training and testing events: 1000 : Background -- training events : 999 : Background -- testing events : 1 : Background -- training and testing events: 1000 : <HEADER> DataSetInfo : Correlation matrix (Signal): : ------------------------ : x y : x: +1.000 +0.075 : y: +0.075 +1.000 : ------------------------ <HEADER> DataSetInfo : Correlation matrix (Background): : ------------------------ : x y : x: +1.000 +0.020 : y: +0.020 +1.000 : ------------------------ <HEADER> DataSetFactory : [datasetcv] : : : : : ======================================== : Processing folds for method BDTG : ======================================== : <HEADER> Factory : Booking method: BDTG_fold1 : <HEADER> BDTG_fold1 : #events: (reweighted) sig: 500 bkg: 500 : #events: (unweighted) sig: 500 bkg: 500 : Training 100 Decision Trees ... patience please : Elapsed time for training with 1000 events: 0.047 sec <HEADER> BDTG_fold1 : [datasetcv] : Evaluation of BDTG_fold1 on training sample (1000 events) : Elapsed time for evaluation of 1000 events: 0.00346 sec : Creating xml weight file: datasetcv/weights/TMVACrossValidation_BDTG_fold1.weights.xml : Creating standalone class: datasetcv/weights/TMVACrossValidation_BDTG_fold1.class.C <HEADER> Factory : Test all methods <HEADER> Factory : Test method: BDTG_fold1 for Classification performance : <HEADER> BDTG_fold1 : [datasetcv] : Evaluation of BDTG_fold1 on testing sample (998 events) : Elapsed time for evaluation of 998 events: 0.0034 sec <HEADER> Factory : Evaluate all methods <HEADER> Factory : Evaluate classifier: BDTG_fold1 : <HEADER> BDTG_fold1 : [datasetcv] : Loop over test events and fill histograms with classifier response... : : : Evaluation results ranked by best signal efficiency and purity (area) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA : Name: Method: ROC-integ : datasetcv BDTG_fold1 : 0.973 : ------------------------------------------------------------------------------------------------------------------- : : Testing efficiency compared to training efficiency (overtraining check) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA Signal efficiency: from test sample (from training sample) : Name: Method: @B=0.01 @B=0.10 @B=0.30 : ------------------------------------------------------------------------------------------------------------------- : datasetcv BDTG_fold1 : 0.575 (0.725) 0.947 (0.933) 0.981 (0.980) : ------------------------------------------------------------------------------------------------------------------- : <HEADER> Factory : Thank you for using TMVA! : For citation information, please visit: http://tmva.sf.net/citeTMVA.html <HEADER> Factory : Booking method: BDTG_fold2 : <HEADER> BDTG_fold2 : #events: (reweighted) sig: 499 bkg: 499 : #events: (unweighted) sig: 499 bkg: 499 : Training 100 Decision Trees ... patience please : Elapsed time for training with 998 events: 0.0457 sec <HEADER> BDTG_fold2 : [datasetcv] : Evaluation of BDTG_fold2 on training sample (998 events) : Elapsed time for evaluation of 998 events: 0.00348 sec : Creating xml weight file: datasetcv/weights/TMVACrossValidation_BDTG_fold2.weights.xml : Creating standalone class: datasetcv/weights/TMVACrossValidation_BDTG_fold2.class.C <HEADER> Factory : Test all methods <HEADER> Factory : Test method: BDTG_fold2 for Classification performance : <HEADER> BDTG_fold2 : [datasetcv] : Evaluation of BDTG_fold2 on testing sample (1000 events) : Elapsed time for evaluation of 1000 events: 0.00413 sec <HEADER> Factory : Evaluate all methods <HEADER> Factory : Evaluate classifier: BDTG_fold2 : <HEADER> BDTG_fold2 : [datasetcv] : Loop over test events and fill histograms with classifier response... : : : Evaluation results ranked by best signal efficiency and purity (area) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA : Name: Method: ROC-integ : datasetcv BDTG_fold2 : 0.961 : ------------------------------------------------------------------------------------------------------------------- : : Testing efficiency compared to training efficiency (overtraining check) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA Signal efficiency: from test sample (from training sample) : Name: Method: @B=0.01 @B=0.10 @B=0.30 : ------------------------------------------------------------------------------------------------------------------- : datasetcv BDTG_fold2 : 0.646 (0.696) 0.868 (0.930) 0.975 (0.976) : ------------------------------------------------------------------------------------------------------------------- : <HEADER> Factory : Thank you for using TMVA! : For citation information, please visit: http://tmva.sf.net/citeTMVA.html <HEADER> Factory : Booking method: BDTG : : Reading weightfile: datasetcv/weights/TMVACrossValidation_BDTG_fold1.weights.xml : Reading weight file: datasetcv/weights/TMVACrossValidation_BDTG_fold1.weights.xml : Reading weightfile: datasetcv/weights/TMVACrossValidation_BDTG_fold2.weights.xml : Reading weight file: datasetcv/weights/TMVACrossValidation_BDTG_fold2.weights.xml : : : ======================================== : Processing folds for method Fisher : ======================================== : <HEADER> Factory : Booking method: Fisher_fold1 : <HEADER> Fisher_fold1 : Results for Fisher coefficients: : ----------------------- : Variable: Coefficient: : ----------------------- : x: +0.449 : y: +0.436 : (offset): +0.019 : ----------------------- : Elapsed time for training with 1000 events: 0.000469 sec <HEADER> Fisher_fold1 : [datasetcv] : Evaluation of Fisher_fold1 on training sample (1000 events) : Elapsed time for evaluation of 1000 events: 7.2e-05 sec : Creating xml weight file: datasetcv/weights/TMVACrossValidation_Fisher_fold1.weights.xml : Creating standalone class: datasetcv/weights/TMVACrossValidation_Fisher_fold1.class.C <HEADER> Factory : Test all methods <HEADER> Factory : Test method: Fisher_fold1 for Classification performance : <HEADER> Fisher_fold1 : [datasetcv] : Evaluation of Fisher_fold1 on testing sample (998 events) : Elapsed time for evaluation of 998 events: 0.000205 sec <HEADER> Factory : Evaluate all methods <HEADER> Factory : Evaluate classifier: Fisher_fold1 : <HEADER> Fisher_fold1 : [datasetcv] : Loop over test events and fill histograms with classifier response... : : : Evaluation results ranked by best signal efficiency and purity (area) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA : Name: Method: ROC-integ : datasetcv Fisher_fold1 : 0.976 : ------------------------------------------------------------------------------------------------------------------- : : Testing efficiency compared to training efficiency (overtraining check) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA Signal efficiency: from test sample (from training sample) : Name: Method: @B=0.01 @B=0.10 @B=0.30 : ------------------------------------------------------------------------------------------------------------------- : datasetcv Fisher_fold1 : 0.660 (0.665) 0.952 (0.923) 0.986 (0.985) : ------------------------------------------------------------------------------------------------------------------- : <HEADER> Factory : Thank you for using TMVA! : For citation information, please visit: http://tmva.sf.net/citeTMVA.html <HEADER> Factory : Booking method: Fisher_fold2 : <HEADER> Fisher_fold2 : Results for Fisher coefficients: : ----------------------- : Variable: Coefficient: : ----------------------- : x: +0.501 : y: +0.467 : (offset): -0.000 : ----------------------- : Elapsed time for training with 998 events: 0.000294 sec <HEADER> Fisher_fold2 : [datasetcv] : Evaluation of Fisher_fold2 on training sample (998 events) : Elapsed time for evaluation of 998 events: 8.42e-05 sec : Creating xml weight file: datasetcv/weights/TMVACrossValidation_Fisher_fold2.weights.xml : Creating standalone class: datasetcv/weights/TMVACrossValidation_Fisher_fold2.class.C <HEADER> Factory : Test all methods <HEADER> Factory : Test method: Fisher_fold2 for Classification performance : <HEADER> Fisher_fold2 : [datasetcv] : Evaluation of Fisher_fold2 on testing sample (1000 events) : Elapsed time for evaluation of 1000 events: 0.000149 sec <HEADER> Factory : Evaluate all methods <HEADER> Factory : Evaluate classifier: Fisher_fold2 : <HEADER> Fisher_fold2 : [datasetcv] : Loop over test events and fill histograms with classifier response... : : : Evaluation results ranked by best signal efficiency and purity (area) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA : Name: Method: ROC-integ : datasetcv Fisher_fold2 : 0.966 : ------------------------------------------------------------------------------------------------------------------- : : Testing efficiency compared to training efficiency (overtraining check) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA Signal efficiency: from test sample (from training sample) : Name: Method: @B=0.01 @B=0.10 @B=0.30 : ------------------------------------------------------------------------------------------------------------------- : datasetcv Fisher_fold2 : 0.655 (0.645) 0.900 (0.928) 0.975 (0.977) : ------------------------------------------------------------------------------------------------------------------- : <HEADER> Factory : Thank you for using TMVA! : For citation information, please visit: http://tmva.sf.net/citeTMVA.html <HEADER> Factory : Booking method: Fisher : : Reading weightfile: datasetcv/weights/TMVACrossValidation_Fisher_fold1.weights.xml : Reading weight file: datasetcv/weights/TMVACrossValidation_Fisher_fold1.weights.xml : Reading weightfile: datasetcv/weights/TMVACrossValidation_Fisher_fold2.weights.xml : Reading weight file: datasetcv/weights/TMVACrossValidation_Fisher_fold2.weights.xml : : : ======================================== : Folds processed for all methods, evaluating. : ======================================== : <HEADER> Factory : [datasetcv] : Create Transformation "I" with events from all classes. : <HEADER> : Transformation, Variable selection : : Input : variable 'x' <---> Output : variable 'x' : Input : variable 'y' <---> Output : variable 'y' <HEADER> TFHandler_Factory : Variable Mean RMS [ Min Max ] : ----------------------------------------------------------- : x: -0.014284 1.4061 [ -4.1075 4.0969 ] : y: -0.0066370 1.4204 [ -4.8520 4.0761 ] : ----------------------------------------------------------- : Ranking input variables (method unspecific)... <HEADER> IdTransformation : Ranking result (top variable is best ranked) : -------------------------- : Rank : Variable : Separation : -------------------------- : 1 : x : 5.429e-01 : 2 : y : 5.230e-01 : -------------------------- : Elapsed time for training with 1998 events: 6.2e-06 sec <HEADER> BDTG : [datasetcv] : Evaluation of BDTG on training sample (1998 events) : Elapsed time for evaluation of 1998 events: 0.00656 sec : Creating xml weight file: datasetcv/weights/TMVACrossValidation_BDTG.weights.xml : Creating standalone class: datasetcv/weights/TMVACrossValidation_BDTG.class.C <WARNING> <WARNING> : MakeClassSpecificHeader not implemented for CrossValidation <WARNING> <WARNING> : MakeClassSpecific not implemented for CrossValidation : Elapsed time for training with 1998 events: 4.05e-06 sec <HEADER> Fisher : [datasetcv] : Evaluation of Fisher on training sample (1998 events) : Elapsed time for evaluation of 1998 events: 0.000371 sec : Creating xml weight file: datasetcv/weights/TMVACrossValidation_Fisher.weights.xml : Creating standalone class: datasetcv/weights/TMVACrossValidation_Fisher.class.C <WARNING> <WARNING> : MakeClassSpecificHeader not implemented for CrossValidation <WARNING> <WARNING> : MakeClassSpecific not implemented for CrossValidation <HEADER> Factory : Test all methods <HEADER> Factory : Test method: BDTG for Classification performance : <HEADER> BDTG : [datasetcv] : Evaluation of BDTG on testing sample (1998 events) : Elapsed time for evaluation of 1998 events: 0.00627 sec <HEADER> Factory : Test method: Fisher for Classification performance : <HEADER> Fisher : [datasetcv] : Evaluation of Fisher on testing sample (1998 events) : Elapsed time for evaluation of 1998 events: 0.000345 sec <HEADER> Factory : Evaluate all methods <HEADER> Factory : Evaluate classifier: BDTG : <HEADER> BDTG : [datasetcv] : Loop over test events and fill histograms with classifier response... : <HEADER> TFHandler_BDTG : Variable Mean RMS [ Min Max ] : ----------------------------------------------------------- : x: -0.014284 1.4061 [ -4.1075 4.0969 ] : y: -0.0066370 1.4204 [ -4.8520 4.0761 ] : ----------------------------------------------------------- <HEADER> Factory : Evaluate classifier: Fisher : <HEADER> Fisher : [datasetcv] : Loop over test events and fill histograms with classifier response... : <HEADER> TFHandler_Fisher : Variable Mean RMS [ Min Max ] : ----------------------------------------------------------- : x: -0.014284 1.4061 [ -4.1075 4.0969 ] : y: -0.0066370 1.4204 [ -4.8520 4.0761 ] : ----------------------------------------------------------- : : Evaluation results ranked by best signal efficiency and purity (area) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA : Name: Method: ROC-integ : datasetcv Fisher : 0.971 : datasetcv BDTG : 0.965 : ------------------------------------------------------------------------------------------------------------------- : : Testing efficiency compared to training efficiency (overtraining check) : ------------------------------------------------------------------------------------------------------------------- : DataSet MVA Signal efficiency: from test sample (from training sample) : Name: Method: @B=0.01 @B=0.10 @B=0.30 : ------------------------------------------------------------------------------------------------------------------- : datasetcv Fisher : 0.665 (0.665) 0.922 (0.922) 0.980 (0.980) : datasetcv BDTG : 0.617 (0.617) 0.914 (0.914) 0.974 (0.974) : ------------------------------------------------------------------------------------------------------------------- : <HEADER> Dataset:datasetcv : Created tree 'TestTree' with 1998 events : <HEADER> Dataset:datasetcv : Created tree 'TrainTree' with 1998 events : <HEADER> Factory : Thank you for using TMVA! : For citation information, please visit: http://tmva.sf.net/citeTMVA.html : Evaluation done.
Process some output programmatically, printing the ROC score for each booked method.
size_t iMethod = 0;
for (auto && result : cv.GetResults()) {
std::cout << "Summary for method " << cv.GetMethods()[iMethod++].GetValue<TString>("MethodName")
<< std::endl;
for (UInt_t iFold = 0; iFold<cv.GetNumFolds(); ++iFold) {
std::cout << "\tFold " << iFold << ": "
<< "ROC int: " << result.GetROCValues()[iFold]
<< ", "
<< "BkgEff@SigEff=0.3: " << result.GetEff30Values()[iFold]
<< std::endl;
}
}
Summary for method BDT Fold 0: ROC int: 0.972504, BkgEff@SigEff=0.3: 0.981 Fold 1: ROC int: 0.96115, BkgEff@SigEff=0.3: 0.975 Summary for method Fisher Fold 0: ROC int: 0.976137, BkgEff@SigEff=0.3: 0.986 Fold 1: ROC int: 0.96584, BkgEff@SigEff=0.3: 0.975
Save the output
outputFile->Close();
std::cout << "==> Wrote root file: " << outputFile->GetName() << std::endl;
std::cout << "==> TMVACrossValidation is done!" << std::endl;
==> Wrote root file: TMVACV.root ==> TMVACrossValidation is done!
Launch the GUI for the root macros
if (!gROOT->IsBatch()) {
// Draw cv-specific graphs
cv.GetResults()[0].DrawAvgROCCurve(kTRUE, "Avg ROC for BDTG");
cv.GetResults()[0].DrawAvgROCCurve(kTRUE, "Avg ROC for Fisher");
// You can also use the classical gui
TMVA::TMVAGui(outfileName);
}
return 0;