87944 - STATISTICAL DATA ANALYSIS FOR NUCLEAR AND SUBNUCLEAR PHYSICS [Module 3]
Gabriele Sirri Istituto Nazionale di Fisica Nucleare
1
87944 - STATISTICAL DATA ANALYSIS FOR NUCLEAR AND SUBNUCLEAR PHYSICS - - PowerPoint PPT Presentation
87944 - STATISTICAL DATA ANALYSIS FOR NUCLEAR AND SUBNUCLEAR PHYSICS [Module 3] Gabriele Sirri Istituto Nazionale di Fisica Nucleare 1 RECAP We talk about Data Modelling (a scientific narrative) Conceptual building blocks for
1
2
– Data Modelling (a scientific narrative…) – Conceptual building blocks for modeling in High En. Phys.
– Terminology: Ensamble, Experiment, Channel, Observable, Parameters of Interest, Nuisance, Distribution, ... – The Marked Poisson Model – Extended Likelihood, ..
– Modeling with RooFit – RooFit Workspace
– Quick start guide (20 pages) (Includes Workspace & Factory – Users Manual – RooFit macros
– difference between variables, observables, parameters – difference between binned and unbinned datasets, … – sometimes we asked something like which Roofit class do you use for variables, observables, parameters, datasets
3
4
http://www.isocpp.org
5
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 6
type of dataset ROOT fit after booking and filling an histogram (binned data). ROOFIT fit using the whole dataset (unbinned data) . Fit method (default) ROOT Least Squares, 𝑦2 ROOFIT maximum likelihood, − log ℒ can be done also with a binned dataset Error bars ROOT by default gaussian approximation √𝑜 ROOFIT exact limits based on Poisson distribution (asymmetric bars when 𝑜 is small !!) Axis Label ROOT left to user …. ROOFIT X taken from the variable definition Y is automatically: ex Events / BIN SIZE !!!!
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 8
void rooofit_ex1() { // Setup model // ============ // Declare variables x,mean,sigma with associated name, title, initial value and allowed range RooRealVar x("x","x",-10,10) ; RooRealVar mean("mean","mean of gaussian",1,-10,10) ; RooRealVar sigma("sigma","width of gaussian",1,0.1,10) ; // Build gaussian p.d.f in terms of x,mean and sigma RooGaussian gauss("gauss","gaussian PDF",x,mean,sigma) ; // Construct plot frame in 'x' RooPlot* xframe = x.frame(Title("Gaussian p.d.f.")) ; // Plot model and change parameter values // ======================================= // Plot gauss in frame (i.e. in x) gauss.plotOn(xframe) ; // Change the value of sigma to 3 sigma.setVal(3) ; // Plot gauss in frame (i.e. in x) and draw frame on canvas gauss.plotOn(xframe,LineColor(kRed)) ; // Generate events // =============== // Generate a dataset of 1000 events in x from gauss RooDataSet* data = gauss.generate(x,1000) ; // Make a second plot frame in x and draw both the // data and the p.d.f in the frame RooPlot* xframe2 = x.frame(Title("Gaussian p.d.f. with data")) ; data->plotOn(xframe2) ; gauss.plotOn(xframe2) ; // Fit model to data // ================== // Fit pdf to data gauss.fitTo(*data) ; // Print values of mean and sigma (that now reflect fitted values and errors) mean.Print() ; sigma.Print() ; // Draw all frames on a canvas TCanvas* c = new TCanvas("rf101_basics","rf101_basics",800,400) ; c->Divide(2) ; c->cd(1) ; gPad->SetLeftMargin(0.15) ; xframe->GetYaxis()->SetTitleOffset(1.6) ; xframe->Draw() ; c->cd(2) ; gPad->SetLeftMargin(0.15) ; xframe2->GetYaxis()->SetTitleOffset(1.6) ; xframe2->Draw() ; } void recap_root() { // define a gaussian p.d.f (mean 1 and sigma 1) // gaus(0) is a substitute for [0]*exp(-0.5*((x-[1])/[2])**2) // and (0) means start numbering parameters at 0. auto g1 = new TF1{"gaus_1", "gaus(0)", -10, 10 }; double mean = 1 ; double sigma = 1 ; g1->SetParameter( 0, 1./( sigma * sqrt(2* 3.1415926 ) ) ); g1->SetParameter( 1, mean ) ; // set mean g1->SetParameter( 2, sigma) ; // set sigma g1->SetLineColor( kBlue ) ; // change sigma to 3 TF1* g2 = new TF1{"gaus_2", "gaus(0)", -10, 10 }; mean = 1 ; sigma = 3 ; g2->SetParameter( 0, 1./( sigma * sqrt(2* 3.1415926 ) ) ); g2->SetParameter( 1, mean ) ; // set mean g2->SetParameter( 2, sigma) ; // set sigma g2->SetLineColor( kRed ) ; // Book histograms auto h_data = new TH1D{"h_data", "gaussian random numbers", 100, -10, 10}; // Create a TRandom3 object to generate random numbers auto seed = 12345; TRandom3 ran{seed}; // Generate some random numbers and fill histograms auto const nvalues = 1000; for (int i=0; i<nvalues; i++){ auto x = ran->Gaus(1,3); // gaussian in mean = 1 sigma = 3 h_data->Fill(x); } h_data->SetXTitle("x"); h_data->SetYTitle("f(x)"); h_data->SetMarkerStyle(20); h_data->Fit("gaus") ; // Draw the canvas c = new TCanvas("c1","c1",800,400) ; c->Divide(2,1) ; c->cd(1); g1->Draw() ; g2->Draw("SAME") ; c->cd(2); h_data->Draw("E1"); }
ROOT ROOFIT
– Composite Models – Common Fitting Errors:
– Composite Models
– Common Fitting Errors
http://lmu.web.psi.ch/docu/manuals/software_manuals/minuit2/mnerror.pdf
(just examples and not exhaustive): – The p.d.f. of a generic extended (signal+background) composite model – The 4 MINUIT algorithms – The difference MINOS vs MIGRAD/HESSE – The theory behind the error computation by HESSE – MINOS and Profile Likelihood Interval – How to recognize/Which strategy to mitigate fit stabilities problems
10
11
ROOT has a modern tool to manipulate and analyze datasets http://root.cern
RDataFrame offers a high level interface for analyses of data stored in TTrees, CSV files and other data formats. In addition, multi-threading and other low-level optimisations allow users to exploit all the resources available on their machines transparently. In a nutshell:
ROOT::EnableImplicitMT(); // Tell ROOT you want to go parallel ROOT::RDataFrame d("myTree", "file_*.root"); // Interface to TTree and CSV auto d2 = d.Filter(“x>0") // only accept events for which x>0 .Define(“r2”,”x*x+y*y”); // define a custom variable r2 auto myHisto = d2.Histo1D(“r2"); // This happens in parallel! myHisto->Draw();
Lazy execution guarantees that all operations are performed in one event loop less typing, better expressivity, … in place of explicity for-loops on TTree events (SetBranchAddress(..), GetEntry(..), …) Good for histograms, profiles, data reductions, user-defined operations on events.
– Working with Likelihood and RooMinuit – TMVA
– TMVA web site: http://tmva.sourceforge.net – TMVA manual: TMVA - Toolkit for Multivariate Data Analysis , arXiv:physics/0703039v5 – TMVA workshop, look at “My tips and tricks” indico.cern.ch/contributionDisplay.py?contribId=5&confId=112879 – TMVA macros
– Definition of (signal|background) efficiency. – Definition of background rejection power. – Definition of Signal purity – The ROC curve. How to build a ROC curve. – We trained some classificator for the search of a signal excess over some background. We have the ROC curve. Is it sufficient to define the best cut?
12
– Incorporating systematics . The “on/off” problem – RooStats
– slides shown during the lecture, then: – ROOSTATS : https://twiki.cern.ch/twiki/bin/view/RooStats – short tutorial : https://twiki.cern.ch/twiki/bin/view/RooStats/RooStatsTutorialsAugust2012 – RooStatsTutorial_120323.pdf https://indico.desy.de/getFile.py/access?contribId=15&resId=3&materialId=sli des&confId=5065
– The ON/OFF problem. Make an example (i.e. a counting model). – Profile Likelihood Ratio as test statistics. – How is distributed the test statistics if H0 is true/false.
13
14
What about binomial ? http://root.cern
15
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 16
Likelihood Contour at 2 Δlog 𝑀 = 2.3, 6.2,11.8 ( 𝜏 = 1,2,3 )
Parameter Parameter
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 17
“If we want to deal with models with more than two parameters, or if we want to analyze a single parameter at a time, we have to find a way to isolate the effects of one or more parameters while still accounting for the rest. A simple, but usually wrong, way of doing this is to calculate a likelihood slice, fixing the values of all but one parameter (usually at their maximum likelihood estimates) and then calculating the likelihood for a range of values of the focal parameter.”
[ms.mcmaster.ca/~bolker/emdbook/chap6A.pdf]
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 18
“Slices can be useful for visualizing the geometry of a many-parameter likelihood surface near its minimum, but they are statistically misleading because they don’t allow the other parameters to vary and thus they don’t show the minimum negative log- likelihood achievable for a particular value of the focal parameter.”
[ms.mcmaster.ca/~bolker/emdbook/chap6A.pdf]
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 19
Profile Likelihood Ratio 𝜇 𝜈 = 𝑀(𝜈, 𝜄) 𝑀( Ƹ 𝜈, 𝜄)
massimizza 𝑀 per uno specifico 𝜈 massimizzano 𝑀 è il punto nello spazio dei parametri che corrisponde al MLE
𝜇 𝜏
2 =
𝑀 𝜏
2, መ
መ 𝑔 𝑀( ෞ 𝜏
2 ,
መ 𝑔)
Negative Log Profile Likelihood 𝑄𝑀𝑀 𝜈 = − log 𝑀 𝜈, 𝜄 + log 𝑀( Ƹ 𝜈, 𝜄)
𝜇 𝑔 = 𝑀 𝑔, ෞ 𝜏
2
𝑀( ෞ 𝜏
2 ,
መ 𝑔)
𝜇 𝜈 = 𝑀(𝜈, 𝜄) 𝑀( Ƹ 𝜈, 𝜄)
È una costante e corrisponde al massimo di 𝑀 su tutto lo spazio dei parametri pll ->plotOn(frame,ShiftToZero()) equivale a spostare in basso la Negative Log Likelihood fino a mettere il minimo a zero. pll = nll.createProfile(frac);
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 20
Profile Likelihood Ratio Minut Contour at 2 Δlog 𝑀 = 1.0 (𝜏 = 1)
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 21
Slices are always steeper than profiles, because they don’t allow the other parameters to adjust to changes in the parameter of interest.
Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 22
“If we want to deal with models with more than two parameters, or if we want to analyze a single parameter at a time, we have to find a way to isolate the effects of one or more parameters while still accounting for the rest. A simple, but usually wrong, way of doing this is to calculate a likelihood slice, fixing the values of all but one parameter (usually at their maximum likelihood estimates) and then calculating the likelihood for a range of values of the focal
Slices can be useful for visualizing the geometry of a many-parameter likelihood surfacenear its minimum, but they are statistically misleading because they don’t allow the other parameters to vary and thus they don’t show the minimum negative log-likelihood achievable for a particular value of the focal parameter. […] ... Slices are always steeper than profiles, because they don’t allow the other parameters to adjust to changes in the focal parameter.” [ms.mcmaster.ca/~bolker/emdbook/chap6A.pdf]