87944 - STATISTICAL DATA ANALYSIS FOR NUCLEAR AND SUBNUCLEAR PHYSICS - - PowerPoint PPT Presentation

87944 statistical data analysis for nuclear and
SMART_READER_LITE
LIVE PREVIEW

87944 - STATISTICAL DATA ANALYSIS FOR NUCLEAR AND SUBNUCLEAR PHYSICS - - PowerPoint PPT Presentation

87944 - STATISTICAL DATA ANALYSIS FOR NUCLEAR AND SUBNUCLEAR PHYSICS [Module 3] Gabriele Sirri Istituto Nazionale di Fisica Nucleare 1 RECAP We talk about Data Modelling (a scientific narrative) Conceptual building blocks for


slide-1
SLIDE 1

87944 - STATISTICAL DATA ANALYSIS FOR NUCLEAR AND SUBNUCLEAR PHYSICS [Module 3]

Gabriele Sirri Istituto Nazionale di Fisica Nucleare

1

slide-2
SLIDE 2

2

RECAP

  • We talk about

– Data Modelling (a scientific narrative…) – Conceptual building blocks for modeling in High En. Phys.

  •  Documentation:

– K. Cranmer, «Practical Statistic for the LHC», https://arxiv.org/abs/1503.07622

  • “Introduction”
  • “Conceptual blocks for modeling”
  • Topics discussed during past examinations

(just examples and not exhaustive):

– Terminology: Ensamble, Experiment, Channel, Observable, Parameters of Interest, Nuisance, Distribution, ... – The Marked Poisson Model – Extended Likelihood, ..

slide-3
SLIDE 3
  • We talk about

– Modeling with RooFit – RooFit Workspace

  •  Documentation: http://root.cern.ch/drupal/content/roofit

– Quick start guide (20 pages) (Includes Workspace & Factory – Users Manual – RooFit macros

  • root.cern.ch → documentation → tutorials → roofit
  • There are over 80 macros illustrating many aspects of RooFit functionality
  • Topics discussed during past examinations

(just examples and not exhaustive):

– difference between variables, observables, parameters – difference between binned and unbinned datasets, … – sometimes we asked something like which Roofit class do you use for variables, observables, parameters, datasets

3

RECAP

slide-4
SLIDE 4

Teach yourself Modern C++

  • Stack-based scope instead of heap or static global scope.
  • Auto type inference instead of explicit type names.
  • Smart pointers instead of raw pointers.
  • std::string instead of raw char[] arrays.
  • Standard template library (STL) containers like vector, list, and map instead of

raw arrays or custom containers. See <vector>, <list>, and <map>.

  • STL algorithms instead of manually coded ones.
  • Inline lambda functions instead of small functions implemented separately.
  • Range-based for loops to write more robust loops that work with arrays, STL

containers in the form for ( for-range-declaration : expression ).

  • Exceptions, to report and handle error conditions.
  • Lock-free inter-thread communication using STL std::atomic<> (see <atomic>)

instead of other inter-thread communication mechanisms.

4

Tip of the week

http://www.isocpp.org

slide-5
SLIDE 5
  • Force yourself coding with a style:

ex: Google C++ Style Guide

https://google.github.io/styleguide/cppguide.html

– install clang-format or astyle

  • Set up a repository. Always. Also for small project.

– – Simple, powerful Git GUI: SourceTree

  • boost C++ libraries: http://www.boost.org/

– Lot of work already done for you !!!

5

Further about C++ Programming

slide-6
SLIDE 6

 RECAP… Tutorial

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 6

[fit gaus model to unbinned dataset] Edit the macro roofit_empty.cpp and, following the comments inside, create a Gaussian p.d.f. with mean = 0, and sigma = 1. Change the sigma to 3. Visualize the p.d.f. . Generate an unbinned dataset of 10000 events. Make a Fit with Maximum

  • Likelihood. Visualize the results.

Tips: Use information from the slides shown during the lecture or from RooFit Manual at par 2 (Submit ex11.cxx and the image of the canvas in png format)

slide-7
SLIDE 7

ROOT RooFit

slide-8
SLIDE 8

type of dataset ROOT fit after booking and filling an histogram (binned data). ROOFIT fit using the whole dataset (unbinned data) . Fit method (default) ROOT Least Squares, 𝑦2 ROOFIT maximum likelihood, − log ℒ can be done also with a binned dataset Error bars ROOT by default gaussian approximation √𝑜 ROOFIT exact limits based on Poisson distribution (asymmetric bars when 𝑜 is small !!) Axis Label ROOT left to user …. ROOFIT X taken from the variable definition Y is automatically: ex Events / BIN SIZE !!!!

ROOFIT ROOT

 ROOT vs RooFit (2)

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 8

slide-9
SLIDE 9

void rooofit_ex1() { // Setup model // ============ // Declare variables x,mean,sigma with associated name, title, initial value and allowed range RooRealVar x("x","x",-10,10) ; RooRealVar mean("mean","mean of gaussian",1,-10,10) ; RooRealVar sigma("sigma","width of gaussian",1,0.1,10) ; // Build gaussian p.d.f in terms of x,mean and sigma RooGaussian gauss("gauss","gaussian PDF",x,mean,sigma) ; // Construct plot frame in 'x' RooPlot* xframe = x.frame(Title("Gaussian p.d.f.")) ; // Plot model and change parameter values // ======================================= // Plot gauss in frame (i.e. in x) gauss.plotOn(xframe) ; // Change the value of sigma to 3 sigma.setVal(3) ; // Plot gauss in frame (i.e. in x) and draw frame on canvas gauss.plotOn(xframe,LineColor(kRed)) ; // Generate events // =============== // Generate a dataset of 1000 events in x from gauss RooDataSet* data = gauss.generate(x,1000) ; // Make a second plot frame in x and draw both the // data and the p.d.f in the frame RooPlot* xframe2 = x.frame(Title("Gaussian p.d.f. with data")) ; data->plotOn(xframe2) ; gauss.plotOn(xframe2) ; // Fit model to data // ================== // Fit pdf to data gauss.fitTo(*data) ; // Print values of mean and sigma (that now reflect fitted values and errors) mean.Print() ; sigma.Print() ; // Draw all frames on a canvas TCanvas* c = new TCanvas("rf101_basics","rf101_basics",800,400) ; c->Divide(2) ; c->cd(1) ; gPad->SetLeftMargin(0.15) ; xframe->GetYaxis()->SetTitleOffset(1.6) ; xframe->Draw() ; c->cd(2) ; gPad->SetLeftMargin(0.15) ; xframe2->GetYaxis()->SetTitleOffset(1.6) ; xframe2->Draw() ; } void recap_root() { // define a gaussian p.d.f (mean 1 and sigma 1) // gaus(0) is a substitute for [0]*exp(-0.5*((x-[1])/[2])**2) // and (0) means start numbering parameters at 0. auto g1 = new TF1{"gaus_1", "gaus(0)", -10, 10 }; double mean = 1 ; double sigma = 1 ; g1->SetParameter( 0, 1./( sigma * sqrt(2* 3.1415926 ) ) ); g1->SetParameter( 1, mean ) ; // set mean g1->SetParameter( 2, sigma) ; // set sigma g1->SetLineColor( kBlue ) ; // change sigma to 3 TF1* g2 = new TF1{"gaus_2", "gaus(0)", -10, 10 }; mean = 1 ; sigma = 3 ; g2->SetParameter( 0, 1./( sigma * sqrt(2* 3.1415926 ) ) ); g2->SetParameter( 1, mean ) ; // set mean g2->SetParameter( 2, sigma) ; // set sigma g2->SetLineColor( kRed ) ; // Book histograms auto h_data = new TH1D{"h_data", "gaussian random numbers", 100, -10, 10}; // Create a TRandom3 object to generate random numbers auto seed = 12345; TRandom3 ran{seed}; // Generate some random numbers and fill histograms auto const nvalues = 1000; for (int i=0; i<nvalues; i++){ auto x = ran->Gaus(1,3); // gaussian in mean = 1 sigma = 3 h_data->Fill(x); } h_data->SetXTitle("x"); h_data->SetYTitle("f(x)"); h_data->SetMarkerStyle(20); h_data->Fit("gaus") ; // Draw the canvas c = new TCanvas("c1","c1",800,400) ; c->Divide(2,1) ; c->cd(1); g1->Draw() ; g2->Draw("SAME") ; c->cd(2); h_data->Draw("E1"); }

ROOT ROOFIT

slide-10
SLIDE 10
  • We talk about

– Composite Models – Common Fitting Errors:

  • Understanding MINUIT output
  • Instabilities and correlation coefficients
  •  Documentation: http://root.cern.ch/drupal/content/roofit

– Composite Models

  • RooFit User Manual pp17-28

– Common Fitting Errors

  • Slides
  • F. James, Interpretation of the errors on parameters as given by MINUIT

http://lmu.web.psi.ch/docu/manuals/software_manuals/minuit2/mnerror.pdf

  • Topics discussed during past examinations

(just examples and not exhaustive): – The p.d.f. of a generic extended (signal+background) composite model – The 4 MINUIT algorithms – The difference MINOS vs MIGRAD/HESSE – The theory behind the error computation by HESSE – MINOS and Profile Likelihood Interval – How to recognize/Which strategy to mitigate fit stabilities problems

10

RECAP

slide-11
SLIDE 11

11

Tip of the week

ROOT has a modern tool to manipulate and analyze datasets http://root.cern

RDataFrame offers a high level interface for analyses of data stored in TTrees, CSV files and other data formats. In addition, multi-threading and other low-level optimisations allow users to exploit all the resources available on their machines transparently. In a nutshell:

ROOT::EnableImplicitMT(); // Tell ROOT you want to go parallel ROOT::RDataFrame d("myTree", "file_*.root"); // Interface to TTree and CSV auto d2 = d.Filter(“x>0") // only accept events for which x>0 .Define(“r2”,”x*x+y*y”); // define a custom variable r2 auto myHisto = d2.Histo1D(“r2"); // This happens in parallel! myHisto->Draw();

Lazy execution guarantees that all operations are performed in one event loop less typing, better expressivity, … in place of explicity for-loops on TTree events (SetBranchAddress(..), GetEntry(..), …) Good for histograms, profiles, data reductions, user-defined operations on events.

slide-12
SLIDE 12
  • We talk about

– Working with Likelihood and RooMinuit – TMVA

  •  Documentation: http://root.cern.ch/drupal/content/roofit

– TMVA web site: http://tmva.sourceforge.net – TMVA manual: TMVA - Toolkit for Multivariate Data Analysis , arXiv:physics/0703039v5 – TMVA workshop, look at “My tips and tricks” indico.cern.ch/contributionDisplay.py?contribId=5&confId=112879 – TMVA macros

  • root.cern.ch → documentation → tutorials → tmva
  • Topics discussed during past examinations

(just examples and not exhaustive):

– Definition of (signal|background) efficiency. – Definition of background rejection power. – Definition of Signal purity – The ROC curve. How to build a ROC curve. – We trained some classificator for the search of a signal excess over some background. We have the ROC curve. Is it sufficient to define the best cut?

12

RECAP

slide-13
SLIDE 13
  • We talk about

– Incorporating systematics . The “on/off” problem – RooStats

  • Documentation: http://root.cern.ch/drupal/content/roofit

– slides shown during the lecture, then: – ROOSTATS : https://twiki.cern.ch/twiki/bin/view/RooStats – short tutorial : https://twiki.cern.ch/twiki/bin/view/RooStats/RooStatsTutorialsAugust2012 – RooStatsTutorial_120323.pdf https://indico.desy.de/getFile.py/access?contribId=15&resId=3&materialId=sli des&confId=5065

  • Topics discussed during past examinations

(just examples and not exhaustive):

– The ON/OFF problem. Make an example (i.e. a counting model). – Profile Likelihood Ratio as test statistics. – How is distributed the test statistics if H0 is true/false.

13

RECAP (today)

slide-14
SLIDE 14

14

Tip of the week

What about binomial ? http://root.cern

TEfficiency https://root.cern.ch/doc/master/classTEfficiency.html

slide-15
SLIDE 15

Slice Likelihood vs Profile Likelihood

15

slide-16
SLIDE 16

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 16

Likelihood Contour at 2 Δlog 𝑀 = 2.3, 6.2,11.8 ( 𝜏 = 1,2,3 )

Parameter Parameter

slide-17
SLIDE 17

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 17

“If we want to deal with models with more than two parameters, or if we want to analyze a single parameter at a time, we have to find a way to isolate the effects of one or more parameters while still accounting for the rest. A simple, but usually wrong, way of doing this is to calculate a likelihood slice, fixing the values of all but one parameter (usually at their maximum likelihood estimates) and then calculating the likelihood for a range of values of the focal parameter.”

[ms.mcmaster.ca/~bolker/emdbook/chap6A.pdf]

likelihood slice

slide-18
SLIDE 18

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 18

“Slices can be useful for visualizing the geometry of a many-parameter likelihood surface near its minimum, but they are statistically misleading because they don’t allow the other parameters to vary and thus they don’t show the minimum negative log- likelihood achievable for a particular value of the focal parameter.”

[ms.mcmaster.ca/~bolker/emdbook/chap6A.pdf]

likelihood slice

slide-19
SLIDE 19

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 19

Profile Likelihood Ratio 𝜇 𝜈 = 𝑀(𝜈, ෠ ෠ 𝜄) 𝑀( Ƹ 𝜈, ෠ 𝜄)

massimizza 𝑀 per uno specifico 𝜈 massimizzano 𝑀 è il punto nello spazio dei parametri che corrisponde al MLE

𝜇 𝜏

𝑕2 =

𝑀 𝜏

𝑕2, መ

መ 𝑔 𝑀( ෞ 𝜏

𝑕2 ,

መ 𝑔)

Negative Log Profile Likelihood 𝑄𝑀𝑀 𝜈 = − log 𝑀 𝜈, ෠ ෠ 𝜄 + log 𝑀( Ƹ 𝜈, ෠ 𝜄)

𝜇 𝑔 = 𝑀 𝑔, ෢ ෞ 𝜏

𝑕2

𝑀( ෞ 𝜏

𝑕2 ,

መ 𝑔)

𝜇 𝜈 = 𝑀(𝜈, ෠ ෠ 𝜄) 𝑀( Ƹ 𝜈, ෠ 𝜄)

È una costante e corrisponde al massimo di 𝑀 su tutto lo spazio dei parametri pll ->plotOn(frame,ShiftToZero()) equivale a spostare in basso la Negative Log Likelihood fino a mettere il minimo a zero. pll = nll.createProfile(frac);

slide-20
SLIDE 20

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 20

Profile Likelihood Ratio Minut Contour at 2 Δlog 𝑀 = 1.0 (𝜏 = 1)

slide-21
SLIDE 21

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 21

Slices are always steeper than profiles, because they don’t allow the other parameters to adjust to changes in the parameter of interest.

slide-22
SLIDE 22

Laboratorio Analisi Statistica dei Dati per HEP - G.Sirri 22

“If we want to deal with models with more than two parameters, or if we want to analyze a single parameter at a time, we have to find a way to isolate the effects of one or more parameters while still accounting for the rest. A simple, but usually wrong, way of doing this is to calculate a likelihood slice, fixing the values of all but one parameter (usually at their maximum likelihood estimates) and then calculating the likelihood for a range of values of the focal

  • parameter. [...]

Slices can be useful for visualizing the geometry of a many-parameter likelihood surfacenear its minimum, but they are statistically misleading because they don’t allow the other parameters to vary and thus they don’t show the minimum negative log-likelihood achievable for a particular value of the focal parameter. […] ... Slices are always steeper than profiles, because they don’t allow the other parameters to adjust to changes in the focal parameter.” [ms.mcmaster.ca/~bolker/emdbook/chap6A.pdf]