Multivariate Data Analysis with T MVA Andreas Hoecker ( * ) (CERN) - PowerPoint PPT Presentation

Multivariate Data Analysis with T MVA Andreas Hoecker ( * ) (CERN) Statistical Tools Workshop, DESY, Germany, June 19, 2008 ( * ) On behalf of the present core team: A. Hoecker, P. Speckmayer, J. Stelzer, H. Voss And the contributors: A. Christov, Or Cohen, Kamil Kraszewski, Krzysztof Danielowski, S. Henrot-Versillé, M. Jachowski, A. Krasznahorkay Jr., Maciej Kruk, Y. Mahalalel, R. Ospanov, X. Prudent, A. Robert, F. Tegenfeldt, K. Voss, M. Wolter, A. Zemla See acknowledgments on page 43 On the web: http://tmva.sf.net/ (home), https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome (tutorial)

Event Classification Suppose data sample with two types of events: H 0 , H 1 We have found discriminating input variables x 1 , x 2 , … What decision boundary should we use to select events of type H 1 ? Rectangular cuts? A linear boundary? A nonlinear one? x 2 x 2 x 2 H 1 H 1 H 1 H 0 H 0 H 0 x 1 x 1 x 1 How can we decide this in an optimal way ? � Let the machine learn it ! Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 2 2

Multivariate Event Classification All multivariate classifiers have in common to condense (correlated) multi-variable input information in a single scalar output variable It is a R n → R regression problem; classification is in fact a discretised regression y ( H 0 ) → 0, y ( H 1 ) → 1 MV regression is also interesting ! In work for TMVA ! … Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 3 3

T M V A T M V A Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 4 4

What is T MVA ROOT: is the analysis framework used by most (HEP)-physicists Idea: rather than just implementing new MVA techniques and making them available in ROOT ( i . e ., like TMulitLayerPercetron does): Have one common platform / interface for all MVA classifiers Have common data pre-processing capabilities Train and test all classifiers on same data sample and evaluate consistently Provide common analysis (ROOT scripts) and application framework Provide access with and without ROOT, through macros, C++ executables or python Outline of this talk The T MVA project Quick survey of available classifiers and processing steps Evaluation tools Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 5 5

T MVA Development and Distribution T MVA is a sourceforge (SF) package for world-wide access Home page ……………….http://tmva.sf.net/ SF project page …………. http://sf.net/projects/tmva View CVS …………………http://tmva.cvs.sf.net/tmva/TMVA/ Mailing list .………………..http://sf.net/mail/?group_id=152074 Tutorial TWiki …………….https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome Active project � fast response time on feature requests Currently 4 core developers, and 16 active contributors >2400 downloads since March 2006 (not accounting cvs checkouts and ROOT users) Written in C++, relying on core ROOT functionality Integrated and distributed with ROOT since ROOT v5.11/03 Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 6 6

T h e T M V A C l a s s i f i e r s T h e T M V A C l a s s i f i e r s Currently implemented classifiers : Rectangular cut optimisation Projective and multidimensional likelihood estimator k-Nearest Neighbor algorithm Fisher and H-Matrix discriminants Function discriminant Artificial neural networks (3 multilayer perceptron implementations) Boosted/bagged decision trees with automatic node pruning RuleFit Support Vector Machine Currently implemented data preprocessing stages: Decorrelation Principal Value Decomposition Transformation to uniform and Gaussian distributions ( coming soon ) Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 7 7

Data Preprocessing: Decorrelation Commonly realised for all methods in T MVA (centrally in Dat aSet Dat aSet class) Removal of linear correlations by rotating input variables using the “square-root” of the correlation matrix using the Principal Component Analysis Note that decorrelation is only complete, if Correlations are linear Input variables are Gaussian distributed Not very accurate conjecture in general SQRT derorr. PCA derorr. original SQRT derorr. PCA derorr. original Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 8 8

Rectangular Cut Optimisation Simplest method: cut in rectangular variable volume ( ) ( ) { } ( ) ∩ ∈ = ⊂ ⎣ ⎡ ⎤ 0,1 , x i x i x x ⎦ cut event eve nt ,min ,ma x v v v { } ∈ variabl es v Technical challenge: how to find optimal cuts ? MINUIT fails due to non-unique solution space T MVA uses: Monte Carlo sampling , Genetic Algorithm , Simulated Annealing Huge speed improvement of volume search by sorting events in binary tree Cuts usually benefit from prior decorrelation of cut variables Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 9 9

Projective Likelihood Estimator (PDE Approach) Much liked in HEP: probability density estimators for each input variable combined in likelihood estimator Likelihood ratio PDFs discriminating variables for event i event PDE introduces fuzzy logic ∏ ( ) signal ( ) p x i event k k { } ( ) ∈ variables = k y i event L ⎛ ⎞ ∑ ∏ ( ) Species: signal, ⎜ ( ) ⎟ U p x i ⎜ ⎟ background types event k k ⎝ ⎠ { } { } ∈ ∈ species variable s U k Ignores correlations between input variables Optimal approach if correlations are zero (or linear � decorrelation) Otherwise: significant performance loss Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 10 10

PDE Approach: Estimating PDF Kernels Technical challenge: how to estimate the PDF shapes 3 ways: parametric fitting (function) nonparametric fitting event counting Difficult to automate Easy to automate, can create Automatic, unbiased, for arbitrary PDFs artefacts/suppress information but suboptimal We have chosen to implement nonparametric fitting in T MVA original distribution Binned shape interpolation using spline is Gaussian functions and adaptive smoothing Unbinned adaptive kernel density estimation (KDE) with Gaussian smearing T MVA performs automatic validation of goodness-of-fit Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 11 11

Multidimensional PDE Approach Use a single PDF per event class (sig, bkg), which spans N var dimensions PDE Range-Search: count number of signal and background events in Carli-Koblitz, NIM A501, 576 (2003) “vicinity” of test event � preset or adaptive volume defines “vicinity” x 2 H 1 H 0 x 1 Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 12 12

Multidimensional PDE Approach Use a single PDF per event class (sig, bkg), which spans N var dimensions PDE Range-Search: count number of signal and background events in Carli-Koblitz, NIM A501, 576 (2003) “vicinity” of test event � preset or adaptive volume defines “vicinity” ( ) , V � 0.86 y i PDERS event x 2 H 1 test event H 0 x 1 Improve y PDERS estimate within V by using various N var -D kernel estimators Enhance speed of event counting in volume by binary tree search Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 13 13

Multidimensional PDE Approach k-Nearest Neighbor Better than searching within a volume (fixed or floating), count adjacent reference events till statistically significant number reached Method intrinsically adaptive Very fast search with kd-tree event sorting Top Workshop, LPSC, Oct 18–20, 2007 DESY, June 19, 2008 A. Hoecker ― Multivariate Data Analysis with T MVA A. Hoecker: Multivariate Analysis with T MVA 14 14

Multivariate Data Analysis with T MVA Andreas Hoecker ( * ) (CERN) - PowerPoint PPT Presentation

Multivariate Data Analysis with T MVA Andreas Hoecker ( * ) (CERN) Statistical Tools Workshop, DESY, Germany, June 19, 2008 ( * ) On behalf of the present core team: A. Hoecker, P. Speckmayer, J. Stelzer, H. Voss And the contributors: A. Christov,

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

WELCOME/BIENVENUE MVA POWER INC - Page 1 of 48 Who is MVA Power ? Founded in 1991 by Marc

Reading multivariate data Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Multivariate t-distributions Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana

Multivariate Linear Regression Max Turgeon STAT 4690Applied Multivariate Analysis

Multivariate Normal Distribution Max Turgeon STAT 4690Applied Multivariate Analysis Building

Multivariate Data Analysis in Omics Research Diverging Alternative Splicing Fingerprints

Multivariate Analysis of Variance Max Turgeon STAT 4690Applied Multivariate Analysis Quick

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Statistical Tools in Collider Experiments Multivariate analysis in high energy physics Lecture 3

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Advanced PHP Dr. Steven Bitner A/B and Multivariate testing Why use multivariate testing If

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

Machine Learning Techniques for HEP Data Analysis with T MVA Andreas Hoecker ( * ) (CERN) Seminar,

Natural Language Processing Classification I Dan Klein UC Berkeley 1 2 Classification

Generative and Discriminative Learning Machine Learning 1 What we saw most of the semester

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Charming new results from STAR! NSD Staff Meeting, January 22, 2019 Sooraj Radhakrishnan

CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke

Does Training Affect Match Performance? A Study Using Data Mining And Tracking Devices

MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values