Pitfalls of evaluating a classifiers performance in high energy - PowerPoint PPT Presentation

Pitfalls of evaluating a classifier’s performance in high energy physics applications Gilles Louppe, NYU (@glouppe) Tim Head, EPFL (@betatim) February 18, 2016 Heavy Flavour Data Mining workshop

Disclaimer The following applies only for the learning protocol of the Flavours of Physics Kaggle challenge (Blake et al., 2015). See (Louppe and Head, 2015) for the notebook explanations. 2 / 16

Flavours of Physics: Finding τ �→ µµµ challenge Given a learning set L of • simulated signal events ( x , s ) • real data background events ( x , b ), build a classifier ϕ : X �→ { s , b } for distinguishing τ �→ µµµ signal events from background events. 3 / 16

Control channel test The simulation is not perfect: simulated and real data events can often be distinguished. To avoid exploiting simulation versus real data artefacts to classify signal from background events, we evaluate whether ϕ behaves differently on simulated signal and real data signal from a control channel C . Control channel test: Requires the Kolmogorov-Smirnov test statistic between { ϕ ( x ) | x ∈ C sim } and { ϕ ( x ) | x ∈ C data } to be strictly smaller than some pre-defined threshold t . 4 / 16

Loophole If control data can be distinguished from training data, then there exist classifiers ϕ exploiting simulation artefacts to classify signal from background events, for which the control channel test succeeds. Therefore, • The true performance of ϕ on real data may be significantly different (typically lower) than estimated on simulated signal events versus real data background events. • Passing the KS test should not be interpreted as ϕ not exploiting simulation versus real data artefacts. 5 / 16

Toy example Let us consider an artificial classification problem between signal and background events, along with some close control channel data C sim and C data . Let us assume an input space defined on three input variables X 1 , X 2 , X 3 as follows. 6 / 16

X 1 is irrelevant for real data signal versus real data background, but relevant for simulated versus real data events. 7 / 16

X 2 is relevant for background and non-background events. 8 / 16

X 3 is relevant for training versus control events, but has otherwise no discriminative power between signal and background events. 9 / 16

Random exploration def find_best_tree(X_train, y_train, X_test, y_test, X_data, y_data, X_control_sim, X_control_data): best_auc_test, best_auc_data = 0, 0 best_ks = 0 best_tree = None for seed in range(2000): clf = ExtraTreesClassifier(n_estimators=1, max_features=1, max_leaf_nodes=5, random_state=seed) clf.fit(X_train, y_train) auc_test = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]) auc_data = roc_auc_score(y_data, clf.predict_proba(X_data)[:, 1]) ks = ks_statistic(clf.predict_proba(X_control_sim)[:, 1], clf.predict_proba(X_control_data)[:, 1]) if auc_test > best_auc_test and ks < 0.09: best_auc_test = auc_test best_auc_data = auc_data best_ks = ks best_tree = clf return best_auc_test, best_auc_data, best_ks, best_tree 10 / 16

Random exploration auc_test, auc_data, ks, tree = find_best_tree(...) >>> auc_test = 0.9863 # Estimated AUC (simulated signal vs. data background) >>> ks = 0.0578 # KS statistic < 0.09 What just happened? By chance, we have found a classifier that • has seemingly good test performance; • passes the control channel test that we have defined. This classifier appears to be exactly the one we were seeking. Wrong. The expected ROC AUC on real data signal and real data background is significantly lower than our first estimate, suggesting that there is still something wrong. >>> auc_data = 0.9097 # True AUC (data signal vs. data background) 11 / 16

ϕ exploits X 1 , i.e. simulation versus real data artefacts to indirectly classify signal from background events, while still passing the control channel test because of its use of X 3 ! 12 / 16

Winning the challenge 1. Learn to distinguish between training and control data, 2. Build a classifier on training data, with all the freedom to exploit simulation artefacts, 3. Assign random predictions to samples predicted as control data, otherwise predict using the classifier found in the previous step. The reconstructed mass allows to distinguish signal from background and training from control! 13 / 16

A machine learning response As simulated training data increases (i.e., as N → ∞ ), 1 � � L ( ϕ ( x i )) → L ( ϕ ( x )) p sim ( x ) d x . N x i We want to be good on real data, i.e., minimize � L ( ϕ ( x )) p data ( x ) d x . Solution: importance weighting . 1 p data ( x i ) ϕ ∗ = arg min � p sim ( x i ) L ( ϕ ( x i )) N ϕ x i 14 / 16

Density ratio estimation But for signal events, we don’t even have real data observations! p data ( x ) p sim ( x ) ≈ p data-control ( x ) Assumption: p sim-control ( x ) = r ( x ) In the likelihood-free setting, estimating r ( x ) is known as the density-ratio estimation problem. Same as • Learning under covariate shift, • Probabilistic classification, • Likelihood-ratio test, • Outlier detection, • Mutual information estimation, ... See Sugiyama et al. (2012) for a review. 15 / 16

Conclusions • Formulating appropriate machine learning tasks is difficult. • On purpose or unwillingly, simulation versus real data artefacts could be exploited to maximize classifiers accuracy. • Physically more correct classifiers can be obtained e.g. with density-ratio reweighting. 16 / 16

References Blake, T., Bettler, M.-O., Chrzaszcz, M., Dettori, F., Ustyuzhanin, A., and Likhomanenko, T. (2015). Flavours of physics: the machine learning challenge or the search of τ → 3 µ decays at LHCb. Louppe, G. and Head, T. (2015). Pitfalls of evaluating a classifier’s performance in high energy physics applications. http://dx.doi.org/10.5281/zenodo.34934 . Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density ratio estimation in machine learning . Cambridge University Press.

Pitfalls of evaluating a classifiers performance in high energy - PowerPoint PPT Presentation

Pitfalls of evaluating a classifiers performance in high energy physics applications Gilles Louppe, NYU (@glouppe) Tim Head, EPFL (@betatim) February 18, 2016 Heavy Flavour Data Mining workshop Disclaimer The following applies only for the

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Classifier Selection Nicholas Ver Hoeve Craig Martek Ben Gardner Classifier Ensembles Assume

Classifier Classifier Systems Systems

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative

Pitfalls in Using a case based approach, we will Arrhythmias review pitfalls in management of:

Knowledge Engineering Pitfalls Knowledge Engineering Pitfalls Which one is better to represent

Unit 4: Performance & Benchmarking CPU Performance Performance Pitfalls

1 How Distribution of Classifier Values Stratified Trial Design: Affect Classifier Performance

SoK: The Challenges, Pitfalls, and Perils of Using Hardware Performance Counters for Security

Stability and Stabilization of polynomial dynamical systems Hadi Ravanbakhsh Sriram

The support is a morphism of monads Sharwin Rezagholi 1 Tobias Fritz 2 Paolo Perrone 1 1 Max Planck

EUSO-TA Y.Kawasaki (RIKEN)

A Comparison of Chinese Parsers for Stanford Dependencies Wanxiang Che, Valentin I. Spitkovsky

Decision Trees ID3 A Python implementation Daniel Pettersson 1 Otto Nordander 2 Pierre Nugues 3 1

Classification with a control channel Dont cheat yourself! Gilles Louppe (@glouppe) Tim Head

3D Data visualization with Mayavi Prabhu Ramachandran Department of Aerospace Engineering IIT

Applications of Machine Learning in Engineering (and Parameter Tuning) Lars Kotthofg University

Pitfalls of evaluating a classifiers performance in high energy - PowerPoint PPT Presentation

Pitfalls of evaluating a classifiers performance in high energy physics applications Gilles Louppe, NYU (@glouppe) Tim Head, EPFL (@betatim) February 18, 2016 Heavy Flavour Data Mining workshop Disclaimer The following applies only for the

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Classifier Selection Nicholas Ver Hoeve Craig Martek Ben Gardner Classifier Ensembles Assume

Classifier Classifier Systems Systems

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative

Pitfalls in Using a case based approach, we will Arrhythmias review pitfalls in management of:

Knowledge Engineering Pitfalls Knowledge Engineering Pitfalls Which one is better to represent

Unit 4: Performance &amp; Benchmarking CPU Performance Performance Pitfalls

1 How Distribution of Classifier Values Stratified Trial Design: Affect Classifier Performance

SoK: The Challenges, Pitfalls, and Perils of Using Hardware Performance Counters for Security

Stability and Stabilization of polynomial dynamical systems Hadi Ravanbakhsh Sriram

The support is a morphism of monads Sharwin Rezagholi 1 Tobias Fritz 2 Paolo Perrone 1 1 Max Planck

EUSO-TA Y.Kawasaki (RIKEN)

A Comparison of Chinese Parsers for Stanford Dependencies Wanxiang Che, Valentin I. Spitkovsky

Decision Trees ID3 A Python implementation Daniel Pettersson 1 Otto Nordander 2 Pierre Nugues 3 1

Classification with a control channel Dont cheat yourself! Gilles Louppe (@glouppe) Tim Head

3D Data visualization with Mayavi Prabhu Ramachandran Department of Aerospace Engineering IIT

Applications of Machine Learning in Engineering (and Parameter Tuning) Lars Kotthofg University

Unit 4: Performance & Benchmarking CPU Performance Performance Pitfalls