Higgs Machine Learning Challenge experience. A HEP pattern recognition challenge ?
David Rousseau LAL-Orsay 10th February 2015
Higgs Machine Learning Challenge experience. A HEP pattern - - PowerPoint PPT Presentation
Higgs Machine Learning Challenge experience. A HEP pattern recognition challenge ? David Rousseau LAL-Orsay 10th February 2015 CTD 2015, Berkeley Outline q Machine Learning, Challenges q The Higgs Machine Learning challenge q A
David Rousseau LAL-Orsay 10th February 2015
2
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
3
q Neural Nets used somewhat in the 90’ies (e.g. LEP) q BDT (Adaboost) invented in 97 q MVA techniques (= Machine Learning) have been used extensively at D0/CDF (mostly BDT, but not only) in the 00’ies q Atlas/CMS less eager to adopt MVA at LHC starts for some good reasons:
q But lot of work recently with MVA techniques
q Meanwhile Neural Net reappear in their “deep” incantation (See Peter Sadowski’s talk this afternoon)
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
4
technique within HEP
easy, it takes time to really become an expert with e.g. BDT
evaluation of systematics (which of course are excellent things to do)
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
5
q Challenges have become in the last 10 years a common way of working for the machine learning community q Machine learning scientists are eager to test their algorithms on real life problemsèmore valuable(=publisheable) than artificial problems q Company or academics want to outsource a problem to machine learning scientist, but also geeks etc. The company sets up a challenge like:
q Some companies makes a business from organising challenges: datascience.net, kaggle q A few recent examples now…
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
6
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
7
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
8
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
9
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
Olga Kokshagina 2015
20 Courtesy : Lakhani 2014
OI is suitable for a variety of nonconvential surprising ideas that are « far » from traditional expertise - > high volatility Experts are highly skilled, trained - > more focused, performed solution, low variety
10
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
12
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
13
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
14
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
ATLAS-CONF-2013-108
15
q First idea in Sep 2012 q Challenge ran from May to September 2014 q People register to Kaggle web site hosted https://www.kaggle.com/c/higgs-boson . (additional info on https://higgsml.lal.in2p3.fr) q Open to almost any one
q …download training dataset (with label) with 250k events q …train their own algorithm to optimise the significance (à la s/sqrt(b)) q …download test dataset (without labels) with 550k events q …upload their own classification q The site automatically calculates significance. Public (100k events) and private (450k events) leader boards update instantly. q Competition closes mid september 2014. People are asked to provide their code and
q The most interesting one gets the “HEP meets ML award”
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
16
ASCII csv file, with mixture of Higgs to tautau signal and corresponding background, from official GEANT4 ATLAS simulation Weight and signal/background (for training dataset
weight (fully normalised) label : « s » or « b » Conf note variables used for categorization or BDT: DER_mass_MMC DER_mass_transverse_met_lep DER_mass_vis DER_pt_h DER_deltaeta_jet_jet DER_mass_jet_jet DER_prodeta_jet_jet DER_deltar_tau_lep DER_pt_tot DER_sum_pt DER_pt_ratio_lep_tau DER_met_phi_centrality DER_lep_eta_centrality Primitive 3-vectors allowing to compute the conf note variables (mass neglected), 16 independent variables: PRI_tau_pt PRI_tau_eta PRI_tau_phi PRI_lep_pt PRI_lep_eta PRI_lep_phi PRI_met PRI_met_phi PRI_met_sumet PRI_jet_num (0,1,2,3, capped at 3) PRI_jet_leading_pt PRI_jet_leading_eta PRI_jet_leading_phi PRI_jet_subleading_pt PRI_jet_subleading_eta PRI_jet_subleading_phi PRI_jet_all_pt David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
17
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
18
1. Systematics 2. 2 categories x n BDT score bins 3. Background estimated from data (embedded, anti tau, control region) and some MC 4. Weights include all corrections. Some negative weights (tt) 5. Potentially use any information from all 2012 data and MC events 6. Few variables fed in two BDT 7. Significance from complete fit with NP etc… 8. MVA with TMVA BDT
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
1. No systematics 2. No categories, one signal region 3. Straight use of ATLAS G4 MC 4. Weights only include normalisation and pythia
rejected. 5. Only use variables and events preselected by the real analysis 6. All BDT variables + categorisation variables + primitives 3-vector 7. Significance from “regularised Asimov” 8. MVA “no-limit”
19
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
20
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
7000$ 4000$ 2000$
HEP meets ML award XGBoost authors Free trip to CERN
TMVA expert, with TMVA improvements Best physicist
21
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
22
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
23
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
Clearly at the top!
24
https://indico.lal.in2p3.fr/event/2632/
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
q In short:
marginal gain
25
q Re-importing into HEP all the ML developments q dataset being released imminently on CERN Open Data Portal http://
(citeable with a d.o.i)
q Better understand what was done by the best participants q NIPS proceedings write-up (with detailed description of “how they did it ?”) q Organisation of visit of winners of HEP meets ML award at CERN (Tianqi Chen and Tong He, authors of XGBoost, and Gabor Melis
q Discussion on-going with TMVA experts
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
27
q Tracking dominates reconstruction CPU time at LHC q HL-LHC (phase 2) perspective : increased pileup :
q CPU time quadratic/exponential extrapolation (difficult to quote any number)
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
Graeme Stewart ECFA HL-LHC workshop 2014
28
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
29
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
30
http://papers.nips.cc/paper/5572-a-complete-variational-tracker.pdf
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
31
31
32
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
33
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
34
(the following is the result of brain storming with a few HEP and ML
q Focus on pattern recognition of ATLAS/CMS –like experiment at HL- LHC q Give full G4 simulation of a possible HL-LHC Si tracker (which one does not matter) q Mixture of meaningful e.g. events W,Z,top q Give list of 3D points
realistic but would increase complexity.
q Give millions of events q Participant develop algorithm on these millions events, then their algorithm is evaluated (figure of merit f.o.m see next slide) on withheld smallish test sample q F.o.m appear online on leaderboard q Best f.o.m win at the end of the challenge
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
35
q We’re more interested by CPU gain, than efficiency or fake rate reduction (the latter two probably also require more specific expertise), provided they are “good enough” q Efficiency, fake rate measured per track wrt fraction of 3D point belonging to the same true track (we’re not really interested in track parameter estimation) q So something like: f.o.m=1/CPU*sigmoïd(efficiency, 95%)*sigmoïd(1/fake,1000) q Why sigmoïd rather than a hard threshold ? To avoid luck factor with participant being close to the limit and losing all on the test sample) (top participants do not like luck factor) q Might want to measure efficiency per PT/eta bin (would not want to score well submission with e.g. 0 efficiency above 10 GeV). Also maybe need to deal with electron, tracks from detached vertices separately. q Many many details to sort out in practice
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
36
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
Olga Kokshagina 2015 Courtesy : Lakhani 2013
Harvard Medical School Contest for Biology Big Data Problem in Genomics Two week long competition - $2000 prize pot x 3 on TopCoder.com
Best in-house solutions Challenge submissions
37
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
38
q Some participants will want to compete without releasing their software (at least not at the beginning)ècan make separate leader board and prices q Incentive : price for the winners but also a function of the score reached q Need a starter kit with real HEP software
q Strive to promote “coopetition” so that participant collaborates (tricky) q Foresee of releasing the sample publicly (e.g. CERN Open Data Portal) just after the challenge q Foresee a publication outlet (e.g. a satellite NIPS workshop proceedings, like for HiggsML) q Anticipate from the very beginning the final re-import stage
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
39
take long)
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
41
q Machine Learning is the part of computer science which, in particular, deals among other with automatic classification:
q Developing rapidly
§ google advertisements based on your searches or your gmail messages § amazon : “we recommend for you” based on what you already bought
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
42
Isabelle Guyon http://chalearn.org/
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
43
q From ATLAS full sim Geant4 MC12 production q 30 variables (see later) q Signal is Hètautau, Background a mixture of : Z, top, W q Based on November 2013 ATLAS Htautau conf note ATLAS- CONF-2013-108 q Preselection for lep-had topology : single lepton trigger, one lepton identified, one hadronic tau identified q è800.000 events:
q Reproduces reasonably well (~20%) content of 3 highest sensitivity bins (x 2 categories) in conf note q (some background and many correction factors deliberately omitted so that the sample cannot be used for physics, only for machine learning studies)
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
44
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
45
q We thought of smearing the parameters to prevent the matching q Back of an envelope calculation: by how much should D variables be smeared so that the original can be matched with 50% probability among N entries q èsmearing should be at least 15% q èevents become meaningless q èbad idea!
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
46
q Hiding the origin of an entry in a DB is called “sanitizing”, this is a notoriously difficult, and very hot topic, e.g:
(gender, zip code, and birthdate uniquely identifies 83% of US people)
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
47
price, they have to release it under an OS license (so that we can 1) verify it 2) use it)
to release the data (Thorsten W handled this negociation)
and only for the challenge (“Can I use the data for a master thesis ?” Sorry no.)èfinally late agreement to release in on CERN ODP
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
48
q Need to have one robust estimator of the quality of the classification algorithm q Decided to use the well known (in ATLAS) “Asimov” formula (G. Cowan, K.
Cranmer, E. Gross, and O. Vitells, “Asymptotic formulae for likelihood-based tests of new physics”, EPJCC, vol. 71, pp. 1–19, 2011. ) with regularization on top
q Why b’=b+10 (“regularisation”) : practical way to avoid large significance fluctuation when small phase space region with very few background events is chosen. Do not want to pick winners on their luck. q Note that normalisation already done in the weights : no need to explain integrated luminosity and cross-section q Glen Cowan has derived a new version of Asimov formula including a sigma_b (to be shown at coming Statistics Forum) from systematics or statistics
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
49
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
50
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
51
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
52
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
53
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
54
q BDT (Boosted Decision Tree) which is by far the most used technique in Atlas/CMS is actually an old technique (Adaboost 1997) q More recent techniques, just an example:
technique running on 16K cores for three days, watching 10M random YouTube video stills [Le et al., ICML’12]
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
55
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
56
q See
http://atlas.ch/news/2014/machine-learning-wins-the- higgs-challenge.html
q 1 : Gabor Melis (Hungary) lisp developer and consultant : wins 7000$ q 2 : Tim Salimans (Neitherland) data science consultant: wins 4000$ q 3 : Pierre Courtiol (France) ? : wins 2000$ q HEP meets ML award: team crowwork, Tianqi Chen and Tong He PhD students in data science at Seattle and Vancouver. Provided XGBoost used by many
CERN in 2015
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
57
10 20 30 40 50 60 70 80 3 3,04 3,08 3,12 3,16 3,2 3,24 3,28 3,32 3,36 3,4 3,44 3,48 3,52 3,56 3,6 3,64 3,68 3,72 3,76 3,8 3,84 3,88 3,92 3,96 4
David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley
Simple tmva BDT Tuned and improved tmva BDT