introduction to tmva and primary electron track
play

Introduction to TMVA and Primary Electron Track Determination Erin - PowerPoint PPT Presentation

Introduction to TMVA and Primary Electron Track Determination Erin Conley SNB/LE Working Group Meeting June 20, 2018 6/20/2018 1 Introduction 30.25 MeV Event Display: Time (ticks) vs. Wire in CC interactions Goal: determine


  1. Introduction to TMVA and Primary Electron Track Determination Erin Conley SNB/LE Working Group Meeting June 20, 2018 6/20/2018 1

  2. Introduction 30.25 MeV Event Display: Time (ticks) vs. Wire in 𝜉 𝑓 CC interactions • Goal: determine primary electron (reconstructed) track – Not always obvious; having a concrete, general method would be useful! – Using MARLEY events made by J. Stock in May 2017 • TMVA provides methods based on machine learning to help reach this goal. 6/20/2018 2

  3. TMVA: Introduction • TMVA: Toolkit for Multivariate Data Analysis – Framework in ROOT to be used for classification and regression problems – Various multivariate analysis (MVA) methods available • Two independent phases: – Training phase: MVA methods trained, tested, evaluated – Application phase: chosen MVA methods applied to classification problem • Need to worry about overtraining: too few degrees of freedom leads to unrealistic increase in classification performance • Data pre-processing available to, e.g., de- correlate or “Gaussian - ize ” variables 6/20/2018 3

  4. TMVA Output Characteristics about input variables: MVA method performance plots: • • Distributions for signal, background MVA method classifier outputs of thumb: want 𝜍 𝐿𝑇 ≳ 0.01 ) input variables – Kolmogorov-Smirnov test statistic to determine whether overtraining occurred (rule • Distributions for transformed variables (e.g., decorrelated variables) • Optimal cut for MVA method classifiers • Correlation plots + matrix to • Classification probabilities + PDFs understand linear correlations between • Probability integral transformation variables (rarity) • Receiver operation characteristics Use these plots to choose optimal (ROC) curves combinations of variables, data pre- processing strategy, etc. Use these plots to compare MVA method performances, choose optimal cuts on data, etc. 6/20/2018 4

  5. MARLEY Simulations: Preparing for TMVA • Used BackTracker to determine which tracks were made by primary electron – Used 2D hits associated with tracks – Multiple tracks in the event can be made by the primary electron 10 hits → primary electron produced 6 hits) – Some tracks are partially made by primary electrons (e.g., track with • For the purposes of preliminary TMVA tests: – Signal: tracks that had 75% or more of its hits produced by the primary electron – Background: all other tracks • Used full 30.25 MeV MARLEY simulation (10,000 events) 6/20/2018 5

  6. Determining Primary Track “By Eye” • Scanned event displays of 100 events in 30.25 MeV MARLEY data – These events had 2.61 reconstructed tracks on average (pmtracktc) • Out of the 100 events… – 2 events had no reconstructed tracks – 2 events I failed to identify the primary track – 10 events I correctly identified at least one primary track but… • Misidentified another track • Failed to identify all primary tracks – 86 events I correctly identified all primary tracks • Out of the 86 events where I was 100% correct… – 14 events contained one track 6/20/2018 6

  7. Variables Used from MARLEY Simulations 1. Track length: as given in the recob::Track object “Charge deposition”: Sum of integral values of all hits in a track 2. – recob::Hit::Integral(): integral under calibrated signal waveform “Path time”: difference between max/min peak times in the track 3. – recob::Hit::PeakTime(): time of signal peak (ticks) “Summed RMS”: sum of RMS of all hits in a track 4. – recob::Hit::RMS(): RMS of hit shape (ticks) • Also used calorimetry information from tracks: “Summed dQdx ”: sum of dQdx values on collection plane 5. “ Calo KE”: kinetic energy of track on collection plane 6. – Potential issue: not all tracks have calorimetry information (bug?) 6/20/2018 7

  8. Signal: tracks with 75-100% of their Input Variable Distributions hits made by primary electron Background: all other tracks Input variable: length Input variable: timeofint Input variable: chargedepo 0.0014 0.702 4.08 290 Signal 0.1 0.6 Background 0.0012 / U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% / (1/N) dN 0.08 / (1/N) dN 0.5 (1/N) dN 0.001 0.4 0.06 0.0008 0.3 0.0006 0.04 0.2 0.0004 0.02 0.1 0.0002 0 0 0 5 10 15 20 25 20 40 60 80 100 120 140 160 2000 4000 6000 8000 10000 length chargedepo timeofint Input variable: summedrms Input variable: summeddqdx Input variable: caloke 203 0.002 4.42 5.28 0.12 0.16 0.0018 / U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.1, 0.0)% / / (1/N) dN 0.0016 0.14 (1/N) dN (1/N) dN 0.1 0.0014 0.12 0.08 0.0012 0.1 0.001 0.06 0.08 0.0008 0.06 0.04 0.0006 0.04 0.0004 0.02 0.02 0.0002 0 0 0 20 40 60 80 100 120 140 160 180 1000 2000 3000 4000 5000 6000 7000 8000 20 40 60 80 100 120 140 160 180 200 summeddqdx summedrms caloke 6/20/2018 8

  9. Decorrelated Variable Distributions Input variable’Deco’-transformed : length Input variable’Deco’-transformed : timeofint Input variable’Deco’-transformed : chargedepo 0.284 0.316 0.189 Signal 3 1.6 1.2 Background 1.4 U/O-flow (S,B): (0.0, 0.0)% / (0.2, 0.1)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 2.5 / / / 1 (1/N) dN (1/N) dN (1/N) dN 1.2 2 0.8 1 1.5 0.8 0.6 0.6 1 0.4 0.4 0.5 0.2 0.2 0 0 0 - - 2 0 2 4 6 8 2 0 2 4 6 8 0 1 2 3 4 5 6 7 length (Deco) timeofint (Deco) chargedepo (Deco) Input variable’Deco’-transformed : caloke I nput variable’Deco’-transformed : summedrms n put variable’Deco’-transformed : summeddqdx 0.221 0.285 0.307 4.5 1.4 1.4 4 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 1.2 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% U/O-flow (S,B): (0.0, 0.0)% / (0.1, 0.0)% / / / 1.2 3.5 (1/N) dN (1/N) dN (1/N) dN 1 3 1 0.8 2.5 0.8 2 0.6 0.6 1.5 0.4 0.4 1 0.2 0.2 0.5 0 0 0 - - - - 4 2 0 2 4 6 4 2 0 2 4 6 8 0 1 2 3 4 5 6 7 8 summedrms (Deco) summeddqdx (Deco) caloke (Deco) 6/20/2018 9

  10. Track Determination + TMVA • Number “signal” events: 10942 • Number “background” events: 8736 • Trained on 8442 signal, 6236 background events; tested on 2500 signal, 2500 background events – Tried to minimize testing sample; the more training, the better! • Tested ~5 different MVA methods so far, including cut-based analysis, likelihood estimator, boosted decision trees – TMVA ranks MVA methods by best signal efficiency – Use ROC curve to determine MVA performance – Will only show BDT results (cut-based, likelihood results in backup) 6/20/2018 10

  11. Boosted Decision Tree (BDT) Method • Structured like binary tree; “yes/no” decisions taken on one variable at a time until stop regions → eventually classified criterion reached – Splits phase space into many trees → “forest” as signal or background – Boosted: extends to several • Purposes of track determination: BDT with Schematic of decision tree: leaf nodes at bottom decorrelated variables are labeled “signal” and “background” after binary splits are made; these labels depend on the majority of events that end up in nodes 6/20/2018 11

  12. ROC Curve for TMVA Methods Background rejection versus Signal efficiency • Shows true positive rate 1 Background rejection versus false positive 0.9 rate for different ROC integral values: • possible cutoff points BDT: 0.945 0.8 • • Likelihood: 0.942 Use ROC curve to • Cuts: 0.934 compare MVA 0.7 performances 0.6 • The larger the area/ integral, the better the 0.5 performance MVA Method: • From the integral 0.4 BDT values, we see that Likelihood MVA methods are 0.3 Cuts comparable, BDT is 0.2 performing well! 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Signal efficiency 6/20/2018 12

  13. BDT Classifier Output + Cuts TMVA overtraining check for classifier: BDT Cut efficiencies and optimal cut value Signal purity 3.5 Signal efficiency dx Signal (test sample) Signal (training sample) Signal efficiency*purity Background efficiency / S/ S+B Background (test sample) Background (training sample) (1/N) dN Efficiency (Purity) Significance 30 3 Kolmogorov-Smirnov test: signal (background) probability = 0.055 (0.032) 1 25 2.5 0.8 U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% 20 2 0.6 1.5 15 0.4 1 10 0.2 For 1000 signal and 1000 background 0.5 5 events the maximum S/ S+B is 27.77 when cutting at -0.06 0 0 0 - - - - - - 0.6 0.4 0.2 0 0.2 0.4 0.6 0.6 0.4 0.2 0 0.2 0.4 0.6 BDT response Cut value applied on BDT output • TMVA convention: signal events at larger classifier • Gives us an idea of our performance when we values, background at smaller apply BDT to other datasets (e.g., future • Note the KS test statistics are above 0.01; indicates no MARLEY simulations) overtraining occurred 6/20/2018 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend