Automated Parallel Calculation of Collaborative Statistical Models - PowerPoint PPT Presentation

Automated Parallel Calculation of Collaborative Statistical Models in RooFit Patrick Bos IEEE eScience, Amsterdam, 31 October 2018

Automated Parallel Computation of Collaborative Statistical Models rke (PI), Vince Cro Physics: Wouter Verk rkerk roft , Carsten Burga Burgard rd Bos (yours truly), Inti Pel eScience: Patrick Bo Pelupessy essy , Jisk Attema ma

Particle physics

High energy proton collisions Ima mages fr from ht http:/ ://atlas.physicsma masterclasses.org

CERN Large Hadron Collider LHC @ CERN • ATLAS, CMS • LHCb 10 PB/yr p-p Reduced to kB- MBs binned & unbinned events

Research questions Higgs properties Physics beyond the Standard Model • Supersymmetry? • Dark matter? • …

RooFit: Collaborative Statistical Modeling

Collaborative Statistical Modeling • RooFit: build models together • Teams 10-100 physicists • Collaborations ~3000 à ~100 teams • Ex Exasca scale co collabo aborat atio ion 15 sy 10 3 brain 18 (ex 10 15 10 synaptic co c connect ections x s x 1 brains = = 10 10 18 exa) • • 1 goal • Pretty impressive to an outsider

Collaborative Statistical Modeling with RooFit Higgs @ ATLAS 20k+ nodes, 125k hours Expression tree of C++ objects for mathematical components (variables, operators, functions, integrals, datasets, etc.) Couple with data, event “observables” Making RooFit faster (~30x; ~h à ~m) • More ef effic icient ient collaboratio tion Faster iteration/debugging • Faster feedback between teams • • Next level physics modeling ambitions, e e retaining int inter eractiv ive e workfl kflow x x a a s s c c a a l l e e c c 1. Complex likelihood models, e.g. o o m m p p l l e e x x a) Higgs fit to all channels, ~200 datasets, O(1000) i i t t y y ! ! parameter, now O(few) hours b) EFT framework: again 10-100x more expensive 2. Unbinned ML fits with very large data samples 3. Unbinned ML fits with MC-style numeric integrals

Goals and Design: Make fitting in RooFit faster using automated parallel calculation

Making fitting in RooFit faster: how? Serial: benchmarks show no obvious bottlenecks RooFit already highly optimized (pre-calculation/memoization, MPFE) Parallel

Fitting method: Quasi-Newton minimization (+ handle multiple minima, physics allowed ranges, confidence intervals) Minuit : minimize PDF !(#; %) : • Quasi-Newton MIGRAD method Left: Newton Right: gradient descent • Gradient + line-search: '( ( )+') +(()) '( • gradient for N parameters % : ') ≈ 2N ! calls à parallelize 2N ') ') 2-3 ! calls à parallelize ! • line-search: descend along gradient direction

Faster fitting: (how) can we do it? likelihood: events Levels of parallelism “Vector” 1. Gradient (parameter partial derivatives) in minimizer likelihood: (unequal) 2. Likelihood ( ! ) components 3. Integrals (normalization) & other expensive shared components integrals etc.

Faster fitting: (how) can we do it? Heteroge He geneous : sizes, types likelihood: events • Multiple strategies • How to split up? Small components à ne need low low • la late tency ncy/ove overhe head likelihood: Large components as well… • (unequal) components Run time depends on optimizations, • differs per parameter, hard to predict • How to divide over cores? Load balancing à ta task sk-bas based ed • appr approac ach: wo work ste steali ling ng • … both for likelihood-level and integrals etc. gradient-level parallelization

Design: MultiProcess task-stealing framework Job tasks Task-stealing, worker pool, executes Jo loop : Wor Worker lo !" Job = likelihood component, !# , … Worker requests No threads, process-based: Job task BidirMMapPipe handles fork, mmap, pipes Worker Queue sends pops task result Queue Worker Worker 1 1 ↻ ipc pipe pipes Master Master Queue ↻ Queue Worker 2 Worker 2 ↻ Worker executes ... ... task ter : main RooFit process, submits Jobs to queue, waits for results Maste Ma (or does other things in between) …until Job done then Queue sends results Qu Queue lo loop : act on input from Master or Workers (mainly to avoid loop back to Master on request in Master / user code) --- collect/distribute Jobs and results

Parallel performance (MPFE & MP) Likelihood fits (unbinned, binned) Gradients

Parallel likelihood fits: unbinned, MPFE Run-time vs N(cores): simple N-dim Gaussian, many events timing 3 . 0 Before: max ~2x minimization wall time [s] measured (not pinned) 2 . 5 measured (CPUs pinned) expected (ideal) Now (with CPU 2 . 0 pinning fixed): 1 . 5 max ~20x (more for larger fits) 1 . 0 0 . 5 1 2 3 4 7 8 5 6 number of workers/CPUs

Parallel likelihood fits: certain classes of models, e.g. binned fits with Beeston-Barlow modelling of template uncertainties Run-time vs N(cores): certain types of binned fits Actual performance under investigation Expected performance (ideal parallelization) CP CPU time (s (single co core) )

Fitting method: Quasi-Newton minimization (+ handle multiple minima, physics allowed ranges, confidence intervals) Minuit : minimize PDF !(#; %) : • Quasi-Newton MIGRAD method Left: Newton Right: gradient descent • Gradient + line-search: '( ( )+') +(()) '( • gradient for N parameters % : ') ≈ 2N ! calls à parallelize 2N ') ') 2-3 ! calls à parallelize ! • line-search: descend along gradient direction • Important: serial & parallel results same • non-trivial, Minuit internal transformations

Gradient parallelization First benchmarks: “ ggF model” (gluon-gluon fusion à Higgs boson), MIGRAD fit realistic, non-trivial (265 parameters) scaling not perfect and erratic (+/- 5s) likely caused by communication protocol - under investigation RooMinimizer MultiProcess GradMinimizer - 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers 28s 33s 20s 15s 14s 17s (…) 11s

Conclusions Interactive study of complex LHC physics fits (e.g. Higgs) requires parallelization We improved scaling performance of likelihood-level parallelization Bottlenecks still exist for certain classes of models New flexible framework: multi-level parallelization (likelihood, gradient) First working version, now analysis and tuning performance

Let’s stay in touch +31 (0)6 10 79 58 74 egpbos p.bos@esciencecenter.nl linkedin.com/in/egpbos www.esciencecenter.nl blog.esciencecenter.nl

Encore

Future work Load balancing PDF timings change dynamically due to RooFit precalculation strategies … not a problem for numerical integrals Analytical derivatives (automated? CLAD )

Minuit confidence intervals

Numerical integrals “Analytical” integrals Forced numerical (Monte Carlo) integrals (Higgs fits didn’t have them)

Numerical integrals Individual NI timings Sum of slowest integrals/cores (variation in runs and iterations) per iteration over the entire run Maxima Minima (single core total runtime: 3.2s)

Faster fitting: MultiProcess design RooFit RooFit:: ::MultiProcess MultiProcess::Vector< ::Vector<YourSerialClass YourSerialClass> Serial class: likelihood (e.g. RooNLLVar ) or gradient (Minuit) Interface: subclass + MP Define ”vector elements” Group elements into tasks (to be executed in parallel) RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager

Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit RooFit:: ::MultiProcess MultiProcess:: ::SharedArg SharedArg<T> <T> Normalization integrals or other shared expensive objects Parallel task definition specific to type of object … design in progress RooFit::MultiProcess::TaskManager

Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> RooFit RooFit:: ::MultiProcess MultiProcess:: ::TaskManager TaskManager Queue gathers tasks and communicates with worker pool Workers steal tasks from queue Worker pool: forked processes ( BidirMMapPipe ) • performant and already used in RooFit • no thread-safety concerns • instead: communication concerns • … flexible design, implementation can be replaced (e.g. TBB)

MultiProcess for users vector<double> x {1, 4, 5, 6.48074}; xSquaredSerial xsq_serial(x); size_t N_workers = 4; xSquaredParallel xsq_parallel(N_workers, x); // get the same results, but now faster: xsq_serial.get_result(); xsq_parallel.get_result(); // use parallelized version in your existing functions void some_function(xSquaredSerial* xsq); some_function(&xsq_parallel); // no problem!

MultiProcess usage for devs MP::TaskManager MP::Job Parallelized Parallelized MP:: Vector Vector class class Serial Serial class class template <class T> class MP::Vector : public T, public MP::Job class Parallel : public MP:Vector<Serial>

Automated Parallel Calculation of Collaborative Statistical Models - PowerPoint PPT Presentation

Automated Parallel Calculation of Collaborative Statistical Models in RooFit Patrick Bos IEEE eScience, Amsterdam, 31 October 2018 Automated Parallel Computation of Collaborative Statistical Models rke (PI), Vince Cro Physics: Wouter Verk rkerk

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

COLLABORATIVE COMMUNITY PRESENTATION MAY 30TH, 2018 One San Pedro COLLABORATIVE One San Pedro

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Event Based Programming Check out EventBasedProgramming from SVN Share designs for the Game

Creating a Job and Viewing Applicants Creating a Job in Talent Link Human

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct 18, 2013 Thomas Jefferson

Faster Programs with Guile 3 FOSDEM 2019, Brussels Andy Wingo | wingo@igalia.com wingolog.org |

Glossary of Terms & Observer Pattern Supplemental DRY Don't Repeat Yourself Repetition =

Lesson 3 Applications Life Cycle Victor Matos Cleveland State University Portions of this

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 05:

Essential AND Functional: The Importance of Essential Job Functions for Disability Law

Automated Parallel Calculation of Collaborative Statistical Models - PowerPoint PPT Presentation

Automated Parallel Calculation of Collaborative Statistical Models in RooFit Patrick Bos IEEE eScience, Amsterdam, 31 October 2018 Automated Parallel Computation of Collaborative Statistical Models rke (PI), Vince Cro Physics: Wouter Verk rkerk

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

COLLABORATIVE COMMUNITY PRESENTATION MAY 30TH, 2018 One San Pedro COLLABORATIVE One San Pedro

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Event Based Programming Check out EventBasedProgramming from SVN Share designs for the Game

Creating a Job and Viewing Applicants Creating a Job in Talent Link Human

NP SciDAC Project: JLab Site Report Blint Jo Jefferson Lab, Oct 18, 2013 Thomas Jefferson

Faster Programs with Guile 3 FOSDEM 2019, Brussels Andy Wingo | wingo@igalia.com wingolog.org |

Glossary of Terms &amp; Observer Pattern Supplemental DRY Don't Repeat Yourself Repetition =

Lesson 3 Applications Life Cycle Victor Matos Cleveland State University Portions of this

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 05:

Essential AND Functional: The Importance of Essential Job Functions for Disability Law

Glossary of Terms & Observer Pattern Supplemental DRY Don't Repeat Yourself Repetition =