Trying to run EvtGen in parallel provided some usefull information - PowerPoint PPT Presentation

Wishing to reuse part of the code written by the BaBar collaboration, there are two main questions that require an answer:  Is it possible to run in parallel BaBar code (legacy code)? If this is the case, what kind of performances can be expected?  What type of parallelization can be done on this code with the minimum impact? Trying to run EvtGen in parallel provided some usefull information S. Longo – 2nd SuperB Collaboration Meeting – LNF 2/15

EvtGen is: « …an event generator designed for the simulation of the physics of B decays.» (http://www.slac.stanford.edu/~lange/EvtGen) Some characteristics:  Can run in «standalone» mode (without the BaBar Framework)  It’s written in C++ and interfaced with Fortrans event generators  It depends on legacy code written in Fortran (Pythia, Photos) and C++ (CERNLib, CLHEP) S. Longo – 2nd SuperB Collaboration Meeting – LNF 3/15

[…] Initialize Random number generator EvtRandomEngine* MyRandomEngine; MyRandomEngine = new EvtCLHEPRandomEngine(); Set initial conditions double xyzt = 0.0; HepLorentzVector t_init(xyzt,xyzt,xyzt,xyzt); EvtGen* myGenerator = new EvtGen("DECAY.DEC","evt.pdl",MyRandomEngine); EvtVector4R p_init(EvtPDL::getMass(EvtPDL::getId("Upsilon(4S)")),0.0,0.0,0.0); Event = EvtParticleFactory::particleFactory(EvtPDL::getId("Upsilon(4S)"), p_init); Event->setVectorSpinDensity(); TheGenerator->generateEvent(Event, t_init); Generate one Event […] S. Longo – 2nd SuperB Collaboration Meeting – LNF 4/15

Parallelization of a for cycle is done defining a BodyObject as follow: Class BodyObject { private: <Thread Pool Private Data> public: BodyObject (…); void operator()(const blocked_range<size_t>&Range) const { <Thread Private Data> for (size_t i = Range.begin(); i != Range.end(); ++i) { <Something to be executed in parallel> } } }; S. Longo – 2nd SuperB Collaboration Meeting – LNF 5/15

A first try was done parallelizing the decay phase of the generation process as follow:  The body object is initialized with the initial conditions (x,y,z,t) and the Generator to employ  A vector of events is generated by the functor void operator()(const blocked_range<unsigned long>& Range) const { for (unsigned long i = Range.begin(); i < Range.end(); ++i) { EvtVector4R p_init(EvtPDL::getMass(EvtPDL::getId("Upsilon(4S)")), 0.0,0.0,0.0); (*EventVector)[i] = EvtParticleFactory::particleFactory( EvtPDL::getId("Upsilon(4S)"), p_init); (*EventVector)[i]->setVectorSpinDensity(); EventGenerator->generateEvent((*EventVector)[i], *t_init); } } S. Longo – 2nd SuperB Collaboration Meeting – LNF 6/15

A single generator per thread pool is a bottleneck EvtGen itself must be «parallelized» in some way but:  There is a large use of static classes and properties  Data produced in the Fortran part of the code is passed through «Common blocks» (memory shared by code of the same program unit) A Body dy Objec ect t with a loca cal Event vent Gene nerat ator cann annot ot wor ork! k! A different parallelization pattern has to be used. S. Longo – 2nd SuperB Collaboration Meeting – LNF 7/15

Profiling a serial execution of the code to produce some thousands of events, we get: % cumulative self calls self total 24.75 4.02 4.02 91182679 0.00 0.00 EvtBtoXsgammaFermiUtil::FermiExpFunc (…) 12.52 6.05 2.03 67624537 0.00 0.00 EvtBtoXsgammaKagan::DeltaFermiFunc (…) 10.85 7.81 1.76 1217901394 0.00 0.00 std::vector<double, std::allocator<double> >::operator [](…) const 7.24 8.98 1.18 67624537 0.00 0.00 EvtBtoXsgammaKagan::Delta (…) 6.04 9.96 0.98 91182550 0.00 0.00 EvtBtoXsgammaKagan::FermiFunc (…) 5.24 10.81 0.85 88725714 0.00 0.00 EvtItgThreeCoeffFcn::myFunction (…) const […] More than 2 3 of the time is spent doing math (Integrating Fermi and Delta Functions) S. Longo – 2nd SuperB Collaboration Meeting – LNF 8/15

 Profiling gave us that Hadronic Mass Spectra computation is the most time consuming procedure  There’s room for a performance increase if we parallelize function integration How to parallelize legacy code?  TBB allows fine grained management of threads and tasks, but requires a complete rewrite of the code to become «TBB compliant»  OpenMP give less freedom to the programmer but can be easily injected inside existent code S. Longo – 2nd SuperB Collaboration Meeting – LNF 9/15

EvtBtoXsgammaKagan::computeHadronicMass is the EvtGen procedure that calculate Hadronic Mass Spectra It contains a quite long setup followed by a for cycle where the Branching Fraction is calculated: this is the section that have to be parallelized. How to proceed with OpenMP parallelization?  Identifying objects dependencies  Creating separated objects local to each thread  Reducing the result of the loop S. Longo – 2nd SuperB Collaboration Meeting – LNF 10/15

Comparison between serial execution and the OpenMP implementation Measurements were done on a single Intel Xeon E5630 system (4 cores, 2 HT per core), with 12GB of RAM The parallelization pattern increases legacy code performances Note: Plotted measures are correlated! S. Longo – 2nd SuperB Collaboration Meeting – LNF 11/15

Comparison between serial execution and the OpenMP implementation Moving beyond the setup used to profile the application, the SpeedUp quickly falls down. Other parts of the code become dominant in CPU usage Note: Plotted measures are correlated! S. Longo – 2nd SuperB Collaboration Meeting – LNF 12/15

EvtGen parallelization provided some usefull information:  Legacy Fortran code can be executed in a parallel environment like OpenMP or TBB  TBB cannot be profitably used in modules like EvtGen (static classes/properties) if we don’t want to rewrite major part of code  OpenMP can be employed to paralelize sections of legacy code, with minor modification  OpenMP solution can provide a quite good performance gain S. Longo – 2nd SuperB Collaboration Meeting – LNF 13/15

EvtGen example also suggest a pattern that can be adopted to parallelize legacy modules:  Identify a set of tipical use cases  Profile the module on those cases  Identify the most time consuming part of the code  Parallelize it via OpenMP Unfortunately, this add a new line of work to the Framework R&D activities. S. Longo – 2nd SuperB Collaboration Meeting – LNF 14/15

Trying to run EvtGen in parallel provided some usefull information - PowerPoint PPT Presentation

Wishing to reuse part of the code written by the BaBar collaboration, there are two main questions that require an answer: Is it possible to run in parallel BaBar code (legacy code)? If this is the case, what kind of performances can be

FINANCIAL REQUIREMENTS ...and other usefull things to know for your IVB Project Utrecht, 15

Lambda Usefull outside functional programming, functions nowadays also in Java, C++, Python,

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Muddy Run/Conowingo Recreation Sites and Facilities Consultation Presentation September 14-15,

Outdoor Heritage Projects Blood Run Blood Run Oak Forest Blood Run 2012 Big Sioux River overlook

+ Characterization of Miller Run and Conceptual Plan for Characterization of Miller Run and

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Clausal Graph Tableaux for Hybrid Logic with Eventualities and Difference Mark Kaminski and Gert

Reco of B0 J/psiKs MC Avdhesh Chandra Rice University

To infinity, and beyond! Kiyan Ahmadizadeh CS 614 - Fall 2007 LRPC - Motivation Small-kernel

Specializing FGPU for Persistent Deep Learning Rui Ma, Alex Hsu, Tian Tan (The University of

A Canonical Model Construction for Iteration-Free PDL with Intersection Florian Bruse Daniel

S-38.180 Exercise 2: Integrated Services Mika Ilvesm aki Helsinki University of Technology

Midterm 2 Example Problems Exam on March 21 from 2-3:20 pm in CSC 12 Reference to the textbook,

iRODS&ARCS ShundeZhang shunde.zhang@arcs.org.au Whoarewe?

Trying to run EvtGen in parallel provided some usefull information - PowerPoint PPT Presentation

Wishing to reuse part of the code written by the BaBar collaboration, there are two main questions that require an answer: Is it possible to run in parallel BaBar code (legacy code)? If this is the case, what kind of performances can be

FINANCIAL REQUIREMENTS ...and other usefull things to know for your IVB Project Utrecht, 15

Lambda Usefull outside functional programming, functions nowadays also in Java, C++, Python,

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Muddy Run/Conowingo Recreation Sites and Facilities Consultation Presentation September 14-15,

Outdoor Heritage Projects Blood Run Blood Run Oak Forest Blood Run 2012 Big Sioux River overlook

+ Characterization of Miller Run and Conceptual Plan for Characterization of Miller Run and

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Clausal Graph Tableaux for Hybrid Logic with Eventualities and Difference Mark Kaminski and Gert

Reco of B0 J/psiKs MC Avdhesh Chandra Rice University

To infinity, and beyond! Kiyan Ahmadizadeh CS 614 - Fall 2007 LRPC - Motivation Small-kernel

Specializing FGPU for Persistent Deep Learning Rui Ma, Alex Hsu, Tian Tan (The University of

A Canonical Model Construction for Iteration-Free PDL with Intersection Florian Bruse Daniel

S-38.180 Exercise 2: Integrated Services Mika Ilvesm aki Helsinki University of Technology

Midterm 2 Example Problems Exam on March 21 from 2-3:20 pm in CSC 12 Reference to the textbook,

iRODS&amp;ARCS ShundeZhang shunde.zhang@arcs.org.au Whoarewe?

iRODS&ARCS ShundeZhang shunde.zhang@arcs.org.au Whoarewe?