Back from School
IN2P3 2016 Computing School « Heterogeneous Hardware Parallelism »
Vincent Lafage lafage@ipno.in2p3.fr
S2I, Institut de Physique Nucléaire d’Orsay Université Paris-Sud
9 October 2016
1 / 18
Back from School IN2P3 2016 Computing School Heterogeneous - - PowerPoint PPT Presentation
. . . . . . . . . . . . . . . . Vincent Lafage lafage@ipno.in2p3.fr S2I, Institut de Physique Nuclaire dOrsay Universit Paris-Sud 9 October 2016 . . . . . . . . . . . . . . . . . . . . . . . . 1 /
Back from School
IN2P3 2016 Computing School « Heterogeneous Hardware Parallelism »
Vincent Lafage lafage@ipno.in2p3.fr
S2I, Institut de Physique Nucléaire d’Orsay Université Paris-Sud
9 October 2016
1 / 18Thema
« Discover the major forms of parallelism ofgered by modern hardware architectures. Explore how to take advantage of these by combining multiple software technologies, looking for good balance of performance, code portability and durability. Most
example of particle collision simulation. » https://indico.in2p3.fr/event/13126/
2 / 18e+e− → 3γ https://bitbucket.org/bixente/3photons/src
Numerical integration
⇒ Monte-Carlo method (no adaptative refjnement)
· pseudo-random number generator · translate to a random event RAMBO « A new Monte Carlo treatment of multiparticle phase space at high energies »
…and start again But you have two types of iterations :
1 the ones that depends upon previous ones… 2 … and others ⇒ Monte-Carlo is embarrassingly parallel1
… but rather delightfully parallel
1… as in embarrassment of riches 4 / 18Pseudo Random Number
1 Linear congruential, Lehmer ’s
Knuth « Seminumerical Algorithms »
« Random numbers fall mainly in the planes » (see RANDU problem)2 RANLUX Lüscher
« A portable high-quality random number generator for lattice fjeld theory simulations »3 xorshifts Marsaglia 4 Mersenne Twister Matsumoto-Nishimura 5 Counter Based Random123
https://www.deshawresearch.com/resources_random123.htmlMain loop
//!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements resetME2 (&ee3p); // Event generator evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } 6 / 18OpenMP'ified loop
//!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; #pragma omp for reduction…(+:) for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements resetME2 (&ee3p); // Event generator evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } 7 / 185
http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/
There is no escaping parallel computing any more even on a laptop.
~10 years since serial code just “got faster” gain by
parallelism
1998-2014 : The memory Wall
Year Proc GFlops GHz Cores SIMD GB/s 1998 Dec α 0.750 0.375 1 0.6 2014 Intel Xeon 500 2.6 2 × 14 AVX.2 68 ×1333 ×7 ×28 ×4/8 ×100
7/110 -R´ esum´ e
Depuis 15 ans : Acc´ el´ eration des super-ordinateurs :×1000 Acc´ el´ eration des nœuds :×1000 Acc´ el´ eration par la fr´ equence :×10 Acc´ el´ eration par le // SMP :×10 Acc´ el´ eration par la vectorisation :×10 Augmentation de la bande passante :×100
14/110Moving data
Data is moved through wires. Wires behave like an RC circuit. Trade-off: Longer response time (“latency”) Higher current (more power) Physics says: Communication is slow, power-hungry, or both.
Andreas Kl¨If all you have is a hammer,….
… everything looks like a nail
Prefer Moving Work to the Data over Moving Data to the Work Not only we need to parallelize, but also we need that the arithmetic intensity of the problem is strong enough so that we are not memory bound: do not be blinded by the number of core in a GPU, nor the
but the product, inversion and diagonalization of matrices are uniquely designed to exploit vector architectures. To look at the problems through the eyes of a GPU coder, you need to identify matrix products. Theory simulations are likely to benefjt from GPU, as they usually rely on less data to transfer
To vectorize efficiently,….
(good space and time locality of data)
+ OpenMP 4+
W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H
Palaiseau, 23-27 May 2016 “École informatique IN2P3 2016”
10
Key points
(changing technologies)
task)
performance analysis)
Efficiency Portability
http://libocca.org/talks/riceOG15.pdf
The C++ Standard
under consideration
(P0058R1)
Massively Parallel Task-Based Programming with HPX 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science 7/ 77
HPX 101 – API Overview
HPX C++ Standard Library C++ R f(p...) Synchronous Asynchronous Fire & Forget (returns R) (returns future<R>) (returns void) Functions f(p...) async(f, p...) apply(f, p...) (direct) Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...) (lazy) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (lazy) bind(a(), id, p...) (...) async(bind(a(), id, p...), ...) apply(bind(a(), id, p...), ...) In Addition: dataflow(func, f1, f2);
Massively Parallel Task-Based Programming with HPX 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science 37/ 77
Setting the Stage
Idea: Start with math-y statement of the operation “Push a few buttons” (transformations) to optimize for the target device Strongly separate these two parts Philosophy: Avoid “intelligence” User can assume partial responsibility for correctness Embedding in Python provides generation/transform flexibility
Andreas Kl¨Conclusion
+ Performance
− Obfuscation : OpenMP : 1 kSLOC C99 : 1,2+1,2 kSLOC OpenCL + CPU : 2,5 kSLOC HPX + CPU : 2,5 kSLOC HPX + OpenCL : 6,7 kSLOC HPX : quite a big library (still heavy prototype) ± need for added abstraction layers
! rely on libraries as much as possible
18 / 18