Back from School IN2P3 2016 Computing School Heterogeneous - - PowerPoint PPT Presentation

back from school
SMART_READER_LITE
LIVE PREVIEW

Back from School IN2P3 2016 Computing School Heterogeneous - - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . Vincent Lafage lafage@ipno.in2p3.fr S2I, Institut de Physique Nuclaire dOrsay Universit Paris-Sud 9 October 2016 . . . . . . . . . . . . . . . . . . . . . . . . 1 /


slide-1
SLIDE 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Back from School

IN2P3 2016 Computing School « Heterogeneous Hardware Parallelism »

Vincent Lafage lafage@ipno.in2p3.fr

S2I, Institut de Physique Nucléaire d’Orsay Université Paris-Sud

9 October 2016

1 / 18
slide-2
SLIDE 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thema

« Discover the major forms of parallelism ofgered by modern hardware architectures. Explore how to take advantage of these by combining multiple software technologies, looking for good balance of performance, code portability and durability. Most

  • f the technologies presented here will be applied to a simplifjed

example of particle collision simulation. » https://indico.in2p3.fr/event/13126/

2 / 18
slide-3
SLIDE 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
  • the code (à la Rosetta Code : F77, F90, Ada, C++, C & Go)

e+e− → 3γ https://bitbucket.org/bixente/3photons/src

  • difgerent technologies implemented or simply mentionned:
  • OpenMP,
  • C++'11 & HPX,
  • OpenCL (+ MPI),
  • Intel Threading Building Blocks TBB (C++)
  • DSL-Python
  • OpenCL for FPGA
3 / 18
slide-4
SLIDE 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Numerical integration

⇒ Monte-Carlo method (no adaptative refjnement)

  • generate an event with a weight,

· pseudo-random number generator · translate to a random event RAMBO « A new Monte Carlo treatment of multiparticle phase space at high energies »

  • check if it passes cuts,
  • compute matrix element,
  • weight it,
  • sum it up…

…and start again But you have two types of iterations :

1 the ones that depends upon previous ones… 2 … and others ⇒ Monte-Carlo is embarrassingly parallel1

… but rather delightfully parallel

1… as in embarrassment of riches 4 / 18
slide-5
SLIDE 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Pseudo Random Number

1 Linear congruential, Lehmer ’s

Knuth « Seminumerical Algorithms » 

« Random numbers fall mainly in the planes »  (see RANDU problem)

2 RANLUX Lüscher 

« A portable high-quality random number generator for lattice fjeld theory simulations »

3 xorshifts Marsaglia  4 Mersenne Twister Matsumoto-Nishimura  5 Counter Based Random123 

https://www.deshawresearch.com/resources_random123.html
  • satisfy rigorous statistical testing (BigCrush in TestU01),
  • vectorize and parallelize well (…> 264 independent streams),
  • have long periods (…> 2128),
  • require little or no memory or state,
  • have excellent performance (a few cycles per random byte)
http://dx.doi.org/10.1145/2063384.2063405 5 / 18
slide-6
SLIDE 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Main loop

//!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements resetME2 (&ee3p); // Event generator evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } 6 / 18
slide-7
SLIDE 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

OpenMP'ified loop

//!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; #pragma omp for reduction…(+:) for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements resetME2 (&ee3p); // Event generator evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } 7 / 18
slide-8
SLIDE 8

5

Moore’s Law: transition to many-core

http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/

There is no escaping parallel computing any more even on a laptop.

~10 years since serial
 code just “got faster” gain by


  • n-chip


parallelism

slide-9
SLIDE 9

1998-2014 : The memory Wall

Year Proc GFlops GHz Cores SIMD GB/s 1998 Dec α 0.750 0.375 1 0.6 2014 Intel Xeon 500 2.6 2 × 14 AVX.2 68 ×1333 ×7 ×28 ×4/8 ×100

7/110 -
slide-10
SLIDE 10

R´ esum´ e

Depuis 15 ans : Acc´ el´ eration des super-ordinateurs :×1000 Acc´ el´ eration des nœuds :×1000 Acc´ el´ eration par la fr´ equence :×10 Acc´ el´ eration par le // SMP :×10 Acc´ el´ eration par la vectorisation :×10 Augmentation de la bande passante :×100

14/110
slide-11
SLIDE 11 Intro Machines PyOpenCL Key Algorithm: Scan Loo.py Conclusions

Moving data

Data is moved through wires. Wires behave like an RC circuit. Trade-off: Longer response time (“latency”) Higher current (more power) Physics says: Communication is slow, power-hungry, or both.

Andreas Kl¨
  • ckner
DSL to Manycore
slide-12
SLIDE 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

If all you have is a hammer,….

… everything looks like a nail

Prefer Moving Work to the Data over Moving Data to the Work Not only we need to parallelize, but also we need that the arithmetic intensity of the problem is strong enough so that we are not memory bound: do not be blinded by the number of core in a GPU, nor the

  • benchmarks. Scalar products and convolutions are limited by memory,

but the product, inversion and diagonalization of matrices are uniquely designed to exploit vector architectures. To look at the problems through the eyes of a GPU coder, you need to identify matrix products. Theory simulations are likely to benefjt from GPU, as they usually rely on less data to transfer

  • … as initial input to GPU memory
  • … as fjnal output (aggregate) back from GPU memory
12 / 18
slide-13
SLIDE 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

To vectorize efficiently,….

  • fjnd SIMD parallelism
  • keep data aligned close to cores

(good space and time locality of data)

  • choose your weapon
  • assembly language (+ antidepressant)
  • compiler vector intrinsics (+ aspirin)
  • good compiler (… for trivial loops)

+ OpenMP 4+

  • proper libraries (e.g. eigen, boost::simd,…)
13 / 18
slide-14
SLIDE 14

W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H W W τ τ H

Palaiseau, 23-27 May 2016 “École informatique IN2P3 2016”

10

Introduction Choosing a SW technology

Key points

  • Efficient & Portable application

(changing technologies)

  • Standard free (cheap) technology
  • Easy development (//ism not easy

task)

  • Development tools (debugger,

performance analysis)

Efficiency Portability

http://libocca.org/talks/riceOG15.pdf

slide-15
SLIDE 15

The C++ Standard

  • C++11 introduced lower level abstractions
  • std::thread, std::mutex, std::future, etc.
  • Fairly limited (low level), more is needed
  • C++ needs stronger support for higher-level parallelism
  • New standard: C++17:
  • Parallel versions of STL algorithms (P0024R2)
  • Several proposals to the Standardization Committee are accepted or

under consideration

  • Technical Specification: Concurrency (N4577)
  • Other proposals: Coroutines (P0057R2), task blocks (N4411), executors

(P0058R1)

Massively Parallel Task-Based Programming with HPX 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science 7/ 77

slide-16
SLIDE 16

HPX 101 – API Overview

HPX C++ Standard Library C++ R f(p...) Synchronous Asynchronous Fire & Forget (returns R) (returns future<R>) (returns void) Functions f(p...) async(f, p...) apply(f, p...) (direct) Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...) (lazy) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (lazy) bind(a(), id, p...) (...) async(bind(a(), id, p...), ...) apply(bind(a(), id, p...), ...) In Addition: dataflow(func, f1, f2);

Massively Parallel Task-Based Programming with HPX 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science 37/ 77

slide-17
SLIDE 17 Intro Machines PyOpenCL Key Algorithm: Scan Loo.py Conclusions Loo.py

Setting the Stage

Idea: Start with math-y statement of the operation “Push a few buttons” (transformations) to optimize for the target device Strongly separate these two parts Philosophy: Avoid “intelligence” User can assume partial responsibility for correctness Embedding in Python provides generation/transform flexibility

Andreas Kl¨
  • ckner
DSL to Manycore
slide-18
SLIDE 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

+ Performance

  • OpenMP : × 27 on 32 cores (15.6% overhead)
  • OpenCL + MPI : × 148 on 3 GPU nodes

− Obfuscation : OpenMP : 1 kSLOC C99 : 1,2+1,2 kSLOC OpenCL + CPU : 2,5 kSLOC HPX + CPU : 2,5 kSLOC HPX + OpenCL : 6,7 kSLOC HPX : quite a big library (still heavy prototype) ± need for added abstraction layers

  • massaging the initial code:
  • interest for functional programming (at least, purify functions)
  • change random number generator ⇒ (Random123)
  • reformulate some problems ⇒ matrix products

! rely on libraries as much as possible

18 / 18