Back from School IN2P3 2016 Computing School Heterogeneous - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . Vincent Lafage lafage@ipno.in2p3.fr S2I, Institut de Physique Nucléaire d’Orsay Université Paris-Sud 9 October 2016 . . . . . . . . . . . . . . . . . . . . . . . . 1 / 18 Back from School IN2P3 2016 Computing School « Heterogeneous Hardware Parallelism »

. . . . . . . . . . . . . . . « Discover the major forms of parallelism ofgered by modern hardware architectures . Explore how to take advantage of these by combining multiple software technologies , looking for good balance of performance, code portability and durability . Most of the technologies presented here will be applied to a simplifjed example of particle collision simulation. » https://indico.in2p3.fr/event/13126/ . . . . . . . . . . . . . . . . . . . . . . . . . 2 / 18 Thema

. . . . . . . . . . . . . . . . . . https://bitbucket.org/bixente/3photons/src . . . . . . . . . . . . . . . . . . 3 / 18 . . . . • the code ( à la Rosetta Code : F77, F90, Ada, C++, C & Go) e + e − → 3 γ • difgerent technologies implemented or simply mentionned: • OpenMP , • C++'11 & HPX , • OpenCL (+ MPI) , • Intel Threading Building Blocks TBB (C++) • DSL-Python • OpenCL for FPGA

. . . . . . . . . . . . . . . · pseudo-random number generator · translate to a random event RAMBO « A new Monte Carlo treatment of multiparticle phase space at high energies » …and start again But you have two types of iterations : 1 the ones that depends upon previous ones… … but rather delightfully parallel 1 … as in embarrassment of riches . . . . . . . . . . . . . . 4 / 18 . . . . . . . . . . . Numerical integration ⇒ Monte-Carlo method (no adaptative refjnement) • generate an event with a weight, • check if it passes cuts, • compute matrix element, • weight it, • sum it up… 2 … and others ⇒ Monte-Carlo is embarrassingly parallel 1

. . . . . . . . . . . . . . . . . « A portable high-quality random number generator for lattice fjeld theory simulations » https://www.deshawresearch.com/resources_random123.html http://dx.doi.org/10.1145/2063384.2063405 . . . . . . . . . . . . . . . . . . . . . . . 5 / 18 Pseudo Random Number 1 Linear congruential, Lehmer  ’s Knuth « Seminumerical Algorithms »  « Random numbers fall mainly in the planes »  (see RANDU problem) 2 RANLUX Lüscher  3 xorshift s Marsaglia  4 Mersenne Twister Matsumoto-Nishimura  5 Counter Based Random123  • satisfy rigorous statistical testing (BigCrush in TestU01), • vectorize and parallelize well (…> 2 64 independent streams), • have long periods (…> 2 128 ), • require little or no memory or state, • have excellent performance (a few cycles per random byte)

. // Event generator . . . . . //!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements resetME2 (&ee3p); evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); . evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 / 18 Main loop

. resetME2 (&ee3p); . . . . . //!> Start of integration loop startTime = getticks (); int ev, evSelected = 0; double evWeight; #pragma omp for reduction…(+:) for (ev=0; ev < run->nbrOfEvents; ev++) { // Reset Matrix elements // Event generator . evWeight = generateRambo (&rambo, outParticles , 3, run->ETotal); evWeight = evWeight * run->cstVolume; // Sort outgoing photons by energy sortPhotons (outParticles); // Spinor inner product , scalar product and // center -of-mass frame angles computation computeSpinorProducts (&ee3p.spinor, ee3p.momenta); // intermediate computation computeScalarProducts (&ee3p); if (selectEvent (&pParameters , &ee3p)) { computeME2 (&ee3p, &pParameters , run->ETotal); updateStatistics (&statistics , &pParameters , &ee3p, evWeight); evSelected++; } } . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 / 18 OpenMP'ified loop

Moore’s Law: transition to many-core There is no escaping parallel computing any more even on a laptop. gain by   on-chip   parallelism ~10 years since serial   code just “got faster” 5 http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/

1998-2014 : The memory Wall Year Proc GFlops GHz Cores SIMD GB/s 1998 Dec α 0.750 0.375 1 0.6 2014 Intel Xeon 500 2.6 2 × 14 AVX.2 68 × 1333 × 7 × 28 × 4 / 8 × 100 7/110 -

R´ esum´ e Depuis 15 ans : Acc´ el´ eration des super-ordinateurs : × 1000 Acc´ el´ eration des nœuds : × 1000 Acc´ el´ eration par la fr´ equence : × 10 Acc´ el´ eration par le // SMP : × 10 Acc´ el´ eration par la vectorisation : × 10 Augmentation de la bande passante : × 100 - 14/110

Intro Machines PyOpenCL Key Algorithm: Scan Loo.py Conclusions Moving data Data is moved through wires. Wires behave like an RC circuit. Trade-o ff : Longer response time (“latency”) Higher current (more power) Physics says: Communication is slow, power-hungry, or both. Andreas Kl¨ ockner DSL to Manycore

. . . . . . . . . . . . … everything looks like a nail . Prefer Moving Work to the Data over Moving Data to the Work Not only we need to parallelize, but also we need that the arithmetic intensity of the problem is strong enough so that we are not memory bound : do not be blinded by the number of core in a GPU, nor the benchmarks. Scalar products and convolutions are limited by memory, but the product, inversion and diagonalization of matrices are uniquely designed to exploit vector architectures. To look at the problems through the eyes of a GPU coder, you need to identify matrix products. Theory simulations are likely to benefjt from GPU, as they usually rely on less data to transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 / 18 If all you have is a hammer,…. • … as initial input to GPU memory • … as fjnal output (aggregate) back from GPU memory

. . . . . . . . . . . . . . . . . . (good space and time locality of data) + OpenMP 4+ . . . . . . . . . . . . . . . . . . 13 / 18 . . . . To vectorize efficiently,…. • fjnd SIMD parallelism • keep data aligned close to cores • choose your weapon • assembly language (+ antidepressant) • compiler vector intrinsics (+ aspirin) • good compiler (… for trivial loops) • proper libraries ( e.g. eigen, boost::simd,…)

W H τ Introduction W τ W Choosing a SW technology H τ W τ W H τ W τ W H τ W τ W H τ W τ W http://libocca.org/talks/riceOG15.pdf H τ Efficiency Key points W τ W ● Efficient & Portable application H τ (changing technologies) W τ ● Standard free (cheap) technology W H τ ● Easy development (//ism not easy W τ task) W H τ ● Development tools (debugger, Portability W τ performance analysis) W H τ 10 Palaiseau, 23-27 May 2016 “ École informatique IN2P3 2016 ” W τ

The C++ Standard • C++11 introduced lower level abstractions • std::thread, std::mutex, std::future, etc. • Fairly limited (low level), more is needed • C++ needs stronger support for higher-level parallelism • New standard: C++17: • Parallel versions of STL algorithms (P0024R2) • Several proposals to the Standardization Committee are accepted or under consideration • Technical Specification: Concurrency (N4577) • Other proposals: Coroutines (P0057R2), task blocks (N4411), executors (P0058R1) Massively Parallel Task-Based Programming with HPX 7/ 77 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science

HPX 101 – API Overview Synchronous Asynchronous Fire & Forget R f(p...) (returns R ) (returns future<R> ) (returns void ) Functions f(p...) async(f, p...) apply(f, p...) (direct) C++ Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...) (lazy) C++ Standard Library Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (lazy) bind(a(), id, p...) async(bind(a(), id, p...), apply(bind(a(), id, p...), (...) ...) ...) HPX In Addition: dataflow(func, f1, f2); Massively Parallel Task-Based Programming with HPX 37/ 77 23.05.2016 | Thomas Heller | Computer Architecture – Department of Computer Science

Back from School IN2P3 2016 Computing School Heterogeneous - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . Vincent Lafage lafage@ipno.in2p3.fr S2I, Institut de Physique Nuclaire dOrsay Universit Paris-Sud 9 October 2016 . . . . . . . . . . . . . . . . . . . . . . . . 1 /

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Back-to-School Outreach Strategies Agenda Back-to-School Outreach Why Back-to-School?

THE BACK THE MUSCLES THE MUSCLES OF THE BACK THE MUSCLES OF THE BACK Muscles of the back are

PREVENTION OF LOW BACK PAIN : Back Education for Workplace and Home PRESENTED AT BODIJA-ASHI

Disability Claims Process Personal support through each step to get back to health, back to work

Back To Work Back To Work Back To Work Back To Work Q2 F2008 Results 31 January 2008 y

Preventing Back Injuries Preventing Back Injuries Back injuries are the nation s number

BACK-TO-SCHOOL ORGANIZATION School year 2020/2021 Back-to-school : Tuesday, September 8 th , 2020

Express-Node Back-end Building a simple back-end application with Express, Node, and MongoDB

Oasys Post Processing New Features in Version 16.0 www.arup.com/dyna Back to Contents Back to

AIKEN COUNTY PUBLIC SCHOOLS Back-to-School 7.7.2020 Back-to-School (B2S)Planning A task

Back to school presentation Bubble format Back to school presentation Check it out! Bu Bubbles

Back to School Night Back to School Night Will Begin 2020-2021 In A Few Minutes Tuckahoe

Welcome to Back to School Night! 6th Grade Back to School Night Sample Title Sample Subtitle

CMS Back to School 2020-21 Plan Wednesday, July 1, 2020 Work Session Outcomes 1. Review back to

D62 Back to School July 20, 2020 Goals: D62 Back to School Continue to monitor state and

Cryptography Authentication Public Key Key Management ITS335: IT Security Signatures Random

Network Protocol Design and Evaluation 07 - Simulation, Part I Stefan Rhrup University of

Speculative High-Performance Simulation Alessandro Pellegrini A.Y. 2017/2018 Simulation

Introduction to Algorithms Introduction to Algorithms Arrays provide an indirect way to access

Totally Symmetric Partial Latin Squares with Trivial Autotopism Groups Trent G. Marbach

Computing the sets of totally symmetric and totally conjugate orthogonal partial Latin squares by

Biangular Lines Darcy Best March 24, 2014 Joint work with: Hadi Kharaghani (University of

AN APPLICATION OF POSITIVE DEFINITE FUNCTIONS TO THE PROBLEM OF MUBS MIHAIL N. KOLOUNTZAKIS, M