CHEP 2010 How to harness the performance potential How to harness - PowerPoint PPT Presentation

CHEP 2010 How to harness the performance potential How to harness the performance potential of current Multi-Core CPUs and GPUs Sverre Jarp CERN openlab IT Dept. CERN CERN Taipei, Monday 18 October 2010

CHEP 2010, Taipei Contents Contents  The hardware situation The hardware situation  Current software Current software  Software prototypes Software prototypes Soft are protot pes Soft are protot pes  Some recommendations Some recommendations  Conclusions Conclusions 2 Sverre Jarp - CERN

CHEP 2010, Taipei The The h hardware d situation situation 3 Sverre Jarp - CERN

CHEP 2010, Taipei In the days of the Pentium In the days of the Pentium  Life was really simple: Life was really simple: Pipeline  Basically two dimensions B i ll t di i  The frequency of the pipeline Superscalar  The number of boxes Th b f b  The semiconductor industry increased the frequency Nodes  We acquired the right number of  We acquired the right number of (single-socket) boxes Sockets 4 Sverre Jarp - CERN

CHEP 2010, Taipei Today: Seven dimensions of multiplicative performance Seven dimensions of multiplicative performance  First three dimensions:  Pipelined execution units Pipelining  Large superscalar design  Large superscalar design  Wide vector width (SIMD) Superscalar Superscalar  Next dimension is a “pseudo” dimension: dimension: Vector width Multithreading  Hardware multithreading Nodes  Last three dimensions:  Multiple cores p Sockets  Multiple sockets  Multiple compute nodes  Multiple compute nodes Multicore 5 SIMD = Single Instruction Multiple Data Sverre Jarp - CERN

CHEP 2010, Taipei Moore’s law Moore s law  We continue to double the number of  We continue to double the number of transistors every other year  The consequences  The consequences  CPUs  Single core  Multicore  Manycore Si l  M lti  M  Vectors  Hardware threading H d th di  GPUs  Huge number of FMA units  Today we commonly acquire chips with 1’000’000’000 transistors! with 1’000’000’000 transistors! From Wikipedia Adapted from Wikipedia 6 Sverre Jarp - CERN

CHEP 2010, Taipei Real consequence of Moore’s law Real consequence of Moore s law  We are being “drowned” in transistors:  We are being “drowned” in transistors:  More (and more complex) execution units  Hundreds of new instructions  Longer SIMD vectors  Large number of cores Large number of cores  More hardware threading  In order to profit we need to “think parallel” p p  Data parallelism  Task parallelism 7 Sverre Jarp - CERN

CHEP 2010, Taipei Four floating point data flavours (256b) Four floating-point data flavours (256b)  Longer vectors: Longer vectors:  AVX (Advanced Vector eXtension) is coming:  As of next year, vectors will be 256 bits in length  Intel’s “Sandy Bridge” first (others are coming, also from AMD)  E0 E0 Single precision Si l i i - - - - - - -  Scalar single (SS)  Packed single (PS)  Packed single (PS) E7 E7 E6 E6 E5 E5 E4 E4 E3 E3 E2 E2 E1 E1 E0 E0   Double precision Double precision E0 - - -  Scalar Double (SD)  Packed Double (PD) Packed Double (PD) E1 E0 E3 E2 Without vectors in our software we will use Without vectors in our software, we will use 1/4 or 1/8 of the available execution width 8 Sverre Jarp - CERN

CHEP 2010, Taipei The move to many-core systems The move to many core systems  Examples of “CPU slots”: Sockets * Cores * HW-threads  Examples of CPU slots : Sockets Cores HW-threads  Basically what you observe in “cat /proc/cpuinfo”  Conservative:  Conservative:  Dual-socket AMD six-core (Istanbul): 2 * 6 * 1 = 12  Dual socket Intel six core (Westmere):  Dual-socket Intel six-core (Westmere): 2 * 6 * 2 = 24 2 6 2 = 24  Aggressive:  Quad-socket AMD Magny-Cours (12-core) 4 * 12 * 1 = 48  Quad-socket Nehalem-EX “octo-core”: 4 * 8 * 2 = 64  In the near future: Hundreds of CPU slots !  Quad-socket Sun Niagara (T3) processors w/16 cores and 8 Quad socket Sun Niagara (T3) processors w/16 cores and 8 threads (each): 4 * 16 * 8 = 512  And by the time new software is ready: Thousands !!  And, by the time new software is ready: Thousands !! 9 Sverre Jarp - CERN

CHEP 2010, Taipei Accelerators (1): Intel MIC Accelerators (1): Intel MIC  Many Integrated Core architecture:  Announced at ISC10 (June 2010)  Based on the x86 architecture, 22nm ( in 2012?) Based on the x86 architecture, 22nm ( in 2012?)  Many-core (> 50 cores) + 4-way multithreaded + 512-bit vector unit  Limited memory: Few Gigabytes In Order, 4 In Order, 4 In Order, 4 . . . . . . . . . . . . threads, SIMD-16 threads, SIMD-16 nction threads, SIMD-16 erface splay Fixed I$ D$ I$ D$ ler ler Dis Inte ory Controll F ory Controll Fu  L2 Cache Memo Memo Interface Texture System Logic In Order, 4 In Order, 4 . . . . . . . . . . . . . . . . . . . . . . . . threads, SIMD-16 threads, SIMD-16 I$ I$ D$ D$ I$ I$ D$ D$ 10 Sverre Jarp - CERN

CHEP 2010, Taipei Accelerators (2): Nvidia Fermi GPU Accelerators (2): Nvidia Fermi GPU Instruction Cache Instruction Cache  Streaming Multiprocessing  Streaming Multiprocessing Scheduler Scheduler Scheduler Scheduler (SM) Architecture Dispatch Dispatch Dispatch Dispatch Register File Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core  32 “CUDA cores” per SM (512 total) Core Core Core Core Core Core Core Core  Core Core Core Core Core Core Core Core Peak single precision floating point g p g p performance (at 1.15 GHz”: Core Core Core Core Core Core Core Core  Above 1 Tflop Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core  Double-precision: 50% Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core  D Dual Thread Scheduler l Th d S h d l Load/Store Units x 16 Lots of Special Func Units x 4  Interconnect Network Interconnect Network 64 KB of RAM for shared memory and interest in the interest in the 64K C 64K C 64K Configurable 64K Configurable fi fi bl bl L1 cache (configurable) Cache/Shared Cache/Shared Mem Mem HEP on-line Uniform Cache Uniform Cache community y  A few Gigabytes of main memory g y y Adapted from Nvidia 11 Sverre Jarp - CERN

CHEP 2010, Taipei Current Current software software 12 Sverre Jarp - CERN

CHEP 2010, Taipei SW performance: A complicated story! SW performance: A complicated story!  We start with a concrete real life problem to solve  We start with a concrete, real-life problem to solve  For instance, simulate the passage of elementary particles through matter through matter  We write programs in high level languages  C++, JAVA, Python, etc.  A compiler (or an interpreter) transforms the high-level code to A compiler (or an interpreter) transforms the high level code to machine-level code  We link in external libraries  A sophisticated processor with a complex architecture and A sophisticated processor with a complex architecture and even more complex micro-architecture executes the code  In most cases, we have little clue as to the efficiency of this In most cases e ha e little cl e as to the efficienc of this transformation process 13 Sverre Jarp - CERN

CHEP 2010, Taipei We need forward scalability  Not only should a program be written in such a way that it extracts maximum performance from today’s hardware extracts maximum performance from today s hardware  On future processors, performance should scale automatically  In the worst case, one would have to recompile or relink  Additional CPU/GPU hardware, be it cores/threads or vectors would automatically be put to good use vectors, would automatically be put to good use  Scaling would be as expected: g p  If the number of cores (or the vector size) doubled:  Scaling would be close to 2x, but certainly not just a few percent g y j p  We cannot afford to “rewrite” our software for every hardware change! 14 Sverre Jarp - CERN

CHEP 2010, Taipei Concurrency in HEP Concurrency in HEP  We are “blessed” with lots of it:  Entire events E ti t  Particles, hits, tracks and vertices  Physics processes  I/O streams (ROOT trees branches) I/O streams (ROOT trees, branches)  Buffer manipulations (also data compaction, etc.)  Fitting variables  Partial sums, partial histograms  and many others …..  Usable for both data and task parallelism!  But fine grained parallelism is not well exposed in  But, fine-grained parallelism is not well exposed in today’s software frameworks 15 Sverre Jarp - CERN

CHEP 2010 How to harness the performance potential How to harness - PowerPoint PPT Presentation

CHEP 2010 How to harness the performance potential How to harness the performance potential of current Multi-Core CPUs and GPUs Sverre Jarp CERN openlab IT Dept. CERN CERN Taipei, Monday 18 October 2010 CHEP 2010, Taipei Contents

CHEP OPERATIONS REVIEW ORLANDO, FLORIDA 26 OCTOBER 2007 INFORMATION SYSTEMS AT CHEP DONNA

ATLAS Replica Management in Rucio: Replication Rules and Subscriptions CHEP 2013 Martin Barisits

Experiences with moving to open source standards for building and packaging Dennis van Dok,

Baird Industrial Conference 11 November 2015 Peter Mackie Group President, CHEP Pallets

Eye and Brain Eye and Brain Central visual pathways 1 2/22/2010 2 2/22/2010 3 2/22/2010 4

Porting the LHCb Stack from x86 (Intel) to aarch64 (ARM) CHEP 2018, Sofia Laura Promberger 1 2

Directly Calculating the Glue Component of the Nucleon in Lattice QCD CHEP 2019 Tomas L. Howson,

Incident Command & Management Jason Mahoney NRP, CEDP, CHCM, CHEC III, CHEP, CHSP, NHDP-BC

Evidence driven recommendations for hypertension Doreen M. Rabi, MD MSc Associate Professor,

ATLAS ROOT I/O pt 2 Atlas Hot Topics (with reference to CHEP presentations) Big data

100GE Upgrades at FNAL Phil DeMar ; Andrey Bobyshev CHEP 2015 April 14, 2015 FNAL High-Impact

CHEP 2009 [paper 28] Eric Grancher eric.grancher@cern.ch CERN IT Image courtesy of

Proposal to add DUNE to the OSG Council Ken Herner for the DUNE Collaboration CHEP 2019 13 Dec

High Performance Experiment Data Archiving with gStore Chep 2012, New York May 21, 2012 Horst

I.M. Skaugen SE 3Q 2010 presentation IMS Innovative Maritime Solutions 15 October 2010 1

Financial Results for 4/2010- -9/2010 9/2010 Financial Results for 4/2010 and and Financial

THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University

An ECDSA Processor for RFID Authentication Michael Hutter, Martin Feldhofer, and Thomas Plos

we have been assuming that the data collections we

FFTs Overview EECS 360 Notes Methods descriptions Hardware Implementations

Page Fault Liberation Army Its turtles Turing machines all the way down! Julian Bangert,

ss 6 Cl Class CSC 472/583 Software Security Return-oriented programming (ROP) Dr. Si Chen

Theory of Computation Course note based on Computability, Complexity, and Languages: Fundamentals

Debugger Usage at HRSK 2009-03-19 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351 - 463 - 31945

CHEP 2010 How to harness the performance potential How to harness - PowerPoint PPT Presentation

CHEP 2010 How to harness the performance potential How to harness the performance potential of current Multi-Core CPUs and GPUs Sverre Jarp CERN openlab IT Dept. CERN CERN Taipei, Monday 18 October 2010 CHEP 2010, Taipei Contents

CHEP OPERATIONS REVIEW ORLANDO, FLORIDA 26 OCTOBER 2007 INFORMATION SYSTEMS AT CHEP DONNA

ATLAS Replica Management in Rucio: Replication Rules and Subscriptions CHEP 2013 Martin Barisits

Experiences with moving to open source standards for building and packaging Dennis van Dok,

Baird Industrial Conference 11 November 2015 Peter Mackie Group President, CHEP Pallets

Eye and Brain Eye and Brain Central visual pathways 1 2/22/2010 2 2/22/2010 3 2/22/2010 4

Porting the LHCb Stack from x86 (Intel) to aarch64 (ARM) CHEP 2018, Sofia Laura Promberger 1 2

Directly Calculating the Glue Component of the Nucleon in Lattice QCD CHEP 2019 Tomas L. Howson,

Incident Command &amp; Management Jason Mahoney NRP, CEDP, CHCM, CHEC III, CHEP, CHSP, NHDP-BC

Evidence driven recommendations for hypertension Doreen M. Rabi, MD MSc Associate Professor,

ATLAS ROOT I/O pt 2 Atlas Hot Topics (with reference to CHEP presentations) Big data

100GE Upgrades at FNAL Phil DeMar ; Andrey Bobyshev CHEP 2015 April 14, 2015 FNAL High-Impact

CHEP 2009 [paper 28] Eric Grancher eric.grancher@cern.ch CERN IT Image courtesy of

Proposal to add DUNE to the OSG Council Ken Herner for the DUNE Collaboration CHEP 2019 13 Dec

High Performance Experiment Data Archiving with gStore Chep 2012, New York May 21, 2012 Horst

I.M. Skaugen SE 3Q 2010 presentation IMS Innovative Maritime Solutions 15 October 2010 1

Financial Results for 4/2010- -9/2010 9/2010 Financial Results for 4/2010 and and Financial

THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University

An ECDSA Processor for RFID Authentication Michael Hutter, Martin Feldhofer, and Thomas Plos

we have been assuming that the data collections we

FFTs Overview EECS 360 Notes Methods descriptions Hardware Implementations

Page Fault Liberation Army Its turtles Turing machines all the way down! Julian Bangert,

ss 6 Cl Class CSC 472/583 Software Security Return-oriented programming (ROP) Dr. Si Chen

Theory of Computation Course note based on Computability, Complexity, and Languages: Fundamentals

Debugger Usage at HRSK 2009-03-19 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351 - 463 - 31945

Incident Command & Management Jason Mahoney NRP, CEDP, CHCM, CHEC III, CHEP, CHSP, NHDP-BC