case of Marconi, the new CINECA flagship system Piero Lanucara - PowerPoint PPT Presentation

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara

BG/Q (Fermi) as a Tier0 Resource • Many advantages as a supercomputing resource: – Low energy consumption. – Limited floor space requirements – Fast internal network – Homogeneous architecture → simple usage model. • But – Low, single core performance + I/O structure meant very high parallelism necessary (at least 1024 cores). – For some applications low memory/core (1Gb) and I/O performance also a problem. FERMI scheduled to be Also limited capabilities of O.S. on compute cores (e.g. no interactive access) decommissioned mid-end 2016 – Cross compilation, because login nodes different to compute nodes, can complicate some build procedures.

Replacing Fermi at Cineca - considerations • A new procurement is a complicated process and considers many factors but must include (together with the price): – Minimum peak compute power – Power consumption – Floor space required – Availability – Disk space, internal network, etc. • IBM no longer offers the BlueGene range for supercomputers so cannot be a solution. • Many computer centres are adopting instead a heterogenous model for computer clusters

Replacing Fermi – the Marconi solution 2016 2017 2018 2019 2020 Fermi 2PFlops . 0.8Mwatt Galileo & PICO 1.2PFlops . 0.4Mwatt Marconi A1 Volume Beta Alpha 2.1PFlops . 0.7Mwatt Marconi A2 Commit Wins 11PFlops. 1.3MWatt Marconi A3 7PFlops. 1.0Mwatt Sistem A4(?) 50PFlops. 3.2 Mwatt 2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt 1.2Mwatt 50 rack 120 rack 150 rack 120 rack 120 rack 100mq 240mq 240mq 240mq 300mq

Marconi High level system Characteristics Tender proposal Partition Installation CPU # nodes # of Racks Power A1 – Broadwell (2.1 April 2016 E5-2697 v4 1512 25 700KW PFlops) A2 - Knight Landing (11 September KNL 3600 50 1300KW Pflops) 2016 A3 – Skylake (7 Pflops June 2017 E5-2680 v5 >1500 >25 1000KW expected) Network: Intel OmniPath

� 1 PFs – no-conventional (KNL) A2 A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional A3 A3 1 PFs - conventional A1 A1 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs

� storage GSS*: 5 PB scratch area (50 GB/s) by Jan17: 10 PB long term storage (40 GB/s) 20 PB Tape library 1 PFs – no-conventional (KNL) A2 A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional A3 A3 1 PFs - conventional A1 A1 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs

Marconi - Compute A1 ( half reserved to EUROfusion ) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 ( 1 PFlops to EUROfusion ) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 ( great part reserved to EUROfusion ) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node

Marconi - Network Network type: new Intel Omnipath Largest Omnipath cluster of the world Network topology: Fat-tree 2:1 oversubscription tapering at the level of the core switches only Core Switches: 5 x OPA Core Switch “Sawtooth Forest” 768 ports each Hdge Switch: 216 OPA Edge Switch “Eldorado Forest” 48 ports each Maximum system configuration: 5(opa) x 768(ports) x 2(tapering) -> 7680 servers

5 x 768 ports core Switches 3x 216 x 48 ports Hedge Switches 32 downlink 6624 Compute nodes 32 nodes fully interconnected island

A1 HPL Full system Linpack: • 1 MPI task per node • perf range: 1.6 – 1.7PFs. • Max Perf: 1.72389PFs with Turbo-OFF. • Turbo-ON -> throttling June 2016:Number 46

A2 HPL Full system Linpack: 3556 nodes • 1 MPI task per node • Max Perf with HyperThreading-OFF. November 2016:Number 12 [0] ================================================================================ [0] T/V N NB P Q Time Gflops [0] -------------------------------------------------------------------------------- [0] W[0] R00L2L4 6287568 336 28 127 26628.96 6.22304e+06 [0] HPL_pdgesv() start time Fri Nov 4 23:10:08 2016 [0] [0] HPL_pdgesv() end time Sat Nov 5 06:33:57 2016 [0] [0] HPL Efficiency by CPU Cycle 2505.196% [0] HPL Efficiency by BUS Cycle 2439.395% [0] -------------------------------------------------------------------------------- [0] ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006293 ...... PASSED [0] ================================================================================ [0] [0] Finished 1 tests with the following results: [0] 1 tests completed and passed residual checks, [0] 0 tests completed and failed residual checks, [0] 0 tests skipped because of illegal input values. [0] -------------------------------------------------------------------------------- [0] [0] End of Tests. [0] ================================================================================

Intel Xeon PHI KNC (Galileo@CINECA) • An accelerator (like GPUs) but more similar to a conventional multicore CPU. • Current version, Knight’s • No need to write CUDA Corner (KNC) has 57-61 or OpenCL as Intel 1.0-1.2 GHz cores,8- compilers will compile 16GB RAM. 512 bit Fortran or C code for the vector unit. MIC. • Cores connected in a • 1-2 Tflops, according to ring topology and MPI model. possible.

• A2: Knights Landing (KNL) – A big unknown because very few people currently have access to KNL. – But we know the architecture of KNL and the differences and similarities with respect to KNC. – The main differences are: • KNL will be a standalone processor not an accelerator (unlike KNC) • KNL has more powerful cores and faster internal network. • On package high performance, memory (16GB, MCDRAM).

Intel Xeon PHI KNC-KNL comparision KNC (Galileo) KNL (Marconi) 61 (pentium) #cores 68 (Atom ) Core frequency 1.238 GHz 1.4 Ghz Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM Internal network Bi-directional Ring Mesh Vectorisation 512 bit /core 2xAVX-512 /core Usage Co-processor Standalone Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp) Power ~300W ~200W A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X slower. Big differences also in memory bandwidth.

Coming next: A3 • A3. Intel Skylake processors (mid-2017) – Successor to Haswell, and launched in 2015. – Expect increase in performance and power efficiency.

Coming next: A3

Marconi A1 and A2 exploitation at its best *

Exploiting the parallel universe Three levels of parallelism supported by Intel hardware • Multi thread/task performance Thread Level • Exposed by programming models Parallelism • Execute tens/hundreds/thousands task concurrently • Single thread performance Vector Level • Exposed by tools and programming models Parallelism • Operate on 4/8/16 elements at a time • Single thread performance Instruction Level • Automatically exposed by HW/tools Parallelism • Effectively limited to a few instructions

A1 exploitation • A1: Broadwell nodes – Similar to Haswell cores present on Galileo. – Expect only a small difference in single core performance wrt Galileo, but a big difference compared to Fermi. – More cores/node (36) should mean better OpenMP performance but also MPI performance will improve (faster network). cores/node 36 – Life much easier for SPMD programming models. Memory/node 128 GB + Use SIMD vectorization

Single Instruction Multiple Data (SIMD) vectorization • Technique for exploiting VLP on a single thread • Operate on more than one element at a time • Might decrease instruction counts significantly • Elements are stored on SIMD registers or vectors • Code needs to be vectorized • Vectorization usually on inner loops a[i:4] • Main and remainder loops are generated for (int i = 0; i < N; i++) Scalar loop b[i:4] c[i] = a[i] + b[i]; for (int i = 0; i < N; i += 4) SIMD loop c[i:4] (4 elements) c[i:4] = a[i:4] + b[i:4];

case of Marconi, the new CINECA flagship system Piero Lanucara - PowerPoint PPT Presentation

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara BG/Q (Fermi) as a Tier0 Resource Many advantages as a supercomputing resource: Low energy consumption. Limited floor space

SCAI SuperComputing Application & Innovation CINECA Daniela Galetti December 2019 2

DSpace-CRIS@HKU: Achieving Visibility With a CERIF Compliant Open Source System David T Palmer,

iRODS UGM 2019 Mattia DAntonio m.dantonio@cineca.it 26-27 th June 2019, Utrecht, The

for information exchange Federated data & computing infrastructure Giuseppe Fiameni (CINECA)

Nonlinear output regulation: a reasoned overview and new developments Lorenzo Marconi C.A.SY. /

Nanoscopic approach to dynamics of liquids Umberto Marini B. Marconi University of Camerino and

Experience with new architectures: moving from HELIOS to Marconi Serhiy Mochalskyy, Roman Hatzky

Introduction on CFD in hemodynamics Raffaele Ponzini, PhD CINECA SCAI Dept. Segrate, Italy

Enabling Software for Intel MIC Architecture F . Affinito V. Ruggiero CINECA Rome - SCAI

Introduction to Deep Learning and Tensorflow Day 3 MR ST LF FS SO IB CINECA Roma - SCAI

QE, main strategies of parallelization and levels of parallelisms Fabio AFFINITO SCAI - Cineca

iRODS UGM 2019 Michele Carpen - m.carpen@cineca.it iRODS UGM 2019 26-27 June 2019, Utrecht,

BioCFD Tutorial: part 2 Raffaele Ponzini, CINECA SCAI Dept. Segrate, Italy PRACE Autumn

CORDEX_CORE @ ICTP http://www.cordex.org/experiment-guidelines/cordex-core Where are the data?

Nomchong Psychology P/L P/L It's all about the way you think It's all about the way you think

Hercules Synthetic Minor Permit Public Hearing Angela Marconi, P.E., BCEE December 8, 2016

Lightweight Runtimes (Galileo IoT) Team Sparkle Dhinesh Shiva Keno Guru Central

BALAM Educational Robotics in Guatemala Guatemala Population: ~18M (Country) ~4M (Metro)

P1 Galileo Briefing by Mrs Sharon Seow (Form Teacher) chew_sharon@moe.edu.sg , 6508 7324 Mdm

Galileo: a legacy for builders of new worlds? Francesco Bruschi Politecnico di Milano

What is a Paradigm Shift? Timothy C. Weiskel 11 September 2013 Class Discussion Session Week

ROADS TO HUMAN LEVEL AI? biologicalimitate humans. Even neural nets, sho work eventually.

The Frobenius problem for Mersenne numerical semigroups M.B. Branco Setember 2014 This is joint

Dynamically Reconfigurable Galileo IoT Testbed Intermediate talk for the Bachelors Thesis by

Sambuz

Useful Links

Newsletter

Mail Us

case of Marconi, the new CINECA flagship system Piero Lanucara - PowerPoint PPT Presentation

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara BG/Q (Fermi) as a Tier0 Resource Many advantages as a supercomputing resource: Low energy consumption. Limited floor space

SCAI SuperComputing Application &amp; Innovation CINECA Daniela Galetti December 2019 2

DSpace-CRIS@HKU: Achieving Visibility With a CERIF Compliant Open Source System David T Palmer,

iRODS UGM 2019 Mattia DAntonio m.dantonio@cineca.it 26-27 th June 2019, Utrecht, The

for information exchange Federated data &amp; computing infrastructure Giuseppe Fiameni (CINECA)

Nonlinear output regulation: a reasoned overview and new developments Lorenzo Marconi C.A.SY. /

Nanoscopic approach to dynamics of liquids Umberto Marini B. Marconi University of Camerino and

Experience with new architectures: moving from HELIOS to Marconi Serhiy Mochalskyy, Roman Hatzky

Introduction on CFD in hemodynamics Raffaele Ponzini, PhD CINECA SCAI Dept. Segrate, Italy

Enabling Software for Intel MIC Architecture F . Affinito V. Ruggiero CINECA Rome - SCAI

Introduction to Deep Learning and Tensorflow Day 3 MR ST LF FS SO IB CINECA Roma - SCAI

QE, main strategies of parallelization and levels of parallelisms Fabio AFFINITO SCAI - Cineca

iRODS UGM 2019 Michele Carpen - m.carpen@cineca.it iRODS UGM 2019 26-27 June 2019, Utrecht,

BioCFD Tutorial: part 2 Raffaele Ponzini, CINECA SCAI Dept. Segrate, Italy PRACE Autumn

CORDEX_CORE @ ICTP http://www.cordex.org/experiment-guidelines/cordex-core Where are the data?

Nomchong Psychology P/L P/L It's all about the way you think It's all about the way you think

Hercules Synthetic Minor Permit Public Hearing Angela Marconi, P.E., BCEE December 8, 2016

Lightweight Runtimes (Galileo IoT) Team Sparkle Dhinesh Shiva Keno Guru Central

BALAM Educational Robotics in Guatemala Guatemala Population: ~18M (Country) ~4M (Metro)

P1 Galileo Briefing by Mrs Sharon Seow (Form Teacher) chew_sharon@moe.edu.sg , 6508 7324 Mdm

Galileo: a legacy for builders of new worlds? Francesco Bruschi Politecnico di Milano

What is a Paradigm Shift? Timothy C. Weiskel 11 September 2013 Class Discussion Session Week

ROADS TO HUMAN LEVEL AI? biologicalimitate humans. Even neural nets, sho work eventually.

The Frobenius problem for Mersenne numerical semigroups M.B. Branco Setember 2014 This is joint

Dynamically Reconfigurable Galileo IoT Testbed Intermediate talk for the Bachelors Thesis by

Sambuz

Useful Links

Newsletter

Mail Us

SCAI SuperComputing Application & Innovation CINECA Daniela Galetti December 2019 2

for information exchange Federated data & computing infrastructure Giuseppe Fiameni (CINECA)