case of Marconi, the new CINECA flagship system Piero Lanucara - - PowerPoint PPT Presentation

case of marconi the new cineca
SMART_READER_LITE
LIVE PREVIEW

case of Marconi, the new CINECA flagship system Piero Lanucara - - PowerPoint PPT Presentation

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara BG/Q (Fermi) as a Tier0 Resource Many advantages as a supercomputing resource: Low energy consumption. Limited floor space


slide-1
SLIDE 1

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara

slide-2
SLIDE 2

BG/Q (Fermi) as a Tier0 Resource

  • Many advantages as a supercomputing

resource:

– Low energy consumption. – Limited floor space requirements – Fast internal network – Homogeneous architecture → simple usage model.

  • But

– Low, single core performance + I/O structure meant very high parallelism necessary (at least 1024 cores). – For some applications low memory/core (1Gb) and I/O performance also a problem. Also limited capabilities of O.S. on compute cores (e.g. no interactive access) – Cross compilation, because login nodes different to compute nodes, can complicate some build procedures.

FERMI scheduled to be decommissioned mid-end 2016

slide-3
SLIDE 3

Replacing Fermi at Cineca - considerations

  • A new procurement is a complicated process and

considers many factors but must include (together with the price):

– Minimum peak compute power – Power consumption – Floor space required – Availability – Disk space, internal network, etc.

  • IBM no longer offers the BlueGene range for

supercomputers so cannot be a solution.

  • Many computer centres are adopting instead a

heterogenous model for computer clusters

slide-4
SLIDE 4

2019 2016 2017 2018 2020

Marconi A1 2.1PFlops . 0.7Mwatt Marconi A2

  • 11PFlops. 1.3MWatt

Alpha Commit Wins Volume Beta

Marconi A3

  • 7PFlops. 1.0Mwatt

Sistem A4(?)

  • 50PFlops. 3.2 Mwatt

Fermi 2PFlops . 0.8Mwatt Galileo & PICO 1.2PFlops . 0.4Mwatt

2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt 1.2Mwatt 50 rack 120 rack 120 rack 120 rack 150 rack 100mq 240mq 240mq 240mq 300mq

Replacing Fermi – the Marconi solution

slide-5
SLIDE 5

Marconi High level system Characteristics

Partition Installation CPU # nodes # of Racks Power A1 – Broadwell (2.1 PFlops) April 2016 E5-2697 v4 1512 25 700KW A2 - Knight Landing (11 Pflops) September 2016 KNL 3600 50 1300KW A3 – Skylake (7 Pflops expected) June 2017 E5-2680 v5 >1500 >25 1000KW

Tender proposal Network: Intel OmniPath

slide-6
SLIDE 6
  • A2

A2

KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs

A3 A3

SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs

A1 A1

BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs 1 PFs - conventional 1 PFs – no-conventional (KNL) 5 PFs - conventional

slide-7
SLIDE 7
  • A2

A2

KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs

A3 A3

SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs

A1 A1

BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs 1 PFs - conventional 1 PFs – no-conventional (KNL) 5 PFs - conventional

storage

GSS*: 5 PB scratch area (50 GB/s) by Jan17: 10 PB long term storage (40 GB/s) 20 PB Tape library

slide-8
SLIDE 8

Marconi - Compute

A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node

slide-9
SLIDE 9

Marconi - Compute

A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node

slide-10
SLIDE 10

Marconi - Compute

A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node

slide-11
SLIDE 11

Marconi - Network

Network type: new Intel Omnipath Largest Omnipath cluster of the world Network topology: Fat-tree 2:1 oversubscription tapering at the level of the core switches only Core Switches: 5 x OPA Core Switch “Sawtooth Forest” 768 ports each Hdge Switch: 216 OPA Edge Switch “Eldorado Forest” 48 ports each Maximum system configuration: 5(opa) x 768(ports) x 2(tapering) -> 7680 servers

slide-12
SLIDE 12

6624 Compute nodes 216 x 48 ports Hedge Switches 5 x 768 ports core Switches 3x 32 downlink 32 nodes fully interconnected island

slide-13
SLIDE 13

A1 HPL

Full system Linpack:

  • 1 MPI task per node
  • perf range: 1.6 – 1.7PFs.
  • Max Perf: 1.72389PFs with

Turbo-OFF.

  • Turbo-ON -> throttling

June 2016:Number 46

slide-14
SLIDE 14

A2 HPL

Full system Linpack: 3556 nodes

  • 1 MPI task per node
  • Max Perf with HyperThreading-OFF.

November 2016:Number 12

[0] ================================================================================

[0] T/V N NB P Q Time Gflops [0] -------------------------------------------------------------------------------- [0] W[0] R00L2L4 6287568 336 28 127 26628.96 6.22304e+06 [0] HPL_pdgesv() start time Fri Nov 4 23:10:08 2016 [0] [0] HPL_pdgesv() end time Sat Nov 5 06:33:57 2016 [0] [0] HPL Efficiency by CPU Cycle 2505.196% [0] HPL Efficiency by BUS Cycle 2439.395% [0] -------------------------------------------------------------------------------- [0] ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006293 ...... PASSED [0] ================================================================================ [0] [0] Finished 1 tests with the following results: [0] 1 tests completed and passed residual checks, [0] 0 tests completed and failed residual checks, [0] 0 tests skipped because of illegal input values. [0] -------------------------------------------------------------------------------- [0] [0] End of Tests. [0] ================================================================================

slide-15
SLIDE 15
  • An accelerator (like

GPUs) but more similar to a conventional multicore CPU.

  • Current version, Knight’s

Corner (KNC) has 57-61 1.0-1.2 GHz cores,8- 16GB RAM. 512 bit vector unit.

  • Cores connected in a

ring topology and MPI possible.

  • No need to write CUDA
  • r OpenCL as Intel

compilers will compile Fortran or C code for the MIC.

  • 1-2 Tflops, according to

model.

Intel Xeon PHI KNC (Galileo@CINECA)

slide-16
SLIDE 16
  • A2: Knights Landing

(KNL)

–A big unknown because very few people currently have access to KNL. –But we know the architecture of KNL and the differences and similarities with respect to KNC. –The main differences are:

  • KNL will be a standalone

processor not an accelerator (unlike KNC)

  • KNL has more powerful cores

and faster internal network.

  • On package high

performance, memory (16GB, MCDRAM).

slide-17
SLIDE 17

Intel Xeon PHI KNC-KNL comparision

KNC (Galileo) KNL (Marconi) #cores 61 (pentium) 68 (Atom ) Core frequency 1.238 GHz 1.4 Ghz Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM Internal network Bi-directional Ring Mesh Vectorisation 512 bit /core 2xAVX-512 /core Usage Co-processor Standalone Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp) Power ~300W ~200W

A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X

  • slower. Big differences also in memory bandwidth.
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20

Coming next: A3

  • A3. Intel Skylake

processors (mid-2017)

–Successor to Haswell, and launched in 2015. –Expect increase in performance and power efficiency.

slide-21
SLIDE 21

Coming next: A3

slide-22
SLIDE 22

Marconi A1 and A2 exploitation at its best

*

slide-23
SLIDE 23

Exploiting the parallel universe

Three levels of parallelism supported by Intel hardware

  • Multi thread/task performance
  • Exposed by programming models
  • Execute tens/hundreds/thousands task concurrently

Thread Level Parallelism

  • Single thread performance
  • Exposed by tools and programming models
  • Operate on 4/8/16 elements at a time

Vector Level Parallelism

  • Single thread performance
  • Automatically exposed by HW/tools
  • Effectively limited to a few instructions

Instruction Level Parallelism

slide-24
SLIDE 24
  • A1: Broadwell nodes

–Similar to Haswell cores present on Galileo.

–Expect only a small difference in single core performance wrt Galileo, but a big difference compared to Fermi. –More cores/node (36) should mean better OpenMP performance but also MPI performance will improve (faster network). –Life much easier for SPMD programming models.

cores/node 36 Memory/node 128 GB

A1 exploitation

+ Use SIMD vectorization

slide-25
SLIDE 25

Single Instruction Multiple Data (SIMD) vectorization

  • Technique for exploiting VLP on a single thread
  • Operate on more than one element at a time
  • Might decrease instruction counts significantly
  • Elements are stored on SIMD registers or vectors
  • Code needs to be vectorized
  • Vectorization usually on inner loops
  • Main and remainder loops are generated

for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; for (int i = 0; i < N; i += 4) c[i:4] = a[i:4] + b[i:4];

a[i:4] b[i:4] c[i:4]

Scalar loop SIMD loop (4 elements)

slide-26
SLIDE 26
  • A2: KNL

–(More) similar to KNC coprocessor present

  • n Galileo, but with remarkable differences:
  • High bandwidth MCDRAM available
  • AVX-512 ISA
  • Binary compatible with ‘standard’ Xeon

–MPI and OpenMP mixing still the best choice (less OpenMP performances dependent).

cores/node 68 Memory/node 96GByte DDR4 + 16GByte MCDRAM

A2 exploitation

++ Use SIMD vectorization

slide-27
SLIDE 27

A2: MCDRAM

  • Memory bandwidth in HPC is one of common bottleneck

for performances

  • To increase the demand for memory bandwidth KNL

have a on-package high memory bandwidth memory (HBM) based on multi-channel dynamic random access memory (MCDRAM).

  • This memory is capable of delivering up to 5x

performance (≥ 400 Gb/s) compared to DDR4 memory on same platform (≥ 90 GB/s)

slide-28
SLIDE 28

A2: MCDRAM

  • HBM on KNL can be used as
  • 1. a last-level cache (LLC)
  • 2. an addressable memory.
  • The configuration is determined at boot time, by

choosing in BIOS setting between three MCDRAM modes:

  • 1. Flat mode
  • 2. Cache mode
  • 3. Hybrid mode
slide-29
SLIDE 29

A2: MCDRAM

  • The best mode to use will depend on the application.
slide-30
SLIDE 30

A2: Using HBM as addressable memory

Two methods for this:

  • the numactl tool

 Works best if the whole app can fit in MCDRAM

  • the memkind library

 Using library calls or Compiler Directives  Needs source modification

slide-31
SLIDE 31

A2: Using numactl to access MCDRAM

  • Run "numactl --hardware" to see the NUMA configuration of your system
  • Look for the node only.

 If the total memory footprint of your app is smaller than the size of MCDRAM

  • ps -C myapp u
  • see RSS value
  • Use numactl to allocate all of its memory from MCDRAM
  • numactl --membind=mcdram_id myapp
  • Where mcdram_id is the ID of MCDRAM "node"

 If the total memory footprint of your app is larger than the size of MCDRAM  You can still use numactl to allocate part of your app in MCDRAM

  • numactl --preferred=mcdram_id myapp
  • Allocations that don't fit into MCDRAM spills over to DDR
  • numactl --interleave=nodes myapp
  • Allocations are interleaved across all nodes
slide-32
SLIDE 32

A2: Using Memkind to access MCDRAM

  • Memkind library is a user-extensible heap manager built
  • n top of jemalloc, a C library for general-purpose

memory allocation functions.

  • The library is generalizable to any NUMA architecture,

but for Knights Landing processors it is used primarily for manual allocation to HBM using special allocators for C/C++

  • has limited support for Fortran
slide-33
SLIDE 33

A2: Using Memkind - C case

  • Allocate 1000 floats from DDR

float *fv; fv = (float *)malloc(sizeof(float) * 1000);

  • Allocate 1000 floats from MCDRAM

float *fv; fv = (float *)hbw_malloc(sizeof(float) * 1000);

slide-34
SLIDE 34

A2: Using Memkind - Fortran case

C Declare arrays to be dynamic REAL, ALLOCATABLE :: A(:), B(:), C(:) !DEC$ ATTRIBUTES FASTMEM :: A NSIZE=1024 c c allocate array 'A' from MCDRAM c ALLOCATE (A(1:NSIZE)) c c Allocate arrays that will come from DDR c ALLOCATE (B(NSIZE), C(NSIZE))

slide-35
SLIDE 35
  • A2: KNL

–(More) similar to KNC coprocessor present

  • n Galileo, but with remarkable differences:
  • High bandwidth MCDRAM available
  • AVX-512 ISA
  • Binary compatible with ‘standard’ Xeon

–MPI and OpenMP mixing still the best choice (less OpenMP performances dependent).

cores/node 68 Memory/node 96GByte DDR4 + 16GByte MCDRAM

A2 exploitation

++ Use SIMD vectorization

slide-36
SLIDE 36
slide-37
SLIDE 37

Porting Applications onto Marconi A1 and A2

  • Applications developers must utilise vectorisation (SIMD)

and/or MPI(OpenMP) processes(threads), and possibly the fast memory of KNL (MCDRAM).

  • From the user perspective, its success will depend on how

software tools and applications are able to exploit the KNL

  • architecture. Key idea: use proxy (KNC, …):
  • SPECFEM3D_GLOBE. Already reasonable results with

KNC in the framework of the IPCC@CINECA activity. Good amount of vectorisation (FORCE_VECTORIZATION preprocessing enabling and SIMD optimization) suitable for KNC and future KNL. High number of OpenMP threads scaling (up to more than 60 on KNC) Worth noting that up to now KNC haven’t been widely supported by Geophysical Applications software developers and users. A remarkable exception is SPECFEM3D_GLOBE software CIG repo, where the “native” version is maintained and tested. Again, this should be fine for its KNL startup.

slide-38
SLIDE 38

Good practice: vectorization (SPECFEM3D_GLOBE KNC investigation)

Computer system e.t. (sec.) Speedup wrt Haswell Haswell (Galileo) 570.20 1.00 KNC (Galileo) 430.35 1.32 SPECFEM3D_GLOBE Regional_MiddleEast test case: forward simulation Computer System e.t. (sec.) Slowdown factor wrt vectorised Haswell (Galileo) 687.14 1.20 KNC (Galileo) 848.12 1.97 Based on a 4-node Galileo partition (16 MPI processes, 4 and 60 OpenMP threads on Haswell and KNC respectively). The impact of vectorisation: on Haswell and KNC respectively). SPECFEM3D_GLOBE Regional_MiddleEast test case: no vectorisation comparison <- 2x Slowdown factor

slide-39
SLIDE 39

Good practice: choosing MCDRAM memory modes

  • MCDRAM cache mode should be the good choice for

most applications but…

  • …applications with ‘cache unfriendly’ data are candidates

for using other memory modes. In this case:

Try to identify what to put in MCDRAM:  timings with selected data items allocated fast/slow (intrusive)  memory profiling (VTUNE Amplifier tool)

slide-40
SLIDE 40

Conclusions

  • Marconi A1: moderate improvements over the years….

but a big improvements compared to Fermi.

  • High expectations of Marconi A2 KNL performances.
  • KNC paves the way for increasing performances…
  • ….try to manage domain parallelism, increase threading,

exploit data parallelism (vectorisation) and improve data locality (new chance: use on package memory)