case of Marconi, the new CINECA flagship system Piero Lanucara - - PowerPoint PPT Presentation
case of Marconi, the new CINECA flagship system Piero Lanucara - - PowerPoint PPT Presentation
HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara BG/Q (Fermi) as a Tier0 Resource Many advantages as a supercomputing resource: Low energy consumption. Limited floor space
BG/Q (Fermi) as a Tier0 Resource
- Many advantages as a supercomputing
resource:
– Low energy consumption. – Limited floor space requirements – Fast internal network – Homogeneous architecture → simple usage model.
- But
– Low, single core performance + I/O structure meant very high parallelism necessary (at least 1024 cores). – For some applications low memory/core (1Gb) and I/O performance also a problem. Also limited capabilities of O.S. on compute cores (e.g. no interactive access) – Cross compilation, because login nodes different to compute nodes, can complicate some build procedures.
FERMI scheduled to be decommissioned mid-end 2016
Replacing Fermi at Cineca - considerations
- A new procurement is a complicated process and
considers many factors but must include (together with the price):
– Minimum peak compute power – Power consumption – Floor space required – Availability – Disk space, internal network, etc.
- IBM no longer offers the BlueGene range for
supercomputers so cannot be a solution.
- Many computer centres are adopting instead a
heterogenous model for computer clusters
2019 2016 2017 2018 2020
Marconi A1 2.1PFlops . 0.7Mwatt Marconi A2
- 11PFlops. 1.3MWatt
Alpha Commit Wins Volume Beta
Marconi A3
- 7PFlops. 1.0Mwatt
Sistem A4(?)
- 50PFlops. 3.2 Mwatt
Fermi 2PFlops . 0.8Mwatt Galileo & PICO 1.2PFlops . 0.4Mwatt
2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt 1.2Mwatt 50 rack 120 rack 120 rack 120 rack 150 rack 100mq 240mq 240mq 240mq 300mq
Replacing Fermi – the Marconi solution
Marconi High level system Characteristics
Partition Installation CPU # nodes # of Racks Power A1 – Broadwell (2.1 PFlops) April 2016 E5-2697 v4 1512 25 700KW A2 - Knight Landing (11 Pflops) September 2016 KNL 3600 50 1300KW A3 – Skylake (7 Pflops expected) June 2017 E5-2680 v5 >1500 >25 1000KW
Tender proposal Network: Intel OmniPath
- A2
A2
KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs
A3 A3
SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs
A1 A1
BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs 1 PFs - conventional 1 PFs – no-conventional (KNL) 5 PFs - conventional
- A2
A2
KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs
A3 A3
SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs
A1 A1
BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs 1 PFs - conventional 1 PFs – no-conventional (KNL) 5 PFs - conventional
storage
GSS*: 5 PB scratch area (50 GB/s) by Jan17: 10 PB long term storage (40 GB/s) 20 PB Tape library
Marconi - Compute
A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node
Marconi - Compute
A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node
Marconi - Compute
A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake 20 cores @ 2.3GHz dual socket node: 40 core and 196GByte /node
Marconi - Network
Network type: new Intel Omnipath Largest Omnipath cluster of the world Network topology: Fat-tree 2:1 oversubscription tapering at the level of the core switches only Core Switches: 5 x OPA Core Switch “Sawtooth Forest” 768 ports each Hdge Switch: 216 OPA Edge Switch “Eldorado Forest” 48 ports each Maximum system configuration: 5(opa) x 768(ports) x 2(tapering) -> 7680 servers
6624 Compute nodes 216 x 48 ports Hedge Switches 5 x 768 ports core Switches 3x 32 downlink 32 nodes fully interconnected island
A1 HPL
Full system Linpack:
- 1 MPI task per node
- perf range: 1.6 – 1.7PFs.
- Max Perf: 1.72389PFs with
Turbo-OFF.
- Turbo-ON -> throttling
June 2016:Number 46
A2 HPL
Full system Linpack: 3556 nodes
- 1 MPI task per node
- Max Perf with HyperThreading-OFF.
November 2016:Number 12
[0] ================================================================================
[0] T/V N NB P Q Time Gflops [0] -------------------------------------------------------------------------------- [0] W[0] R00L2L4 6287568 336 28 127 26628.96 6.22304e+06 [0] HPL_pdgesv() start time Fri Nov 4 23:10:08 2016 [0] [0] HPL_pdgesv() end time Sat Nov 5 06:33:57 2016 [0] [0] HPL Efficiency by CPU Cycle 2505.196% [0] HPL Efficiency by BUS Cycle 2439.395% [0] -------------------------------------------------------------------------------- [0] ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006293 ...... PASSED [0] ================================================================================ [0] [0] Finished 1 tests with the following results: [0] 1 tests completed and passed residual checks, [0] 0 tests completed and failed residual checks, [0] 0 tests skipped because of illegal input values. [0] -------------------------------------------------------------------------------- [0] [0] End of Tests. [0] ================================================================================
- An accelerator (like
GPUs) but more similar to a conventional multicore CPU.
- Current version, Knight’s
Corner (KNC) has 57-61 1.0-1.2 GHz cores,8- 16GB RAM. 512 bit vector unit.
- Cores connected in a
ring topology and MPI possible.
- No need to write CUDA
- r OpenCL as Intel
compilers will compile Fortran or C code for the MIC.
- 1-2 Tflops, according to
model.
Intel Xeon PHI KNC (Galileo@CINECA)
- A2: Knights Landing
(KNL)
–A big unknown because very few people currently have access to KNL. –But we know the architecture of KNL and the differences and similarities with respect to KNC. –The main differences are:
- KNL will be a standalone
processor not an accelerator (unlike KNC)
- KNL has more powerful cores
and faster internal network.
- On package high
performance, memory (16GB, MCDRAM).
Intel Xeon PHI KNC-KNL comparision
KNC (Galileo) KNL (Marconi) #cores 61 (pentium) 68 (Atom ) Core frequency 1.238 GHz 1.4 Ghz Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM Internal network Bi-directional Ring Mesh Vectorisation 512 bit /core 2xAVX-512 /core Usage Co-processor Standalone Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp) Power ~300W ~200W
A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X
- slower. Big differences also in memory bandwidth.
Coming next: A3
- A3. Intel Skylake
processors (mid-2017)
–Successor to Haswell, and launched in 2015. –Expect increase in performance and power efficiency.
Coming next: A3
Marconi A1 and A2 exploitation at its best
*
Exploiting the parallel universe
Three levels of parallelism supported by Intel hardware
- Multi thread/task performance
- Exposed by programming models
- Execute tens/hundreds/thousands task concurrently
Thread Level Parallelism
- Single thread performance
- Exposed by tools and programming models
- Operate on 4/8/16 elements at a time
Vector Level Parallelism
- Single thread performance
- Automatically exposed by HW/tools
- Effectively limited to a few instructions
Instruction Level Parallelism
- A1: Broadwell nodes
–Similar to Haswell cores present on Galileo.
–Expect only a small difference in single core performance wrt Galileo, but a big difference compared to Fermi. –More cores/node (36) should mean better OpenMP performance but also MPI performance will improve (faster network). –Life much easier for SPMD programming models.
cores/node 36 Memory/node 128 GB
A1 exploitation
+ Use SIMD vectorization
Single Instruction Multiple Data (SIMD) vectorization
- Technique for exploiting VLP on a single thread
- Operate on more than one element at a time
- Might decrease instruction counts significantly
- Elements are stored on SIMD registers or vectors
- Code needs to be vectorized
- Vectorization usually on inner loops
- Main and remainder loops are generated
for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; for (int i = 0; i < N; i += 4) c[i:4] = a[i:4] + b[i:4];
a[i:4] b[i:4] c[i:4]
Scalar loop SIMD loop (4 elements)
- A2: KNL
–(More) similar to KNC coprocessor present
- n Galileo, but with remarkable differences:
- High bandwidth MCDRAM available
- AVX-512 ISA
- Binary compatible with ‘standard’ Xeon
–MPI and OpenMP mixing still the best choice (less OpenMP performances dependent).
cores/node 68 Memory/node 96GByte DDR4 + 16GByte MCDRAM
A2 exploitation
++ Use SIMD vectorization
A2: MCDRAM
- Memory bandwidth in HPC is one of common bottleneck
for performances
- To increase the demand for memory bandwidth KNL
have a on-package high memory bandwidth memory (HBM) based on multi-channel dynamic random access memory (MCDRAM).
- This memory is capable of delivering up to 5x
performance (≥ 400 Gb/s) compared to DDR4 memory on same platform (≥ 90 GB/s)
A2: MCDRAM
- HBM on KNL can be used as
- 1. a last-level cache (LLC)
- 2. an addressable memory.
- The configuration is determined at boot time, by
choosing in BIOS setting between three MCDRAM modes:
- 1. Flat mode
- 2. Cache mode
- 3. Hybrid mode
A2: MCDRAM
- The best mode to use will depend on the application.
A2: Using HBM as addressable memory
Two methods for this:
- the numactl tool
Works best if the whole app can fit in MCDRAM
- the memkind library
Using library calls or Compiler Directives Needs source modification
A2: Using numactl to access MCDRAM
- Run "numactl --hardware" to see the NUMA configuration of your system
- Look for the node only.
If the total memory footprint of your app is smaller than the size of MCDRAM
- ps -C myapp u
- see RSS value
- Use numactl to allocate all of its memory from MCDRAM
- numactl --membind=mcdram_id myapp
- Where mcdram_id is the ID of MCDRAM "node"
If the total memory footprint of your app is larger than the size of MCDRAM You can still use numactl to allocate part of your app in MCDRAM
- numactl --preferred=mcdram_id myapp
- Allocations that don't fit into MCDRAM spills over to DDR
- numactl --interleave=nodes myapp
- Allocations are interleaved across all nodes
A2: Using Memkind to access MCDRAM
- Memkind library is a user-extensible heap manager built
- n top of jemalloc, a C library for general-purpose
memory allocation functions.
- The library is generalizable to any NUMA architecture,
but for Knights Landing processors it is used primarily for manual allocation to HBM using special allocators for C/C++
- has limited support for Fortran
A2: Using Memkind - C case
- Allocate 1000 floats from DDR
float *fv; fv = (float *)malloc(sizeof(float) * 1000);
- Allocate 1000 floats from MCDRAM
float *fv; fv = (float *)hbw_malloc(sizeof(float) * 1000);
A2: Using Memkind - Fortran case
C Declare arrays to be dynamic REAL, ALLOCATABLE :: A(:), B(:), C(:) !DEC$ ATTRIBUTES FASTMEM :: A NSIZE=1024 c c allocate array 'A' from MCDRAM c ALLOCATE (A(1:NSIZE)) c c Allocate arrays that will come from DDR c ALLOCATE (B(NSIZE), C(NSIZE))
- A2: KNL
–(More) similar to KNC coprocessor present
- n Galileo, but with remarkable differences:
- High bandwidth MCDRAM available
- AVX-512 ISA
- Binary compatible with ‘standard’ Xeon
–MPI and OpenMP mixing still the best choice (less OpenMP performances dependent).
cores/node 68 Memory/node 96GByte DDR4 + 16GByte MCDRAM
A2 exploitation
++ Use SIMD vectorization
Porting Applications onto Marconi A1 and A2
- Applications developers must utilise vectorisation (SIMD)
and/or MPI(OpenMP) processes(threads), and possibly the fast memory of KNL (MCDRAM).
- From the user perspective, its success will depend on how
software tools and applications are able to exploit the KNL
- architecture. Key idea: use proxy (KNC, …):
- SPECFEM3D_GLOBE. Already reasonable results with
KNC in the framework of the IPCC@CINECA activity. Good amount of vectorisation (FORCE_VECTORIZATION preprocessing enabling and SIMD optimization) suitable for KNC and future KNL. High number of OpenMP threads scaling (up to more than 60 on KNC) Worth noting that up to now KNC haven’t been widely supported by Geophysical Applications software developers and users. A remarkable exception is SPECFEM3D_GLOBE software CIG repo, where the “native” version is maintained and tested. Again, this should be fine for its KNL startup.
Good practice: vectorization (SPECFEM3D_GLOBE KNC investigation)
Computer system e.t. (sec.) Speedup wrt Haswell Haswell (Galileo) 570.20 1.00 KNC (Galileo) 430.35 1.32 SPECFEM3D_GLOBE Regional_MiddleEast test case: forward simulation Computer System e.t. (sec.) Slowdown factor wrt vectorised Haswell (Galileo) 687.14 1.20 KNC (Galileo) 848.12 1.97 Based on a 4-node Galileo partition (16 MPI processes, 4 and 60 OpenMP threads on Haswell and KNC respectively). The impact of vectorisation: on Haswell and KNC respectively). SPECFEM3D_GLOBE Regional_MiddleEast test case: no vectorisation comparison <- 2x Slowdown factor
Good practice: choosing MCDRAM memory modes
- MCDRAM cache mode should be the good choice for
most applications but…
- …applications with ‘cache unfriendly’ data are candidates
for using other memory modes. In this case:
Try to identify what to put in MCDRAM: timings with selected data items allocated fast/slow (intrusive) memory profiling (VTUNE Amplifier tool)
Conclusions
- Marconi A1: moderate improvements over the years….
but a big improvements compared to Fermi.
- High expectations of Marconi A2 KNL performances.
- KNC paves the way for increasing performances…
- ….try to manage domain parallelism, increase threading,