HPC Performance and Energy E ffi ciency Overview and Trends Dr. - PowerPoint PPT Presentation

HPC Performance and Energy E ffi ciency Overview and Trends Dr. Sébastien Varrette June 9th, 2015 Parallel Computing and SMAI 2015 Congress Optimization Group (PCOG) Les Karellis (Savoie) http://hpc.uni.lu

Outline ■ Introduction & Context ■ HPC Data-Center Trends: Time for DLC ■ HPC [Co-]Processor Trends: Go Mobile ■ Middleware Trends: Virtualization, RJMS ■ Software Trends: Rethinking Parallel Computing ■ Conclusion 2

Introduction and Context

            HPC at the Heart of our Daily Life ■ Today... R&D, Academia , Industry, Local Collectivities   ■ … Tomorrow : digital health, nano/bio techno… 4

Performance Evaluation of HPC Systems ■ Commonly used metrics ✓ ︎ FLOPs: raw compute capability ✓ GUPS: memory performance ✓ IOPS: storage performance ✓ bandwidth & latency: memory operations or network transfer ■ Energy E ffi ciency ✓ Power Usage E ff ectiveness (PUE) in HPC data-centers ‣ Total Facility Energy / Total IT Energy ✓ Average system power consumption during execution (W) ✓ Performance-per-Watt (PpW) 5

Ex (in Academia): The UL HPC Platform http://hpc.uni.lu ■ 2 geographical sites, 3 server rooms ■ 4 clusters, ~281 users ✓ 404 nodes, 4316 cores ( 49.92 TFlops ) ✓ Cumul. shared raw storage: 3,13 PB ✓ Around 197 kW ■ > 6,21 M € HW investment so far ■ Mainly Intel -based architecture ■ Mainly Open-Source software stack ✓ Debian, SSH, OpenLDAP , Puppet, FAI... 6

Ex (in Academia): The UL HPC Platform http://hpc.uni.lu 7

General HPC Trends ■ Top500: world’s 500 most powerful computers (since 1993) ✓ Based on High-Performance LINPACK (HPL) benchmark ✓ Last list [Nov. 2014] ‣ #1: Tianhe-2 (China): 3,120,000 cores - 33.863 PFlops… and 17.8 MW ‣ Total combined performance: - 309 PFlops - 215.744 MW over 258 systems   (which provided power information) ■ Green500: Derive PpW metric from Top500 (MFlops/W) ✓ #1: L-CSC GPU Cluster (#168): 5.27 GFlops/W ■ Other Benchmarks: HPC{C,G}, Graph500… 8

Computing Needs Evolution Multi-Scale Weather prediction 1 ZFlops 100 EFlops Human Brain Project 10 EFlops 1 EFlops 100 PFlops Genomics 10 PFlops 1 PFlops 100 TFlops Computational Chemistry Molecular Dynamics 10 TFlops 1 TFlops 100 GFlops Manufacturing 10 GFlops 1 GFlops 1993 1999 2005 2011 2017 2023 2029 9

Computing Power Needs Evolution Multi-Scale Weather prediction 1 GW 1 ZFlops 100 EFlops Human Brain Project 10 EFlops 1 EFlops 100 MW 100 PFlops Genomics 10 PFlops 10 MW 1 PFlops 100 TFlops Computational Chemistry Molecular Dynamics 10 TFlops 1 MW 1 TFlops 100 GFlops Manufacturing 10 GFlops 100 kW 1 GFlops 1993 1999 2005 2011 2017 2023 2029 10

Computing Less Power Needs Evolution Multi-Scale Weather prediction < 20 MW 1 ZFlops 100 EFlops Human Brain Project 10 EFlops 10 MW 1 EFlops 100 PFlops Genomics 10 PFlops 10 MW 1 PFlops 100 TFlops Computational Chemistry Molecular Dynamics 10 TFlops 1 MW 1 TFlops 100 GFlops Manufacturing 10 GFlops 100 kW 1 GFlops 1993 1999 2005 2011 2017 2023 2029 11

The Budgetary Wall Multi-Scale Weather prediction < 20 MW 1 ZFlops < 1 M € / MW / Year 1,5 M € / MW / Year > 3 M € / MW / Year 100 EFlops Human Brain Project 10 EFlops 10 MW 1 EFlops 100 PFlops Genomics 10 PFlops 10 MW 1 PFlops 100 TFlops Computational Chemistry Molecular Dynamics 10 TFlops 1 MW 1 TFlops 100 GFlops Manufacturing 10 GFlops 100 kW 1 GFlops 1993 1999 2005 2011 2017 2023 2029 12

Energy Optimization paths toward Exascale ■ H2020 Exascale Challenge: 1 EFlops in 20 MW ✓ Using today’s most energy e ffi cient TOP500 system: 189MW new [co-]processors, interconnect… Virtualization, RJMS… Hardware Middleware New programming/execution models Data-center Software PUE optim. DLC… Reduced Power Consumption 13

HPC Data-Center Trends: Time for DLC Hardware Middleware Data-center Software Reduced Power Consumption

Cooling and PUE Courtesy of Bull SA 15

Cooling and PUE ■ Direct immersion: the CarnotJet example (PUE: 1.05) 16

HPC [Co-]Processor Trends: Go Mobile Hardware Middleware Data-center Software Reduced Power Consumption

Back to 1995: vector vs. micro-processor ■ Microprocessors ~10x slower than one vector CPU ✓ … thus not faster… But cheaper! 10x 18

Back to 1995: vector vs. micro-processor ■ Microprocessors ~10x slower than one vector CPU ✓ … thus not faster… But cheaper! 18

                  How about now? ■ Mobile SoCs ~10x slower than one microprocessor ✓ … thus not faster… But cheaper!   10x ✓ the “already seen” pattern? ■ Mont-Blanc project: build an HPC system   from embedded and mobile devices 19

Mont-Blanc (Phase 1) project outcomes ■ (2013) Tiribado: the first ARM HPC multicore system Courtesy of BCS 0,15 GFlops/W 20

The UL HPC viridis cluster (2013) ■ 2 encl. (96 nodes, 4U), 12 calxeda boards per enclosure ✓ 4x ARM Cortex A9 @ 1.1 GHz [4C] per Calxeda board ‣ 2x300W, “10” GbE inter-connect 100000 Intel Core i7 AMD G − T40N 10000 Atom N2600 Intel Xeon E7 ARM Cortex A9 1000 PpW −− LOGSCALE 100 10 1 0.1 0,513 GFlops/W 0.01 OSU Lat. OSU Bw. HPL HPL Full CoreMark Fhourstones Whetstones Linpack [EE-LSDS’13] M. Jarus, S. Varrette, A. Oleksiak, and P . Bouvry. Performance Evaluation and Energy Efficiency of High- Density HPC Platforms Based on Intel, AMD and ARM Processors. In Proc. of the Intl. Conf. on Energy Efficiency in Large Scale Distributed Systems (EE-LSDS’13), volume 8046 of LNCS, Vienna, Austria, Apr 2013. 21

Commodity vs. GPGPUs: L-CSC (2014) ■ The German L-CSC cluster (Frankfurt) (2014) ■ Nov 2014: 56 (out of 160) nodes, on each: ✓ 4 GPUs, 2 CPUs, 256 GB RAM ✓ #168 on Top 500 (1.7 PFlops) ✓ #1 on Green 500 5,27 GFlops/W 22

Mobile SoCs and GPGPUs in HPC ■ Very fast development for Mobile SoCs and GP GPUs ■ Convergence between both is foreseen ✓ CPUs inherits from GPUs multi-core with vector inst. ✓ GPUs inherits from CPUs cache-hierarchy ■ In parallel: large innovation in other embedded devices ✓ Intel Xeon Phi co-processor ✓ FPGAs etc. Objective: 50 GFlops/W 23

Middleware Trends: Virtualization, RJMS Hardware Middleware Data-center Software Reduced Power Consumption

Virtualization in an HPC Environment ■ Hypervisor: Core virtualization engine / environment ✓ Type 1 adapted to HPC workload ✓ Performance Loss: > 20% Xen, VMWare (ESXi), KVM Virtualbox 25

Virtualization in an HPC Environment ■ Hypervisor: Core virtualization engine / environment ✓ Type 1 adapted to HPC workload ✓ Performance Loss: > 20% Performance per Watt normalized by Baseline score for HPCC phases on Taurus cluster Observed 250 Refined 120 baseline Xen KVM ESXi 100 200 Relative PpW 80 Power [W] 60 40 150 20 0 100 0 1000 2000 3000 4000 5000 HPL PTRANS FFT STREAM DGEMM RandomAccess Time [s] [CCPE’14] M. Guzek, S. Varrette, V. Plugaru, J. E. Pecero, and P . Bouvry. A Holistic Model of the Performance and the Energy-E ffi ciency of Hypervisors in an HPC Environment .   Intl. J. on Concurrency and Computation: Practice and Experience (CCPE), 26(15):2569–2590, Oct. 2014. 25

Cloud Computing vs. HPC ■ World-widely advertised as THE solution to all problems ■ Classical taxonomy: ✓ {Infrastructure,Platform,Software}-as-a-Service ✓ Grid’5000: Hardware-as-a-Service 26

Cloud Middleware for HPC Workload Middleware : vCloud Eucalyptus OpenNebula OpenStack Nimbus License Proprietary BSD License Apache 2.0 Apache 2.0 Apache 2.0 Supported VMWare/ESX Xen, KVM, Xen, KVM, Xen, KVM, Xen, KVM Hypervisor VMWare VMWare Linux Containers, VMWare/ESX, Hyper-V,QEMU, UML Last Version 5.5.0 3.4 4.4 8 (Havana) 2.10.1 Programming n/a Java / C Ruby Python Java / Python Language Host OS VMX server RHEL 5, ESX RHEL 5, Ubuntu, ESX Ubuntu, Debian, Fedora, Debian, Fedora, Debian, Debian, CentOS 5, openSUSE-11 CentOS 5,openSUSE-11 RHEL, SUSE, Fedora RHEL, SUSE, Fedora Guest OS Windows (S2008,7), Windows (S2008,7), Windows (S2008,7), Windows (S2008,7), Windows (S2008,7), openSUSE,Debian,Solaris openSUSE,Debian,Solaris openSUSE,Debian,Solaris openSUSE,Debian,Solaris openSUSE,Debian,Solaris Contributors VMWare Eucalyptus systems, C12G Labs, Rackspace, IBM, HP, Red Hat, SUSE, Community Community Community Intel, AT&T, Canonical, Nebula, others 27

HPC Performance and Energy E ffi ciency Overview and Trends Dr. - PowerPoint PPT Presentation

HPC Performance and Energy E ffi ciency Overview and Trends Dr. Sbastien Varrette June 9th, 2015 Parallel Computing and SMAI 2015 Congress Optimization Group (PCOG) Les Karellis (Savoie) http://hpc.uni.lu Outline Introduction &

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

E ffi ciency-Improvement Techniques Reading: Ch. 11 in Law & Ch. 10 in Handbook of Simulation

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Mechanism Design with E ! ciency and Equality Considerations (Wine 2017) Mohamad Lati fi an Iman

Optimising the economic e ffi ciency of OSullivan monetary incentives for renewable energy

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Trade-off between Efficiency and Robustness Doctoral Colloqium @ SenSys18, Shenzhen Robert

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Markets and efficiency We analyse the e ffi ciency properties of a market economy, as described in

Next generation cold chain monitoring solutions as a service (CHaaS) Visibility | E ffi ciency |

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

and Zonal Field Generation Z. Lin University of California, Irvine Fusion Simulation Center,

Data- Intensive

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline

An Efficient Implementation of Tiled Polymorphic Temporal Media Simon Archipoff LaBRI FARM, 2015

PlaidML & Stripe Model-guided Optimization & Polyhedral IR Brian Retford PlaidML: Tile

OpenMapTiles: Vector tiles from OpenStreetMap Petr Pridal <petr.pridal@maptiler.com>

Sambuz

Useful Links

Newsletter

Mail Us

HPC Performance and Energy E ffi ciency Overview and Trends Dr. - PowerPoint PPT Presentation

HPC Performance and Energy E ffi ciency Overview and Trends Dr. Sbastien Varrette June 9th, 2015 Parallel Computing and SMAI 2015 Congress Optimization Group (PCOG) Les Karellis (Savoie) http://hpc.uni.lu Outline Introduction &

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

E ffi ciency-Improvement Techniques Reading: Ch. 11 in Law &amp; Ch. 10 in Handbook of Simulation

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Mechanism Design with E ! ciency and Equality Considerations (Wine 2017) Mohamad Lati fi an Iman

Optimising the economic e ffi ciency of OSullivan monetary incentives for renewable energy

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Trade-off between Efficiency and Robustness Doctoral Colloqium @ SenSys18, Shenzhen Robert

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Markets and efficiency We analyse the e ffi ciency properties of a market economy, as described in

Next generation cold chain monitoring solutions as a service (CHaaS) Visibility | E ffi ciency |

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

and Zonal Field Generation Z. Lin University of California, Irvine Fusion Simulation Center,

Data- Intensive

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline

An Efficient Implementation of Tiled Polymorphic Temporal Media Simon Archipoff LaBRI FARM, 2015

PlaidML &amp; Stripe Model-guided Optimization &amp; Polyhedral IR Brian Retford PlaidML: Tile

OpenMapTiles: Vector tiles from OpenStreetMap Petr Pridal &lt;petr.pridal@maptiler.com&gt;

Sambuz

Useful Links

Newsletter

Mail Us

E ffi ciency-Improvement Techniques Reading: Ch. 11 in Law & Ch. 10 in Handbook of Simulation

PlaidML & Stripe Model-guided Optimization & Polyhedral IR Brian Retford PlaidML: Tile

OpenMapTiles: Vector tiles from OpenStreetMap Petr Pridal <petr.pridal@maptiler.com>