Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca - PowerPoint PPT Presentation

Scaling performance In Power‐Limited HPC Systems Prof. Dr. Luca Benini ERC Multitherman Lab D‐ITET, Chair of Digital Dircuits University of Bologna – Italy & Systems ‐ Switzerland

Outline  Power and Thermal Walls in HPC  Power and Thermal Management  Energy‐efficient Hardware  Conclusion

Power Wall  Avg Exascale computing ~170MW in 2020 Sunway TaihuLight 50 93 PF, 15.3 MW GFLOP/W!! 6 GF/W We need almost 10x more energy efficiency Feasible Exascale Top500 ranks the new power budget 30% energy budget of supercomputers by FLOPS ≤ 20MWatts today’s nuclear reactor on Linpack Benchmark The second, Tianhe-2 (ex 1st) consumes 17.8 MW for "only" 33.2 PetaFLOPs, but… Cooling system matters!!! Dynamic Power management (DPM)

Thermal Wall  Max+ Intel Haswell – E5‐2699 v3 (18 core) Up to 24°C Temperature difference on DIE More than 7°C thermal heterogeneity under same workload Thermal range: 69 C° – 101 C° HPC System 2.4 Power (W) GHz 2.1 Per‐core DVFS approach 1.8 GHz 1.5 1.2 GHz GHz GHz Core Voltage (V) Power consumption: 40% ‐ 66% Dynamic thermal management (DTM)

HPC Architecture ‐ Hardware Cold air/water A multi‐scale parallel system CRAC CPU Compute node Hot air/water Rack HPC cluster DPM, DTM are Multi‐scale Problems!

HPC Architecture ‐ Software HPC Resource Scheduling model static Job 1 Partitioning Thread Core … Job 2 Job 3 Batch System + Scheduler Users Job 4 Programming Model COMMUNICATIONS one‐to‐one, one‐to‐many, many‐to‐one and many‐to‐many Programming & Scheduling model is essential!

Outline  Power and Thermal Walls in HPC  Power and Thermal Management  Energy‐efficient Hardware  Conclusion

HW Support for DPM, DTM ACTIVE STATES IDLE STATES DVFS (P‐State) low power (C‐State) Highest P0 frequency Control Range Lowest frequency Pn P‐State: both a voltage and frequency level 2.4 GHz 2.1 P0 Power (W) GHz 1.8 1.5 P1 GHz 1.2 GHz GHz P2 P3 Pn Core Voltage (V) Intel provides a HW power controller called Running Average Power Limit (RAPL).

Power Management  Reactive A significant exploration work on RAPL control:  Zhang, H., & Hoffman, H. (2015). “A Quantitative Evaluation of the RAPL Power Control System”. Feedback Computing. Quantify the behavior the control system in term of:  Stability : freedom from oscillation  Accuracy : convergence to the limit  Settling time : duration until limit is reached  Maximum Overshoot : the maximum difference between the power limit and the measured power

Power Management  HW Predictive  on‐line optimization policies • A. Bartolini et al. "Thermal and Energy Managementof High‐Performance Multicores: Distributed and Self‐ Calibrating Model‐Predictive Controller.“ TPDS’13 Implement proactive and reactive policies using DVFS selections and thread migrations AutoRegressive Moving Average (ARMA) ? Thermal Temperature model Prediction based on RC approch Scheduler based on convex optimization for DVFS selections and thread migrations Online techniques are capable of sensing changes in the workload distribution and setting the processor controls accordingly.

Power Management  SW predictive  Predictive models to estimate the power consumption • Borghesi, A., Conficoni, C., Lombardi, M., & Bartolini, A. “MS3: a Mediterranean‐Stile Job Scheduler for Supercomputers‐do less when it's too hot!”. HPCS 2015 • Sîrbu, A., & Babaoglu, O. “Predicting system‐level power for a hybrid supercomputer”. HPCS 2016 System power (W) Power cap … Job 1 Job 2 Job 3 Job 4 Time Job 1 Job 4 Job 3 Only scheduling No interactions with compute nodes! Job 3 and allocation!

Challenges HW mechanisms SW policies Low overhead No application High overhead Application Fine granularity awareness Coarse granularity aware (milliseconds) (seconds) 1) Low‐Overhead, accurate monitoring 2) Scalable data collections, analytics, decisions 3) Application awareness

Low Overhead, accurate Monitoring High-resolution monitoring  more information available How to analyze real‐time with higher sampling rates? Coarse Grain View (IPMI) 1 Node ‐20 min Max. Ts = 1s DIG @1ms 45 Nodes ‐4s 20 min 12

Low Overhead, accurate Monitoring Real-time Frequency analysis on power supply and more… Application 1 How to do it real‐time for a football‐sized cluster of computing nodes? Application 2 13

Solution – Dwarf In a Giant (DIG) _________ ______ Huge amount of data Goal: monitoring engine capable of fine‐ grained monitoring and spectral analysis distributed on a large‐scale cluster

DIG in Real Life Developing hardware extensions for fine-grained power monitoring: DIG deployed in production machines “Galileo” DAVIDE 15 ARM64 • IBM Power8 based • • Intel Xeon E5 based ARM64 Cavium based • Commercial system • Used for prototyping • Commercial system • with E4 ‐ PCP III • with E4 ‐ PCP II 18 th in Green500 •

DIG Architecture High Resolution Out-of-band Power Monitoring • Overall node power consumption • Can support edge computing/learning • Platform independent (Intel, IBM, ARM) • Sub‐Watt precision • Sampling rate @50kS/s (T=20us) State‐of‐the art systems (Bull‐HDEEM and PowerInsight) • Max. 1 ms sampling period • Use data only offline Hackenberg et al. "HDEEM: high definition energy efficiency monitoring” Laros et al. "Powerinsight‐a commodity power measurement capability." 29/01/2018

Real‐time Capabilities Problems: • ARM not real‐time (losing ADC samples ) • ARM busy with flushing ADC Goal: Offload the processing to the PRUSS

DIG in production: E4’s D.A.V.I.D.E. Possible tasks of the PRUs: Averaging @ 1ms, 1s → offline Computing, FFT → edge analysis Framework Fs max [kHz] CPU Overhead DIG 50 ~40% DIG+PRU, edge analysis 400 <5% DIG+PRU, offline 800 <5% Bull‐HDEEM 1 ? PowerInsight 1 ? μ sec resolved time stamps

Scalable Data Collection, Analytics Applications Apache Grafana Python Matlab Spark Front‐end NoSQL • MQTT Brokers Kairosdb • Data Visualization Cassandra Cassandra node 1 node M • NoSQL Storage ADMIN • Big Data Analytics MQTT2kairos MQTT2Kairos MQTT Brokers Broker 1 Broker M MQTT Back‐end Target Facility • MQTT–enabled sensor Sens_pub Sens_pub Sens_pub Sens_pub Sens_pub Sens_pub collectors

MQTT to NoSQL Storage: MQTT2Kairosdb = {Value;Timestamp} MQTT MQTT2Kairosdb facility/sensors/# Broker facility/sensors/B Metric: Metric: Metric: Sens_pub_A Sens_pub_B Sens_pub_C A B C MQTT Publishers Tags: Tags: Tags: facility facility Facility Sensors Sensors sensors Cassandra Column Family

Examon Analytics: Batch & Streaming Pandas dataframe (Batch) examon‐client (REST) Bahir‐mqtt (Spark connector) (Streaming)

Streaming Analytics: virtual sensors! MQTT Real Time Stream Processing MQTT Stream Processor MQTT Sync Calc. facility/sensors/# Broker Buffer facility/sensors/B Sens_pub_A Sens_pub_B Sens_pub_C MQTT Publishers

Examon in production: CINECA’s GALILEO MQTT OpenStack (CloudUnibo@Pico‐ Proxy CINECA) Broker Grafana Galileo (528 Nodes) Cass00 Cass01 Cass02 Cass03 Cass04 Spark (258 Nodes) Kairosdb Galileo Facility Jupyter Node Spark Cassandra Cassandra Cassandra Cassandra Cassandra Tensorflow Node Managemen BBB Node t Node Pmu_pub Node Volum Volum Volum Volum Volum Pmu_pub Ipmi_pub Sensortag e e e e e Pmu_pub (256G (256G (256G (256G (256G Pmu_pub b) b) b) b) b) Data Ingestion Rate ~67K Metrics/s DB Bandwidth ~98 Mbit/s DB Size ~1000 GB/week DB Write Latency 20 us DB Read Latency 4800 us Tier1 system 0.5‐1TB every week Stream analytics & distributed processing are a necessity Tier0 estimated 10TB per 3.5 Days

Application Aware En2Sol Minimization HARDWARE SOFTWARE Galileo: Tier‐1 HPC system based on an IBM Quantum ESPRESSO is an integrated suite of NeXtScale cluster HPC codes for electronic‐structure • calculations and materials modelling at the Cluster : 516 nodes (14 rack) • nanoscale. Node : Dual socket Intel Haswell E5‐2630 v3 CPUs with 8 cores at 2.4 GHz (85W TDP), DDR3 RAM 128 GB • Power consumption : 360 KW • OS : SMP CentOS Linux version 7.0 • Top500 : Ranked at 281th Car‐Parrinello Kernels Compute node

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca - PowerPoint PPT Presentation

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca Benini ERC Multitherman Lab DITET, Chair of Digital Dircuits University of Bologna Italy & Systems Switzerland Outline Power and Thermal Walls in HPC

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

ADAPTIVE RADIO OUTPUT SCALING FOR POWER AND BANDWIDTH SAVING Koen Zandberg 1 ADAPTIVE RADIO

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

2018 Georgia Forest Economic Report 1/27/20 Jonathan Brown Utilization Forester Georgia

saga PRODUCTION UNIT RAW MATERIAL PRODUCT APPLICATIONS PRODUCTION PROCESS BALES

AI for AI Systems and Chips Azalia Mirhoseini Senior Research Scientist, Google Brain In the

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop

The Economics of Climate Change C 175 Christian Traeger 75 g Part 3: Policy Instruments

The Future of Print Advertising Can We Change? Print Industry Silos P P N R P A U A U E

The West low homogeneity and low consistency (Labov, Ash, Boberg 2006:277)

1

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca - PowerPoint PPT Presentation

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca Benini ERC Multitherman Lab DITET, Chair of Digital Dircuits University of Bologna Italy & Systems Switzerland Outline Power and Thermal Walls in HPC

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

ADAPTIVE RADIO OUTPUT SCALING FOR POWER AND BANDWIDTH SAVING Koen Zandberg 1 ADAPTIVE RADIO

UL HPC School 2017 PS6: Debugging, profiling and performance analysis UL High Performance

2018 Georgia Forest Economic Report 1/27/20 Jonathan Brown Utilization Forester Georgia

saga PRODUCTION UNIT RAW MATERIAL PRODUCT APPLICATIONS PRODUCTION PROCESS BALES

AI for AI Systems and Chips Azalia Mirhoseini Senior Research Scientist, Google Brain In the

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop

The Economics of Climate Change C 175 Christian Traeger 75 g Part 3: Policy Instruments

The Future of Print Advertising Can We Change? Print Industry Silos P P N R P A U A U E

The West low homogeneity and low consistency (Labov, Ash, Boberg 2006:277)

1

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms