scaling performance in power limited hpc systems
play

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca - PowerPoint PPT Presentation

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca Benini ERC Multitherman Lab DITET, Chair of Digital Dircuits University of Bologna Italy & Systems Switzerland Outline Power and Thermal Walls in HPC


  1. Scaling performance In Power‐Limited HPC Systems Prof. Dr. Luca Benini ERC Multitherman Lab D‐ITET, Chair of Digital Dircuits University of Bologna – Italy & Systems ‐ Switzerland

  2. Outline  Power and Thermal Walls in HPC  Power and Thermal Management  Energy‐efficient Hardware  Conclusion

  3. Power Wall  Avg Exascale computing ~170MW in 2020 Sunway TaihuLight 50 93 PF, 15.3 MW GFLOP/W!! 6 GF/W We need almost 10x more energy efficiency Feasible Exascale Top500 ranks the new power budget 30% energy budget of supercomputers by FLOPS ≤ 20MWatts today’s nuclear reactor on Linpack Benchmark The second, Tianhe-2 (ex 1st) consumes 17.8 MW for "only" 33.2 PetaFLOPs, but… Cooling system matters!!! Dynamic Power management (DPM)

  4. Thermal Wall  Max+ Intel Haswell – E5‐2699 v3 (18 core) Up to 24°C Temperature difference on DIE More than 7°C thermal heterogeneity under same workload Thermal range: 69 C° – 101 C° HPC System 2.4 Power (W) GHz 2.1 Per‐core DVFS approach 1.8 GHz 1.5 1.2 GHz GHz GHz Core Voltage (V) Power consumption: 40% ‐ 66% Dynamic thermal management (DTM)

  5. HPC Architecture ‐ Hardware Cold air/water A multi‐scale parallel system CRAC CPU Compute node Hot air/water Rack HPC cluster DPM, DTM are Multi‐scale Problems!

  6. HPC Architecture ‐ Software HPC Resource Scheduling model static Job 1 Partitioning Thread Core … Job 2 Job 3 Batch System + Scheduler Users Job 4 Programming Model COMMUNICATIONS one‐to‐one, one‐to‐many, many‐to‐one and many‐to‐many Programming & Scheduling model is essential!

  7. Outline  Power and Thermal Walls in HPC  Power and Thermal Management  Energy‐efficient Hardware  Conclusion

  8. HW Support for DPM, DTM ACTIVE STATES IDLE STATES DVFS (P‐State) low power (C‐State) Highest P0 frequency Control Range Lowest frequency Pn P‐State: both a voltage and frequency level 2.4 GHz 2.1 P0 Power (W) GHz 1.8 1.5 P1 GHz 1.2 GHz GHz P2 P3 Pn Core Voltage (V) Intel provides a HW power controller called Running Average Power Limit (RAPL).

  9. Power Management  Reactive A significant exploration work on RAPL control:  Zhang, H., & Hoffman, H. (2015). “A Quantitative Evaluation of the RAPL Power Control System”. Feedback Computing. Quantify the behavior the control system in term of:  Stability : freedom from oscillation  Accuracy : convergence to the limit  Settling time : duration until limit is reached  Maximum Overshoot : the maximum difference between the power limit and the measured power

  10. Power Management  HW Predictive  on‐line optimization policies • A. Bartolini et al. "Thermal and Energy Managementof High‐Performance Multicores: Distributed and Self‐ Calibrating Model‐Predictive Controller.“ TPDS’13 Implement proactive and reactive policies using DVFS selections and thread migrations AutoRegressive Moving Average (ARMA) ? Thermal Temperature model Prediction based on RC approch Scheduler based on convex optimization for DVFS selections and thread migrations Online techniques are capable of sensing changes in the workload distribution and setting the processor controls accordingly.

  11. Power Management  SW predictive  Predictive models to estimate the power consumption • Borghesi, A., Conficoni, C., Lombardi, M., & Bartolini, A. “MS3: a Mediterranean‐Stile Job Scheduler for Supercomputers‐do less when it's too hot!”. HPCS 2015 • Sîrbu, A., & Babaoglu, O. “Predicting system‐level power for a hybrid supercomputer”. HPCS 2016 System power (W) Power cap … Job 1 Job 2 Job 3 Job 4 Time Job 1 Job 4 Job 3 Only scheduling No interactions with compute nodes! Job 3 and allocation!

  12. Challenges HW mechanisms SW policies Low overhead No application High overhead Application Fine granularity awareness Coarse granularity aware (milliseconds) (seconds) 1) Low‐Overhead, accurate monitoring 2) Scalable data collections, analytics, decisions 3) Application awareness

  13. Low Overhead, accurate Monitoring High-resolution monitoring  more information available How to analyze real‐time with higher sampling rates? Coarse Grain View (IPMI) 1 Node ‐20 min Max. Ts = 1s DIG @1ms 45 Nodes ‐4s 20 min 12

  14. Low Overhead, accurate Monitoring Real-time Frequency analysis on power supply and more… Application 1 How to do it real‐time for a football‐sized cluster of computing nodes? Application 2 13

  15. Solution – Dwarf In a Giant (DIG) _________ ______ Huge amount of data Goal: monitoring engine capable of fine‐ grained monitoring and spectral analysis distributed on a large‐scale cluster

  16. DIG in Real Life Developing hardware extensions for fine-grained power monitoring: DIG deployed in production machines “Galileo” DAVIDE 15 ARM64 • IBM Power8 based • • Intel Xeon E5 based ARM64 Cavium based • Commercial system • Used for prototyping • Commercial system • with E4 ‐ PCP III • with E4 ‐ PCP II 18 th in Green500 •

  17. DIG Architecture High Resolution Out-of-band Power Monitoring • Overall node power consumption • Can support edge computing/learning • Platform independent (Intel, IBM, ARM) • Sub‐Watt precision • Sampling rate @50kS/s (T=20us) State‐of‐the art systems (Bull‐HDEEM and PowerInsight) • Max. 1 ms sampling period • Use data only offline Hackenberg et al. "HDEEM: high definition energy efficiency monitoring” Laros et al. "Powerinsight‐a commodity power measurement capability." 29/01/2018

  18. Real‐time Capabilities Problems: • ARM not real‐time (losing ADC samples ) • ARM busy with flushing ADC Goal: Offload the processing to the PRUSS

  19. DIG in production: E4’s D.A.V.I.D.E. Possible tasks of the PRUs: Averaging @ 1ms, 1s → offline Computing, FFT → edge analysis Framework Fs max [kHz] CPU Overhead DIG 50 ~40% DIG+PRU, edge analysis 400 <5% DIG+PRU, offline 800 <5% Bull‐HDEEM 1 ? PowerInsight 1 ? μ sec resolved time stamps

  20. Scalable Data Collection, Analytics Applications Apache Grafana Python Matlab Spark Front‐end NoSQL • MQTT Brokers Kairosdb • Data Visualization Cassandra Cassandra node 1 node M • NoSQL Storage ADMIN • Big Data Analytics MQTT2kairos MQTT2Kairos MQTT Brokers Broker 1 Broker M MQTT Back‐end Target Facility • MQTT–enabled sensor Sens_pub Sens_pub Sens_pub Sens_pub Sens_pub Sens_pub collectors

  21. MQTT to NoSQL Storage: MQTT2Kairosdb = {Value;Timestamp} MQTT MQTT2Kairosdb facility/sensors/# Broker facility/sensors/B Metric: Metric: Metric: Sens_pub_A Sens_pub_B Sens_pub_C A B C MQTT Publishers Tags: Tags: Tags: facility facility Facility Sensors Sensors sensors Cassandra Column Family

  22. Examon Analytics: Batch & Streaming Pandas dataframe (Batch) examon‐client (REST) Bahir‐mqtt (Spark connector) (Streaming)

  23. Streaming Analytics: virtual sensors! MQTT Real Time Stream Processing MQTT Stream Processor MQTT Sync Calc. facility/sensors/# Broker Buffer facility/sensors/B Sens_pub_A Sens_pub_B Sens_pub_C MQTT Publishers

  24. Examon in production: CINECA’s GALILEO MQTT OpenStack (CloudUnibo@Pico‐ Proxy CINECA) Broker Grafana Galileo (528 Nodes) Cass00 Cass01 Cass02 Cass03 Cass04 Spark (258 Nodes) Kairosdb Galileo Facility Jupyter Node Spark Cassandra Cassandra Cassandra Cassandra Cassandra Tensorflow Node Managemen BBB Node t Node Pmu_pub Node Volum Volum Volum Volum Volum Pmu_pub Ipmi_pub Sensortag e e e e e Pmu_pub (256G (256G (256G (256G (256G Pmu_pub b) b) b) b) b) Data Ingestion Rate ~67K Metrics/s DB Bandwidth ~98 Mbit/s DB Size ~1000 GB/week DB Write Latency 20 us DB Read Latency 4800 us Tier1 system 0.5‐1TB every week Stream analytics & distributed processing are a necessity Tier0 estimated 10TB per 3.5 Days

  25. Application Aware En2Sol Minimization HARDWARE SOFTWARE Galileo: Tier‐1 HPC system based on an IBM Quantum ESPRESSO is an integrated suite of NeXtScale cluster HPC codes for electronic‐structure • calculations and materials modelling at the Cluster : 516 nodes (14 rack) • nanoscale. Node : Dual socket Intel Haswell E5‐2630 v3 CPUs with 8 cores at 2.4 GHz (85W TDP), DDR3 RAM 128 GB • Power consumption : 360 KW • OS : SMP CentOS Linux version 7.0 • Top500 : Ranked at 281th Car‐Parrinello Kernels Compute node

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend