energy performance tradeoffs for hpc applications on low
play

Energy-performance tradeoffs for HPC applications on low power - PowerPoint PPT Presentation

Energy-performance tradeoffs for HPC applications on low power processors E. Calore 1 . Schifano 1 R. Tripiccione 1 S. F 1 INFN Ferrara and Universit degli Studi di Ferrara, Italy UnConventional High Performance Computing 2015 Euro-Par


  1. Energy-performance tradeoffs for HPC applications on low power processors E. Calore 1 . Schifano 1 R. Tripiccione 1 S. F 1 INFN Ferrara and Università degli Studi di Ferrara, Italy UnConventional High Performance Computing 2015 Euro-Par Workshop Vienna - August 25, 2015 E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 1 / 42

  2. Outline Introduction 1 2 Measuring the energy consumption How to measure Managing the acquisition 3 Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A 4 Conclusions E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 2 / 42

  3. Outline Introduction 1 2 Measuring the energy consumption How to measure Managing the acquisition 3 Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A 4 Conclusions E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 3 / 42

  4. Exploiting the NVIDIA Tegra K1 for HPC applications Why? To use embedded hardware may cost less To use embedded hardware may consume less Execution time for the single processor will be higher than for HPC hardware Need to identify new metrics Cost per GFLOP (in Dollars/Euro): We need to know hardware cost Energy to solution (in Joule): We need to measure energy consumption E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 4 / 42

  5. Exploiting the NVIDIA Tegra K1 for HPC applications Why? To use embedded hardware may cost less To use embedded hardware may consume less Execution time for the single processor will be higher than for HPC hardware Need to identify new metrics Cost per GFLOP (in Dollars/Euro): We need to know hardware cost Energy to solution (in Joule): We need to measure energy consumption E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 4 / 42

  6. Exploiting the NVIDIA Tegra K1 for HPC applications Why? To use embedded hardware may cost less To use embedded hardware may consume less Execution time for the single processor will be higher than for HPC hardware Need to identify new metrics Cost per GFLOP (in Dollars/Euro): We need to know hardware cost Energy to solution (in Joule): We need to measure energy consumption E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 4 / 42

  7. Outline Introduction 1 2 Measuring the energy consumption How to measure Managing the acquisition 3 Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A 4 Conclusions E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 5 / 42

  8. Outline Introduction 1 2 Measuring the energy consumption How to measure Managing the acquisition 3 Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A 4 Conclusions E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 6 / 42

  9. Choosing what to measure Actually, there are several energy metrics Instantaneous Power consumption (Watt) Average Power consumption during execution (Watt) Energy to solution (Joule) Sampling the instantaneous Current absorption, all the other metrics could be derived knowing the execution time T and the supply voltage V : p [ n ] = v [ n ] × i [ n ] N = T × F samp v [ n ] = V ∀ n N − 1 1 N − 1 P avg = 1 � E tosol = p [ n ] � p [ n ] F samp N n = 0 n = 0 E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 7 / 42

  10. Setup to sample instantaneous current absorption One current to voltage converter... ...plus an Arduino UNO (microcontroller + 10-bit ADC + Serial over USB) E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 8 / 42

  11. Current to Voltage + Digitization with Arduino + USB Serial E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 9 / 42

  12. Outline Introduction 1 2 Measuring the energy consumption How to measure Managing the acquisition 3 Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A 4 Conclusions E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 10 / 42

  13. Arduino Sketch code The Arduino board waits a char on the serial connection over the USB port. ’A’: starts to digitize at 1kHz, storing samples in its memory till it runs out of it ’S’: sends the acquired samples over the Serial connection emptying its buffer / / SerialEvent occurs whenever a new / / data comes in the hardware s e r i a l RX. / / This i s called every ms void serialEvent ( ) { ISR ( TIMER0_COMPA_vect ) { while ( Serial . available ( ) ) { if ( acquireData ) { / / get the new byte : byte i ; char inChar = ( char ) Serial . read ( ) ; unsigned int sensorValue = 0; / / Send buffer via Serial / / Average over avgSamples readings if ( inChar == ’S’ ) { / / one read costs about 0.11ms acquireData = false ; for ( i = 0; i < avgSamples ; i ++) { sendData = true ; / / read the input on analog pin0 / / Start data a c q u i s i t i o n sensorValue += analogRead ( A0 ) ; } else if ( inChar == ’A’ ) { } acquireData = true ; isendBuffer [ idx ] = sensorValue ; sendData = false ; idx ++; idx = 0; } } } } } E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 11 / 42

  14. From the host code To measure the energy consumption of a specific function, we can start the acquisition just before starting the computations: int fd ; struct termios newtio , oldtio ; char filename [ 2 5 6 ] ; / / filename were to store acquired data / / I n i t i a l i z e Arduino Serial connection fd = init_serial (& oldtio , & newtio ) ; / / Start arduino data a c q u i s i t i o n start_arduino_acq ( fd ) ; usleep (10000); / / Wait a b i t to have some baseline points in the p l o t run_my_function ( ) ; / / Start arduino data read − out start_arduino_readout ( fd , filename , 900); close_serial ( fd , & oldtio ) ; To store acquired data in the Arduino memory grants for minimal interferences with the code execution in the Jetson board. E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 12 / 42

  15. From the host code To measure the energy consumption of a specific function, we can start the acquisition just before starting the computations: int fd ; struct termios newtio , oldtio ; char filename [ 2 5 6 ] ; / / filename were to store acquired data / / I n i t i a l i z e Arduino Serial connection fd = init_serial (& oldtio , & newtio ) ; / / Start arduino data a c q u i s i t i o n start_arduino_acq ( fd ) ; usleep (10000); / / Wait a b i t to have some baseline points in the p l o t run_my_function ( ) ; / / Start arduino data read − out start_arduino_readout ( fd , filename , 900); close_serial ( fd , & oldtio ) ; To store acquired data in the Arduino memory grants for minimal interferences with the code execution in the Jetson board. E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 12 / 42

  16. Acquired data example with default frequency scaling Propagate on Jetson - 128x4096 900 20 Propagate Iterations 800 700 Current [mA] 600 500 400 300 200 0 100 200 300 400 500 600 700 800 900 Time [ms] ⇑ ⇑ ⇑ · · · ⇑ Iterations can be counted This is a D2H transfer E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 13 / 42

  17. Outline Introduction 1 2 Measuring the energy consumption How to measure Managing the acquisition 3 Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A 4 Conclusions E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 14 / 42

  18. The D2Q37 Lattice Boltzmann Model Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods LBM methods simulate a discrete Boltzmann equation, which under certain conditions, reduce to the Navier-Stokes equation virtual particles called populations arranged at edges of a discrete and regular grid are used to simulate a synthetic and simplified dynamics the interaction is implemented by two main functions applied to the virtual particles: propagation and collision D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes, heat transport and perfect-gas ( P = ρ T ) equations 1 chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product. E. Calore (INFN and Univ. Ferrara) Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 15 / 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend