Energy-performance tradeoffs for HPC applications on low power - - PowerPoint PPT Presentation

energy performance tradeoffs for hpc applications on low
SMART_READER_LITE
LIVE PREVIEW

Energy-performance tradeoffs for HPC applications on low power - - PowerPoint PPT Presentation

Energy-performance tradeoffs for HPC applications on low power processors E. Calore 1 . Schifano 1 R. Tripiccione 1 S. F 1 INFN Ferrara and Universit degli Studi di Ferrara, Italy UnConventional High Performance Computing 2015 Euro-Par


slide-1
SLIDE 1

Energy-performance tradeoffs for HPC applications

  • n low power processors
  • E. Calore1
  • S. F

. Schifano1

  • R. Tripiccione1

1INFN Ferrara and Università degli Studi di Ferrara, Italy

UnConventional High Performance Computing 2015 Euro-Par Workshop

Vienna - August 25, 2015

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 1 / 42

slide-2
SLIDE 2

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 2 / 42

slide-3
SLIDE 3

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 3 / 42

slide-4
SLIDE 4

Exploiting the NVIDIA Tegra K1 for HPC applications

Why?

To use embedded hardware may cost less To use embedded hardware may consume less Execution time for the single processor will be higher than for HPC hardware

Need to identify new metrics

Cost per GFLOP (in Dollars/Euro): We need to know hardware cost Energy to solution (in Joule): We need to measure energy consumption

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 4 / 42

slide-5
SLIDE 5

Exploiting the NVIDIA Tegra K1 for HPC applications

Why?

To use embedded hardware may cost less To use embedded hardware may consume less Execution time for the single processor will be higher than for HPC hardware

Need to identify new metrics

Cost per GFLOP (in Dollars/Euro): We need to know hardware cost Energy to solution (in Joule): We need to measure energy consumption

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 4 / 42

slide-6
SLIDE 6

Exploiting the NVIDIA Tegra K1 for HPC applications

Why?

To use embedded hardware may cost less To use embedded hardware may consume less Execution time for the single processor will be higher than for HPC hardware

Need to identify new metrics

Cost per GFLOP (in Dollars/Euro): We need to know hardware cost Energy to solution (in Joule): We need to measure energy consumption

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 4 / 42

slide-7
SLIDE 7

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 5 / 42

slide-8
SLIDE 8

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 6 / 42

slide-9
SLIDE 9

Choosing what to measure

Actually, there are several energy metrics

Instantaneous Power consumption (Watt) Average Power consumption during execution (Watt) Energy to solution (Joule) Sampling the instantaneous Current absorption, all the other metrics could be derived knowing the execution time T and the supply voltage V: p[n] = v[n] × i[n] v[n] = V ∀ n Pavg = 1 N

N−1

  • n=0

p[n] N = T × Fsamp Etosol = 1 Fsamp

N−1

  • n=0

p[n]

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 7 / 42

slide-10
SLIDE 10

Setup to sample instantaneous current absorption

One current to voltage converter... ...plus an Arduino UNO (microcontroller + 10-bit ADC + Serial over USB)

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 8 / 42

slide-11
SLIDE 11

Current to Voltage + Digitization with Arduino + USB Serial

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 9 / 42

slide-12
SLIDE 12

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 10 / 42

slide-13
SLIDE 13

Arduino Sketch code

The Arduino board waits a char on the serial connection over the USB port. ’A’: starts to digitize at 1kHz, storing samples in its memory till it runs out of it ’S’: sends the acquired samples over the Serial connection emptying its buffer

/ / SerialEvent

  • ccurs whenever a new

/ / data comes in the hardware s e r i a l RX. void serialEvent ( ) { while (Serial . available ( ) ) { / / get the new byte : char inChar = (char)Serial . read ( ) ; / / Send buffer via Serial if (inChar == ’S’) { acquireData = false ; sendData = true ; / / Start data a c q u i s i t i o n } else if (inChar == ’A’) { acquireData = true ; sendData = false ; idx = 0; } } } / / This i s called every ms ISR(TIMER0_COMPA_vect) { if (acquireData) { byte i ; unsigned int sensorValue = 0; / / Average over avgSamples readings / /

  • ne read costs about 0.11ms

for (i = 0; i < avgSamples ; i++) { / / read the input on analog pin0 sensorValue += analogRead(A0 ) ; } isendBuffer [ idx ] = sensorValue ; idx++; } }

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 11 / 42

slide-14
SLIDE 14

From the host code

To measure the energy consumption of a specific function, we can start the acquisition just before starting the computations:

int fd ; struct termios newtio , oldtio ; char filename [ 2 5 6 ] ; / / filename were to store acquired data / / I n i t i a l i z e Arduino Serial connection fd = init_serial(&oldtio , &newtio ) ; / / Start arduino data a c q u i s i t i o n start_arduino_acq(fd ) ; usleep(10000); / / Wait a b i t to have some baseline points in the p l o t run_my_function ( ) ; / / Start arduino data read−out start_arduino_readout(fd , filename , 900); close_serial(fd , &oldtio ) ;

To store acquired data in the Arduino memory grants for minimal interferences with the code execution in the Jetson board.

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 12 / 42

slide-15
SLIDE 15

From the host code

To measure the energy consumption of a specific function, we can start the acquisition just before starting the computations:

int fd ; struct termios newtio , oldtio ; char filename [ 2 5 6 ] ; / / filename were to store acquired data / / I n i t i a l i z e Arduino Serial connection fd = init_serial(&oldtio , &newtio ) ; / / Start arduino data a c q u i s i t i o n start_arduino_acq(fd ) ; usleep(10000); / / Wait a b i t to have some baseline points in the p l o t run_my_function ( ) ; / / Start arduino data read−out start_arduino_readout(fd , filename , 900); close_serial(fd , &oldtio ) ;

To store acquired data in the Arduino memory grants for minimal interferences with the code execution in the Jetson board.

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 12 / 42

slide-16
SLIDE 16

Acquired data example with default frequency scaling

200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Current [mA] Time [ms] Propagate on Jetson - 128x4096 20 Propagate Iterations

⇑ ⇑ ⇑ · · · Iterations can be counted ⇑ This is a D2H transfer

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 13 / 42

slide-17
SLIDE 17

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 14 / 42

slide-18
SLIDE 18

The D2Q37 Lattice Boltzmann Model

Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods LBM methods simulate a discrete Boltzmann equation, which under certain conditions, reduce to the Navier-Stokes equation virtual particles called populations arranged at edges of a discrete and regular grid are used to simulate a synthetic and simplified dynamics the interaction is implemented by two main functions applied to the virtual particles: propagation and collision D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes, heat transport and perfect-gas (P = ρT) equations

1chemical reactions turning cold-mixture of reactants into hot-mixture of burnt

product.

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 15 / 42

slide-19
SLIDE 19

Simulation of the Rayleigh-Taylor (RT) Instability

Instability at the interface of two fluids of different densities triggered by gravity. A cold-dense fluid over a less dense and warmer fluid triggers an instability that mixes the two fluid-regions (till equilibrium is reached).

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 16 / 42

slide-20
SLIDE 20

Computational Scheme of LBM

foreach time−step foreach lattice−point propagate ( ) ; endfor foreach lattice−point collide ( ) ; endfor endfor

Embarassing parallelism

All sites can be processed in parallel applying in sequence propagate and collide.

Challenge

Design an efficient implementation able exploit a large fraction of available peak performance.

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 17 / 42

slide-21
SLIDE 21

D2Q37: propagation scheme

perform accesses to neighbour-cells at distance 1,2, and 3 generate memory-accesses with sparse addressing patterns

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 18 / 42

slide-22
SLIDE 22

D2Q37 collision

collision is computed at each lattice-cell after computation of boundary conditions computational intensive: for the D2Q37 model requires ≈ 7500 DP floating-point operations completely local: arithmetic operations require only the populations associate to the site computation of propagate and collide kernels are kept separate after propagate but before collide we may need to perform collective

  • perations (e.g. divergence of of the velocity field) if we include

computations conbustion effects.

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 19 / 42

slide-23
SLIDE 23

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 20 / 42

slide-24
SLIDE 24

Initial Code implementations

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 21 / 42

slide-25
SLIDE 25

Code implementations

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 22 / 42

slide-26
SLIDE 26

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 23 / 42

slide-27
SLIDE 27

Manually setting the Jetson clocks

Manually setting CPU frequency and online cores (LP vs G)

echo "userspace" > / sys / devices / system / cpu / cpu0 / cpufreq / scaling_governor echo <frequency> > / sys / devices / system / cpu / cpu0 / cpufreq / scaling_setspeed echo 0 > / sys / devices / system / cpu / cpuquiet / tegra_cpuquiet / enable echo 0 > / sys / devices / system / cpu / cpu1 / online echo 0 > / sys / devices / system / cpu / cpu2 / online echo 0 > / sys / devices / system / cpu / cpu3 / online echo LP > / sys / kernel / cluster / active

Manually setting GPU (and MEM) frequency

cat / sys / kernel / debug / clock / gbus / possible_rates 72000 108000 180000 252000 324000 396000 468000 540000 612000 648000 684000 708000 756000 804000 852000 (kHz) echo 852000000 > / sys / kernel / debug / clock / override . gbus / rate echo 1 > / sys / kernel / debug / clock / override . gbus / state

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 24 / 42

slide-28
SLIDE 28

Optimizing for the new metrics

Using the regular time to solution metric (aka faster is better)

Highest CPU frequency is the best Highest GPU frequency is the best Highest MEM frequency is the best Best software parameters have to be identified (e.g. CUDA block size)

Using the energy to solution metric

Which CPU frequency is the best? Which GPU frequency is the best? Which MEM frequency is the best? Best software parameters have still to be identified (e.g. CUDA block size)

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 25 / 42

slide-29
SLIDE 29

Optimizing for the new metrics

Using the regular time to solution metric (aka faster is better)

Highest CPU frequency is the best Highest GPU frequency is the best Highest MEM frequency is the best Best software parameters have to be identified (e.g. CUDA block size)

Using the energy to solution metric

Which CPU frequency is the best? Which GPU frequency is the best? Which MEM frequency is the best? Best software parameters have still to be identified (e.g. CUDA block size)

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 25 / 42

slide-30
SLIDE 30

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 26 / 42

slide-31
SLIDE 31

Propagate changing the G cluster clock

200 300 400 500 600 700 800 900 1000 50 100 150 200 250 300 350 400 450 Current [mA] Time [ms] Propagate on Jetson - 128x1024sp - Changing CPU Clock 4-2320500-0-924000.dat 4-2218500-0-924000.dat 4-2116500-0-924000.dat 4-2014500-0-924000.dat 4-1938000-0-924000.dat 4-1836000-0-924000.dat 4-1734000-0-924000.dat 4-1632000-0-924000.dat 4-1530000-0-924000.dat 4-1428000-0-924000.dat 4-1326000-0-924000.dat 4-1224000-0-924000.dat 4-1122000-0-924000.dat 4-1092000-0-924000.dat 4-960000-0-924000.dat 4-828000-0-924000.dat 4-696000-0-924000.dat 4-564000-0-924000.dat 4-312000-0-924000.dat 4-204000-0-924000.dat

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 27 / 42

slide-32
SLIDE 32

Propagate changing the MEM clock

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 300 350 400 450 Current [mA] Time [ms] Propagate on Jetson - 128x1024sp - Changing MEM Clock 4-2320500-0-924000.dat 4-2320500-0-792000.dat 4-2320500-0-600000.dat 4-2320500-0-528000.dat 4-2320500-0-396000.dat 4-2320500-0-300000.dat 4-2320500-0-204000.dat 4-2320500-0-102000.dat 4-2320500-0-68000.dat

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 28 / 42

slide-33
SLIDE 33

Time and Energy to solution (Propagate)

204 696 1122 1632 2320.5 12.7 300 600 924 101 102 103 Time to Solution [ms] CPU Clock [MHz] Memory Clock [MHz] Time to Solution [ms] 101 102 Energy to Solution [mJ]

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 29 / 42

slide-34
SLIDE 34

Collide changing the G cluster clock

200 400 600 800 1000 1200 1400 100 200 300 400 500 600 700 800 900 Current [mA] Time [ms] Collide on Jetson - 128x1024sp - Changing CPU Clock 4-2320500-0-924000.dat 4-2218500-0-924000.dat 4-2116500-0-924000.dat 4-2014500-0-924000.dat 4-1938000-0-924000.dat 4-1836000-0-924000.dat 4-1734000-0-924000.dat 4-1632000-0-924000.dat 4-1530000-0-924000.dat 4-1428000-0-924000.dat 4-1326000-0-924000.dat 4-1224000-0-924000.dat 4-1122000-0-924000.dat 4-1092000-0-924000.dat 4-960000-0-924000.dat 4-828000-0-924000.dat 4-696000-0-924000.dat 4-564000-0-924000.dat 4-312000-0-924000.dat 4-204000-0-924000.dat

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 30 / 42

slide-35
SLIDE 35

Collide changing the MEM clock

200 400 600 800 1000 1200 1400 100 200 300 400 500 600 700 800 900 Current [mA] Time [ms] Collide on Jetson - 128x1024sp - Changing MEM Clock 4-2320500-0-924000.dat 4-2320500-0-792000.dat 4-2320500-0-600000.dat 4-2320500-0-528000.dat 4-2320500-0-396000.dat 4-2320500-0-300000.dat 4-2320500-0-204000.dat 4-2320500-0-102000.dat 4-2320500-0-68000.dat

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 31 / 42

slide-36
SLIDE 36

Time and Energy to solution (Collide)

204 696 1122 1632 2320.5 12.7 300 600 924 101 102 103 Time to Solution [ms] CPU Clock [MHz] Memory Clock [MHz] Time to Solution [ms] 102 103 Energy to Solution [mJ]

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 32 / 42

slide-37
SLIDE 37

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 33 / 42

slide-38
SLIDE 38

Time and Energy to solution (Propagate)

72 252 468 648 852 12.7 300 600 924 101 102 103 Time to Solution [ms] GPU Clock [MHz] Memory Clock [MHz] Time to Solution [ms] 101 102 Energy to Solution [mJ]

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 34 / 42

slide-39
SLIDE 39

Time and Energy to solution (Collide)

72 252 468 648 852 12.7 300 600 924 101 102 103 Time to Solution [ms] GPU Clock [MHz] Memory Clock [MHz] Time to Solution [ms] 102 103 Energy to solution [mJ]

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 35 / 42

slide-40
SLIDE 40

Energy to Sol. vs Time to Sol. CPU(top), GPU(bottom)

0.5 1 1.5 2 2.5 3 100 200 300 400 500 600 700 800 900 Energy to Solution [J] Time to Solution [ms] Propagate Collide Fit line 0.5 1 1.5 2 2.5 3 100 200 300 400 500 600 700 800 900 Energy to Solution [J] Time to Solution [ms] Propagate Collide Fit line

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 36 / 42

slide-41
SLIDE 41

Energy to Solution vs Time to Solution (CPU)

0.2 0.4 0.6 0.8 1 100 200 300 400 500 600 700 800 900 Energy to Solution [J] Time to Solution [ms] Propagate Collide

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 37 / 42

slide-42
SLIDE 42

Energy to Solution vs Time to Solution (GPU)

0.2 0.4 0.6 0.8 1 100 200 300 400 500 600 700 800 900 Energy to Solution [J] Time to Solution [ms] Propagate Collide

72 12 180 12 252 12 396 12 468 12 612 12 648 12 708 12 804 12 852 12 468 20 540 20 612 20 648 20 684 20 708 20 756 20 804 20 852 20

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 38 / 42

slide-43
SLIDE 43

Energy to Solution vs Time to Solution (GPU) zoom

130 140 150 160 170 180 40 45 50 55 60 Energy to Solution [mJ]

612,792 612,924 648,396 648,528 648,600 648,792 648,924 684,300 684,396 684,528 684,600 684,792 684,924 708,300 708,396 708,528 708,600 708,792 708,924 756,300 756,396 756,528 756,600 756,792 756,924 804,300 804,396 804,528 804,600 804,792 804,924 852,300 852,396 852,528 852,600 852,792 852,924

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 39 / 42

slide-44
SLIDE 44

Outline

1

Introduction

2

Measuring the energy consumption How to measure Managing the acquisition

3

Lattice Boltzmann Model (D2Q37) Code Implementations Managing the Frequency Scaling C with NEON intrinsics, on the Cortex A15 CUDA on the GK20A

4

Conclusions

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 40 / 42

slide-45
SLIDE 45

Conclusions

baseline power consumption (leakage current + ancillary electronics) is relevant concerning the whole energy budget. limited but not negligible power optimization is possible by adjusting clocks

  • n a kernel-by-kernel basis (≈ 20 · · · 30%).

best region is close to the system highest frequencies.

  • ptions to run the processor at very low frequencies seem almost useless;

if possible would be interesting to be able to remove power from the (sub-)system while idle.

Future works

perform similar measurements on a high-end node and compare results. test on newer low-power processors, such us the Tegra X1. consider not only hardware-based tuning, but also software tuning.

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 41 / 42

slide-46
SLIDE 46

Conclusions

baseline power consumption (leakage current + ancillary electronics) is relevant concerning the whole energy budget. limited but not negligible power optimization is possible by adjusting clocks

  • n a kernel-by-kernel basis (≈ 20 · · · 30%).

best region is close to the system highest frequencies.

  • ptions to run the processor at very low frequencies seem almost useless;

if possible would be interesting to be able to remove power from the (sub-)system while idle.

Future works

perform similar measurements on a high-end node and compare results. test on newer low-power processors, such us the Tegra X1. consider not only hardware-based tuning, but also software tuning.

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 41 / 42

slide-47
SLIDE 47

Thanks for Your attention

  • E. Calore (INFN and Univ. Ferrara)

Energy vs Performance on Jetson TK1 Vienna, August 25, 2015 42 / 42