SLIDE 1 Modelling the Energy Consumption of Soft Real-Time Tasks on Heterogeneous Computing Architectures
H.E. Zahaf1, R. Olejnik1, G. Lipari1, A.E. Benyamina2
1Université de Lille 2University of Oran
January 19, 2016
SLIDE 2
Outline
Introduction Experimental setting Time vs. energy Conclusions and Current work
SLIDE 3
Outline
Introduction Experimental setting Time vs. energy Conclusions and Current work
SLIDE 4
Context and motivation
Computing at the edge
SLIDE 5 Fog Computing
◮ Fog Computing characteristics
◮ Computing at the edge means that data are pre-processed
before being stored in the cloud
◮ thus reducing network load
◮ Fog Computing requirements
◮ Multicore, heterogeneous ◮ different kind of computation are needed ◮ Low power consumption ◮ (Soft real-time)
SLIDE 6 Minimise power consumption
◮ Modern processors have many ways of reducing power
consumption
◮ Dynamic Voltage and Frequency Scaling (DVFS)
◮ dynamically adjust processor frequency to minimise energy . . . ◮ . . . without reducing performances too much
◮ Dynamic Power Management (DPM)
◮ Turn off processors that are not used/needed ◮ Pack all computation in a small number of processors . . . ◮ . . . without reducing performance too much
◮ In any case, performance is the key here
SLIDE 7 Soft real-time tasks
◮ A soft real-time task consists of a sequence of processing to
be executed periodically
◮ e.g.: every 20 msec, encode one video frame ◮ Period = 20 msec
◮ Usually associated with a deadline
◮ every video frame must be encoded within 20 msec ◮ Deadline = 20 msec
◮ Goal:
◮ find the minimum frequency such that the task completes
within its deadline
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
τ1
SLIDE 8 Soft real-time tasks
◮ A soft real-time task consists of a sequence of processing to
be executed periodically
◮ e.g.: every 20 msec, encode one video frame ◮ Period = 20 msec
◮ Usually associated with a deadline
◮ every video frame must be encoded within 20 msec ◮ Deadline = 20 msec
◮ Goal:
◮ find the minimum frequency such that the task completes
within its deadline
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
τ1
SLIDE 9 Soft real-time tasks
◮ A soft real-time task consists of a sequence of processing to
be executed periodically
◮ e.g.: every 20 msec, encode one video frame ◮ Period = 20 msec
◮ Usually associated with a deadline
◮ every video frame must be encoded within 20 msec ◮ Deadline = 20 msec
◮ Goal:
◮ find the minimum frequency such that the task completes
within its deadline
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
τ1
SLIDE 10 Soft real-time tasks
◮ A soft real-time task consists of a sequence of processing to
be executed periodically
◮ e.g.: every 20 msec, encode one video frame ◮ Period = 20 msec
◮ Usually associated with a deadline
◮ every video frame must be encoded within 20 msec ◮ Deadline = 20 msec
◮ Goal:
◮ find the minimum frequency such that the task completes
within its deadline
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
τ1
SLIDE 11 The problem
◮ Model:
◮ a set of soft real-time tasks ◮ with different periods, deadlines, execution time profiles ◮ scheduled by an operating system using a real-time scheduling
algorithm
◮ on a set of heterogeneous processors
◮ Problem:
◮ allocate tasks to processor ◮ set frequency ◮ set scheduling parameters
◮ Objective
◮ minimise total energy without missing deadlines
SLIDE 12 Energy model
◮ In order to solve the problem, we need to have a model of the
energy consumption
◮ Problems:
◮ Processor model: ◮ Energy saving mechanisms, internal to the chip and
transparent to the programmer, try to minimise energy of micro-operations
◮ Complexity of the hardware/software interaction ◮ Influence of pipeline and cache on execution time ◮ Tasks share resources (caches, memory bus, peripherals)
◮ It is impossible to derive an exact model
◮ we resort to measurement
SLIDE 13
Outline
Introduction Experimental setting Time vs. energy Conclusions and Current work
SLIDE 14 ARM Big/Little
ARM Cortex A7
L1 Cache
ARM Cortex A7
L1 Cache
ARM Cortex A7
L1 Cache
ARM Cortex A7
L1 Cache
ARM Cortex A15
L1 Cache
ARM Cortex A15
L1 Cache
ARM Cortex A15
L1 Cache
ARM Cortex A15
L1 Cache RAM Memory L2 Cache L2 Cache
Little's cluster Big's cluster
GPU Mali
Figure : ARM Big/Little
◮ It is possible to set the frequency of each processor group, but
not of individual cores
◮ Each group (Little or Big) has its own characteristics in terms
- f execution time speed-up and energy consumption
SLIDE 15 Energy Sensors
◮ Sensors:
◮ current for Little’s group ◮ current for Big’s group ◮ current for RAM
◮ A global amperometer for the card
◮ used to check consistency of measurements
SLIDE 16 Benchmarks
◮ Three periodic tasks ◮ MATMUL (L)
◮ Multiplying 2 square matrices of LxL for a certain number of
times
◮ FFT
◮ Fast Fourier transform of a random input signal
◮ FFMEG
◮ the decoding algorithm of a specific video input
◮ Tasks were executed periodically every T units of time
◮ we measured execution time, and energy consumption of
processors and memory
◮ Linux OS
◮ Frequency governor disabled
SLIDE 17
Outline
Introduction Experimental setting Time vs. energy Conclusions and Current work
SLIDE 18 Execution time
◮ The execution time of MATMUL(200x200) thread allocated
- n one little/big Core, with no interference
500 1,000 1,500 2,000 0.5 1 1.5 2 2.5 Frequency (Mhz) Execution Time (S) B-avg L-avg
SLIDE 19 Model of computation time
◮ Computation time varies with frequency according to the
following rule Ci(f ) = f m f cti +mti
◮ Two components:
◮ cti represents the number of instruction cycles executed on the
processor
◮ mti represent the main memory access
◮ The second component does not vary with frequency, but
depends on the number of cache misses
◮ hence on the interference of other tasks on the cache and on
the bus
SLIDE 20 Computing task’s parameters
◮ We can compute both components for each task in a typical
setting, with a simple regression
◮ Example: MATMUL(size)
◮ (times are expressed in milliseconds)
Size RSS (Kb) ct (L) mt (L) ct (B) mt (B) 150 1272 98 15 23 7 200 1452 254 17 66 8 250 1651 526 19 146 9 300 1840 978 21 278 10
SLIDE 21 Impact of interference
◮ Co-execution of an interfering task (MATMUL(200x200))
◮ the interference increases with the size of the matrix
200 400 600 800 1,000 1,200 1,400 0.2 0.4 0.6 0.8 Frequency (Mhz) Execution Time (S) L-With-P L-Without 500 1,000 1,500 2,000 0.2 0.4 0.6 0.8 Frequency (Mhz) Execution Time (S) B-With-P B-Without
SLIDE 22
Dynamic power
◮ Energy consumption of MATMUL(150) on Big and Little cores 500 1,000 1,500 2,000 1 2 3 4 Frequency (Mhz) Power (w) B-avg L-avg ◮ The little at fmax = 1400 consumes less than the big at
fmin = 200
◮ Power can be model as a polynomial of 3rd degree:
P(f ) = af 3 +bf 2 +cf +d
SLIDE 23 Impact of idle processors
◮ We can only measure the energy consumed by all little cores
◮ one single sensor per group of cores
500 1,000 1,500 2,000 2 4 6 Frequency (Mhz) Power (w) One-L Two-L Three-L One-B Two-B Three-B ◮ Not easy to understand what it is going on:
◮ the OS puts the core in low power mode when not executing,
reducing also static energy
◮ however, there is a shared "base" for all processors
SLIDE 24 Power consumption of RAM
◮ Big core consumes slightly less
◮ probably due to the larger L2 cache (less cache misses)
500 1,000 1,500 2,000 2·10−2 4·10−2 6·10−2 8·10−2 0.1 Frequency (Mhz) Little core Big core
SLIDE 25
Model of energy
◮ We used a 3-degree polynomial of frequency to model the
energy consumption: P(f ) = af 3 +bf 2 +cf +d
200 400 600 800 1,000 1,200 1,400 0.1 0.2 0.3 0.4 Frequency (Mhz) Power (W) Real-FFT-1
SLIDE 26
Model of energy
◮ We used a 3-degree polynomial of frequency to model the
energy consumption: P(f ) = af 3 +bf 2 +cf +d
200 400 600 800 1,000 1,200 1,400 0.1 0.2 0.3 0.4 Frequency (Mhz) Power (W) Real-FFT-1 Rg-FFT-1
SLIDE 27
Model of energy
◮ We used a 3-degree polynomial of frequency to model the
energy consumption: P(f ) = af 3 +bf 2 +cf +d
200 400 600 800 1,000 1,200 1,400 0.1 0.2 0.3 0.4 Frequency (Mhz) Power (W) Real-FFT-1 Rg-FFT-1 Real-MM-1
SLIDE 28
Model of energy
◮ We used a 3-degree polynomial of frequency to model the
energy consumption: P(f ) = af 3 +bf 2 +cf +d
200 400 600 800 1,000 1,200 1,400 0.1 0.2 0.3 0.4 Frequency (Mhz) Power (W) Real-FFT-1 Rg-FFT-1 Real-MM-1 Rg-MM-1
SLIDE 29 Model of energy
◮ We used a 3-degree polynomial of frequency to model the
energy consumption: P(f ) = af 3 +bf 2 +cf +d
200 400 600 800 1,000 1,200 1,400 0.1 0.2 0.3 0.4 Frequency (Mhz) Power (W) Real-FFT-1 Rg-FFT-1 Real-MM-1 Rg-MM-1
◮ Regression: FFT MatMul a 4.6·10−11 5.2·10−11 b 2.2·10−8 4.1·10−9 c 3.4·10−8 7.8·10−5 d 4.4·10−2 1.7·10−2
SLIDE 30 Summary
◮ Fact 1: each task has its own coefficients of power
consumption (static and dynamic)
◮ due to internal power optimization by the hardware
◮ Fact 2: scalability of computation time varies with the task
◮ it depends on size of data vs. size of L1/L2 cache, and
interference
◮ We need to put things together
SLIDE 31 Energy consumption
◮ The energy consumed by a task E(f ) = P(f )Ci(f )
◮ (we are not considering deadline constraints in this graph)
500 1,000 1,500 2,000 0.2 0.4 0.6 0.8 1 Frequency (Mhz) Energy (Wh) FFT-L FFT-B MM-L MM-B ◮ Fact 3: The optimal frequency is different for each task
SLIDE 32
Outline
Introduction Experimental setting Time vs. energy Conclusions and Current work
SLIDE 33 Summary
◮ To minimise energy
◮ Profiling of applications to find coefficients
Pi(f ) = aif 3 +bif 2 +cif +di Ci(f ) = cti fmax f +mti Ei(f ) = Ci(f )Pi(f ) Etot(f ,∆) =
Ei(fL)ni(∆)+
Ei(fB)ni(∆)+Emem
◮ Allocating tasks on processors and set processor frequencies
(fL and fB)
◮ Constraints: ◮ deadlines must be respected ◮ group frequency ◮ A non-linear mixed-integer programming optimisation problem ◮ Difficult to solve with exact optimisation tools ◮ We are developing heuristics
SLIDE 34 Profiling
◮ There is no free lunch! ◮ Complex model
◮ Energy consumption depends on task code ◮ Profiling seems to be the only concrete option available ◮ However, the complexity of profiling grows with the complexity
◮ need to explore input space!
◮ Further research is needed to reduce the profiling effort
SLIDE 35 Parallelization
◮ Can we parallelize tasks? ◮ Pros:
◮ may simplify task allocation ◮ may help respecting deadlines
◮ Cons:
◮ One more complexity dimension to an already difficult problem
SLIDE 36
Thank you for listening! Questions ?