A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 - - PowerPoint PPT Presentation

a high precision gpu cpu and memory power model for the
SMART_READER_LITE
LIVE PREVIEW

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 - - PowerPoint PPT Presentation

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no Learning Outcome Deep, low-level knowledge of the Tegra K1 GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM Accurate, generic


slide-1
SLIDE 1

Kristoffer Robin Stokke krisrst@ifi.uio.no

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC

slide-2
SLIDE 2

Learning Outcome

3/24/2016 2

  • Deep, low-level knowledge of the Tegra K1

– GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM

  • Accurate, generic power modelling for the Tegra K1

– Method, model training and evaluation

  • Hardware-software codesign for power-aware

computing

– Analysing power usage of joint GPU-CPU execution – Optimising kernels for power

slide-3
SLIDE 3

Motivating Example: Detailed Power Breakdown

3/24/2016 3

slide-4
SLIDE 4

Tegra K1: Hereogeneous Multicore 28 nm SoC

  • Tegra family of mobile Systems-on-Chip

(SoC), < 12 W power usage

  • (Tegra 2, 3, 4..)
  • Tegra K1 & Tegra X1
  • Programmable GPU (CUDA)
  • Power management capabilities

3/24/2016 4

Tegra K1 Tegra X1 CPU High Performance 4 x ARM Cortex-A15 4 x ARM Cortex-A57 Low Power 1 x ARM Cortex-A15 4 x ARM Cortex-A53 GPU 192-Core Kepler 256-Core Maxwell Memory 2 GB (Jetson-TK1) 4 GB (Jetson-TX1)

slide-5
SLIDE 5

GPU-Accelerated Mobile Systems

  • Drones, cars, smart phones, space

exploration

  • Video processing, vehicular applications,

neural networks, object tracking

  • Energy

– Battery limitation – Environmental aspect – Device failure

3/24/2016 5

slide-6
SLIDE 6

Energy-Efficient Video Processing

  • Consider an HD video processing pipeline
  • E.g. a Tegra-enabled drone live-

streaming a football stadium

  • Raw video is lens-distorted and shaky
  • We implement several video filters to

compensate for these effects

  • «Goal»: Reach 60 FPS using as little

energy as possible using hardware capabilities

  • How can we understand the

relationship between software activity, power management capabilities and power usage?

Rotation filter

60 FPS

«Shaky video»

Frame stream

3/24/2016 6

?? ?

Debarrel filter

slide-7
SLIDE 7

Measuring Power

3/24/2016 7

[1] Peres, M. Reverse engineering power management on NVIDIA GPUs - A detailed overview [2] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads. [3] http://mlab.no/blog/2015/08/a-peek-in-the-lab-tegra-k1-power-and-voltage-measurements/

  • Surprisingly hard
  • Few tools to measure power
  • We use an external power source and

measurement unit

  • Keithley K2280-S
  • 100 nA precision, high sampling rate
  • For details and code, check our paper[2] and

blog[3]

  • VGA BIOS dumps[1] reveal rail

measurement sensors (I2C) on most NVIDIA GPUs

  • Reading them breaks GPUs and hangs

Linux

slide-8
SLIDE 8

Tegra K1 SoC Architecture: Rails and Clocks

3/24/2016 8

  • Power on a rail can be described using

the standard CMOS equations[1][2]

  • Rail voltage
  • Increases with clock frequency
  • Total power
  • ..is the sum of power of all rails

Transistor leakage Capacitance load per cycle Cycles per second

[1] Nam Sung et. al., 2003. Leakage Current: Moore’s Law Meets Static Power. [2] Castagnetti et. al., 2010. Power Consumption Modeling for DVFS Exploitation.

slide-9
SLIDE 9

Tegra K1 SoC Architecture: Rails and Clocks

3/24/2016 9

  • Clock frequency, rail voltage and power

usage are deeply coupled

  • Increasing clock frequency increases

voltage, and vice versa

  • From previous slide: power ∝
  • Measured (Idle) GPU Power

Clock Rail Description Frequency Steps Range [MHz] cpu_g HP Rail HP Cluster 20 204 -> 2320 cpu_lp Core Rail LP Core 9 51 -> 1092 emc Memory 10 40 -> 924 gpu GPU Rail GPU 15 72 -> 852

Important clocks for software power optimisation GPU Rail Voltage vs. GPU Frequency

slide-10
SLIDE 10

Tegra K1 SoC Architecture: Rails and Clocks

3/24/2016 10

  • Core rail voltage depends on two clocks
  • Memory and LP core frequency
  • HP rail voltage depends on HP core frequency
slide-11
SLIDE 11

Related Work: Rate-Based Power Models

  • Have achieved extremely widespread use since 1997[1]

– Advanced uses: On-line power models for smart phones[2][3]

  • Main advantage: concept is simple

– Power is correlated with utilisation levels (events per second)

  • E.g. rate at which instructions are executed, or rate of cache misses
  • Cost of events per second estimated with multivariable, linear regression

– A typical model for total power

3/24/2016 11

[1] Feeney L.M., 1997. An Energy Consumption Model for Performance Analysis of Routing Protocols for Mobile Ad Hoc Networks. [2] Xiao, Y. et. al., 2010. A System-Level Model for Runtime Power Estimation on Mobile Devices. [3] Dong, M. and Zhong, L., 2011. Self-Constructive High-Rate System Energy Modeling for Battery-Powered Mobile Systems.

  • Events per second

Cost (

  • )

Constant base power

slide-12
SLIDE 12

A Rate-Based Power Model for the Tegra K1

  • Disadvantages

– Ignores important factors

  • Clock-gating
  • Power-gating
  • Voltage variations
  • Frequency scaling
  • Hardware contention

– Tends to yield negative coefficients (we «gain» power per event per second)

  • Illogical and confusing

3/24/2016 12

Device Predictor (CUPTI and PERF) Coefficient GPU L2 32B read transactions per second

  • 18.6 nW per eps

L1 4B read transactions per second 0.0 nW per eps L1 4B write transactions per second

  • 3.7 nW per eps

Integer instructions per second 6.2 pW per eps Float 32 instructions per second 6.6 pW per eps Float 64 instructions per second 279 pW per eps

  • Misc. instructions per second
  • 300 pW per eps

Conversion instructions per second 236 pW per eps CPU Active CPU cycles per second 887 pW per eps CPU instructions per second 1.47 nW per eps

slide-13
SLIDE 13

A Rate-Based Power Model for the Tegra K1

  • Estimating power of a motion estimation GPU kernel

– Model performs poorly at different memory and GPU frequency levels – Estimation error can be as high as 80 %, and for some areas (green) it is near perfect at 0 %

3/24/2016 13

POINT Rate-based models should be used with care over frequency ranges

Estimation error for a motion estimation CUDA kernel

slide-14
SLIDE 14

Related Work: CMOS-Based Power Models

  • Some authors[1][2][3] attempt to model switching capacitance

directly for rails using the CMOS equations

– Slightly more complicated

  • Run a workload on several CPU-GPU-memory frequencies, log rail

voltages and power – Estimate and using multivariable, linear regression

  • Advantages

– Voltages and leakage currents considered

3/24/2016 14

[1] Castagnetti, A. et. al., 2010. Power Consumption Modeling for DVFS Exploitation. [2] Pathania, A. et. al., 2015. Power-Performance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs. [3] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads

slide-15
SLIDE 15

Modelling Switching Capacitance

  • So how does such a model perform on the Tegra K1?

– Better than the rate-based one – Accuracy generally > 85 %, but only about 50 % accurate on high frequencies

  • Disadvantages / reasons

– varies depending on workload – Switching activity in one domain (memory) varies depending on frequency in another (CPU) – ..but model assumes independent relationship between and frequency in other domains

3/24/2016 15

Estimation error for a motion estimation CUDA kernel

slide-16
SLIDE 16

Building High-Precision Power Models

  • Rate- and CMOS-based models are complementary

– They «solve each other’s problems»

  • We need the physical insight from CMOS based models, and the

statistical insight into hardware utilisation from rate-based models

3/24/2016 16

Rate-based CMOS-based Advantages

  • Considers detailed

utilisation through HPCs

  • Considers rail

voltages and leakage currents Disadvantages

  • Does not consider

rail voltages and leakage currents

  • Does not consider

detailed hardware utilisation

slide-17
SLIDE 17

Building High-Precision Power Models

3/24/2016 17

  • The problem is in the dynamic part of the CMOS equation:

– ..which doesn’t consider that on a rail is actually depending on frequencies in

  • ther domains (e.g. memory rail depends on CPU and GPU frequency)
  • We now want to express switching activity in terms of measurable

hardware activity similarly to rate-based models:

,,

  • Number of utilisation

predictors on rail R Capacitive load per event per second Hardware utilisation predictor (events per second)

slide-18
SLIDE 18
  • We need to measure hardware activity in each of the

four rails

– Memory, HP cluster, Core and GPU rails

  • What constitutes good hardware activity

predictors?

– From rate-based models: , can be cache misses, cache writebacks, instructions, cycles..

  • Estimate capacitance load per event

– Should ideally cover all hardware activity in a rail – This is a major task in understanding and/or guessing what is going in in hardware

  • Possibly industrial secrets («what is really happening when I run this code»)

3/24/2016 18

Understanding Hardware Activity

,,

slide-19
SLIDE 19
  • In addition to switching activity, there are a number of power

management mechanisms that must be taken into account

– Clock gating effectively shuts off («gates») clock distribution into circuits or parts of the circuits – Power gating shuts of supply to circuits within rails (clock gating often implied but not always) – Rail gating shuts off power to entire rails, for example

  • GPU rail is gated when inactive for more than 500 ms
  • CPU HP rail is gated when kernel driver detects inactivity
  • «THE» challenge: To understand when and how circuits are being gated

– GPU is the hardest. No technical details about internal workings – Hard to trace gating duration (we come back to this)

3/24/2016 19

Understanding Hardware Activity

slide-20
SLIDE 20

3/24/2016 20

Understanding Hardware Activity: CPU Gating

  • Switching capacitance per CPU clock

cycle is very important

– Each core clock gates itself either directly through power management (wfi and wfe instructions) or indirectly – Fortunately: Clock gating is easily tracked with

  • ur kernel tracing framework
  • Individual cores are power-gated when

idle

– Effectively cuts leakage current from that core – No PERF HPCs to track this. – We use our kernel tracing framework to measure the time in power-gated state.

  • HP rail is also gated if the CPU is idle

– Effectively cuts leakage current on that rail

slide-21
SLIDE 21

3/24/2016 21

Understanding Hardware Activity: CPU Instructions

  • Switching activity predictors , for HP

and Core rails

– Almost identical processors on both rails – Four on the HP rail, one on the core rail

  • Software excersises various

architectural units in the CPUs

– Integer, floating point, NEON..

  • No HPCs for these on the Tegra K1

– Instead, the Tegra K1 CPU has a single instruction counter

  • Counts everything
  • Loss of generality is unavoidable

– Switching capacitance per «generic instruction» must be estimated on a per- process basis

slide-22
SLIDE 22

3/24/2016 22

Understanding Hardware Activity: CPU Cache

  • L1 and L2 Cache writebacks and refills

– Can be traced with PERF – These are popular in rate-based models. However..

  • Problem! Cache writebacks are usually

accompanied by cache refills

– Event rates (cache writebacks and refills per second) are not diverse enough for switching capacitance to be estimated – We define two new «events» to trace local cache activity (refills and writebacks between L1 and L2) and external cache activity (refills and writebacks between L2 and RAM)

slide-23
SLIDE 23

3/24/2016 23

CPU Summary

  • Dynamic power (hardware activity predictors)

– ,: Instructions per second (workload specific) – ,, , : Local and external cache traffic per second – ,: Active cycles per second (subject to clock gating)

  • Static power

– ,: Individual core leakage current (when not power gated) – ,: HP rail leakage current (when not rail gated) – ,: Core rail leakage current (always present)

  • Total power for HP and core rails:

, , ,,

  • , , ,,
  • 4
slide-24
SLIDE 24

3/24/2016 24

Understanding Hardware Activity: GPU

slide-25
SLIDE 25

3/24/2016 25

Understanding Hardware Activity: GPU Cores

  • NVIDIA provides CUPTI

– Essentially a much more fine-grained HPC implementation than PERF

  • Please NVIDIA continue to support it!   
  • ..it would however be nice with better documentation on the events

– Fine-grained instruction counting

  • We can therefore estimate switching capacitance per instruction type

– Some out of scope, such as Special Function Unit (SFU) instructions (sin, cos, tan, ..)

HPC Name Description inst_integer Integer instructions inst_bit_convert Conversion instructions inst_control Control flow instructions inst_misc Miscallaneous instructions inst_fp_32/64 Floating point instructions

Core block dynamic power predictors

slide-26
SLIDE 26

3/24/2016 26

Understanding Hardware Activity: GPU Memory (1)

  • GPU has less L2 cache than the CPU (128 kB), but larger cache line (64-bit)

– Easily the most complex part of dynamic power because memory is so flexible – ..and because documentation is confusing (nvprof --query-events --query-metrics)

  • L2 cache serves read requests

– 32 B read accesses through CUPTI HPC l2_subp0_total_read_sector_queries – There is an HPC for writes (l2_subp0_total_write_sector_queries), but we cannot estimate a capacitance cost for it – this indicates that L2 cache is write-back

  • Which is surprising!

EMC

slide-27
SLIDE 27

3/24/2016 27

Understanding Hardware Activity: GPU Memory (2)

  • L1 GPU cache has many uses:

– Caching global (RAM) reads not writes --ptxas-options=«--dlcm=ca» – Caching local data (function parameters) and register spills – Shared memory (in which case it can be read and written by thread blocks)

  • There is no CUPTI HPC which counts raw L1 reads and writes

– Must combine the HPCs for all types of L1 accesses to make our own counter: HPC Name Description l1_global_load_hit L1 cache hit for global (RAM) data l1_local_{store/load}_hit L1 register spill / local cache l1_shared_{store/load}_transactions Shared memory shared_efficiency

..or PTX code

slide-28
SLIDE 28

3/24/2016 28

Understanding Hardware Activity: GPU Memory (3)

  • Shared memory complicates the picture..

– Memory is often broadcasted to all threads of a warp – In this case, the l1_shared_load_transactions HPC counts all of the accesses, but in hardware there was only a single access

  • Same for writes

– Impossible to fix, but it is possible to approximate the actual accesses:

  • l1_shr_{load/store} = l1_shared_{load/store}_transactions * shared_efficiency
  • Although it is not a really good solution.

HPC Name Description l1_shared_{store/load}_transactions Shared memory shared_efficiency

slide-29
SLIDE 29

3/24/2016 29

Understanding Hardware Activity: GPU Memory (4)

  • So in summary, we count the number of L1 4B reads and writes as

– (4B reads, 8B also possible)

1 l1_local_load_hit + l1_shr_load + l1_global_load_hit 1 l1_local_store_hit + l1_shr_store

HPC Name Description l1_global_load_hit L1 cache hit for global (RAM) data l1_local_{store/load}_hit L1 register spill / local cache l1_shared_{store/load}_transactions Shared memory shared_efficiency

slide-30
SLIDE 30

3/24/2016 30

Understanding Hardware Activity: GPU Gating

  • GPU is both rail- and clock-gated

– The details are however unknown to us (what parts of the circuits are gated and when) – For the duration of a kernel, we assume clock is always on

  • However there is a power management bug on the Tegra K1..

– If you use any CUPTI function call or run anything through nvprof there are two interesting effects which are very hard to see: – 1) Rail is not power gated anymore, for however long the GPU is idle – 2) GPU clock switching capacitance is twice as high

Processing Inactive GPU Rail Gate

slide-31
SLIDE 31

3/24/2016 31

GPU Summary

  • Dynamic power (hardware activity predictors)

– ,, ,, ,, ,, ,: Integer, float32, float64, conversion and misc. instructions per second – ,, ,, ,: L2 reads, L1 reads and L1 writes per second – ,: Active cycles per second (not subject to clock gating)

  • Static power

– ,: GPU leakage current when rail on

  • Total power for GPU rail:

, ,,

slide-32
SLIDE 32

3/24/2016 32

Understanding Hardware Activity: Memory

  • Monitoring RAM activity is very

challenging

– There are no HPCs, built-in monitoring tools..

  • The Tegra K1 however has an activity

monitor

– emc_cpu: total RAM cycles spent serving CPU requests – emc_gpu: total RAM cycles spent serving GPU requests

– These should reflect direct RAM utilisation from CPU and GPU

  • In addition, the RAM continuously

spends cycles (no matter if it is inactive) to maintain its own consistency

CPU Complex GPU

slide-33
SLIDE 33

3/24/2016 33

Memory Summary

  • Dynamic power (hardware activity predictors)

– ,, ,: Active memory cycles from CPU and GPU workloads – ,: Active cycles per second (not subject to clock gating)

  • Static power

– Memory is driven by LDO regulators and the rail voltage is always 1.35 V – Therefore it is not possible to isolate leakage current

  • Total power for memory rail:

,,

  • 1.35
slide-34
SLIDE 34

3/24/2016 34

Finding the Right Answer (1)

  • The unknown variables are

– The switching capacitances , – The leakage currents , – And the base power

  • We haven’t talked alot about base power but just consider it the constant power draw of all
  • ther rails and electrical components which are not being used (idle)
  • The resulting expression is linear where all voltages and predictors are known

– Which means we can find the coefficients using multivariable linear regression – ..If we are careful enough..

  • , ,

GPU, HP, Core and memory rail

,

,

,,

slide-35
SLIDE 35

3/24/2016 35

Finding the Right Answer (2)

  • For regression to work, a training data set must be generated

– ..and the training software must be carefully designed to ensure that the predictors vary enough compared to one another

  • The following is the benchmark suite for the GPU

– We start stressing a few number of architectural units and then add units on top

slide-36
SLIDE 36

3/24/2016 36

Finding the Right Answer (3)

  • Likewise, the CPU training benchmarks:
slide-37
SLIDE 37

3/24/2016 37

Finding the Right Answer (4)

  • All benchmarks are now run over all possible frequency combinations

– GPU benchmarks: LP core at 1 GHz, vary GPU and memory frequences – CPU benchmarks: For all CPU configurations (LP or any number of HP cores on), vary all CPU and memory frequencies – All predictors are logged

  • This is necessary to force variation in rail voltages, which has several

advantages:

– It makes it possible to predict leakage currents – It helps create diversity in predictors

  • Resulting datasets are quite large (about 2-3000)

– But they can be reduced (it is not necessary to run over absolutely all frequencies)

slide-38
SLIDE 38

3/24/2016 38

GPU Model Coefficients

Positive estimates  Leakage

  • The «memory offsets» compensate for variation in power

across memory frequencies (ref slide 9)

  • Supposed to be negative!
slide-39
SLIDE 39

3/24/2016 39

CPU Model Coefficients

Positive estimates 

slide-40
SLIDE 40

3/24/2016 40

Human-Readable Coefficient Comparison

slide-41
SLIDE 41

3/24/2016 41

CPU Instruction Switching Capacitance

  • Estimated switching capacitance per instruction per second depends
  • n the workload

– Shown for each CPU model training benchmark above – We want individual integer, floating point, register acces etc PERF counters..

Per-Workload Instruction Switching Capacitance Generic Power Removed

slide-42
SLIDE 42

3/24/2016 42

Model Precision

CPU (DCT):

GPU (MVS):

Rate-based CMOS-based Our (hybrid) model

slide-43
SLIDE 43

3/24/2016 43

Power Prediction Over Time

  • Our model is able to predict power usage of both CPU and GPU

execution with very high accuracy DCT Kernel Power Breakdown

slide-44
SLIDE 44

3/24/2016 44

Power Optimisation = Performance Optimisation

  • Caching in L1 over L2 saves power

due to reduced external memory accesses (EMC GPU)

– Because L1 is not cache coherent

  • Using shorter datatypes (float32 over

float64) also conserves energy

– Less direct computation and less conversion instructions in our example – Pascal and mixed precition (16-bit float)?

  • In our experience, optimising for power

is equivalent to optimising for performance

– Which is good news 

DCT Kernel Power Breakdown

slide-45
SLIDE 45

3/24/2016 45

Optimising System Services

  • Estimating instruction power per

system process and application

  • Removing redundant services and
  • ptimising drivers reduces instruction

power

– 20 % saving!

System Level Instruction Power «Generic power» removed

slide-46
SLIDE 46

3/24/2016 46

Saving Power by Exploiting Cache Line Width

Memory Clock Power EMC Activity (CPU) EMC Activity (GPU)

slide-47
SLIDE 47

3/24/2016 47

Conclusion

  • In this presentation, we have shown how we

can understand the power usage of complex, heterogeneous multicore architectures

  • Evaluating a system for power efficiency

requires deep insight into architectures and their internal workings

– In this context, our method provides good pointers for modelling power on other SoCs

  • We have demonstrated how we can analyse

energy consumption of software worklods

– Optimised both CPU and GPU workloads

slide-48
SLIDE 48

Future Work: Clustered Tegra K1

3/24/2016 48