A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 - PowerPoint PPT Presentation

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no

Learning Outcome • Deep, low-level knowledge of the Tegra K1 – GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM • Accurate, generic power modelling for the Tegra K1 – Method, model training and evaluation • Hardware-software codesign for power-aware computing – Analysing power usage of joint GPU-CPU execution – Optimising kernels for power 3/24/2016 2

Motivating Example: Detailed Power Breakdown 3/24/2016 3

Tegra K1: Hereogeneous Multicore 28 nm SoC • Tegra family of mobile Systems-on-Chip (SoC), < 12 W power usage • (Tegra 2, 3, 4..) • Tegra K1 & Tegra X1 • Programmable GPU (CUDA) • Power management capabilities Tegra K1 Tegra X1 High Performance 4 x ARM Cortex-A15 4 x ARM Cortex-A57 CPU Low Power 1 x ARM Cortex-A15 4 x ARM Cortex-A53 192-Core Kepler 256-Core Maxwell GPU 2 GB (Jetson-TK1) 4 GB (Jetson-TX1) Memory 3/24/2016 4

GPU-Accelerated Mobile Systems • Drones, cars, smart phones, space exploration • Video processing, vehicular applications, neural networks, object tracking • Energy – Battery limitation – Environmental aspect – Device failure 3/24/2016 5

Energy-Efficient Video Processing «Shaky video» • Consider an HD video processing pipeline • E.g. a Tegra-enabled drone live- streaming a football stadium • Raw video is lens-distorted and shaky • We implement several video filters to Debarrel Frame filter compensate for these effects stream 60 FPS Rotation • «Goal»: Reach 60 FPS using as little filter energy as possible using hardware capabilities • How can we understand the relationship between software activity, power management capabilities and ? power usage? ?? 3/24/2016 6

Measuring Power • Surprisingly hard • Few tools to measure power • We use an external power source and measurement unit • Keithley K2280-S • 100 nA precision, high sampling rate For details and code, check our paper [2] and • blog [3] • VGA BIOS dumps [1] reveal rail measurement sensors (I 2 C) on most NVIDIA GPUs • Reading them breaks GPUs and hangs Linux [1] Peres, M. Reverse engineering power management on NVIDIA GPUs - A detailed overview [2] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads. [3] http://mlab.no/blog/2015/08/a-peek-in-the-lab-tegra-k1-power-and-voltage-measurements/ 3/24/2016 7

Tegra K1 SoC Architecture: Rails and Clocks • Power on a rail can be described using the standard CMOS equations [1][2] � �� Transistor leakage Cycles per second Capacitance load per cycle • Rail voltage � �� • Increases with clock frequency • Total power • ..is the sum of power of all rails [1] Nam Sung et. al., 2003. Leakage Current: Moore’s Law Meets Static Power. [2] Castagnetti et. al., 2010. Power Consumption Modeling for DVFS Exploitation. 3/24/2016 8

Tegra K1 SoC Architecture: Rails and Clocks GPU Rail Voltage vs. GPU Frequency • Clock frequency, rail voltage and power usage are deeply coupled • Increasing clock frequency increases voltage, and vice versa • From previous slide: power ∝ � � � � �� Measured (Idle) GPU Power Frequency Clock Rail Description Steps Range [MHz] cpu_g HP Rail HP Cluster 20 204 -> 2320 cpu_lp LP Core 9 51 -> 1092 Core Rail emc Memory 10 40 -> 924 gpu GPU Rail GPU 15 72 -> 852 Important clocks for software power optimisation 3/24/2016 9

Tegra K1 SoC Architecture: Rails and Clocks • Core rail voltage depends on two clocks • Memory and LP core frequency • HP rail voltage depends on HP core frequency 3/24/2016 10

Related Work: Rate-Based Power Models • Have achieved extremely widespread use since 1997 [1] – Advanced uses: On-line power models for smart phones [2][3] • Main advantage: concept is simple – Power is correlated with utilisation levels (events per second) • E.g. rate at which instructions are executed, or rate of cache misses • Cost of events per second estimated with multivariable, linear regression – A typical model for total power Events per second � � � �� Constant base power � Cost ( �� ) [1] Feeney L.M., 1997. An Energy Consumption Model for Performance Analysis of Routing Protocols for Mobile Ad Hoc Networks. [2] Xiao, Y. et. al., 2010. A System-Level Model for Runtime Power Estimation on Mobile Devices. [3] Dong, M. and Zhong, L., 2011. Self-Constructive High-Rate System Energy Modeling for Battery-Powered Mobile Systems. 3/24/2016 11

A Rate-Based Power Model for the Tegra K1 Device Predictor (CUPTI and PERF) Coefficient • Disadvantages L2 32B read transactions per second -18.6 nW per eps – Ignores important factors L1 4B read transactions per second 0.0 nW per eps • Clock-gating L1 4B write transactions per second -3.7 nW per eps • Power-gating Integer instructions per second 6.2 pW per eps • Voltage variations GPU Float 32 instructions per second 6.6 pW per eps • Frequency scaling Float 64 instructions per second 279 pW per eps • Hardware contention – Tends to yield negative Misc. instructions per second -300 pW per eps coefficients (we «gain» power Conversion instructions per second 236 pW per eps per event per second) Active CPU cycles per second 887 pW per eps CPU • Illogical and confusing CPU instructions per second 1.47 nW per eps 3/24/2016 12

A Rate-Based Power Model for the Tegra K1 • Estimating power of a motion estimation GPU kernel – Model performs poorly at different memory and GPU frequency levels – Estimation error can be as high as 80 %, and for some areas (green) it is near perfect at 0 % Estimation error for a motion estimation CUDA kernel POINT Rate-based models should be used with care over frequency ranges 3/24/2016 13

Related Work: CMOS-Based Power Models Some authors[1][2][3] attempt to model switching capacitance �� • directly for rails using the CMOS equations – Slightly more complicated � �� • Run a workload on several CPU-GPU-memory frequencies, log rail voltages and power – Estimate � �� and �� using multivariable, linear regression • Advantages – Voltages and leakage currents considered [1] Castagnetti, A. et. al., 2010. Power Consumption Modeling for DVFS Exploitation. [2] Pathania, A. et. al., 2015. Power-Performance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs. [3] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads 3/24/2016 14

Modelling Switching Capacitance • So how does such a model perform on the Tegra K1? – Better than the rate-based one – Accuracy generally > 85 %, but only about 50 % accurate on high frequencies • Disadvantages / reasons – �� varies depending on workload Estimation error for a motion estimation CUDA kernel – Switching activity in one domain (memory) varies depending on frequency in another (CPU) – ..but model assumes independent relationship between �� and frequency in other domains 3/24/2016 15

Building High-Precision Power Models • Rate- and CMOS-based models are complementary Rate-based CMOS-based • Considers detailed • Considers rail Advantages utilisation through voltages and HPCs leakage currents • Does not consider • Does not consider Disadvantages rail voltages and detailed hardware leakage currents utilisation – They «solve each other’s problems» • We need the physical insight from CMOS based models, and the statistical insight into hardware utilisation from rate-based models 3/24/2016 16

Building High-Precision Power Models • The problem is in the dynamic part of the CMOS equation: � � �� ..which doesn’t consider that �� on a rail is actually depending on frequencies in – other domains (e.g. memory rail �� depends on CPU and GPU frequency) • We now want to express switching activity in terms of measurable hardware activity similarly to rate-based models: Number of utilisation � � predictors on rail R � � �� ,� � �,� � � �� Hardware utilisation Capacitive load predictor (events per per event per second second) 3/24/2016 17

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 - PowerPoint PPT Presentation

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no Learning Outcome Deep, low-level knowledge of the Tegra K1 GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM Accurate, generic

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

? Group 6 ? ? CPU ? CPU Memory We want multiple processors to share memory

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

Memory and I/O buses I/O bus 1880Mbps 1056Mbps Memory CPU Crossbar CPU accesses physical

A High-Precision, Hybrid GPU, CPU and RAM Power Model for the Tegra K1 SoC Kristoffer Robin

Investor Presentation August 2019 Index 2. 3. 1. 4. Company Business/ Executive

EAST TEXAS BAPTIST UNIVERSITY Service Learning Orientation www.etbu.edu/serve DO: DO: Go to

15 (Flip - Flop)

Energy Efficient Content-Addressable Memory Advanced Seminar Computer Engineering Institute of

The profit improvement is sustainable and there is more to come Roland Sundn President, Hiab

Classical Control for Quantum Qubit decoherence Physical isolation from environment

A Matlab Interface to Acquire and Analyse Beam Spectra Using a Large Memory Digitizer A. V

Fingerprinting hardware devices Fingerprinting hardware devices using clock-skewing using

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 - PowerPoint PPT Presentation

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no Learning Outcome Deep, low-level knowledge of the Tegra K1 GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM Accurate, generic

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

? Group 6 ? ? CPU ? CPU Memory We want multiple processors to share memory

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

High Performance Hardware, High Performance Hardware, Memory &amp; CPU Memory &amp; CPU Step

Memory and I/O buses I/O bus 1880Mbps 1056Mbps Memory CPU Crossbar CPU accesses physical

A High-Precision, Hybrid GPU, CPU and RAM Power Model for the Tegra K1 SoC Kristoffer Robin

Investor Presentation August 2019 Index 2. 3. 1. 4. Company Business/ Executive

EAST TEXAS BAPTIST UNIVERSITY Service Learning Orientation www.etbu.edu/serve DO: DO: Go to

15 (Flip - Flop)

Energy Efficient Content-Addressable Memory Advanced Seminar Computer Engineering Institute of

The profit improvement is sustainable and there is more to come Roland Sundn President, Hiab

Classical Control for Quantum Qubit decoherence Physical isolation from environment

A Matlab Interface to Acquire and Analyse Beam Spectra Using a Large Memory Digitizer A. V

Fingerprinting hardware devices Fingerprinting hardware devices using clock-skewing using

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step