a high precision gpu cpu and memory power model for the
play

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 - PowerPoint PPT Presentation

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no Learning Outcome Deep, low-level knowledge of the Tegra K1 GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM Accurate, generic


  1. A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke krisrst@ifi.uio.no

  2. Learning Outcome • Deep, low-level knowledge of the Tegra K1 – GK20A GPU, ARM Cortex-A15 CPU, DDR3 RAM • Accurate, generic power modelling for the Tegra K1 – Method, model training and evaluation • Hardware-software codesign for power-aware computing – Analysing power usage of joint GPU-CPU execution – Optimising kernels for power 3/24/2016 2

  3. Motivating Example: Detailed Power Breakdown 3/24/2016 3

  4. Tegra K1: Hereogeneous Multicore 28 nm SoC • Tegra family of mobile Systems-on-Chip (SoC), < 12 W power usage • (Tegra 2, 3, 4..) • Tegra K1 & Tegra X1 • Programmable GPU (CUDA) • Power management capabilities Tegra K1 Tegra X1 High Performance 4 x ARM Cortex-A15 4 x ARM Cortex-A57 CPU Low Power 1 x ARM Cortex-A15 4 x ARM Cortex-A53 192-Core Kepler 256-Core Maxwell GPU 2 GB (Jetson-TK1) 4 GB (Jetson-TX1) Memory 3/24/2016 4

  5. GPU-Accelerated Mobile Systems • Drones, cars, smart phones, space exploration • Video processing, vehicular applications, neural networks, object tracking • Energy – Battery limitation – Environmental aspect – Device failure 3/24/2016 5

  6. Energy-Efficient Video Processing «Shaky video» • Consider an HD video processing pipeline • E.g. a Tegra-enabled drone live- streaming a football stadium • Raw video is lens-distorted and shaky • We implement several video filters to Debarrel Frame filter compensate for these effects stream 60 FPS Rotation • «Goal»: Reach 60 FPS using as little filter energy as possible using hardware capabilities • How can we understand the relationship between software activity, power management capabilities and ? power usage? ?? 3/24/2016 6

  7. Measuring Power • Surprisingly hard • Few tools to measure power • We use an external power source and measurement unit • Keithley K2280-S • 100 nA precision, high sampling rate For details and code, check our paper [2] and • blog [3] • VGA BIOS dumps [1] reveal rail measurement sensors (I 2 C) on most NVIDIA GPUs • Reading them breaks GPUs and hangs Linux [1] Peres, M. Reverse engineering power management on NVIDIA GPUs - A detailed overview [2] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads. [3] http://mlab.no/blog/2015/08/a-peek-in-the-lab-tegra-k1-power-and-voltage-measurements/ 3/24/2016 7

  8. Tegra K1 SoC Architecture: Rails and Clocks • Power on a rail can be described using the standard CMOS equations [1][2] � ���� � � ���� � � ��� � � ��� � ��� � � ���� � � ���� � ���� ���� Transistor leakage Cycles per second Capacitance load per cycle • Rail voltage � ���� • Increases with clock frequency • Total power • ..is the sum of power of all rails [1] Nam Sung et. al., 2003. Leakage Current: Moore’s Law Meets Static Power. [2] Castagnetti et. al., 2010. Power Consumption Modeling for DVFS Exploitation. 3/24/2016 8

  9. Tegra K1 SoC Architecture: Rails and Clocks GPU Rail Voltage vs. GPU Frequency • Clock frequency, rail voltage and power usage are deeply coupled • Increasing clock frequency increases voltage, and vice versa • From previous slide: power ∝ � � � � ���� � � ���� � ���� � ��� � ��� � ���� Measured (Idle) GPU Power Frequency Clock Rail Description Steps Range [MHz] cpu_g HP Rail HP Cluster 20 204 -> 2320 cpu_lp LP Core 9 51 -> 1092 Core Rail emc Memory 10 40 -> 924 gpu GPU Rail GPU 15 72 -> 852 Important clocks for software power optimisation 3/24/2016 9

  10. Tegra K1 SoC Architecture: Rails and Clocks • Core rail voltage depends on two clocks • Memory and LP core frequency • HP rail voltage depends on HP core frequency 3/24/2016 10

  11. Related Work: Rate-Based Power Models • Have achieved extremely widespread use since 1997 [1] – Advanced uses: On-line power models for smart phones [2][3] • Main advantage: concept is simple – Power is correlated with utilisation levels (events per second) • E.g. rate at which instructions are executed, or rate of cache misses • Cost of events per second estimated with multivariable, linear regression – A typical model for total power Events per second � � � ��� � � � � � � � � � ��� Constant base power � Cost ( ����� ��� ������ ) [1] Feeney L.M., 1997. An Energy Consumption Model for Performance Analysis of Routing Protocols for Mobile Ad Hoc Networks. [2] Xiao, Y. et. al., 2010. A System-Level Model for Runtime Power Estimation on Mobile Devices. [3] Dong, M. and Zhong, L., 2011. Self-Constructive High-Rate System Energy Modeling for Battery-Powered Mobile Systems. 3/24/2016 11

  12. A Rate-Based Power Model for the Tegra K1 Device Predictor (CUPTI and PERF) Coefficient • Disadvantages L2 32B read transactions per second -18.6 nW per eps – Ignores important factors L1 4B read transactions per second 0.0 nW per eps • Clock-gating L1 4B write transactions per second -3.7 nW per eps • Power-gating Integer instructions per second 6.2 pW per eps • Voltage variations GPU Float 32 instructions per second 6.6 pW per eps • Frequency scaling Float 64 instructions per second 279 pW per eps • Hardware contention – Tends to yield negative Misc. instructions per second -300 pW per eps coefficients (we «gain» power Conversion instructions per second 236 pW per eps per event per second) Active CPU cycles per second 887 pW per eps CPU • Illogical and confusing CPU instructions per second 1.47 nW per eps 3/24/2016 12

  13. A Rate-Based Power Model for the Tegra K1 • Estimating power of a motion estimation GPU kernel – Model performs poorly at different memory and GPU frequency levels – Estimation error can be as high as 80 %, and for some areas (green) it is near perfect at 0 % Estimation error for a motion estimation CUDA kernel POINT Rate-based models should be used with care over frequency ranges 3/24/2016 13

  14. Related Work: CMOS-Based Power Models Some authors[1][2][3] attempt to model switching capacitance �� • directly for rails using the CMOS equations – Slightly more complicated � ���� � � ���� � � ��� � � ���� � ���� ��� � ���� • Run a workload on several CPU-GPU-memory frequencies, log rail voltages and power – Estimate � ���� and �� using multivariable, linear regression • Advantages – Voltages and leakage currents considered [1] Castagnetti, A. et. al., 2010. Power Consumption Modeling for DVFS Exploitation. [2] Pathania, A. et. al., 2015. Power-Performance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs. [3] Stokke, K.R. et. al., 2015. Why Race-to-Finish is Energy-Inefficient for Continuous Multimedia Workloads 3/24/2016 14

  15. Modelling Switching Capacitance • So how does such a model perform on the Tegra K1? – Better than the rate-based one – Accuracy generally > 85 %, but only about 50 % accurate on high frequencies • Disadvantages / reasons – �� varies depending on workload Estimation error for a motion estimation CUDA kernel – Switching activity in one domain (memory) varies depending on frequency in another (CPU) – ..but model assumes independent relationship between �� and frequency in other domains 3/24/2016 15

  16. Building High-Precision Power Models • Rate- and CMOS-based models are complementary Rate-based CMOS-based • Considers detailed • Considers rail Advantages utilisation through voltages and HPCs leakage currents • Does not consider • Does not consider Disadvantages rail voltages and detailed hardware leakage currents utilisation – They «solve each other’s problems» • We need the physical insight from CMOS based models, and the statistical insight into hardware utilisation from rate-based models 3/24/2016 16

  17. Building High-Precision Power Models • The problem is in the dynamic part of the CMOS equation: � � ���� � � ���� � ���� � ���� � ..which doesn’t consider that �� on a rail is actually depending on frequencies in – other domains (e.g. memory rail �� depends on CPU and GPU frequency) • We now want to express switching activity in terms of measurable hardware activity similarly to rate-based models: Number of utilisation � � predictors on rail R � � ���� � � ���� � ���� � � � �,� � �,� � � ��� Hardware utilisation Capacitive load predictor (events per per event per second second) 3/24/2016 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend