Get Out of the Valley: Power-Efficient Address Mapping for GPUs The - PowerPoint PPT Presentation

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium on Computer Architecture (ISCA) Monday June 4 th , 2018 Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin Wang (MTU), Xiaolin Wang (Peking), Yingwei Luo (Peking), and Lieven Eeckhout (Ghent)

GPU Memory Systems GPUs require high bandwidth memory systems to support efficient execution of 100s to 1000s of concurrent threads DRAM Banks DRAM LLC Slice Channel 0 Network on Chip (NoC) Multiprocessors (SMs) DRAM LLC Slice Streaming Channel 1 DRAM LLC Slice Channel 2 DRAM LLC Slice Channel 3 Achieving high bandwidth requires effectively utilizing the parallel units in the memory system 2

Bank and channel bits must be highly variable Entropy Valley to ensure even distribution of memory requests across LLC slices, channels and banks Memory Address Most Least Row Channel Bank Column Block significant bit significant bit CPUs GPUs Entropy is a Entropy measure of the Entropy information Valley content of each address bit Memory Address Bit Entropy valleys create significant resource imbalance in GPU memory systems - leading to poor performance and low power-efficiency 3

Why Do Entropy Valleys Exist? Column-Major 1D Thread Channel Block (TB) Allocation Channel 0 [y,x] bits 7 [0,0] … 0000 00 … Request [0,0] 6 [1,0] … 0010 00 … Request [1,0] 5 [2,0] … 0100 00 … Request [2,0] Channel 1 Y-dimension [3,0] … 0110 00 … 4 Request [3,0] [4,0] … 1000 00 … Request [4,0] 3 Channel 2 [5,0] … 1010 00 … Request [5,0] 2 [6,0] … 1100 00 … Request [6,0] 1 [7,0] Channel 3 … 1110 00 … Request [7,0] 0 0 1 2 3 4 5 6 7 X-dimension Memory Addresses and Requests DRAM Channels 4

Why Do Entropy Valleys Exist? Column-Major 1D Thread Channel Block (TB) Allocation Channel 0 [y,x] bits Request [0,0] Request [1,0] 7 [0,0] … 0000 00 … Request [2,0] Request [3,0] 6 Request [4,0] Request [5,0] [1,0] … 0010 00 … Request [6,0] Request [7,0] 5 [2,0] … 0100 00 … Y-dimension All requests end up in Channel 0 [3,0] … 0110 00 … 4 [4,0] … 1000 00 … 3 Entropy valleys are caused by Channel 2 [5,0] … 1010 00 … dimension-related array indexing 2 [6,0] … 1100 00 … 1 Our solution: [7,0] Channel 3 … 1110 00 … BIM-based address mapping 0 0 1 2 3 4 5 6 7 X-dimension Memory Addresses and Requests DRAM Channels 5

Getting Out of the Entropy Valley Channel BIM-based Column-Major 1D Thread [y,x] bits Address Mapping Block (TB) Allocation … 0000 00 … [0,0] Channel 0 Output Addr. Binary … 0010 00 … Input Addr. [1,0] Invertible 7 x = … 0100 00 … [2,0] Matrix [3,0] … 0110 00 … (BIM) 6 [4,0] … 1000 00 … Channel 1 [5,0] … 1010 00 … 5 [6,0] … 1100 00 … Y-dimension [7,0] … 1110 00 … 4 Channel 2 [0,0] … 0000 00 … Request [0,0] 3 [1,0] … 0010 11 … Request [1,0] [2,0] … 0100 01 … Request [2,0] 2 [3,0] … 0110 10 … Request [3,0] Channel 3 1 [4,0] … 1000 11 … Request [4,0] [5,0] … 1010 00 … Request [5,0] 0 [6,0] … 1100 10 … Request [6,0] [7,0] 0 1 2 3 4 5 6 7 … 1110 01 … Request [7,0] X-dimension Memory Addresses and Requests DRAM Channels 6

Getting Out of the Entropy Valley Channel BIM-based Column-Major 1D Thread [y,x] bits Address Mapping Block (TB) Allocation … 0000 00 … [0,0] Channel 0 Output Addr. Binary … 0010 00 … Input Addr. [1,0] Request [0,0] Invertible 7 x = … 0100 00 … [2,0] Matrix Request [5,0] [3,0] … 0110 00 … (BIM) 6 [4,0] … 1000 00 … Channel 1 [5,0] … 1010 00 … Request [2,0] 5 [6,0] … 1100 00 … Y-dimension Request [7,0] [7,0] … 1110 00 … 4 Channel 2 [0,0] … 0000 00 … 3 Request [3,0] [1,0] … 0010 11 … [2,0] Request [6,0] … 0100 01 … 2 [3,0] Perfect channel … 0110 10 … Channel 3 1 [4,0] … 1000 11 … utilization! Request [1,0] [5,0] … 1010 00 … 0 Request [4,0] [6,0] … 1100 10 … [7,0] 0 1 2 3 4 5 6 7 … 1110 01 … X-dimension Memory Addresses and Requests DRAM Channels 7

Outline 1. Introduction 2. Window-based memory address entropy 3. Binary Invertible Matrix (BIM) address mapping 4. Results 5. Conclusion 8

Window-based Entropy We need an entropy metric without memory request ordering assumptions Intra-TB Entropy Inter-TB Entropy … 1 0 0 … TB1 TB2 TB3 TB4 … 0 0 1 … BVR 0 1 0 1 Thread Block (TB) 1 … 1 0 1 … … 0 0 0 … Window: The TBs that are likely to issue requests Bit Value Ratio (BVR) 0 that coexist in the memory system … 1 1 0 … … 0 1 1 … Compute Shannon’s entropy function over the BVR Thread Block (TB) 2 … 1 1 1 … probabilities within each window … 0 1 0 … Overall entropy = Mean of window entropies Bit Value Ratio (BVR) 1 With Greedy-Then-Oldest (GTO) warp scheduling, we heuristically set the window size to the number of Streaming Multiprocessors (SMs) 9

Entropy Profile Examples Two channel bits Three bank bits and one bank bit 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit MT LU GS 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit NW LPS NN (no valley) All workloads have low entropy bits, and their location is highly application-dependent GPU address mapping schemes must harvest entropy across broad address bit ranges 10

The Binary Invertible Matrix (BIM) Output Addr. Binary Input Addr. The BIM can represent all possible Invertible x = address mapping schemes that consist Matrix (BIM) of AND and XOR operations Example Memory Map • Matrix covers all possible transformations • Invertibility criterion ensures that all possible Remap (RMP) one-to-one relations are considered Single 1 per row The BIM has low hardware overhead Permutation-based mapping (PM) Zhang et al. • Can be implemented with a tree of XOR-gates [MICRO’00] • Mapping can be performed in a single clock cycle Two 1s in bank and channel rows 12

Our Mapping Schemes Broad mapping strategy Entropy analysis shows that a GPU address mapping policy needs to harvest entropy across broad Multiple 1s for each bank and channel row address bit ranges • We call this the broad mapping strategy • Covers many possible mapping schemes Broad sub-strategies Row Channel Bank Column Block FAE PAE FAE We define three sub-strategies that All All differ in which memory address Binary Invertible Matrix (BIM) fields can be used as input and All output in the BIM Row Channel Bank Column Block • Page Address Entropy (PAE) • Full Address Entropy (FAE) We randomly generate BIMs that match the input and output restrictions of each sub-strategy • All 13

Entropy Impact of Address Mapping Schemes for the MT Benchmark Baseline Remap PM 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit FAE All PAE 1 . 0 1 . 0 1 . 0 Entropy Entropy Entropy 0 . 5 0 . 5 0 . 5 0 . 0 0 . 0 0 . 0 29 18 6 29 18 6 29 18 6 Bit Bit Bit PAE, FAE, and All remove the entropy valleys – the other mapping schemes do not 14

Execution Time vs. DRAM Power 1,2 Average Execution Time Normalized to BASE BASE 1 PM RMP - 1.51X 0,8 PAE ALL FAE 0,6 +1.30X 0,4 0,2 0 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 Average DRAM Power Consumption Normalized to BASE 16

Performance BASE PM RMP PAE FAE ALL 8 +7.5X +6.7X 7 PAE improves Speed-up Relative to BASE 6 performance by +1.31X on average 5 +4.0X compared to PM 4 3 +1.9X +2.0X 2 +1.5X +1.4X +1.4X +1.3X +1.1X +1.0X +1.0X 1 0 MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN 17

Performance per Watt BASE PM RMP PAE FAE ALL 4,5 +3.9X PAE improves 4 Performance per Watt 3,5 Performance per Watt by +1.25X on average 3 compared to PM 2,5 2 +1.4X 1,5 1 0,5 0 MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN 18

Why is PAE Most Power-Efficient? background activate read write 60 BASE PM RMP PAE FAE ALL DRAM Power Breakdown (W) 50 40 30 20 10 0 MT LU GS NW LPS SC SRAD2 DWT2D HS SP AVG FAE and ALL tend to distribute requests with good DRAM page locality to different banks which increases the number of DRAM page activations PAE saves power by keeping these requests in the same bank 19

Conclusion Window-Based Entropy • A novel entropy metric tailored for the highly concurrent memory behavior of GPU compute workloads Binary Invertible Matrix (BIM) address mapping • A unified representation of address mapping schemes that use AND and XOR operations Page Address Entropy (PAE) address mapping • PAE improves performance by 1.31X and performance per Watt by 1.25X compared to the state-of-the-art permutation-based address mapping scheme 21

Thank You! 22

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The - PowerPoint PPT Presentation

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium on Computer Architecture (ISCA) Monday June 4 th , 2018 Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin Wang (MTU),

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

Advanced Texturing Environment Mapping Environment Mapping reflections Environment Mapping

Texture Mapping Texture Mapping 1 Texture Mapping Texture Mapping Motivation Motivation:

Texture Mapping Surface mapping OpenGl and Implementation Details Texture mapping Bump

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

RTK Mapping Process RTK Mapping Presentation RTK Mapping Presentation June 4, 2002 RTK

Mapping data Representing data with maps Geographic analysis tasks Mapping where things are

lecture 17 - more on texture mapping - graphics pipeline - MIP mapping - procedural

Texture and other Mappings Texture Mapping Bump Mapping Displacement Mapping Environment

Computer Graphics (CS 543) Lecture 10: Bump Mapping, Parallax, Relief, Alpha, Specular Mapping

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Who is CFK Valley? 1 CFK Valley some facts CFK Valley is a competence network for CFRP

Performance Engineering for Algorithmic Building Blocks in the GHOST Library Georg Hager, Moritz

programming in the presence of memory faults Saverio Caminiti , Irene Finocchi, and Emanuele G.

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Computer Graphics Seminar MTAT.03.305 Spring 2020 Raimond Tunnel Computer Graphics

Page 1 Ridges of Temporal Locality Ridges of Temporal Locality Pentium 4 Pentium 4 Memory

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on