Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses - PowerPoint PPT Presentation

IEE5008 – Autumn 2012 Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of GPUs Garrido Platero, Luis Angel EECS Graduate Program National Chiao Tung University luis.garrido.platero@gmail.com Luis Garrido, 2012

Outline  Introduction  Overview of GPU Architectures  The SIMD Execution Model  Memory Requests in GPUs  Differences between GDDRx and DDRx  State-of-the-art Memory Scheduling Techniques  Effect of instruction fetch and memory scheduling in GPU Performance  An alternative Memory Access Scheduling in Manycore Accelerators 2 Luis Garrido NCTU IEE5008 Memory Systems 2012

Outline  DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function  Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems  Conclusion  References 3 Luis Garrido NCTU IEE5008 Memory Systems 2012

Introduction  Memory controllers are a critical bottleneck of performance  Scheduling policies compliant with the SIMD execution model.  Characteristics of the memory requests in GPU architectures  Integration of GPU+CPU systems 4 Luis Garrido NCTU IEE5008 Memory Systems 2012

Overview: SIMD execution model GigaThread Engine Instruction Cache Instruction Cache GRID Block (0,0) Block (1,0) Block (2,0) Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler REGISTER FILE (65536 x 32-bit) REGISTER FILE (65536 x 32-bit) Block (0,1) Block (1,1) Block (2,1) Memory Controller Memory Controller DP P LD LD/ DP P LD/ LD DP P LD/ LD DP P LD/ LD Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Unit it ST ST Unit it ST ST Block (0,2) Block (1,2) Block (2,2) … DP P LD/ LD DP P LD LD/ … DP P LD LD/ DP P LD LD/ Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Unit it ST ST Unit it ST ST DP P LD LD/ DP P LD/ LD DP P LD LD/ DP P LD/ LD Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Unit it ST ST Unit it ST ST Block(0,2) … … … … … Thread (0,0) Thread (0,1) Thread (0,2) … … Thread (0,1) Thread (1,1) Thread (2,1) DP P LD LD/ DP P LD/ LD DP P LD LD/ DP P LD/ LD Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Core re Core re Core re Core re Core re Core re … Unit it ST ST … Unit it ST ST DP P LD/ LD DP P LD/ LD DP P LD LD/ DP P LD LD/ Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Thread (0,2) Thread (1,2) Thread (2,2) Unit it ST ST Unit it ST ST Interconnect Network Interconnect Network Memory Controller 64 KB Shared Memory / L1 Cache Memory Controller 64 KB Shared Memory / L1 Cache LD/ LD LD/ LD … ST ST ST ST 48 KB Read-Only Data Cache 48 KB Read-Only Data Cache … … LD LD/ LD/ LD … Tex Tex Tex Tex ST ST ST ST Tex Tex Tex Tex … … Tex Tex Tex Tex Tex Tex Tex Tex Interconnect Network L2 Unified Cache Memory 5 Luis Garrido NCTU IEE5008 Memory Systems 2012

Overview: Memory requests in GPUs  Varying number accesses i i+1 i+2 i+3 i+4 i+5 i+6 A C E A F A B with different characteristics A C E F F E B B D B A B E B B D C - F E B generated simultaneously  Load/Store units handle the data fetch  Concept of memory coalescing  Accesses can be merged  Intra-core merging  Inter-core merging  Reduce number of transactions 6 Luis Garrido NCTU IEE5008 Memory Systems 2012

Overview: Memory requests in GPUs Coalesced accesses = Permuted accesses = … Permutated accesses = … … Coalesced Accesses = … 4 transactions 4 transactions 1 transaction 1 transaction 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 b) a) b) a) Same word accesses = Misaligned accesses = … Misaligned access =< 1 transaction 2 transactions … Same word access = … 6 transactions … 1 transaction 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 c) d) 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 Scattered accesses = c) d) k transactions Scattered accesses = k transactions 0 32 64 96 128 160 192 224 256 288 320 352 384 e) e) 0 32 64 96 128 160 192 224 256 288 320 352 384  Number of transactions depend on:  Parameters of memory sub-system: how many cache levels, cache line size,  Application’s behavior  Scheduling policy and GDDRx controller capabilities 7 Luis Garrido NCTU IEE5008 Memory Systems 2012

Differences in GDDRx and DDRx  GDDRx can receive and send data in the same clock cycles  More channels that DDRx  Higher clock speeds at GDDRx  Higher bandwidth demand  Different power consumption characteristic:  GDDRs require better heat dissipation 8 Luis Garrido NCTU IEE5008 Memory Systems 2012

Memory Request Scheduling Problem  Bandwidth limitation  Desirable to increase:  Row-buffer locality  Bank-level Parallelism  Reduce negative interference among memory accesses  Root cause of the problem: massive amount of simultaneous memory requests  Memory scheduling mechanisms can partially alleviate this issue 9 Luis Garrido NCTU IEE5008 Memory Systems 2012

Effect of IF and memory scheduling  Paper: N.B. Lakshminarayana , H. Kim, “Effect of Instruction Fetch and Memory Scheduling on GPU Performance”, In Workshop on Language, Compiler, and Architecture Support for GPGPU. 2010.  Analyzed effect of different IF and memory scheduling policies.  Seven instruction policies: RR, Icount, LRF, FAIR, ALL, BAR, MEM-BAR  Four scheduling policies: FCFS, FR-FCFS, FR- FAIR, REMINST 10 Luis Garrido NCTU IEE5008 Memory Systems 2012

Effect of IF and memory scheduling  GPUOcelot used to evaluate performance  Instruction count of warps:  Max-Warp to Min-Warp ratio a) a) b) b) Luis Garrido NCTU IEE5008 Memory Systems 2012

Effect of IF and memory scheduling  Most IF policies provide improvement of about 3- 6% on average, except for ICOUNT.  Same IF policies do not improve FCFS to FR- FCFS.  Memory intensive apps. improve with FR-FAIR and REMINST  Adequate IF policies can increase memory access merging Luis Garrido NCTU IEE5008 Memory Systems 2012

Effect of IF and memory scheduling  For asymmetric apps., best performance for FR-FCFS+ RR  None can provide good benefit a) for all asymmetrical apps.  A pplication’s characteristics over policy’s effectiveness. b)  The regularity of the warps gives clues on app. Behavior. c) Luis Garrido NCTU IEE5008 Memory Systems 2012

An Alternative Memory Access Scheduling  Paper: Y. Kim, H. Lee, J. Kim, “An Alternative Memory Access Scheduling in Manycore Accelerators”, in Conf. on Parallel Architectures and Compilation Techniques , 2011. Algorithm 1 mLCA algorithm  FR-FCFS requires a complex 1. At each core (c) 2. if (all req in a warp satisfy r c (m,b) = 0 and structure O(c) < t then 3. for all req within a warp do  Scheduling decisions near the 4. inject req; 5. r c (m,b)++; cores 6. O(c)++; 7. end for  Avoid network congestion: 8. else mLCA algorithm 9. Throttle 10. end if Luis Garrido NCTU IEE5008 Memory Systems 2012

An Alternative Memory Access Scheduling  Observation: locality information affected by the NoC traffic.  Grouping multiple requests into superpackets  Considering: order of requests, source and destination criteria  Two: ICR and OCR  GPGPUSim for experiments  95% speedup: BIFO+mLCA Luis Garrido NCTU IEE5008 Memory Systems 2012

An Alternative Memory Access Scheduling  Authors did not provide explanations for the improvement obtained  Did not provide a characterization of the applications evaluated,  Difficult to assess robustness and effectiveness  Distributing the scheduling tasks at different points of the memory subsystem Luis Garrido NCTU IEE5008 Memory Systems 2012

Scheduling Based on a Potential Function  Paper: N.B. Lakshminarayana, J. Lee, H. Kim, J. Shin, “DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function”, In Computer Architecture Letters , vol.11, no.2, pp.33- 36, July-Dec. 2012.  Model DRAM behavior and scheduling policy, called α - SJF (α -Shortest Job First): potential function.  Memory requests of the warps as jobs  Two features: large scale multithreading, SIMD execution Luis Garrido NCTU IEE5008 Memory Systems 2012

Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses - PowerPoint PPT Presentation

IEE5008 Autumn 2012 Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of GPUs Garrido Platero, Luis Angel EECS Graduate Program National Chiao Tung University luis.garrido.platero@gmail.com Luis

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Operating Systems: Operating Systems: Memory management Memory management Fall 2008 Fall 2008

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Operating Systems WT 2019/20 Memory Management Shared Memory Process 1 virtual memory most

Developing Canadas Only Carlin-Type Gold District April 2019 TSX-V: ATC Forward-Looking

Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade

CARLIN-TYPE GOLD DISTRICT TSX-V: ATC April 2015 FORWARD LOOKING STATEMENTS FORWARD-LOOKING

Welcome to 3 rd Grade ! Thank You for Coming! Thank you for your support! We look forward to the

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache

Emission Offsets and TIER Alberta Environment and Parks November 21, 2019 Topics

Information on the Carbon Offsetting and Reduction Scheme for International Aviation (CORSIA)

BU Institute for Sustainable Energy A study of carbon offsets and RECs to meet Bostons mandate