 
              IEE5008 – Autumn 2012 Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of GPUs Garrido Platero, Luis Angel EECS Graduate Program National Chiao Tung University luis.garrido.platero@gmail.com Luis Garrido, 2012
Outline  Introduction  Overview of GPU Architectures  The SIMD Execution Model  Memory Requests in GPUs  Differences between GDDRx and DDRx  State-of-the-art Memory Scheduling Techniques  Effect of instruction fetch and memory scheduling in GPU Performance  An alternative Memory Access Scheduling in Manycore Accelerators 2 Luis Garrido NCTU IEE5008 Memory Systems 2012
Outline  DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function  Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems  Conclusion  References 3 Luis Garrido NCTU IEE5008 Memory Systems 2012
Introduction  Memory controllers are a critical bottleneck of performance  Scheduling policies compliant with the SIMD execution model.  Characteristics of the memory requests in GPU architectures  Integration of GPU+CPU systems 4 Luis Garrido NCTU IEE5008 Memory Systems 2012
Overview: SIMD execution model GigaThread Engine Instruction Cache Instruction Cache GRID Block (0,0) Block (1,0) Block (2,0) Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler REGISTER FILE (65536 x 32-bit) REGISTER FILE (65536 x 32-bit) Block (0,1) Block (1,1) Block (2,1) Memory Controller Memory Controller DP P LD LD/ DP P LD/ LD DP P LD/ LD DP P LD/ LD Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Unit it ST ST Unit it ST ST Block (0,2) Block (1,2) Block (2,2) … DP P LD/ LD DP P LD LD/ … DP P LD LD/ DP P LD LD/ Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Unit it ST ST Unit it ST ST DP P LD LD/ DP P LD/ LD DP P LD LD/ DP P LD/ LD Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Unit it ST ST Unit it ST ST Block(0,2) … … … … … Thread (0,0) Thread (0,1) Thread (0,2) … … Thread (0,1) Thread (1,1) Thread (2,1) DP P LD LD/ DP P LD/ LD DP P LD LD/ DP P LD/ LD Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Core re Core re Core re Core re Core re Core re … Unit it ST ST … Unit it ST ST DP P LD/ LD DP P LD/ LD DP P LD LD/ DP P LD LD/ Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Unit it ST ST Unit it ST ST Thread (0,2) Thread (1,2) Thread (2,2) Unit it ST ST Unit it ST ST Interconnect Network Interconnect Network Memory Controller 64 KB Shared Memory / L1 Cache Memory Controller 64 KB Shared Memory / L1 Cache LD/ LD LD/ LD … ST ST ST ST 48 KB Read-Only Data Cache 48 KB Read-Only Data Cache … … LD LD/ LD/ LD … Tex Tex Tex Tex ST ST ST ST Tex Tex Tex Tex … … Tex Tex Tex Tex Tex Tex Tex Tex Interconnect Network L2 Unified Cache Memory 5 Luis Garrido NCTU IEE5008 Memory Systems 2012
Overview: Memory requests in GPUs  Varying number accesses i i+1 i+2 i+3 i+4 i+5 i+6 A C E A F A B with different characteristics A C E F F E B B D B A B E B B D C - F E B generated simultaneously  Load/Store units handle the data fetch  Concept of memory coalescing  Accesses can be merged  Intra-core merging  Inter-core merging  Reduce number of transactions 6 Luis Garrido NCTU IEE5008 Memory Systems 2012
Overview: Memory requests in GPUs Coalesced accesses = Permuted accesses = … Permutated accesses = … … Coalesced Accesses = … 4 transactions 4 transactions 1 transaction 1 transaction 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 b) a) b) a) Same word accesses = Misaligned accesses = … Misaligned access =< 1 transaction 2 transactions … Same word access = … 6 transactions … 1 transaction 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 c) d) 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 Scattered accesses = c) d) k transactions Scattered accesses = k transactions 0 32 64 96 128 160 192 224 256 288 320 352 384 e) e) 0 32 64 96 128 160 192 224 256 288 320 352 384  Number of transactions depend on:  Parameters of memory sub-system: how many cache levels, cache line size,  Application’s behavior  Scheduling policy and GDDRx controller capabilities 7 Luis Garrido NCTU IEE5008 Memory Systems 2012
Differences in GDDRx and DDRx  GDDRx can receive and send data in the same clock cycles  More channels that DDRx  Higher clock speeds at GDDRx  Higher bandwidth demand  Different power consumption characteristic:  GDDRs require better heat dissipation 8 Luis Garrido NCTU IEE5008 Memory Systems 2012
Memory Request Scheduling Problem  Bandwidth limitation  Desirable to increase:  Row-buffer locality  Bank-level Parallelism  Reduce negative interference among memory accesses  Root cause of the problem: massive amount of simultaneous memory requests  Memory scheduling mechanisms can partially alleviate this issue 9 Luis Garrido NCTU IEE5008 Memory Systems 2012
Effect of IF and memory scheduling  Paper: N.B. Lakshminarayana , H. Kim, “Effect of Instruction Fetch and Memory Scheduling on GPU Performance”, In Workshop on Language, Compiler, and Architecture Support for GPGPU. 2010.  Analyzed effect of different IF and memory scheduling policies.  Seven instruction policies: RR, Icount, LRF, FAIR, ALL, BAR, MEM-BAR  Four scheduling policies: FCFS, FR-FCFS, FR- FAIR, REMINST 10 Luis Garrido NCTU IEE5008 Memory Systems 2012
Effect of IF and memory scheduling  GPUOcelot used to evaluate performance  Instruction count of warps:  Max-Warp to Min-Warp ratio a) a) b) b) Luis Garrido NCTU IEE5008 Memory Systems 2012
Effect of IF and memory scheduling  Most IF policies provide improvement of about 3- 6% on average, except for ICOUNT.  Same IF policies do not improve FCFS to FR- FCFS.  Memory intensive apps. improve with FR-FAIR and REMINST  Adequate IF policies can increase memory access merging Luis Garrido NCTU IEE5008 Memory Systems 2012
Effect of IF and memory scheduling  For asymmetric apps., best performance for FR-FCFS+ RR  None can provide good benefit a) for all asymmetrical apps.  A pplication’s characteristics over policy’s effectiveness. b)  The regularity of the warps gives clues on app. Behavior. c) Luis Garrido NCTU IEE5008 Memory Systems 2012
An Alternative Memory Access Scheduling  Paper: Y. Kim, H. Lee, J. Kim, “An Alternative Memory Access Scheduling in Manycore Accelerators”, in Conf. on Parallel Architectures and Compilation Techniques , 2011. Algorithm 1 mLCA algorithm  FR-FCFS requires a complex 1. At each core (c) 2. if (all req in a warp satisfy r c (m,b) = 0 and structure O(c) < t then 3. for all req within a warp do  Scheduling decisions near the 4. inject req; 5. r c (m,b)++; cores 6. O(c)++; 7. end for  Avoid network congestion: 8. else mLCA algorithm 9. Throttle 10. end if Luis Garrido NCTU IEE5008 Memory Systems 2012
An Alternative Memory Access Scheduling  Observation: locality information affected by the NoC traffic.  Grouping multiple requests into superpackets  Considering: order of requests, source and destination criteria  Two: ICR and OCR  GPGPUSim for experiments  95% speedup: BIFO+mLCA Luis Garrido NCTU IEE5008 Memory Systems 2012
An Alternative Memory Access Scheduling  Authors did not provide explanations for the improvement obtained  Did not provide a characterization of the applications evaluated,  Difficult to assess robustness and effectiveness  Distributing the scheduling tasks at different points of the memory subsystem Luis Garrido NCTU IEE5008 Memory Systems 2012
Scheduling Based on a Potential Function  Paper: N.B. Lakshminarayana, J. Lee, H. Kim, J. Shin, “DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function”, In Computer Architecture Letters , vol.11, no.2, pp.33- 36, July-Dec. 2012.  Model DRAM behavior and scheduling policy, called α - SJF (α -Shortest Job First): potential function.  Memory requests of the warps as jobs  Two features: large scale multithreading, SIMD execution Luis Garrido NCTU IEE5008 Memory Systems 2012
Recommend
More recommend