iee5008 autumn 2012 memory systems survey on memory
play

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - PowerPoint PPT Presentation

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com


  1. IEE5008 –Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com Luis Garrido 2012

  2. Outline  Introduction  Background of GPGPUs  Hardware  Programming and execution model  GPU Memory Hierarchy  Issues and limitations  Locality on GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 2

  3. Outline  State of the art solutions for memory access scheduling on GPUs  Addressing GPU On-Chip Shared Memory Bank Conflicts using elastic pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs  Conclusion: Comparing and analyzing  References Luis Garrido NCTU IEE5008 Memory Systems 2012 3

  4. Introduction  GPGPUs as a major focus of attention in the heterogeneous computing field  Design philosophy of GPGPUs  Bulk synchronous Programming model  Characteristics of applications  Design space of GPUs  Diverse and multi-variable  Major bottleneck: memory hierarchy  Focus of this work Luis Garrido NCTU IEE5008 Memory Systems 2012 4

  5. Background of GPGPUs  Hardware of GPUs: general view Luis Garrido NCTU IEE5008 Memory Systems 2012 5

  6. Background of GPGPUs  Hardware of GPUs: SMX Luis Garrido NCTU IEE5008 Memory Systems 2012 6

  7. GPU’s Memory Hierarchy  Five different memory spaces  Constant  Texture  Shared  Private  Global  Issues and limitations of the memory hierarchy  Caches of small size  Latency is not a concerned, it is bandwidth  How to make good usage of memory resources? Luis Garrido NCTU IEE5008 Memory Systems 2012 7

  8. Locality on GPUs  Locality in GPUs is defined in close relation to the programming model  Why?  Locality in GPUs as:  Within-warp locality  Within-block localty (cross-warp)  Cross-instruction data reuse  Model suited for GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 8

  9. State of the Art Solutions  Many techniques  An exhaustive survey would be huge  Bring forward the most recent and the ones that explain the most  Types of techniques  Static: rely on compiler techniques, done a priori of execution  Dynamic: rely on architectural enhancements  Static+Dynamic: combine the best of both previous approaches Luis Garrido NCTU IEE5008 Memory Systems 2012 9

  10. State of the Art Solutions  Three papers in this survey:  Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 10

  11. Shared Memory Bank Conflicts  Paper: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  Objective of the mechanism:  Avoid on-chip memory bank conflicts  On-chip memories of GPUs are heavily banked  Impact on performance:  Varying latencies as a result of memory bank conflicts Luis Garrido NCTU IEE5008 Memory Systems 2012 11

  12. Shared Memory Bank Conflicts  This work makes the following contributions:  Careful analysis of the impact on GPU on-chip shared memory bank conflicts  A novel Elastic Pipeline Design to alleviate on-chip shared memory bank conflicts.  Co-designed bank-conflict aware warp scheduling technique to assist the Elastic Pipeline  Pipeline stalls reduction of up to 64% leading to overall system performance Luis Garrido NCTU IEE5008 Memory Systems 2012 12

  13. Shared Memory Bank Conflicts  Core contains information of the warps  Instructions of warps issued in round-robin way  Conflicts has two consequences:  Blocks the upstream pipeline  Introduces a bubble into the M1 stage Luis Garrido NCTU IEE5008 Memory Systems 2012 13

  14. Shared Memory Bank Conflicts  Elastic Pipeline  Two modifications  Buses to allow forwarding  Turning the two stage NONMEMx stage into a FIFO  Modify the scheduling policy of warps Luis Garrido NCTU IEE5008 Memory Systems 2012 14

  15. Shared Memory Bank Conflicts  Avoid the issues due to out- of-order instruction commit  Necessary to know if future instructions will create conflict  Prevent queues from saturating  Mechanism for modification of the scheduling policy Luis Garrido NCTU IEE5008 Memory Systems 2012 15

  16. Shared Memory Bank Conflicts  Experimentation Platform: GPGPUSim  Cycle accurate simulator for GPUs  Three categories of pipeline:  Warp scheduling fails  Shared memory bank  Other Luis Garrido NCTU IEE5008 Memory Systems 2012 16

  17. Shared Memory Bank Conflicts  Core scheduling can fail when:  Not hidden by parallel execution  Barrier synch.  Warp control-flow  Performance enhancement  GPU  GPU + Elastic Pipeline  GPU + elastic pipeline + novel scheduler  Theoretic Luis Garrido NCTU IEE5008 Memory Systems 2012 17

  18. Memory Opt. and Parallelism Pipeline  Paper: A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Two major challenges:  Effective utilization of the GPU memory hierarchy  Judicious management of parallelism  Compiler-based:  Analyzes memory access patterns  Applies an optimization procedure Luis Garrido NCTU IEE5008 Memory Systems 2012 18

  19. Memory Opt. and Parallelism Pipeline  Consider an appropriate way to parallelize and application  Why compiler techniques?  Purposes of the optimization procedure  Increase memory coalescing  Improve usage of shared memory  Balance the amount of parallelism and memory opt.  Distribute memory traffic among different off-chip memory partitions Luis Garrido NCTU IEE5008 Memory Systems 2012 19

  20. Memory Opt. and Parallelism Pipeline  Input: naïve kernel code  Output: optimized kernel code  Re-schedules threads  Depend on mem. acc. behavior  Uncoalesced to coalesced:  Through shared-memory  Compiler analyzes data reuse  Assesses the benefit of using shared-memory Luis Garrido NCTU IEE5008 Memory Systems 2012 20

  21. Memory Opt. and Parallelism Pipeline  Data sharing detected  Merge thread blocks  Merge threads  When to merge thread blocks or just threads?  For experiments:  Various GPU descriptions  Cetus: source-to-source Luis Garrido NCTU IEE5008 Memory Systems 2012 21

  22. Memory Opt. and Parallelism Pipeline  Main limitation: cannot change algorithm structure  Biggest performance increase:  Merging of thread  Merging of thread-blocks Luis Garrido NCTU IEE5008 Memory Systems 2012 22

  23. Use of Demand-Fetched Caches  Paper: Characterizing and improving the Use of Demand-Fetched Caches in GPUs  Caches are highly configurable  Main contributions:  Characterization of application performance  Provides taxonomy  Algorithm to identify an application’s memory access patterns  Presence of a cache hurts or damages performance? Luis Garrido NCTU IEE5008 Memory Systems 2012 23

  24. Use of Demand-Fetched Caches  It is bandwidth, not latency  Estimated traffic: what does it indicate?  A block runs on a single SM  Remember, L1 cache are not coherent.  Configurability of L1 caches  Capacity  ON/OFF  Cached or not?  Kernels are analyzed independently Luis Garrido NCTU IEE5008 Memory Systems 2012 24

  25. Use of Demand-Fetched Caches  Classification of kernels in three groups:  Texture and constant  Shared memory  Global memory Luis Garrido NCTU IEE5008 Memory Systems 2012 25

  26. Use of Demand-Fetched Caches  No correlation between hit rates and performance  Analysis of traffic generated from L2 to L1  The impact of cache line size:  Fraction fetched from DRAM  The whole line into L1 Luis Garrido NCTU IEE5008 Memory Systems 2012 26

  27. Use of Demand-Fetched Caches  Changes in L2-to-L1 memory traffic  Algorithm can reveal the access patterns:  Mem. address <->threadID  Estimate traffic  Determines which instructions to cache Luis Garrido NCTU IEE5008 Memory Systems 2012 27

  28. Use of Demand-Fetched Caches  Steps of the algorithm:  Analyze access patterns  Estimate memory traffic  Determine which instructions will use cache  Based on the thread ID  Memory traffic <-> number of bytes for thread  How to decide whether or not to cache the instruction? Luis Garrido NCTU IEE5008 Memory Systems 2012 28

  29. Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 29

  30. Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 30

  31. Use of Demand-Fetched Caches  Keeping caches for all kernels: 5.8% improvement on average  Conservative strategy: 16.9%  Aggressive approach: 18.0% Luis Garrido NCTU IEE5008 Memory Systems 2012 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend