 
              IEE5008 –Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com Luis Garrido 2012
Outline  Introduction  Background of GPGPUs  Hardware  Programming and execution model  GPU Memory Hierarchy  Issues and limitations  Locality on GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 2
Outline  State of the art solutions for memory access scheduling on GPUs  Addressing GPU On-Chip Shared Memory Bank Conflicts using elastic pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs  Conclusion: Comparing and analyzing  References Luis Garrido NCTU IEE5008 Memory Systems 2012 3
Introduction  GPGPUs as a major focus of attention in the heterogeneous computing field  Design philosophy of GPGPUs  Bulk synchronous Programming model  Characteristics of applications  Design space of GPUs  Diverse and multi-variable  Major bottleneck: memory hierarchy  Focus of this work Luis Garrido NCTU IEE5008 Memory Systems 2012 4
Background of GPGPUs  Hardware of GPUs: general view Luis Garrido NCTU IEE5008 Memory Systems 2012 5
Background of GPGPUs  Hardware of GPUs: SMX Luis Garrido NCTU IEE5008 Memory Systems 2012 6
GPU’s Memory Hierarchy  Five different memory spaces  Constant  Texture  Shared  Private  Global  Issues and limitations of the memory hierarchy  Caches of small size  Latency is not a concerned, it is bandwidth  How to make good usage of memory resources? Luis Garrido NCTU IEE5008 Memory Systems 2012 7
Locality on GPUs  Locality in GPUs is defined in close relation to the programming model  Why?  Locality in GPUs as:  Within-warp locality  Within-block localty (cross-warp)  Cross-instruction data reuse  Model suited for GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 8
State of the Art Solutions  Many techniques  An exhaustive survey would be huge  Bring forward the most recent and the ones that explain the most  Types of techniques  Static: rely on compiler techniques, done a priori of execution  Dynamic: rely on architectural enhancements  Static+Dynamic: combine the best of both previous approaches Luis Garrido NCTU IEE5008 Memory Systems 2012 9
State of the Art Solutions  Three papers in this survey:  Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 10
Shared Memory Bank Conflicts  Paper: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  Objective of the mechanism:  Avoid on-chip memory bank conflicts  On-chip memories of GPUs are heavily banked  Impact on performance:  Varying latencies as a result of memory bank conflicts Luis Garrido NCTU IEE5008 Memory Systems 2012 11
Shared Memory Bank Conflicts  This work makes the following contributions:  Careful analysis of the impact on GPU on-chip shared memory bank conflicts  A novel Elastic Pipeline Design to alleviate on-chip shared memory bank conflicts.  Co-designed bank-conflict aware warp scheduling technique to assist the Elastic Pipeline  Pipeline stalls reduction of up to 64% leading to overall system performance Luis Garrido NCTU IEE5008 Memory Systems 2012 12
Shared Memory Bank Conflicts  Core contains information of the warps  Instructions of warps issued in round-robin way  Conflicts has two consequences:  Blocks the upstream pipeline  Introduces a bubble into the M1 stage Luis Garrido NCTU IEE5008 Memory Systems 2012 13
Shared Memory Bank Conflicts  Elastic Pipeline  Two modifications  Buses to allow forwarding  Turning the two stage NONMEMx stage into a FIFO  Modify the scheduling policy of warps Luis Garrido NCTU IEE5008 Memory Systems 2012 14
Shared Memory Bank Conflicts  Avoid the issues due to out- of-order instruction commit  Necessary to know if future instructions will create conflict  Prevent queues from saturating  Mechanism for modification of the scheduling policy Luis Garrido NCTU IEE5008 Memory Systems 2012 15
Shared Memory Bank Conflicts  Experimentation Platform: GPGPUSim  Cycle accurate simulator for GPUs  Three categories of pipeline:  Warp scheduling fails  Shared memory bank  Other Luis Garrido NCTU IEE5008 Memory Systems 2012 16
Shared Memory Bank Conflicts  Core scheduling can fail when:  Not hidden by parallel execution  Barrier synch.  Warp control-flow  Performance enhancement  GPU  GPU + Elastic Pipeline  GPU + elastic pipeline + novel scheduler  Theoretic Luis Garrido NCTU IEE5008 Memory Systems 2012 17
Memory Opt. and Parallelism Pipeline  Paper: A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Two major challenges:  Effective utilization of the GPU memory hierarchy  Judicious management of parallelism  Compiler-based:  Analyzes memory access patterns  Applies an optimization procedure Luis Garrido NCTU IEE5008 Memory Systems 2012 18
Memory Opt. and Parallelism Pipeline  Consider an appropriate way to parallelize and application  Why compiler techniques?  Purposes of the optimization procedure  Increase memory coalescing  Improve usage of shared memory  Balance the amount of parallelism and memory opt.  Distribute memory traffic among different off-chip memory partitions Luis Garrido NCTU IEE5008 Memory Systems 2012 19
Memory Opt. and Parallelism Pipeline  Input: naïve kernel code  Output: optimized kernel code  Re-schedules threads  Depend on mem. acc. behavior  Uncoalesced to coalesced:  Through shared-memory  Compiler analyzes data reuse  Assesses the benefit of using shared-memory Luis Garrido NCTU IEE5008 Memory Systems 2012 20
Memory Opt. and Parallelism Pipeline  Data sharing detected  Merge thread blocks  Merge threads  When to merge thread blocks or just threads?  For experiments:  Various GPU descriptions  Cetus: source-to-source Luis Garrido NCTU IEE5008 Memory Systems 2012 21
Memory Opt. and Parallelism Pipeline  Main limitation: cannot change algorithm structure  Biggest performance increase:  Merging of thread  Merging of thread-blocks Luis Garrido NCTU IEE5008 Memory Systems 2012 22
Use of Demand-Fetched Caches  Paper: Characterizing and improving the Use of Demand-Fetched Caches in GPUs  Caches are highly configurable  Main contributions:  Characterization of application performance  Provides taxonomy  Algorithm to identify an application’s memory access patterns  Presence of a cache hurts or damages performance? Luis Garrido NCTU IEE5008 Memory Systems 2012 23
Use of Demand-Fetched Caches  It is bandwidth, not latency  Estimated traffic: what does it indicate?  A block runs on a single SM  Remember, L1 cache are not coherent.  Configurability of L1 caches  Capacity  ON/OFF  Cached or not?  Kernels are analyzed independently Luis Garrido NCTU IEE5008 Memory Systems 2012 24
Use of Demand-Fetched Caches  Classification of kernels in three groups:  Texture and constant  Shared memory  Global memory Luis Garrido NCTU IEE5008 Memory Systems 2012 25
Use of Demand-Fetched Caches  No correlation between hit rates and performance  Analysis of traffic generated from L2 to L1  The impact of cache line size:  Fraction fetched from DRAM  The whole line into L1 Luis Garrido NCTU IEE5008 Memory Systems 2012 26
Use of Demand-Fetched Caches  Changes in L2-to-L1 memory traffic  Algorithm can reveal the access patterns:  Mem. address <->threadID  Estimate traffic  Determines which instructions to cache Luis Garrido NCTU IEE5008 Memory Systems 2012 27
Use of Demand-Fetched Caches  Steps of the algorithm:  Analyze access patterns  Estimate memory traffic  Determine which instructions will use cache  Based on the thread ID  Memory traffic <-> number of bytes for thread  How to decide whether or not to cache the instruction? Luis Garrido NCTU IEE5008 Memory Systems 2012 28
Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 29
Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 30
Use of Demand-Fetched Caches  Keeping caches for all kernels: 5.8% improvement on average  Conservative strategy: 16.9%  Aggressive approach: 18.0% Luis Garrido NCTU IEE5008 Memory Systems 2012 31
Recommend
More recommend