Luis Garrido 2012
IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - - PowerPoint PPT Presentation
IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - - PowerPoint PPT Presentation
IEE5008 Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Outline
Introduction Background of GPGPUs
Hardware Programming and execution model
GPU Memory Hierarchy
Issues and limitations
Locality on GPUs
2
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Outline
State of the art solutions for memory access scheduling on GPUs
Addressing GPU On-Chip Shared Memory Bank Conflicts using elastic pipeline A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Characterizing and improving the use of demand- fetched caches in GPUs
Conclusion: Comparing and analyzing References
3
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Introduction
GPGPUs as a major focus of attention in the heterogeneous computing field Design philosophy of GPGPUs
Bulk synchronous Programming model
Characteristics of applications Design space of GPUs
Diverse and multi-variable
Major bottleneck: memory hierarchy Focus of this work
4
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Background of GPGPUs
Hardware of GPUs: general view
5
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Background of GPGPUs
Hardware of GPUs: SMX
6
NCTU IEE5008 Memory Systems 2012 Luis Garrido
GPU’s Memory Hierarchy
Five different memory spaces
Constant Texture Shared Private Global
Issues and limitations of the memory hierarchy
Caches of small size Latency is not a concerned, it is bandwidth How to make good usage of memory resources?
7
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Locality on GPUs
Locality in GPUs is defined in close relation to the programming model
Why?
Locality in GPUs as:
Within-warp locality Within-block localty (cross-warp) Cross-instruction data reuse
Model suited for GPUs
8
NCTU IEE5008 Memory Systems 2012 Luis Garrido
State of the Art Solutions
Many techniques
An exhaustive survey would be huge Bring forward the most recent and the ones that explain the most
Types of techniques
Static: rely on compiler techniques, done a priori of execution Dynamic: rely on architectural enhancements Static+Dynamic: combine the best of both previous approaches
9
NCTU IEE5008 Memory Systems 2012 Luis Garrido
State of the Art Solutions
Three papers in this survey:
Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Characterizing and improving the use of demand- fetched caches in GPUs
10
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Shared Memory Bank Conflicts
Paper: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline Objective of the mechanism:
Avoid on-chip memory bank conflicts
On-chip memories of GPUs are heavily banked Impact on performance:
Varying latencies as a result of memory bank conflicts
11
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Shared Memory Bank Conflicts
This work makes the following contributions:
Careful analysis of the impact on GPU on-chip shared memory bank conflicts A novel Elastic Pipeline Design to alleviate on-chip shared memory bank conflicts. Co-designed bank-conflict aware warp scheduling technique to assist the Elastic Pipeline Pipeline stalls reduction of up to 64% leading to overall system performance
12
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Shared Memory Bank Conflicts
Core contains information of the warps Instructions of warps issued in round-robin way Conflicts has two consequences:
Blocks the upstream pipeline Introduces a bubble into the M1 stage
13
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Shared Memory Bank Conflicts
Elastic Pipeline Two modifications
Buses to allow forwarding Turning the two stage NONMEMx stage into a FIFO Modify the scheduling policy of warps
14
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Shared Memory Bank Conflicts
Avoid the issues due to out-
- f-order instruction commit
Necessary to know if future instructions will create conflict Prevent queues from saturating Mechanism for modification
- f the scheduling policy
15
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Shared Memory Bank Conflicts
Experimentation Platform: GPGPUSim
Cycle accurate simulator for GPUs
Three categories of pipeline:
Warp scheduling fails Shared memory bank Other
16
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Shared Memory Bank Conflicts
Core scheduling can fail when:
Not hidden by parallel execution Barrier synch. Warp control-flow
Performance enhancement
GPU GPU + Elastic Pipeline GPU + elastic pipeline + novel scheduler Theoretic
17
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Memory Opt. and Parallelism Pipeline
Paper: A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Two major challenges:
Effective utilization of the GPU memory hierarchy Judicious management of parallelism
Compiler-based:
Analyzes memory access patterns Applies an optimization procedure
18
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Memory Opt. and Parallelism Pipeline
Consider an appropriate way to parallelize and application Why compiler techniques? Purposes of the optimization procedure
Increase memory coalescing Improve usage of shared memory Balance the amount of parallelism and memory opt. Distribute memory traffic among different off-chip memory partitions
19
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Memory Opt. and Parallelism Pipeline
Input: naïve kernel code Output: optimized kernel code Re-schedules threads
Depend on mem. acc. behavior
Uncoalesced to coalesced:
Through shared-memory
Compiler analyzes data reuse
Assesses the benefit of using shared-memory
20
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Memory Opt. and Parallelism Pipeline
Data sharing detected
Merge thread blocks Merge threads
When to merge thread blocks
- r just threads?
For experiments:
Various GPU descriptions Cetus: source-to-source
21
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Memory Opt. and Parallelism Pipeline
Main limitation: cannot change algorithm structure Biggest performance increase:
Merging of thread Merging of thread-blocks
22
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
Paper: Characterizing and improving the Use of Demand-Fetched Caches in GPUs Caches are highly configurable Main contributions:
Characterization of application performance Provides taxonomy Algorithm to identify an application’s memory access patterns
Presence of a cache hurts or damages performance?
23
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
It is bandwidth, not latency Estimated traffic: what does it indicate? A block runs on a single SM Remember, L1 cache are not coherent. Configurability of L1 caches
Capacity ON/OFF Cached or not?
Kernels are analyzed independently
24
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
Classification of kernels in three groups:
Texture and constant Shared memory Global memory
25
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
No correlation between hit rates and performance Analysis of traffic generated from L2 to L1 The impact of cache line size:
Fraction fetched from DRAM The whole line into L1
26
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
Changes in L2-to-L1 memory traffic Algorithm can reveal the access patterns:
Mem. address <->threadID Estimate traffic
Determines which instructions to cache
27
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
Steps of the algorithm:
Analyze access patterns Estimate memory traffic Determine which instructions will use cache
Based on the thread ID Memory traffic <-> number of bytes for thread How to decide whether or not to cache the instruction?
28
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
Four possible cases:
Cache-on traffic = cache-off memory traffic Cache-on traffic < cache-off traffic Cache-on traffic > cache-off traffic Unknown access addresses
29
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
Four possible cases:
Cache-on traffic = cache-off memory traffic Cache-on traffic < cache-off traffic Cache-on traffic > cache-off traffic Unknown access addresses
30
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Use of Demand-Fetched Caches
Keeping caches for all kernels: 5.8% improvement on average Conservative strategy: 16.9% Aggressive approach: 18.0%
31
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Conclusion
Things in common:
Extensive analyses of memory access patterns Relationship: access patterns and thread hierarchy On-chip scheduling property: rearranging memory access instructions
Off-chip accesses can also be improved by static approaches Difference in latency for on-chip and off-chip memory accesses
32
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Conclusion
Reduce individual accesses to DRAM On-chip accesses not as centralized Scheduling threads = managing resources Highlighting the relationship between the execution model and memory access patterns Last two mechanism: static based Advantages of static based approach
View the whole application: detailed strategy Cross-optimization: coalescing/shared memory
33
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Conclusion
Limitations of static-based approaches:
Oblivious to dynamics of execution: undetermined addresses Limited by capabilities of compilers Deterministic vs. un-deterministic segments of code
Both approaches: limited on variables Dynamic mechanism: capture changes of execution
34
NCTU IEE5008 Memory Systems 2012 Luis Garrido
Conclusion
Limitations of dynamic approaches:
Cannot store unlimited information Limited on how far they can look
Advantages of dynamic approaches:
Respond better to execution dynamics
35
NCTU IEE5008 Memory Systems 2012 Luis Garrido
References
Diamos, Gregory; Kerr, Andrew; Yalamanchili, Sudhakar; Clark, Nathan. “Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems”. PACT ’10.. Lashgar, Ahmad; Baniasadi, Amirali. “Performance in GPU Architectures: Potentiales and Distances”. WDDD 2011. NVIDIA Corporation. “NVIDIA GeForce GTX 680”. Whitepaper 2012.
36
NCTU IEE5008 Memory Systems 2012 Luis Garrido
References
NVIDIA Corporattion. “NVIDIA CUDA C Programming Guide. Version 4.2”. 2012. Jia, Wenhao; Shaw, Kelly A.; Martonosi, Margaret. “Characterizing and Improving the Use of Demand-Fetched Caches in GPUs”. ICS 2012. Gou, Chunyang; Gaydadjiev, Georgi N. “Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline”. International Journal on Parallel Programming, June 2012.
37
NCTU IEE5008 Memory Systems 2012 Luis Garrido
References
Yuan, G.L.; Fung, W.W.L.; Wong, H.; Aamodt, T.M. “Analyzing CUDA workloads using a detailed GPU Simulator”. International Symposium on Performance Analysis of Systems and Software (ISPASS) 2009. Yang, Yi; Xiang, Ping; Kong, Jingfei; Zhou,
- HuiYang. “A GPGPU Compiler for Memory
Optimization and Parallelism Management”. PLDI 2010.
38
NCTU IEE5008 Memory Systems 2012 Luis Garrido