IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - - PowerPoint PPT Presentation

iee5008 autumn 2012 memory systems survey on memory
SMART_READER_LITE
LIVE PREVIEW

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - - PowerPoint PPT Presentation

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com


slide-1
SLIDE 1

Luis Garrido 2012

IEE5008 –Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs

Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com

slide-2
SLIDE 2

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Outline

Introduction Background of GPGPUs

Hardware Programming and execution model

GPU Memory Hierarchy

Issues and limitations

Locality on GPUs

2

slide-3
SLIDE 3

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Outline

State of the art solutions for memory access scheduling on GPUs

Addressing GPU On-Chip Shared Memory Bank Conflicts using elastic pipeline A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Characterizing and improving the use of demand- fetched caches in GPUs

Conclusion: Comparing and analyzing References

3

slide-4
SLIDE 4

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Introduction

GPGPUs as a major focus of attention in the heterogeneous computing field Design philosophy of GPGPUs

Bulk synchronous Programming model

Characteristics of applications Design space of GPUs

Diverse and multi-variable

Major bottleneck: memory hierarchy Focus of this work

4

slide-5
SLIDE 5

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Background of GPGPUs

Hardware of GPUs: general view

5

slide-6
SLIDE 6

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Background of GPGPUs

Hardware of GPUs: SMX

6

slide-7
SLIDE 7

NCTU IEE5008 Memory Systems 2012 Luis Garrido

GPU’s Memory Hierarchy

Five different memory spaces

Constant Texture Shared Private Global

Issues and limitations of the memory hierarchy

Caches of small size Latency is not a concerned, it is bandwidth How to make good usage of memory resources?

7

slide-8
SLIDE 8

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Locality on GPUs

Locality in GPUs is defined in close relation to the programming model

Why?

Locality in GPUs as:

Within-warp locality Within-block localty (cross-warp) Cross-instruction data reuse

Model suited for GPUs

8

slide-9
SLIDE 9

NCTU IEE5008 Memory Systems 2012 Luis Garrido

State of the Art Solutions

Many techniques

An exhaustive survey would be huge Bring forward the most recent and the ones that explain the most

Types of techniques

Static: rely on compiler techniques, done a priori of execution Dynamic: rely on architectural enhancements Static+Dynamic: combine the best of both previous approaches

9

slide-10
SLIDE 10

NCTU IEE5008 Memory Systems 2012 Luis Garrido

State of the Art Solutions

Three papers in this survey:

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Characterizing and improving the use of demand- fetched caches in GPUs

10

slide-11
SLIDE 11

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Shared Memory Bank Conflicts

Paper: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline Objective of the mechanism:

Avoid on-chip memory bank conflicts

On-chip memories of GPUs are heavily banked Impact on performance:

Varying latencies as a result of memory bank conflicts

11

slide-12
SLIDE 12

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Shared Memory Bank Conflicts

This work makes the following contributions:

Careful analysis of the impact on GPU on-chip shared memory bank conflicts A novel Elastic Pipeline Design to alleviate on-chip shared memory bank conflicts. Co-designed bank-conflict aware warp scheduling technique to assist the Elastic Pipeline Pipeline stalls reduction of up to 64% leading to overall system performance

12

slide-13
SLIDE 13

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Shared Memory Bank Conflicts

Core contains information of the warps Instructions of warps issued in round-robin way Conflicts has two consequences:

Blocks the upstream pipeline Introduces a bubble into the M1 stage

13

slide-14
SLIDE 14

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Shared Memory Bank Conflicts

Elastic Pipeline Two modifications

Buses to allow forwarding Turning the two stage NONMEMx stage into a FIFO Modify the scheduling policy of warps

14

slide-15
SLIDE 15

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Shared Memory Bank Conflicts

Avoid the issues due to out-

  • f-order instruction commit

Necessary to know if future instructions will create conflict Prevent queues from saturating Mechanism for modification

  • f the scheduling policy

15

slide-16
SLIDE 16

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Shared Memory Bank Conflicts

Experimentation Platform: GPGPUSim

Cycle accurate simulator for GPUs

Three categories of pipeline:

Warp scheduling fails Shared memory bank Other

16

slide-17
SLIDE 17

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Shared Memory Bank Conflicts

Core scheduling can fail when:

Not hidden by parallel execution Barrier synch. Warp control-flow

Performance enhancement

GPU GPU + Elastic Pipeline GPU + elastic pipeline + novel scheduler Theoretic

17

slide-18
SLIDE 18

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Memory Opt. and Parallelism Pipeline

Paper: A GPGPU Compiler for Memory Optimization and Parallelism Pipeline Two major challenges:

Effective utilization of the GPU memory hierarchy Judicious management of parallelism

Compiler-based:

Analyzes memory access patterns Applies an optimization procedure

18

slide-19
SLIDE 19

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Memory Opt. and Parallelism Pipeline

Consider an appropriate way to parallelize and application Why compiler techniques? Purposes of the optimization procedure

Increase memory coalescing Improve usage of shared memory Balance the amount of parallelism and memory opt. Distribute memory traffic among different off-chip memory partitions

19

slide-20
SLIDE 20

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Memory Opt. and Parallelism Pipeline

Input: naïve kernel code Output: optimized kernel code Re-schedules threads

Depend on mem. acc. behavior

Uncoalesced to coalesced:

Through shared-memory

Compiler analyzes data reuse

Assesses the benefit of using shared-memory

20

slide-21
SLIDE 21

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Memory Opt. and Parallelism Pipeline

Data sharing detected

Merge thread blocks Merge threads

When to merge thread blocks

  • r just threads?

For experiments:

Various GPU descriptions Cetus: source-to-source

21

slide-22
SLIDE 22

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Memory Opt. and Parallelism Pipeline

Main limitation: cannot change algorithm structure Biggest performance increase:

Merging of thread Merging of thread-blocks

22

slide-23
SLIDE 23

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

Paper: Characterizing and improving the Use of Demand-Fetched Caches in GPUs Caches are highly configurable Main contributions:

Characterization of application performance Provides taxonomy Algorithm to identify an application’s memory access patterns

Presence of a cache hurts or damages performance?

23

slide-24
SLIDE 24

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

It is bandwidth, not latency Estimated traffic: what does it indicate? A block runs on a single SM Remember, L1 cache are not coherent. Configurability of L1 caches

Capacity ON/OFF Cached or not?

Kernels are analyzed independently

24

slide-25
SLIDE 25

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

Classification of kernels in three groups:

Texture and constant Shared memory Global memory

25

slide-26
SLIDE 26

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

No correlation between hit rates and performance Analysis of traffic generated from L2 to L1 The impact of cache line size:

Fraction fetched from DRAM The whole line into L1

26

slide-27
SLIDE 27

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

Changes in L2-to-L1 memory traffic Algorithm can reveal the access patterns:

Mem. address <->threadID Estimate traffic

Determines which instructions to cache

27

slide-28
SLIDE 28

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

Steps of the algorithm:

Analyze access patterns Estimate memory traffic Determine which instructions will use cache

Based on the thread ID Memory traffic <-> number of bytes for thread How to decide whether or not to cache the instruction?

28

slide-29
SLIDE 29

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

Four possible cases:

Cache-on traffic = cache-off memory traffic Cache-on traffic < cache-off traffic Cache-on traffic > cache-off traffic Unknown access addresses

29

slide-30
SLIDE 30

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

Four possible cases:

Cache-on traffic = cache-off memory traffic Cache-on traffic < cache-off traffic Cache-on traffic > cache-off traffic Unknown access addresses

30

slide-31
SLIDE 31

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Use of Demand-Fetched Caches

Keeping caches for all kernels: 5.8% improvement on average Conservative strategy: 16.9% Aggressive approach: 18.0%

31

slide-32
SLIDE 32

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Conclusion

Things in common:

Extensive analyses of memory access patterns Relationship: access patterns and thread hierarchy On-chip scheduling property: rearranging memory access instructions

Off-chip accesses can also be improved by static approaches Difference in latency for on-chip and off-chip memory accesses

32

slide-33
SLIDE 33

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Conclusion

Reduce individual accesses to DRAM On-chip accesses not as centralized Scheduling threads = managing resources Highlighting the relationship between the execution model and memory access patterns Last two mechanism: static based Advantages of static based approach

View the whole application: detailed strategy Cross-optimization: coalescing/shared memory

33

slide-34
SLIDE 34

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Conclusion

Limitations of static-based approaches:

Oblivious to dynamics of execution: undetermined addresses Limited by capabilities of compilers Deterministic vs. un-deterministic segments of code

Both approaches: limited on variables Dynamic mechanism: capture changes of execution

34

slide-35
SLIDE 35

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Conclusion

Limitations of dynamic approaches:

Cannot store unlimited information Limited on how far they can look

Advantages of dynamic approaches:

Respond better to execution dynamics

35

slide-36
SLIDE 36

NCTU IEE5008 Memory Systems 2012 Luis Garrido

References

Diamos, Gregory; Kerr, Andrew; Yalamanchili, Sudhakar; Clark, Nathan. “Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems”. PACT ’10.. Lashgar, Ahmad; Baniasadi, Amirali. “Performance in GPU Architectures: Potentiales and Distances”. WDDD 2011. NVIDIA Corporation. “NVIDIA GeForce GTX 680”. Whitepaper 2012.

36

slide-37
SLIDE 37

NCTU IEE5008 Memory Systems 2012 Luis Garrido

References

NVIDIA Corporattion. “NVIDIA CUDA C Programming Guide. Version 4.2”. 2012. Jia, Wenhao; Shaw, Kelly A.; Martonosi, Margaret. “Characterizing and Improving the Use of Demand-Fetched Caches in GPUs”. ICS 2012. Gou, Chunyang; Gaydadjiev, Georgi N. “Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline”. International Journal on Parallel Programming, June 2012.

37

slide-38
SLIDE 38

NCTU IEE5008 Memory Systems 2012 Luis Garrido

References

Yuan, G.L.; Fung, W.W.L.; Wong, H.; Aamodt, T.M. “Analyzing CUDA workloads using a detailed GPU Simulator”. International Symposium on Performance Analysis of Systems and Software (ISPASS) 2009. Yang, Yi; Xiang, Ping; Kong, Jingfei; Zhou,

  • HuiYang. “A GPGPU Compiler for Memory

Optimization and Parallelism Management”. PLDI 2010.

38

slide-39
SLIDE 39

NCTU IEE5008 Memory Systems 2012 Luis Garrido

References

Jacob, Bruce; Ng, Spencer W.; Wang, David T. “Memory Systems: Cache, DRAM, Disk’. Elsevier, 2008. Lakshminarayana, Nagesh B.; Kim, Hyesoon. “Effect of Instruction Fetch and Memory Scheduling on GPU Performance”. HPCA/PPoPP 2010.

39