IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - PowerPoint PPT Presentation

IEE5008 –Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com Luis Garrido 2012

Outline  Introduction  Background of GPGPUs  Hardware  Programming and execution model  GPU Memory Hierarchy  Issues and limitations  Locality on GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 2

Outline  State of the art solutions for memory access scheduling on GPUs  Addressing GPU On-Chip Shared Memory Bank Conflicts using elastic pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs  Conclusion: Comparing and analyzing  References Luis Garrido NCTU IEE5008 Memory Systems 2012 3

Introduction  GPGPUs as a major focus of attention in the heterogeneous computing field  Design philosophy of GPGPUs  Bulk synchronous Programming model  Characteristics of applications  Design space of GPUs  Diverse and multi-variable  Major bottleneck: memory hierarchy  Focus of this work Luis Garrido NCTU IEE5008 Memory Systems 2012 4

Background of GPGPUs  Hardware of GPUs: general view Luis Garrido NCTU IEE5008 Memory Systems 2012 5

Background of GPGPUs  Hardware of GPUs: SMX Luis Garrido NCTU IEE5008 Memory Systems 2012 6

GPU’s Memory Hierarchy  Five different memory spaces  Constant  Texture  Shared  Private  Global  Issues and limitations of the memory hierarchy  Caches of small size  Latency is not a concerned, it is bandwidth  How to make good usage of memory resources? Luis Garrido NCTU IEE5008 Memory Systems 2012 7

Locality on GPUs  Locality in GPUs is defined in close relation to the programming model  Why?  Locality in GPUs as:  Within-warp locality  Within-block localty (cross-warp)  Cross-instruction data reuse  Model suited for GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 8

State of the Art Solutions  Many techniques  An exhaustive survey would be huge  Bring forward the most recent and the ones that explain the most  Types of techniques  Static: rely on compiler techniques, done a priori of execution  Dynamic: rely on architectural enhancements  Static+Dynamic: combine the best of both previous approaches Luis Garrido NCTU IEE5008 Memory Systems 2012 9

State of the Art Solutions  Three papers in this survey:  Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Characterizing and improving the use of demand- fetched caches in GPUs Luis Garrido NCTU IEE5008 Memory Systems 2012 10

Shared Memory Bank Conflicts  Paper: Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline  Objective of the mechanism:  Avoid on-chip memory bank conflicts  On-chip memories of GPUs are heavily banked  Impact on performance:  Varying latencies as a result of memory bank conflicts Luis Garrido NCTU IEE5008 Memory Systems 2012 11

Shared Memory Bank Conflicts  This work makes the following contributions:  Careful analysis of the impact on GPU on-chip shared memory bank conflicts  A novel Elastic Pipeline Design to alleviate on-chip shared memory bank conflicts.  Co-designed bank-conflict aware warp scheduling technique to assist the Elastic Pipeline  Pipeline stalls reduction of up to 64% leading to overall system performance Luis Garrido NCTU IEE5008 Memory Systems 2012 12

Shared Memory Bank Conflicts  Core contains information of the warps  Instructions of warps issued in round-robin way  Conflicts has two consequences:  Blocks the upstream pipeline  Introduces a bubble into the M1 stage Luis Garrido NCTU IEE5008 Memory Systems 2012 13

Shared Memory Bank Conflicts  Elastic Pipeline  Two modifications  Buses to allow forwarding  Turning the two stage NONMEMx stage into a FIFO  Modify the scheduling policy of warps Luis Garrido NCTU IEE5008 Memory Systems 2012 14

Shared Memory Bank Conflicts  Avoid the issues due to out- of-order instruction commit  Necessary to know if future instructions will create conflict  Prevent queues from saturating  Mechanism for modification of the scheduling policy Luis Garrido NCTU IEE5008 Memory Systems 2012 15

Shared Memory Bank Conflicts  Experimentation Platform: GPGPUSim  Cycle accurate simulator for GPUs  Three categories of pipeline:  Warp scheduling fails  Shared memory bank  Other Luis Garrido NCTU IEE5008 Memory Systems 2012 16

Shared Memory Bank Conflicts  Core scheduling can fail when:  Not hidden by parallel execution  Barrier synch.  Warp control-flow  Performance enhancement  GPU  GPU + Elastic Pipeline  GPU + elastic pipeline + novel scheduler  Theoretic Luis Garrido NCTU IEE5008 Memory Systems 2012 17

Memory Opt. and Parallelism Pipeline  Paper: A GPGPU Compiler for Memory Optimization and Parallelism Pipeline  Two major challenges:  Effective utilization of the GPU memory hierarchy  Judicious management of parallelism  Compiler-based:  Analyzes memory access patterns  Applies an optimization procedure Luis Garrido NCTU IEE5008 Memory Systems 2012 18

Memory Opt. and Parallelism Pipeline  Consider an appropriate way to parallelize and application  Why compiler techniques?  Purposes of the optimization procedure  Increase memory coalescing  Improve usage of shared memory  Balance the amount of parallelism and memory opt.  Distribute memory traffic among different off-chip memory partitions Luis Garrido NCTU IEE5008 Memory Systems 2012 19

Memory Opt. and Parallelism Pipeline  Input: naïve kernel code  Output: optimized kernel code  Re-schedules threads  Depend on mem. acc. behavior  Uncoalesced to coalesced:  Through shared-memory  Compiler analyzes data reuse  Assesses the benefit of using shared-memory Luis Garrido NCTU IEE5008 Memory Systems 2012 20

Memory Opt. and Parallelism Pipeline  Data sharing detected  Merge thread blocks  Merge threads  When to merge thread blocks or just threads?  For experiments:  Various GPU descriptions  Cetus: source-to-source Luis Garrido NCTU IEE5008 Memory Systems 2012 21

Memory Opt. and Parallelism Pipeline  Main limitation: cannot change algorithm structure  Biggest performance increase:  Merging of thread  Merging of thread-blocks Luis Garrido NCTU IEE5008 Memory Systems 2012 22

Use of Demand-Fetched Caches  Paper: Characterizing and improving the Use of Demand-Fetched Caches in GPUs  Caches are highly configurable  Main contributions:  Characterization of application performance  Provides taxonomy  Algorithm to identify an application’s memory access patterns  Presence of a cache hurts or damages performance? Luis Garrido NCTU IEE5008 Memory Systems 2012 23

Use of Demand-Fetched Caches  It is bandwidth, not latency  Estimated traffic: what does it indicate?  A block runs on a single SM  Remember, L1 cache are not coherent.  Configurability of L1 caches  Capacity  ON/OFF  Cached or not?  Kernels are analyzed independently Luis Garrido NCTU IEE5008 Memory Systems 2012 24

Use of Demand-Fetched Caches  Classification of kernels in three groups:  Texture and constant  Shared memory  Global memory Luis Garrido NCTU IEE5008 Memory Systems 2012 25

Use of Demand-Fetched Caches  No correlation between hit rates and performance  Analysis of traffic generated from L2 to L1  The impact of cache line size:  Fraction fetched from DRAM  The whole line into L1 Luis Garrido NCTU IEE5008 Memory Systems 2012 26

Use of Demand-Fetched Caches  Changes in L2-to-L1 memory traffic  Algorithm can reveal the access patterns:  Mem. address <->threadID  Estimate traffic  Determines which instructions to cache Luis Garrido NCTU IEE5008 Memory Systems 2012 27

Use of Demand-Fetched Caches  Steps of the algorithm:  Analyze access patterns  Estimate memory traffic  Determine which instructions will use cache  Based on the thread ID  Memory traffic <-> number of bytes for thread  How to decide whether or not to cache the instruction? Luis Garrido NCTU IEE5008 Memory Systems 2012 28

Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 29

Use of Demand-Fetched Caches  Four possible cases:  Cache-on traffic = cache-off memory traffic  Cache-on traffic < cache-off traffic  Cache-on traffic > cache-off traffic  Unknown access addresses Luis Garrido NCTU IEE5008 Memory Systems 2012 30

Use of Demand-Fetched Caches  Keeping caches for all kernels: 5.8% improvement on average  Conservative strategy: 16.9%  Aggressive approach: 18.0% Luis Garrido NCTU IEE5008 Memory Systems 2012 31

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - PowerPoint PPT Presentation

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com

IEE5008 Autumn 2012 Memory Systems 3D Nand Flash Memory Pranav Arya Department of

IEE5008 Autumn 2012 Memory Systems PIPELINED SRAM Pranav Arya EECS Intl Graduate Program

IEE5008 Autumn 2012 Memory Systems 3D Stacking SRAM Anwar,Hossameldin Department of

IEE5008 Autumn 2012 Memory Systems Quad Data Rate SRAM for the High-Throughput Communication

Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of

Autumn 2012 In Site Autumn 2012 By Kevin Greene, Inga Hall & Nicola Ellis Construction and

Memory Systems Solid State Disks Anwar,Hossameldin Department of Electronics Engineering

Introduction to Game Programming Introduction to Game Programming Autumn 2017 Autumn 2017

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Systems Overview of the NAND Flash High- Speed Interfacing and Controller Architecture

IEE5009 Autumn 2012 Memory Systems Storage Class Memory Chiao-Ying, Huang Department of

Chapter 9. Survey Research Chapter 9. Survey Research survey research methods? survey research

IEE5009 Autumn 2012 Memory Systems Ternary Content Addressable Memory Chiao-Ying, Huang

Small Business Survey 2012 Focus on mentoring Survey details Small Business Survey 2012

Member Survey 2015 Survey method Surv Survey Monk y Monkey as survey platform, receiving 82

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Developing Canadas Only Carlin-Type Gold District April 2019 TSX-V: ATC Forward-Looking

Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade

CARLIN-TYPE GOLD DISTRICT TSX-V: ATC April 2015 FORWARD LOOKING STATEMENTS FORWARD-LOOKING

Emission Offsets and TIER Alberta Environment and Parks November 21, 2019 Topics

Information on the Carbon Offsetting and Reduction Scheme for International Aviation (CORSIA)

BU Institute for Sustainable Energy A study of carbon offsets and RECs to meet Bostons mandate

Q1 2020 Analyst & investor presentation 21 January 2020 Q1 performance momentum continues

Staff Pa S per 2 2012 Contact t(s) Martin n Friedhoff mfriedho off@ifrs.org +44 (0 )20

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access - PowerPoint PPT Presentation

IEE5008 Autumn 2012 Memory Systems Survey on Memory Access Scheduling For On-Chip Cache Memories of GPGPUs Garrido Platero, Luis Angel Department of Electronics Engineering National Chiao Tung University luis.garrido.platero@gmail.com

IEE5008 Autumn 2012 Memory Systems 3D Nand Flash Memory Pranav Arya Department of

IEE5008 Autumn 2012 Memory Systems PIPELINED SRAM Pranav Arya EECS Intl Graduate Program

IEE5008 Autumn 2012 Memory Systems 3D Stacking SRAM Anwar,Hossameldin Department of

IEE5008 Autumn 2012 Memory Systems Quad Data Rate SRAM for the High-Throughput Communication

Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of

Autumn 2012 In Site Autumn 2012 By Kevin Greene, Inga Hall &amp; Nicola Ellis Construction and

Memory Systems Solid State Disks Anwar,Hossameldin Department of Electronics Engineering

Introduction to Game Programming Introduction to Game Programming Autumn 2017 Autumn 2017

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Systems Overview of the NAND Flash High- Speed Interfacing and Controller Architecture

IEE5009 Autumn 2012 Memory Systems Storage Class Memory Chiao-Ying, Huang Department of

Chapter 9. Survey Research Chapter 9. Survey Research survey research methods? survey research

IEE5009 Autumn 2012 Memory Systems Ternary Content Addressable Memory Chiao-Ying, Huang

Small Business Survey 2012 Focus on mentoring Survey details Small Business Survey 2012

Member Survey 2015 Survey method Surv Survey Monk y Monkey as survey platform, receiving 82

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Developing Canadas Only Carlin-Type Gold District April 2019 TSX-V: ATC Forward-Looking

Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade

CARLIN-TYPE GOLD DISTRICT TSX-V: ATC April 2015 FORWARD LOOKING STATEMENTS FORWARD-LOOKING

Emission Offsets and TIER Alberta Environment and Parks November 21, 2019 Topics

Information on the Carbon Offsetting and Reduction Scheme for International Aviation (CORSIA)

BU Institute for Sustainable Energy A study of carbon offsets and RECs to meet Bostons mandate

Q1 2020 Analyst &amp; investor presentation 21 January 2020 Q1 performance momentum continues

Staff Pa S per 2 2012 Contact t(s) Martin n Friedhoff mfriedho off@ifrs.org +44 (0 )20

Autumn 2012 In Site Autumn 2012 By Kevin Greene, Inga Hall & Nicola Ellis Construction and

Q1 2020 Analyst & investor presentation 21 January 2020 Q1 performance momentum continues