Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication

Objective – To understand the design of a tiled parallel algorithm for matrix multiplication – Loading a tile – Phased execution – Barrier Synchronization 2

Matrix Multiplication – Data access pattern N – Each thread - a row of M and a column of N – Each thread block – a strip of M and a WIDTH strip of N M P BLOCK_WIDTHE WIDTH Row BLOCK_WIDTH WIDTH WIDTH Col 3

Tiled Matrix Multiplication – Break up the execution of each N thread into phases – so that the data accesses by the thread block in each phase are WIDTH focused on one tile of M and one tile of N – The tile is of BLOCK_SIZE elements in each dimension M P BLOCK_WIDTHE WIDTH Row BLOCK_WIDTH WIDTH WIDTH Col 4

Loading a Tile – All threads in a block participate – Each thread loads one M element and one N element in tiled code 5 5

Phase 0 Load for Block (0,0) N 0,0 N 0,1 N 0,2 N 0,3 N 0,0 N 0,1 Shared Memory N 1,0 N 1,1 N 1,2 N 1,3 N 1,0 N 1,1 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,0 M 0,1 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,0 M 1,1 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 6

Phase 0 Use for Block (0,0) (iteration 0) N 0,0 N 0,1 N 0,2 N 0,3 N 0,0 N 0,1 Shared Memory N 1,0 N 1,1 N 1,2 N 1,3 N 1,0 N 1,1 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,0 M 0,1 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,0 M 1,1 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 7

Phase 0 Use for Block (0,0) (iteration 1) N 0,0 N 0,1 N 0,2 N 0,3 N 0,0 N 0,1 Shared Memory N 1,0 N 1,1 N 1,2 N 1,3 N 1,0 N 1,1 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,0 M 0,1 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,0 M 1,1 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 8

Phase 1 Load for Block (0,0) N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 2,0 N 2,1 Shared Memory N 3,0 N 3,1 N 3,2 N 3,3 N 3,0 N 3,1 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,2 M 0,3 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,2 M 1,3 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 9

Phase 1 Use for Block (0,0) (iteration 0) N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 2,0 N 2,1 Shared Memory N 3,0 N 3,1 N 3,2 N 3,3 N 3,0 N 3,1 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,2 M 0,3 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,2 M 1,3 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 10

Phase 1 Use for Block (0,0) (iteration 1) N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 2,0 N 2,1 Shared Memory N 3,0 N 3,1 N 3,2 N 3,3 N 3,0 N 3,1 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,2 M 0,3 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,2 M 1,3 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 11

Execution Phases of Toy Example 12

Execution Phases of Toy Example (cont.) Shared memory allows each value to be accessed by multiple threads 13

Barrier Synchronization – Synchronize all threads in a block – __syncthreads() – All threads in the same block must reach the __syncthreads() before any of the them can move on – Best used to coordinate the phased execution tiled algorithms – To ensure that all elements of a tile are loaded at the beginning of a phase – To ensure that all elements of a tile are consumed at the end of a phase 14

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication Objective To understand the design of a tiled parallel algorithm for matrix multiplication Loading a tile Phased execution

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

Cache Management Improving Memory Locality and Reducing Memory Latency Introduction Memory

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Module 4.1 Memory and Data Locality CUDA Memories Objective To learn to effectively use

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

Data Analy/c Cloud Instance Op/ons MapReduce Spot Instances Evalua/on Data

Simultaneous Inference for Massive Data: Distributed Bootstrap Yang Yu 1 , Shih-Kang Chao 2 ,

Failure Detectors Concurrency Trilogy Part IV Announcements Project proposals are due

Fermionic DM Higgs Portal An EFT approach Michael A. Fedderke University of Chicago Based on

The Communication Complexity of Finding a Stable Marriage A Tale of Passion and Greed Will

On the periodicity of irreducible elements in arithmetical congruence monoids Christopher

Low-Energy Pion-Photon Reactions and Chiral Symmetry N. Kaiser International Conference HADRON

CS 374: Algorithms & Models of Computation Chandra Chekuri University of Illinois,

Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication Objective To understand the design of a tiled parallel algorithm for matrix multiplication Loading a tile Phased execution

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

Cache Management Improving Memory Locality and Reducing Memory Latency Introduction Memory

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Module 4.1 Memory and Data Locality CUDA Memories Objective To learn to effectively use

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

Data Analy/c Cloud Instance Op/ons MapReduce Spot Instances Evalua/on Data

Simultaneous Inference for Massive Data: Distributed Bootstrap Yang Yu 1 , Shih-Kang Chao 2 ,

Failure Detectors Concurrency Trilogy Part IV Announcements Project proposals are due

Fermionic DM Higgs Portal An EFT approach Michael A. Fedderke University of Chicago Based on

The Communication Complexity of Finding a Stable Marriage A Tale of Passion and Greed Will

On the periodicity of irreducible elements in arithmetical congruence monoids Christopher

Low-Energy Pion-Photon Reactions and Chiral Symmetry N. Kaiser International Conference HADRON

CS 374: Algorithms &amp; Models of Computation Chandra Chekuri University of Illinois,

CS 374: Algorithms & Models of Computation Chandra Chekuri University of Illinois,