module 4 3 memory model and locality
play

Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication Objective To understand the design of a tiled parallel algorithm for matrix multiplication Loading a tile Phased execution


  1. GPU Teaching Kit Accelerated Computing Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication

  2. Objective – To understand the design of a tiled parallel algorithm for matrix multiplication – Loading a tile – Phased execution – Barrier Synchronization 2

  3. Matrix Multiplication – Data access pattern N – Each thread - a row of M and a column of N – Each thread block – a strip of M and a WIDTH strip of N M P BLOCK_WIDTHE WIDTH Row BLOCK_WIDTH WIDTH WIDTH Col 3

  4. Tiled Matrix Multiplication – Break up the execution of each N thread into phases – so that the data accesses by the thread block in each phase are WIDTH focused on one tile of M and one tile of N – The tile is of BLOCK_SIZE elements in each dimension M P BLOCK_WIDTHE WIDTH Row BLOCK_WIDTH WIDTH WIDTH Col 4

  5. Loading a Tile – All threads in a block participate – Each thread loads one M element and one N element in tiled code 5 5

  6. Phase 0 Load for Block (0,0) N 0,0 N 0,1 N 0,2 N 0,3 N 0,0 N 0,1 Shared Memory N 1,0 N 1,1 N 1,2 N 1,3 N 1,0 N 1,1 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,0 M 0,1 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,0 M 1,1 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 6

  7. Phase 0 Use for Block (0,0) (iteration 0) N 0,0 N 0,1 N 0,2 N 0,3 N 0,0 N 0,1 Shared Memory N 1,0 N 1,1 N 1,2 N 1,3 N 1,0 N 1,1 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,0 M 0,1 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,0 M 1,1 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 7

  8. Phase 0 Use for Block (0,0) (iteration 1) N 0,0 N 0,1 N 0,2 N 0,3 N 0,0 N 0,1 Shared Memory N 1,0 N 1,1 N 1,2 N 1,3 N 1,0 N 1,1 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,0 M 0,1 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,0 M 1,1 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 8

  9. Phase 1 Load for Block (0,0) N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 2,0 N 2,1 Shared Memory N 3,0 N 3,1 N 3,2 N 3,3 N 3,0 N 3,1 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,2 M 0,3 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,2 M 1,3 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 9

  10. Phase 1 Use for Block (0,0) (iteration 0) N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 2,0 N 2,1 Shared Memory N 3,0 N 3,1 N 3,2 N 3,3 N 3,0 N 3,1 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,2 M 0,3 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,2 M 1,3 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 10

  11. Phase 1 Use for Block (0,0) (iteration 1) N 0,0 N 0,1 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 2,0 N 2,1 Shared Memory N 3,0 N 3,1 N 3,2 N 3,3 N 3,0 N 3,1 Shared Memory P 0,0 P 0,1 P 0,2 P 0,3 M 0,0 M 0,1 M 0,2 M 0,3 M 0,2 M 0,3 P 1,0 P 1,1 P 1,2 P 1,3 M 1,0 M 1,1 M 1,2 M 1,3 M 1,2 M 1,3 P 2,0 P 2,1 P 2,2 P 2,3 M 2,0 M 2,1 M 2,2 M 2,3 P 3,0 P 3,1 P 3,2 P 3,3 M 3,0 M 3,1 M 3,2 M 3,3 11

  12. Execution Phases of Toy Example 12

  13. Execution Phases of Toy Example (cont.) Shared memory allows each value to be accessed by multiple threads 13

  14. Barrier Synchronization – Synchronize all threads in a block – __syncthreads() – All threads in the same block must reach the __syncthreads() before any of the them can move on – Best used to coordinate the phased execution tiled algorithms – To ensure that all elements of a tile are loaded at the beginning of a phase – To ensure that all elements of a tile are consumed at the end of a phase 14

  15. GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend