Seminar on GPGPU Programming: Optimising Matrix Multiplications with - - PowerPoint PPT Presentation
Seminar on GPGPU Programming: Optimising Matrix Multiplications with - - PowerPoint PPT Presentation
Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010 Table of Contents Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared
Table of Contents
Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared Memory Optimising Block Size CUBLAS Discussion
Introduction
◮ Matrix multiplication for square unfiformally random matrices ◮ C = AB where A, B, C ∈ R(n,n) ◮ Syntetical benchmarking, since we do not know anything
about the matrices
◮ In real life problems we usually have information about the
matrix, it can be symmetric or orthogonal, or have some other pattern which can be exlpoited in the computations
About the benchmarks
◮ Executed on miranda ◮ CPU code in C++, GPU code in CUDA ◮ Measurements average of 5 runs after one warm-up run ◮ Calculations performed in single precission floating point ◮ Only actual calculation timed, no allocation or copying
between host and device
◮ Matrices of sizes 100 × 100 (40Kb) to 4000 × 4000 (64MB)
were used,
Naive CPU implementation
◮ Simple ”by definition” implementation ◮ Loops through the elements of the output matrix C, and
calculates each element seperately
◮ No multithreading, no smart fetching of elements from A and
B
Benchmarks
100 1000
Matrix Width
0.1 1 10 100 1000 10000
Time (ms)
CPU Naive
Figure: Naive CPU implementation
BLAS libary IT++
◮ A general purpose linear algebra and signal processing library
for C++
◮ Utilizes underlying BLAS implementataions ◮ Seems to do multithreading and smarter memory management ◮ Does not seem to use Strasses (or any other guys) matrix
multiplication algorithm
Benchmarks
100 1000
Matrix Width
0.1 1 10 100 1000 10000
Time (ms)
CPU Naive IT++
Figure: IT++ library
Naive GPU implementation
◮ Trivial reimplementation of the CPU naive code to CUDA ◮ Replaces the loops with threading, that is that each thread is
created for each element in the output matrix C
◮ All data is retreived from the global memory of the GPU
Benchmarks
100 1000
Matrix Width
0.1 1 10 100 1000 10000
Time (ms)
CPU Naive IT++ GPU Naive
Figure: Naive GPU implementation
Speed it up with Shared Memory
◮ The naive GPU implementation only used global memory for
accessing matrices A and B
◮ Since each element is accessed multiple times, it would be
faster to store the elements somwhere close, such as the SM (stream multiprocessor) shared memory
◮ Give each thread block a responsibilty to calculate one block
- f the output matrix C
◮ Store data needed to calculate the block in the shared memory
Benchmarks
A B C
B.width A.width col A.height B.height B.width-1 row A.height-1
Figure: Naive matrix multiplication
Benchmarks
A
!
B C Csub
BLOCK_SIZE B.width A.width BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE
blockRow row
BLOCK_SIZE-1 BLOCK_SIZE-1
col blockCol
A.height B.height
Figure: Matrix multiplication with shared memory
Benchmarks
100 1000
Matrix Width
0.1 1 10 100 1000 10000
Time (ms)
CPU Naive IT++ GPU Naive + Shared Memory
Figure: GPU using shared memory
What can we do with block size
◮ The block size represents the amount of threads executed by
- ne SM (stream multiprocessor)
◮ The amount of threads stays constant ◮ But the amount of data kept in the shared memory of the SM
is increased, decreasing the amount of costly accesses to the global memory
◮ Block size is limited to 22, since the maximum amount of
thread blocks in one grid is 512 (222 = 484 and 232 = 529)
Benchmarks
100 1000
Matrix Width
0.1 1 10 100 1000 10000
Time (ms)
CPU Naive IT++ GPU Naive + Shared Memory + large blocksize
Figure: GPU with larger blocksize
CUBLAS library
◮ A C library provided by nVidia implementing the BLAS (Basic
Linear Algebra Subprograms) specification
◮ Could not find what it actually does, but seems to do
something.
Benchmarks
100 1000
Matrix Width
0.1 1 10 100 1000 10000
Time (ms)
CPU Naive IT++ GPU Naive + Shared Memory + large blocksize CUBLAS
Figure: CUBLAS library implementation
Benchmarks
100 1000
Matrix Width
0.1 1 10 100 1000 10000
Time (ms)
CPU Naive IT++ GPU Naive + Shared Memory + large blocksize CUBLAS
Figure: This is interesting
Benchmarks (Zoomed)
1008 1024 1040 1056 1072 1088 1104 1120 1136 1152 1168 1184 1200
Matrix Width
2 4 6 8 10 12 14 16 18 20
Time (ms)
CUBLAS
Figure: Zoom on spikes
◮ CUBLAS twice as fast when the width of the matrix is
divisible by 16
◮ Noticed by O. Schenk et al in Algorithmic performance studies
- n graphics processing units. Stating that When the matrix is