Seminar on GPGPU Programming: Optimising Matrix Multiplications with - - PowerPoint PPT Presentation

seminar on gpgpu programming optimising matrix
SMART_READER_LITE
LIVE PREVIEW

Seminar on GPGPU Programming: Optimising Matrix Multiplications with - - PowerPoint PPT Presentation

Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA Axel Eirola 28.01.2010 Table of Contents Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared


slide-1
SLIDE 1

Seminar on GPGPU Programming: Optimising Matrix Multiplications with CUDA

Axel Eirola 28.01.2010

slide-2
SLIDE 2

Table of Contents

Introduction Multiplication with CPU Naive implementation IT++ Multiplication with CUDA Naive implementation Using Shared Memory Optimising Block Size CUBLAS Discussion

slide-3
SLIDE 3

Introduction

◮ Matrix multiplication for square unfiformally random matrices ◮ C = AB where A, B, C ∈ R(n,n) ◮ Syntetical benchmarking, since we do not know anything

about the matrices

◮ In real life problems we usually have information about the

matrix, it can be symmetric or orthogonal, or have some other pattern which can be exlpoited in the computations

slide-4
SLIDE 4

About the benchmarks

◮ Executed on miranda ◮ CPU code in C++, GPU code in CUDA ◮ Measurements average of 5 runs after one warm-up run ◮ Calculations performed in single precission floating point ◮ Only actual calculation timed, no allocation or copying

between host and device

◮ Matrices of sizes 100 × 100 (40Kb) to 4000 × 4000 (64MB)

were used,

slide-5
SLIDE 5

Naive CPU implementation

◮ Simple ”by definition” implementation ◮ Loops through the elements of the output matrix C, and

calculates each element seperately

◮ No multithreading, no smart fetching of elements from A and

B

slide-6
SLIDE 6

Benchmarks

100 1000

Matrix Width

0.1 1 10 100 1000 10000

Time (ms)

CPU Naive

Figure: Naive CPU implementation

slide-7
SLIDE 7

BLAS libary IT++

◮ A general purpose linear algebra and signal processing library

for C++

◮ Utilizes underlying BLAS implementataions ◮ Seems to do multithreading and smarter memory management ◮ Does not seem to use Strasses (or any other guys) matrix

multiplication algorithm

slide-8
SLIDE 8

Benchmarks

100 1000

Matrix Width

0.1 1 10 100 1000 10000

Time (ms)

CPU Naive IT++

Figure: IT++ library

slide-9
SLIDE 9

Naive GPU implementation

◮ Trivial reimplementation of the CPU naive code to CUDA ◮ Replaces the loops with threading, that is that each thread is

created for each element in the output matrix C

◮ All data is retreived from the global memory of the GPU

slide-10
SLIDE 10

Benchmarks

100 1000

Matrix Width

0.1 1 10 100 1000 10000

Time (ms)

CPU Naive IT++ GPU Naive

Figure: Naive GPU implementation

slide-11
SLIDE 11

Speed it up with Shared Memory

◮ The naive GPU implementation only used global memory for

accessing matrices A and B

◮ Since each element is accessed multiple times, it would be

faster to store the elements somwhere close, such as the SM (stream multiprocessor) shared memory

◮ Give each thread block a responsibilty to calculate one block

  • f the output matrix C

◮ Store data needed to calculate the block in the shared memory

slide-12
SLIDE 12

Benchmarks

A B C

B.width A.width col A.height B.height B.width-1 row A.height-1

Figure: Naive matrix multiplication

slide-13
SLIDE 13

Benchmarks

A

!

B C Csub

BLOCK_SIZE B.width A.width BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE

blockRow row

BLOCK_SIZE-1 BLOCK_SIZE-1

col blockCol

A.height B.height

Figure: Matrix multiplication with shared memory

slide-14
SLIDE 14

Benchmarks

100 1000

Matrix Width

0.1 1 10 100 1000 10000

Time (ms)

CPU Naive IT++ GPU Naive + Shared Memory

Figure: GPU using shared memory

slide-15
SLIDE 15

What can we do with block size

◮ The block size represents the amount of threads executed by

  • ne SM (stream multiprocessor)

◮ The amount of threads stays constant ◮ But the amount of data kept in the shared memory of the SM

is increased, decreasing the amount of costly accesses to the global memory

◮ Block size is limited to 22, since the maximum amount of

thread blocks in one grid is 512 (222 = 484 and 232 = 529)

slide-16
SLIDE 16

Benchmarks

100 1000

Matrix Width

0.1 1 10 100 1000 10000

Time (ms)

CPU Naive IT++ GPU Naive + Shared Memory + large blocksize

Figure: GPU with larger blocksize

slide-17
SLIDE 17

CUBLAS library

◮ A C library provided by nVidia implementing the BLAS (Basic

Linear Algebra Subprograms) specification

◮ Could not find what it actually does, but seems to do

something.

slide-18
SLIDE 18

Benchmarks

100 1000

Matrix Width

0.1 1 10 100 1000 10000

Time (ms)

CPU Naive IT++ GPU Naive + Shared Memory + large blocksize CUBLAS

Figure: CUBLAS library implementation

slide-19
SLIDE 19

Benchmarks

100 1000

Matrix Width

0.1 1 10 100 1000 10000

Time (ms)

CPU Naive IT++ GPU Naive + Shared Memory + large blocksize CUBLAS

Figure: This is interesting

slide-20
SLIDE 20

Benchmarks (Zoomed)

1008 1024 1040 1056 1072 1088 1104 1120 1136 1152 1168 1184 1200

Matrix Width

2 4 6 8 10 12 14 16 18 20

Time (ms)

CUBLAS

Figure: Zoom on spikes

slide-21
SLIDE 21

◮ CUBLAS twice as fast when the width of the matrix is

divisible by 16

◮ Noticed by O. Schenk et al in Algorithmic performance studies

  • n graphics processing units. Stating that When the matrix is

not divisible by 16, there are conflicts in shared memory regarding multiple threads accessing the same bank at the same time. This forces one thread to be put in a queue while the other thread is accessing the mem- ory, increasing the amount of time for all memory accesses to be completed.

◮ The question is: Why aren’t the smaller matrices padded to

become divisible by 16?

slide-22
SLIDE 22

Profit ratio

◮ Tesla C1060 costs about $1200, and calculates a 2000 × 2000

matrix in 50 ms

◮ Core i7 920 costs about $300, and calculates a 2000 × 2000

matrix in 2000 ms

◮ CUBLAS is about 40 times faster than IT++, while a Tesla

costs only about 4 times more than a Core i7

◮ So the profit ratio becomes tenfold.

300$ ∗ 2000ms 1200$ ∗ 50ms = 10

slide-23
SLIDE 23

Summary

◮ GPGPU is fast :) ◮ But without proper memory management it isn’t as fast as it

could be.

◮ Even the libraries aren’t as fast as they could be