[PPT] - TSM2: Optimizing Tall-and-Skinny Matrix- Matrix Multiplication on PowerPoint Presentation

SLIDE 1

TSM2: Optimizing Tall-and-Skinny Matrix- Matrix Multiplication on GPUs

Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao*, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben**, Qiang Guan***, Zizhong Chen

University of California, Riverside *University of Alabama **Los Alamos National Laboratory ***Kent State University

SLIDE 2

Linear algebra kernels are widely y used

Linear algebra kernels have been widely used.
E.g., scientific simulation, big data analytics,

machine leaning, etc.

Matrix-matrix multiplication (GEMM)
One of the most fundamental computation

kernel that is used to build up other kernels

Core computation of many applications.
Cost most of the computation time of

applications

(Source: Berkeley Dwarfs Report)

SLIDE 3

In Input shap ape e of

f GE

GEMM can an var aries ies fr from

m ap

applic lication ion to

ap

applic licatio ion

Deep Neural Networks Dense Matrix Decompositions K-means Algorithm Based Fault Tolerance Relative regular shape input Tall-and-skinny shape input

SLIDE 4

Tw Two Kinds of Computations

Computation bound
Memory bound

A B C × = n n n n n n A B × = n n k n B k n Computation bound memory bound Matrix-matrix multiplication Matrix-matrix multiplication with Tall-and-skinny input n > 10,000 and k < 100 A x × = n n 1 n y n Matrix-vector multiplication 1

à Performance of application is bounded by the computation power. à Performance of application is bounded by the memory bandwidth.

SLIDE 5

Wh Why tall all-an and-sk skinn nny be beha haves s di differently tha han n regul gular sha shape pe input nput?

A B C × = n n n n n n Matrix-matrix multiplication A B × = n n k n C k n Matrix-matrix multiplication with Tall-and-skinny input Input matrices size is O(n2). Computing time complexity is O(n3). Each element is used n times. Input matrices size is O(n2). Computing time complexity is O(n2k) Each element is used k times on average

So for tall and skinny matrix input, depending on the k and the ratio between target GPU’s peak

computation power and peak memory throughput, it is usually memory bound.

SLIDE 6

GP GPUs ar are e wid idely ely used ed for ac acceler eleratin ing ap applic licatio ions

Good at parallelized computations.
Higher computation power and memory throughput.
Commonly used for accelerating matrix-related computations.

SLIDE 7

cu cuBLAS lib librar ary

One of the most commonly used standard linear algebra

libraries optimized for GPUs, which is developed by Nvidia.

The core computing library of many big data and scientific

computing applications.

With deep optimization by Nvidia, the cuBLAS library is able to

provide state-of-the-art performance in regular-shaped input matrix cases.

But not fully optimized for tall-and-skinny matrix cases.

SLIDE 8

Po Poor Pe Perfo formance on Current State-of

f-the

the-Ar Art t Desi sign: gn:

A B C × = n n n n n n A B × = n n k n B k n Computation bound memory bound Regular-sized matrix multiplication Tall-and-skinny matrix multiplication With large n, k in similar magnitude n >> k Current state-of-the-art design only

ptimized for computation bound case

50 100 150 200 250 300 350 50 100 150 200 250

10240 11264 12288 13312 14336 15360 16384 17408 18432 19456 20480 21504 22528 23552 24576 25600 26624 27648 28672 29696 30720

Memory Throughput (GB/s) Performance (Gflop/s) Matrix Size (n) with k = 16 Performance (cuBLAS) Peak Performance Memory Throughput (cuBLAS) Peak Memory Throughput 50 100 150 200 250 300 350

50 100 150 200 250

10240 11264 12288 13312 14336 15360 16384 17408 18432 19456 20480 21504 22528 23552 24576 25600 26624 27648 28672 29696 30720

Memory Throughput (GB/s)

Performance (Gflops) Input Matrix Size (n) with k = 2

Performance (cuBLAS) Peak Performance Memory Throughput (cuBLAS) Peak Memory Throughput

Low GPU utilization:

K = 2:
49.9% memory band.
37.9% peak comp. power
K=16:
31.1% memory band.
56.6% peak comp. power

Regular size: 80%-90% of the peak computation power Performance Memory throughput Comp./Mem. HW peak Sudden drop Sudden drop Performance Memory throughput Comp./Mem. HW peak

SLIDE 9

TS TSM2: redesigned matrix-matrix x multiplication for tall-an and-sk skinn nny input nput

1) Total number of global memory accesses. 2) Efficiency on global memory throughput. 3) Parallelism of overall workload. 4) On-chip memory utilization. 5) Streaming Multiprocessor (SM) utilization.

Several factors are considered:

SLIDE 10

Al Algorithm thm de desi sign: gn: ho how to fit t the the workload d into the the pr progr gramming ng mod model el of

f CUDA(C

(Con

ntin

inued ed)

We divide the workload by assigning n rows of matrix A to n different
threads. Each vector-matrix multiplication is assigned to one thread.

A B × = n n k n C k n

i. To ensure high parallelism and high Streaming Multiprocessor occupancy. ii. To ensure minimum number of memory access in favor of matrix A.

iii. To enable high memory accesses efficiency.

Thread i

SLIDE 11

Re Redesigning matrix-matrix x multiplication for tall-an and-sk skinn nny input nput

Rethinking algorithm design – aiming to reduce total number of memory access
Inner product vs. Outer product
Memory access to each element of A: k times
Memory access to each element of B: n times
Total number of accesses: 2kn2
Memory access to each element of A: 1 time
Memory access to each element of B: n times
Total number of accesses: (k+1)n2

Version 0: Inner Product Version 1: Outer Product

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 10K 15K 20K 25K 30K Speedup Matrix Size (n)

cuBLAS BLASX TSM2-V0 TSM2-V1 Tall-and-skinny GEMM with K=8 on Nvidia Tesla K40c

SLIDE 12

Glo Global al memo memory ac acces ess effic ficien iency an analy alysis is

Global memory access efficiency per transaction = useful data/cache line size
Affect overall application memory access efficiency
Determined by the memory access pattern and the algorithm
Can be challenging to improve without modifying the algorithm design
For outer product GEMM:

8 𝑐𝑧𝑢𝑓𝑡 128 𝑐𝑧𝑢𝑓𝑡 = 𝟕. 𝟑𝟔% 𝑝𝑠 8 𝑐𝑧𝑢𝑓𝑡 32 𝑐𝑧𝑢𝑓𝑡 = 𝟑𝟔% 128 𝑐𝑧𝑢𝑓𝑡 128 𝑐𝑧𝑢𝑓𝑡 = 𝟐𝟏𝟏% 𝑝𝑠 32 𝑐𝑧𝑢𝑓𝑡 32 𝑐𝑧𝑢𝑓𝑡 = 𝟐𝟏𝟏%

SLIDE 13

Imp Improvin ing gl

global memory access effi ficiency

Version 2: Outer Product + Shared Mem.

GPU shared memory: sharing data between threads with threadblock
Benefit: decoupling data load pattern and data use pattern.

Load data in shared memory in a more efficient way.

Mem. transaction efficiency = 100%

Keeping the original data use pattern in outer product version.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 10K 15K 20K 25K 30K Speedup Matrix Size (n)

cuBLAS BLASX TSM2-V0 TSM2-V1 TSM2-V2 Tall-and-skinny GEMM with K=8 on Nvidia Tesla K40c

SLIDE 14

Imp Improvin ing gl

global memory access effi ficiency

Version 2: Outer Product + Shared Mem. Load Use Data dependency between data load and data use instructions

Even with efficient global memory loading pattern,

it still brings high GPU underutilization

Main cause: long memory access latency can be

hard to hide.

SLIDE 15

Da Data prefetch: : Improving g GPU utilization

A B C

Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread 3

registers holding current tile of A shared mem. holding current tile of B t1

t2

t3

prefetch next tile A to registers next tile becomes current tile in next iteration prefetch next tile B to registers load next tile to shared mem. before next iteration.

{

ne

thread block

{

ne

thread block

calculation on current tile

LD C LD NextB LD NextA Compute LD NextA Compute ST C Threads Sync. LD NextB LD NextA Compute LD NextA Compute Threads Sync. LD NextB LD NextA Compute LD NextA Compute Threads Sync.

Data prefetch Version 3: Outer Product + Shared Mem. + Data Prefectch Prefetch the data needed for the next iteration.

Adding prefectch data for the next

iteration improves latency hiding and GPU utilization.

Load Use Load data in shared memory in a more efficient way.

Mem. transaction efficiency = 100%

Keeping the original data use pattern in outer product version.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 10K 15K 20K 25K 30K Speedup Matrix Size (n)

cuBLAS BLASX TSM2-V0 TSM2-V1 TSM2-V2 TSM2-V2 Tall-and-skinny GEMM with K=8 on Nvidia Tesla K40c

SLIDE 16

Exp xperimental evaluation:

GPU Model Micro-architectures Memory Peak performance Peak memory bandwidth Tesla K40c Kepler 12 GB 1430 GFLOPS 288 GB/s Tesla M40 Maxwell 24 GB 213 GFLOPS 288 GB/s Tesla P100 Pascal 16 GB 4600 GFLOPS 720 GB/s

SLIDE 17

Exp xperimental evaluation: Speedup (on Nvidia Tesla K40c)

SLIDE 18

Exp xperimental evaluation: Memory bandwidth (on Nvidia Tesla K40c)

SLIDE 19

Exp xperimental evaluation on Nvidia Tesla M40 and P100

Tesla M40 Tesla P100

SLIDE 20

Sh Showcase se 1: K-means

Execution time of the first 100 iterations of Lloyd’s K-means

algorithm on K40c (d = 4096, k = 16).

Using our TSM2, we speedup K- means by 1.06x - 1.89x

(avg. 1.53x).

GPU version K-means originally developed by NVIDIA:

https://github.com/NVIDIA/kmeans Core computation of Lloyd’s K-means: distance calculation. Common choice: Euclidean Distance ||x − y||2 = ||x||2 + ||y||2 − 2xy When we have multiple x and y: Group x à matrix X Group y à matrix Y calculating xy à XY (matrix matrix multiplication) Calculating distance between:

Data points X (n points with d dimensions);
Centroids C (k centroids with d dimensions);
à matrix-matrix multiplication: (n*d) times (d*k).
Usually k << n,d à tall-and-skinny

SLIDE 21

Sh Showcase se 2: ABFT Matrix Checksum Encoding

We compare the checksum encoding performance by using cuBLAS and TSM2 on K40c. As we can see, our TSM2 significantly improve the checksum encoding calculation with 1.10x to 1.90x speedup (avg. 1.67x).

Core computation of ABFT: calculating

checksum (encode redundant info)

E.g., calculate the checksum of matrix A

with checksum weight vector v: 𝑑ℎ𝑓𝑑𝑙𝑡𝑣𝑛 𝐵 = 𝐵𝑤

Usually use multiple different checksum

weight vectors.

If we use c different checksum weight

vectors à (m-by-n) times (n-by-c)

Common choice: c = 2 << m,n à tall-and

skinny

SLIDE 22

Co Conclusion:

We first analyzed the performance of current GEMM in the latest

cuBLAS library.

We discovered the potential challenges of optimizing tall-and-skinny

GEMM since its workload is memory bound.

We redesigned an optimized tall-and-skinny GEMM with several
ptimization techniques focusing on GPU resource utilization.
Experiment results show that our optimized implementation can achieve

better performance on three modern GPU micro-architectures.

SLIDE 23

We have an optimized design, but when do we use it?

How to determine when the computation is memory bound and when it is not?

Tuning parameters Hardware parameters GPU Peak Perf. GPU Peak Mem. Band. Computation bound memory bound Computation bound memory bound Computation bound memory bound NVIDIA Tesla K40: K=40 NVIDIA Tesla M40: K=6 NVIDIA Tesla P100: K=50