Autotuning and Specialization: Speeding up Matrix Multiply for Small - - PowerPoint PPT Presentation

autotuning and specialization speeding up matrix multiply
SMART_READER_LITE
LIVE PREVIEW

Autotuning and Specialization: Speeding up Matrix Multiply for Small - - PowerPoint PPT Presentation

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology Jaewook Shin 1 , Mary W. Hall 2 , Jacqueline Chame 3 , Chun Chen 2 , Paul D. Hovland 1 1 ANL/MCS 2 University of Utah 3 USC/ISI iWAPT 2009,


slide-1
SLIDE 1

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

iWAPT 2009, October 2, 2009

Jaewook Shin1, Mary W. Hall2, Jacqueline Chame3, Chun Chen2, Paul D. Hovland1

1ANL/MCS 2University of Utah 3USC/ISI

slide-2
SLIDE 2
slide-3
SLIDE 3

3

Gap between H/W and S/W Performance

Moore’s Law

What do we do with the exponentially increasing number of transistors?

H/W performance increases through ...

More parallelism Longer vector length for SIMD More cores Heterogeneous architectures STI CELL processors NVIDIA Graphics processors More hard-to-use instructions prefetch, cache manipulations, SIMD, ...

H/W is becoming too complex to utilize all features. The fraction of H/W performance achieved by S/W decreases. (Increasing performance gap between H/W and S/W)

slide-4
SLIDE 4
slide-5
SLIDE 5

5

Performance Tuning

Performance Tuning Original program Tuned program

slide-6
SLIDE 6

6

Manual Performance Tuning

Performance is split between compiler and programmer

programmer

  • riginal

program manually tuned program

compiler

compiler-optimized program

execution

slide-7
SLIDE 7

7

Application Developers

Significant human efforts expensive Human can explore slow Small set of code transformations Small set of code variants

For one machine-application pair at a time

Relies on human experiences error prone Mechanical and repetitive Should be performed by tools automatically

slide-8
SLIDE 8

8

Compiler Optimizations

Tied to a H/W architecture Not portable Conservative assumptions for unknown information Static analyses cannot benefit from dynamic information. Optimizations are good at mostly simple codes. Based on a static profitability model Only the optimizations profitable for most applications Fixed order of optimizations

slide-9
SLIDE 9

9

Compiler-Based Empirical Performance Tuning (Autotuning)

App #1 App #2 App #n App #3 Compiler-Based Empirical Performance Tuning

Application-specific optimizations

Compiler #2 H/W #2 Compiler #1 H/W #1 Compiler #3 H/W #3 Compiler #N H/W #N

Architecture-specific optimizations Simple Fast Portable

... ...

slide-10
SLIDE 10

We are… We are not … Compilier-based Manual nor library-based Collaborative Fully automatic

slide-11
SLIDE 11

11

Nek5000

High-order spectral element CFD code http://nek5000.mcs.anl.gov Scales to #P > 100,000 30,000,000 cpu-hour INCITE allocation (2009) Early science application on BG/P Applications:

nuclear reactor modelling, astrophysics, climate modelling, combustion, biofluids, ...

slide-12
SLIDE 12

12

Speedups of Nek5000: 1.36X

50 100 150 200 Baseline Tuned Run time

(Sec.)

slide-13
SLIDE 13

13

Compiler-Based Empirical Performance Tuning for Nek5000

  • riginal

program kernel

variant #1 variant #2 variant #N tuned kernel

tuned program profile

  • 1. profiling:

gprof

...

  • 3. code transformation:

CHiLL(USC/Utah)

  • 2. outlining:

manual

  • 4. pruning:

heuristics

  • 5. library:

manual

slide-14
SLIDE 14

14

Tools and Environment

Tools

Profiling: gprof Code transformation: CHiLL Backend compiler: Intel compilers version 10.1 PAPI (Performance Application Programming Interface)

AMD Phenom

2.5 GHz, Quad core 4 double-precision floating point operations / cycle 10 GFlops / core Ubuntu Linux 8.04-x86_64 All 16 SIMD registers are available. Kernel patched with perfctr

slide-15
SLIDE 15

15

Profiling

~ 60% of run time spent in mxm44_0 mxm44_0

Dense matrix multiplication Small rectangular matrices Hand-optimized by 4x4 unrolling (434 lines) 8 input sizes comprise 74% of all computation Matrix sizes depend only on the degree of problem

size m k n 1 8 10 8 2 10 8 10 3 10 10 10 4 10 8 64 5 8 10 100 6 100 8 10 7 10 10 100 8 100 10 10

Baseline

slide-16
SLIDE 16

16

BLAS Library Performance

slide-17
SLIDE 17

17

BLAS Library Performance: Small Rectangular Matrices

10 20 30 40 50 60 70 80

8,10,8 10,8,10 10,10,10 10,8,64 8,10,100 100,8,10 10,10,100 100,10,10

Baseline ATLAS ACML GOTO

%

slide-18
SLIDE 18

18

Contributions: Fast Search & High Performance

  • riginal

program kernel

variant #1 variant #2 variant #N tuned kernel

tuned program profile

  • 1. profiling:

gprof

...

  • 3. code transformation:

CHiLL(USC/Utah)

  • 2. outlining:

manual

  • 4. pruning:

heuristics

  • 5. library:

manual

slide-19
SLIDE 19
slide-20
SLIDE 20

20

Dense Matrix Multiply Kernel

do i=1,M do j=1,N C(i,j)=0.0 do k=1,K C(i,j)+=A(i,k)*B(k,j)

Input matrices are represented as (M,K,N). The loop order of this loopnest is 123 for (i,j,k).

slide-21
SLIDE 21

21

Two Code Transformations

Unrolling: Increases instruction-level parallelism (ILP), ... Loop permutation: Affects the compiler’s SIMD code generation, ... Example: 10x10x10

Variant # loop order 1 1 1 1 2 1 1 2 ... ... ... ... ... 11 1 2 1 ... ... ... ... ... 1000 10 10 10 1001 1 1 1 1002 1 1 2 ... ... ... ... ... 6000 10 10 10 Ui Uk Uj 123(ijk) 123(ijk) 123(ijk) 123(ijk) 132(ikj) 132(ikj) 321(kji)

slide-22
SLIDE 22

22

Parameter Space

Formed by a set of all possible code variants Loop permutation

Six loop orders for the three loops of mxm

Unrolling

N unroll factors for a loop with N iterations ranging from 1 to N M*K*N unrolling for the three loops of M, K and N iterations

Time budget for tuning: 1 day Examples:

10x10x10: 6,000 variants OK (~ 7 hours) 100x10x10: 60,000 variants Unacceptable with exhaustive search 10 times larger in size of the space Each point in the space has a larger code.

Need either

Search and/or Pruning the space

slide-23
SLIDE 23

23

Performance Distribution (10x10x10)

slide-24
SLIDE 24

24

Heuristic #1/4: Loop Order

slide-25
SLIDE 25

25

Heuristic #2/4: Instruction Cache

3

13 ≤ × ×

j k i

U U U

slide-26
SLIDE 26

26

Heuristic #3/4: Unroll Factor of 1 on One Loop (SIMD)

slide-27
SLIDE 27

27

Heuristic #4/4: Unroll Factors Evenly Dividing Iteration Count

slide-28
SLIDE 28

28

Reduction of the Parameter Space by Heuristics

% 10 20 30 40 50 60 70 80 90 100

8,10,8 10,8,10 10,10,10 10,8,64 8,10,100 100,8,10 10,10,100 100,10,10

Loop Order I-Cache SIMD Even Unroll All Four

slide-29
SLIDE 29
slide-30
SLIDE 30

30

Specialization

mxm(a, M, b, K, c, N){ for(i=0; i<M; i++) for(j=0; j<N; j++) for(k=0; k<K; k++) c[i][j]+=a[i][k]*b[k][j]; }

Fix the input matrix sizes

mxm_101010(a, b, c){ for(i=0; i<10; i++) for(j=0; j<10; j++) for(k=0; k<10; k++) c[i][j]+=a[i][k]*b[k][j]; } mxm(a, M, b, K, c, N){ if(M==10&&K==10&&N==10) mxm_101010(a,b,c); else mxm_original(a,M,b,K,c,N); }

slide-31
SLIDE 31

31

High Performance by Specialization

Simpler code for more information (CHiLL, ifort)

Makes a simple kernel simpler Concrete information for compilers Ex) Interprocedural analysis: The arrays are aligned to 16 byte boundaries in memory. The arrays are not aliased with each other.

Code optimization:

SIMD: No conditionals to check for alignments No instructions for aligning data Custom code-transformations Optimizations tailored to particular input matrix sizes More efficient code: Less checking

slide-32
SLIDE 32

32

Matrix Multiply Performance for Small Matrices (in cache)

10 20 30 40 50 60 70 80

8,10,8 10,8,10 10,10,10 10,8,64 8,10,100 100,8,10 10,10,100 100,10,10

Baseline mxf8/10 vanilla ATLAS ACML GOTO TUNE TUNE13

%

slide-33
SLIDE 33

33

The Code Variants Selected by Applying the Heuristics

No. m,k,n Size Loop Order Ui Uk Uj %max 1 8,10,8 3840 ijk 8 10 4 98.7 2 10,8,10 4800 ijk 1 8 5 100 3 10,10,10 6000 jik 1 9 5 99.3 4 10,8,64 30720 ijk 1 8 4 5 8,10,100 48000 ijk 1 10 4 6 100,8,10 48000 jki 1 8 5 7 10,10,100 60000 jik 1 10 4 8 100,10,10 60000 jik 1 10 10

slide-34
SLIDE 34

34

Custom Code-Transformations

No. m,k,n 1 2 3 4 5 6 7 8 1 8,10,8 58 27 49 38 58 49 56 54 2 10,8,10 43 61 58 20 20 51 39 58 3 10,10,10 39 37 59 31 20 52 44 58 4 10,8,64 44 20 54 62 61 47 62 50 5 8,10,100 57 38 57 38 59 50 59 54 6 100,8,10 27 73 74 19 19 75 58 67 7 10,10,100 39 37 58 39 61 52 61 57 8 100,10,10 26 41 71 34 19 62 60 75

(% of peak)

slide-35
SLIDE 35

35

What we’ve learned are ...

Job partitioning: Tools: Simple and repetitive work Human: The rest Pruning heuristics Small parameter space

No local searches embarrassingly parallel

Specialization

Fix the input matrix sizes: ifort, CHiLL Have ifort generate aligned SIMD code:

  • ipo, __attribute__((aligned (16)))

Simpler input to the tools More information High performance Custom code transformations

Success in tuning Nek5000 Potential for a (wide) range of machine-application pairs No dependency or commutative operations Small data that fits in the L1 cache A stride in tuning matrix multiply for small, rectangular matrices

slide-36
SLIDE 36

36

Summary

The performance gap between H/W and S/W is increasing. Compiler-based empirical performance tuning is a viable solution. Specialization custom optimization (~ 74% of peak) Pruning heuristics embarrassingly parallel (Use supercomputers!) Future work

Other machines: BG/P,Q At a higher level Other applications: UNIC, S3D, MADNESS, ...

slide-37
SLIDE 37

37

Questions?

slide-38
SLIDE 38

38

Matrix Multiply Performance for Small Matrices (memory)

5 10 15 20 25 30 35

8,10,8 10,8,10 10,10,10 10,8,64 8,10,100 100,8,10 10,10,100 100,10,10

Baseline mxf8/10 vanilla ATLAS ACML GOTO TUNE TUNE13

%

slide-39
SLIDE 39

do e=1,240 call mxm (dxm12,8,x(1,e),10,ta1,100) do iz=0,9 call mxm (ta1(1+80*iz),8,iytm12,10,ta2(1+64*iz),8) enddo call mxm (ta2,64,iztm12,10,dx(1,e),8) do i=1,512 dx(i,e)=dx(i,e)*rm2(i,e) enddo call mxm (ixm12,8,x(1,e),10,ta3,100) do iz=0,9 call mxm (ta3(1+80*iz),8,dytm12,10,ta2(1+64*iz),8) enddo call mxm (ta2,64,iztm12,10,ta1,8) do i=1,512 dx(i,e)=dx(i,e)+ta1(i)*sm2(i,e) enddo do iz=0,9 call mxm (ta3(1+80*iz),8,iytm12,10,ta2(1+64*iz),8) enddo call mxm (ta2,64,dztm12,10,ta3,8) do i=1,512 dx(i,e)=dx(i,e)+ta3(i)*tm2(i,e) enddo do i=1,512 dx(i,e)=dx(i,e)*w3m2(i,e) enddo enddo

Array Size(byte) dxm12 640 dytm12 6400 dztm12 640 ixm12 640 iytm12 640 iztm12 640 ta1 6400 ta2 5120 ta3 6400 x(1,e) 8000 dx(1,e) 4096 rm2(1,e) 4096 sm2(1,e) 4096 tm2(1,e) 4096 w3m2(1,e) 4096 Total 56000

slide-40
SLIDE 40

40

Different Optimization Parameters for Different Input Sizes

No. m,k,n 1 2 3 4 5 6 7 8 1 8,10,8 58 27 49 38 58 49 56 54 2 10,8,10 43 61 58 20 20 51 39 58 3 10,10,10 39 37 59 31 20 52 44 58 4 10,8,64 44 20 54 62 61 47 62 50 5 8,10,100 57 38 57 38 59 50 59 54 6 100,8,10 27 73 74 19 19 75 58 67 7 10,10,100 39 37 58 39 61 52 61 57 8 100,10,10 26 41 71 34 19 62 60 75

Selected parameters Loop bound is smaller than the applied unroll amount Pruned parameters