TSLP Throttling Automatic Vectorization: When Less is More - - PowerPoint PPT Presentation

tslp throttling automatic vectorization when less is more
SMART_READER_LITE
LIVE PREVIEW

TSLP Throttling Automatic Vectorization: When Less is More - - PowerPoint PPT Presentation

TSLP Throttling Automatic Vectorization: When Less is More Vasileios Porpodas and Timothy M. Jones University of Cambridge LLVM Developers Meeting 2015 www.cl.cam.ac.uk/ vp331/ slide 1 of 16 Why SIMD Vectorization? Scalar Reg. File


slide-1
SLIDE 1

TSLP Throttling Automatic Vectorization: When Less is More

Vasileios Porpodas and Timothy M. Jones

University of Cambridge

LLVM Developer’s Meeting 2015

slide 1 of 16 www.cl.cam.ac.uk/∼vp331/

slide-2
SLIDE 2

Why SIMD Vectorization?

  • Scalable parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-3
SLIDE 3

Why SIMD Vectorization?

  • Scalable parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-4
SLIDE 4

Why SIMD Vectorization?

  • Scalable parallelism

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-5
SLIDE 5

Why SIMD Vectorization?

  • Scalable parallelism

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-6
SLIDE 6

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-7
SLIDE 7

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance
  • Energy efficiency

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-8
SLIDE 8

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance
  • Energy efficiency
  • Supported since mid 90’s
  • Frequent updates of vector

ISAs

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

AVX2 SSE4

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-9
SLIDE 9

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance
  • Energy efficiency
  • Supported since mid 90’s
  • Frequent updates of vector

ISAs

  • Vector generation not

done in hardware

  • Low-level programming or

capable compiler

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

AVX2 SSE4

slide 2 of 16 www.cl.cam.ac.uk/∼vp331/

slide-10
SLIDE 10

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]

slide 3 of 16 www.cl.cam.ac.uk/∼vp331/

slide-11
SLIDE 11

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer

slide 3 of 16 www.cl.cam.ac.uk/∼vp331/

slide-12
SLIDE 12

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

slide 3 of 16 www.cl.cam.ac.uk/∼vp331/

slide-13
SLIDE 13

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

  • In theory it should be a superset of loop-vectorizer

slide 3 of 16 www.cl.cam.ac.uk/∼vp331/

slide-14
SLIDE 14

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

  • In theory it should be a superset of loop-vectorizer
  • Unroll loop and vectorize with SLP
  • Even if loop-vectorizer fails, SLP could partly succeed

slide 3 of 16 www.cl.cam.ac.uk/∼vp331/

slide-15
SLIDE 15

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

  • In theory it should be a superset of loop-vectorizer
  • Unroll loop and vectorize with SLP
  • Even if loop-vectorizer fails, SLP could partly succeed
  • In practice it is missing features present in the Loop

vectorizer (Interleaved Loads, Predication)

slide 3 of 16 www.cl.cam.ac.uk/∼vp331/

slide-16
SLIDE 16

SLP Vectorization Algorithm

  • Input is scalar IR

Scalar Code

slide 4 of 16 www.cl.cam.ac.uk/∼vp331/

slide-17
SLIDE 17

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions Find vectorization seed instructions 1. Scalar Code

slide 4 of 16 www.cl.cam.ac.uk/∼vp331/

slide-18
SLIDE 18

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

Find vectorization seed instructions 1. Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 16 www.cl.cam.ac.uk/∼vp331/

slide-19
SLIDE 19

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 16 www.cl.cam.ac.uk/∼vp331/

slide-20
SLIDE 20

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count
  • Check vectorization profitability

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 16 www.cl.cam.ac.uk/∼vp331/

slide-21
SLIDE 21

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count
  • Check vectorization profitability
  • Emit vectors only if profitable

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Vectorize groups & emit vectors YES 5. DONE Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 16 www.cl.cam.ac.uk/∼vp331/

slide-22
SLIDE 22

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count
  • Check vectorization profitability
  • Emit vectors only if profitable

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Vectorize groups & emit vectors YES 5. NO DONE Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 16 www.cl.cam.ac.uk/∼vp331/

slide-23
SLIDE 23

When SLP is not profitable

  • Costs outweigh the benefits: E.g. too many

gather/scatter instructions

ADD1 ADD2 ADD3 ADD4 Original Vectorized

slide 5 of 16 www.cl.cam.ac.uk/∼vp331/

slide-24
SLIDE 24

When SLP is not profitable

  • Costs outweigh the benefits: E.g. too many

gather/scatter instructions

ADD1 ADD2 ADD3 ADD4 Original Vectorized ADD1 ADD2 ADD3 ADD4

slide 5 of 16 www.cl.cam.ac.uk/∼vp331/

slide-25
SLIDE 25

When SLP is not profitable

  • Costs outweigh the benefits: E.g. too many

gather/scatter instructions

ADD1 ADD2 ADD3 ADD4 Original Vectorized ADD1 ADD2 ADD3 ADD4 Insert1 Insert2 Insert3 Insert4 Extract1 Extract2 Extract3 Extract4

slide 5 of 16 www.cl.cam.ac.uk/∼vp331/

slide-26
SLIDE 26

SLP not profitable for whole graph

A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i])))

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-27
SLIDE 27

SLP not profitable for whole graph

A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-28
SLIDE 28

SLP not profitable for whole graph

+ + A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-29
SLIDE 29

SLP not profitable for whole graph

+ + L L A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-30
SLIDE 30

SLP not profitable for whole graph

+ + L * * L A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-31
SLIDE 31

SLP not profitable for whole graph

+ + L * * L + + A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-32
SLIDE 32

SLP not profitable for whole graph

+ + L * * L + * + * A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-33
SLIDE 33

SLP not profitable for whole graph

+ + L * * L L + * L + * A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-34
SLIDE 34

SLP not profitable for whole graph

+ + L * * L L L + * L + * L A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-35
SLIDE 35

SLP not profitable for whole graph

+ + L * * L L L L + * L L + * L A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-36
SLIDE 36

SLP not profitable for whole graph

+ + L * * L L L L + * L L + * L Total Cost: −1 −1 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-37
SLIDE 37

SLP not profitable for whole graph

+ + L * * L L L L + * L L + * L Total Cost: −1 −2 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 + + S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-38
SLIDE 38

SLP not profitable for whole graph

+ + L * * L L L L + * L L + * L Total Cost: −1 −3 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 −1 L L + + S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-39
SLIDE 39

SLP not profitable for whole graph

+ + L * * L L L L + * L L + * L Total Cost: −1 −4 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 −1 −1 L L + + * * S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-40
SLIDE 40

SLP not profitable for whole graph

+ + L * * L L L L + * L L + * L Total Cost: −1 −5 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 −1 −1 −1 L L + + * * + + S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-41
SLIDE 41

SLP not profitable for whole graph

+ + L * * L L L L + * L L + * L Total Cost: −1 −6 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 −1 −1 −1 −1 * * L L + + * * + + S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-42
SLIDE 42

SLP not profitable for whole graph

i i L L + + L * * L L L L + * L L + * L Total Cost: −1 −4 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 +1 +1 −1 −1 −1 −1 * * L L + + * * + + S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-43
SLIDE 43

SLP not profitable for whole graph

i i L L i L i L + + L * * L L L L + * L L + * L Total Cost: −1 −2 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 +1 +1 +1 +1 −1 −1 −1 −1 * * L L + + * * + + S S S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-44
SLIDE 44

SLP not profitable for whole graph

i i L L i i L L i L i L + + L * * L L L L + * L L + * L Total Cost: −1 A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) −1 +1 +1 +1 +1 +1 +1 −1 −1 −1 −1 * * L L + + * * + + S S

Unprofitable !

S S

slide 6 of 16 www.cl.cam.ac.uk/∼vp331/

slide-45
SLIDE 45

TSLP removes unprofitable region

i L −1 −1 −1 −1 L L + + * * S S Total Cost: 0 −1 +1 +1 +1 +1 +1 +1 −1 * * + + L L

SLP

i L i i L i i L

Unprofitable!

slide 7 of 16 www.cl.cam.ac.uk/∼vp331/

slide-46
SLIDE 46

TSLP removes unprofitable region

i L −1 −1 −1 −1 L L + + * * S S Total Cost: TSLP CUT −1 +1 +1 +1 +1 +1 +1 −1 * * + + L L

TSLP

i L i i L i i L

slide 7 of 16 www.cl.cam.ac.uk/∼vp331/

slide-47
SLIDE 47

TSLP removes unprofitable region

i L L L L L L −1 −1 −1 L L + + S S Total Cost: TSLP CUT +1 i +1

TSLP

* + * * + *

slide 7 of 16 www.cl.cam.ac.uk/∼vp331/

slide-48
SLIDE 48

TSLP removes unprofitable region

i L L L L L L −1 −1 −1 L L + + S S Total Cost: TSLP CUT +1 i +1

TSLP

−1 * + * * + *

Profitable !

slide 7 of 16 www.cl.cam.ac.uk/∼vp331/

slide-49
SLIDE 49

TSLP Algorithm

  • Extension to SLP

1. Scalar IR Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-50
SLIDE 50

TSLP Algorithm

  • Extension to SLP

2. 1. Scalar IR Generate the SLP graph Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-51
SLIDE 51

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts

2. 3. 1. Scalar IR Calculate all valid cuts Generate the SLP graph Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-52
SLIDE 52

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts

2. 3. 1. Scalar IR

  • 4. Throttle (cut) the SLP graph

Calculate all valid cuts Generate the SLP graph Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-53
SLIDE 53

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts

2. 3. 5. 1. Scalar IR Calculate cost of vectorization

  • 4. Throttle (cut) the SLP graph

Calculate all valid cuts Generate the SLP graph Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-54
SLIDE 54

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts
  • Keep best cut

2. 3. 5. 1. Scalar IR Calculate cost of vectorization

4.

6. Save cut with best cost Throttle (cut) the SLP graph Calculate all valid cuts Generate the SLP graph Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-55
SLIDE 55

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts
  • Keep best cut

2. 3. 5. 1. Scalar IR 7. Tried all cuts? Calculate cost of vectorization

4.

6. Save cut with best cost Throttle (cut) the SLP graph Calculate all valid cuts Generate the SLP graph NO Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-56
SLIDE 56

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts
  • Keep best cut
  • Vanilla SLP

2. 3. 5. 1. Scalar IR 7. 8. Tried all cuts? cost < threshold? YES Calculate cost of vectorization

4.

6. Save cut with best cost Throttle (cut) the SLP graph Calculate all valid cuts Generate the SLP graph NO Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-57
SLIDE 57

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts
  • Keep best cut
  • Vanilla SLP

2. 3. 5. 1. Scalar IR 7. 8. Tried all cuts? cost < threshold? YES YES Calculate cost of vectorization

4.

6. Save cut with best cost Throttle (cut) the SLP graph Calculate all valid cuts Generate the SLP graph NO 9. Replace scalars with vectors Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-58
SLIDE 58

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts
  • Keep best cut
  • Vanilla SLP

2. 3. 5. DONE 1. Scalar IR 7. 8. Tried all cuts? cost < threshold? YES YES Calculate cost of vectorization

4.

6. Save cut with best cost Throttle (cut) the SLP graph Calculate all valid cuts Generate the SLP graph NO 9. Replace scalars with vectors Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-59
SLIDE 59

TSLP Algorithm

  • Extension to SLP
  • Try out many cuts
  • Keep best cut
  • Vanilla SLP

2. 3. 5. DONE 1. Scalar IR 7. 8. Tried all cuts? cost < threshold? NO YES YES Calculate cost of vectorization

4.

6. Save cut with best cost Throttle (cut) the SLP graph Calculate all valid cuts Generate the SLP graph NO 9. Replace scalars with vectors Find seed instructions for vectorization

slide 8 of 16 www.cl.cam.ac.uk/∼vp331/

slide-60
SLIDE 60

Cost calculation example

TotalCost

+ + * * + + * * L L L L L L L S S L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-61
SLIDE 61

Cost calculation example

V+ S +G −Scalar Vector TotalCost

+ + * * + + * * L L L L L L L S S L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-62
SLIDE 62

Cost calculation example

V+ S +G −Scalar Vector TotalCost

+ + * * + + * * L L L L L L L S S

− − − − − − − − 18 18 18 18 18 18 18 18

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-63
SLIDE 63

Cost calculation example

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − − − − − − − −

SCALAR

18 18 18 = 18 18 18 18 18

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-64
SLIDE 64

Cost calculation example

cut1

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − 1 +16+ − 2 − − − − − −

VEC SCALAR

18 18 18 +1 = = 18 18 18 18 18

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-65
SLIDE 65

Cost calculation example

cut1 cut2

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − 1 +16+ − 2 5 + + 8 − 8 − − − − −

VECTOR SCALAR

18 18 18 +3 = +1 = = 18 18 18 18 18

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-66
SLIDE 66

Cost calculation example

cut1 cut2 cut3

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − 1 +16+ − 2 5 + + 8 − 8 2 + + 14 − 4 − − − −

VECTOR SCALAR

18 18 18 +2 = +3 = +1 = = 18 18 18 18 18

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-67
SLIDE 67

Cost calculation example

cut1 cut2 cut3 cut4

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − 1 +16+ − 2 5 + + 8 − 8 2 + + 14 − 4 3 + + 12 − 2 − − −

VECTOR SCALAR

18 18 18 −1 = +2 = +3 = +1 = = 18 18 18 18 18 TSLP

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-68
SLIDE 68

Cost calculation example

cut1 cut2 cut3 cut4 cut5

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − 1 +16+ − 2 5 + + 8 − 8 2 + + 14 − 4 3 + + 12 − 2 4 10+ + − 4 − −

VECTOR SCALAR

18 18 = 18 −1 = +2 = +3 = +1 = = 18 18 18 18 18 TSLP

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-69
SLIDE 69

Cost calculation example

cut1 cut2 cut3 cut4 cut5 cut6

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − 1 +16+ − 2 5 + + 8 − 8 2 + + 14 − 4 3 + + 12 − 2 4 10+ + − 4 + 5 + 8 − 6 −

VECTOR SCALAR

18 +1 = 18 = 18 −1 = +2 = +3 = +1 = = 18 18 18 18 18 TSLP

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-70
SLIDE 70

Cost calculation example

cut1 cut2 cut3 cut4 cut5 cut6

no cut (SLP)

V+ S +G −Scalar Vector TotalCost

cut0 + + * * + + * * L L L L L L L S S

18+ + − 1 +16+ − 2 5 + + 8 − 8 2 + + 14 − 4 3 + + 12 − 2 4 10+ + − 4 + 5 + 8 − 6 + 6 + 6 − 6

SCALAR

SLP 18 +1 = = 18 = 18 −1 = +2 = +3 = +1 = = 18 18 18 18 18 TSLP

VECTOR

L

slide 9 of 16 www.cl.cam.ac.uk/∼vp331/

slide-71
SLIDE 71

Subgraph (Cuts) Generation Algorithm

L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-72
SLIDE 72

Subgraph (Cuts) Generation Algorithm

S L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-73
SLIDE 73

Subgraph (Cuts) Generation Algorithm

S + S L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-74
SLIDE 74

Subgraph (Cuts) Generation Algorithm

S + S S + L L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-75
SLIDE 75

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-76
SLIDE 76

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L S + * L + L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-77
SLIDE 77

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-78
SLIDE 78

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-79
SLIDE 79

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L S + * + L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-80
SLIDE 80

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L S + * + L L L S + * + * L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-81
SLIDE 81

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L S + * + L L L S + * + * L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-82
SLIDE 82

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L S + * + L L L S + * + * L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-83
SLIDE 83

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L S + * + L L L S + * + * L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-84
SLIDE 84

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L S + * + L L L S + * + * L L L S + * L + *

  • Only connected subgraphs that include the root

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-85
SLIDE 85

Subgraph (Cuts) Generation Algorithm

S + S S + L L S + * L L L L S + * L + * L L S + * L + L S + * L L S + * + L L L S + * + * L L L S + * L + *

  • Only connected subgraphs that include the root
  • Worst time complexity O(2BxN) (N=Nodes, B=Neighbors)

slide 10 of 16 www.cl.cam.ac.uk/∼vp331/

slide-86
SLIDE 86

Fast Subgraph (Cuts) Generation Algorithm

Y ... subgraph ... X

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-87
SLIDE 87

Fast Subgraph (Cuts) Generation Algorithm

Y ... subgraph subgraphs > T ? NO ... X

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-88
SLIDE 88

Fast Subgraph (Cuts) Generation Algorithm

Y ... Y ... subgraph subgraphs > T ? NO subgraph ... ... X X

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-89
SLIDE 89

Fast Subgraph (Cuts) Generation Algorithm

Y ... Y ... subgraph subgraphs > T ? NO ... Y ... subgraph subgraph ... ... X X X

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-90
SLIDE 90

Fast Subgraph (Cuts) Generation Algorithm

Y ... Y ... Y ... subgraph subgraphs > T ? NO subgraph ... Y ... subgraph subgraph ... ... X X X ... X

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-91
SLIDE 91

Fast Subgraph (Cuts) Generation Algorithm

Y ... Y ... Y ... subgraph subgraphs > T ? NO YES subgraph ... Y ... subgraph subgraph ... ... X X X ... X

  • After T subgraphs, attach all neighbors

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-92
SLIDE 92

Fast Subgraph (Cuts) Generation Algorithm

Y ... Y ... Y ... Y ... subgraph subgraphs > T ? NO YES subgraph ... Y ... subgraph subgraph ... subgraph ... X X X ... X ... X

  • After T subgraphs, attach all neighbors

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-93
SLIDE 93

Fast Subgraph (Cuts) Generation Algorithm

Y ... Y ... Y ... Y ... subgraph subgraphs > T ? NO YES subgraph ... Y ... subgraph subgraph ... subgraph ... X X X ... X ... X

  • After T subgraphs, attach all neighbors
  • Complexity reduced to linear O(T + N)

slide 11 of 16 www.cl.cam.ac.uk/∼vp331/

slide-94
SLIDE 94

Experimental Setup

  • Implemented TSLP in the trunk version of the

LLVM 3.6 compiler.

slide 12 of 16 www.cl.cam.ac.uk/∼vp331/

slide-95
SLIDE 95

Experimental Setup

  • Implemented TSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz

slide 12 of 16 www.cl.cam.ac.uk/∼vp331/

slide-96
SLIDE 96

Experimental Setup

  • Implemented TSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7

slide 12 of 16 www.cl.cam.ac.uk/∼vp331/

slide-97
SLIDE 97

Experimental Setup

  • Implemented TSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7
  • Kernels, SPEC 2006 and NPB2.3-C
  • We evaluated the following cases:

slide 12 of 16 www.cl.cam.ac.uk/∼vp331/

slide-98
SLIDE 98

Experimental Setup

  • Implemented TSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7
  • Kernels, SPEC 2006 and NPB2.3-C
  • We evaluated the following cases:

1 All loop, SLP and TSLP vectorizers disabled (O3)

slide 12 of 16 www.cl.cam.ac.uk/∼vp331/

slide-99
SLIDE 99

Experimental Setup

  • Implemented TSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7
  • Kernels, SPEC 2006 and NPB2.3-C
  • We evaluated the following cases:

1 All loop, SLP and TSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP)

slide 12 of 16 www.cl.cam.ac.uk/∼vp331/

slide-100
SLIDE 100

Experimental Setup

  • Implemented TSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7
  • Kernels, SPEC 2006 and NPB2.3-C
  • We evaluated the following cases:

1 All loop, SLP and TSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP) 3 O3 + TSLP enabled (TSLP)

slide 12 of 16 www.cl.cam.ac.uk/∼vp331/

slide-101
SLIDE 101

TSLP increases performance

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 m

  • t

i v a t i

  • n

c

  • m

p u t e

  • r

h s m u l t

  • s

u 3

  • m

a t

  • v

e c

  • s

u m

  • 4

d i r e w a l d

  • L

R c

  • r

r e c t i

  • n

c

  • m

p u t e

  • t

r i a n g l e

  • b

b

  • x

l b m

  • h

a n d l e I n O u t F l

  • w

s h i f t

  • L

R c

  • r

r e c t i

  • n

G M e a n Normalized Time

Performance

O3 SLP TSLP

slide 13 of 16 www.cl.cam.ac.uk/∼vp331/

slide-102
SLIDE 102

TSLP static cost savings

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 m

  • t

i v a t i

  • n

c

  • m

p u t e

  • r

h s m u l t

  • s

u 3

  • m

a t

  • v

e c

  • s

u m

  • 4

d i r e w a l d

  • L

R c

  • r

r e c t i

  • n

c

  • m

p u t e

  • t

r i a n g l e

  • b

b

  • x

l b m

  • h

a n d l e I n O u t F l

  • w

s h i f t

  • L

R c

  • r

r e c t i

  • n

G M e a n Normalized cost

  • Avg. static Scalar, SLP and TSLP cost normalized to Scalar

Scalar SLP TSLP

slide 14 of 16 www.cl.cam.ac.uk/∼vp331/

slide-103
SLIDE 103

TSLP TotalCost exploration

  • 2

2 4 6 3 6 9 12 15 18 21 24

SLP TSLP

mult-su3-mat-vec-sum

  • SLP non-profitable, TSLP profitable

slide 15 of 16 www.cl.cam.ac.uk/∼vp331/

slide-104
SLIDE 104

TSLP TotalCost exploration

  • 2

2 4 6 3 6 9 12 15 18 21 24

SLP TSLP

  • 4
  • 2

2 10 20 30 40 50 60

SLP TSLP

mult-su3-mat-vec-sum lbm-handleInOutFlow-3

  • SLP non-profitable, TSLP profitable
  • SLP profitable, but TSLP more profitable

slide 15 of 16 www.cl.cam.ac.uk/∼vp331/

slide-105
SLIDE 105

TSLP TotalCost exploration

  • 2

2 4 6 3 6 9 12 15 18 21 24

SLP TSLP

  • 4
  • 2

2 10 20 30 40 50 60

SLP TSLP

4 8 12 60 120 180 240 300 360 420 480 540 600

SLP TSLP

mult-su3-mat-vec-sum lbm-handleInOutFlow-3 lab-handleInOutFlow-5

  • SLP non-profitable, TSLP profitable
  • SLP profitable, but TSLP more profitable
  • TSLP exploration gradually improves cost

slide 15 of 16 www.cl.cam.ac.uk/∼vp331/

slide-106
SLIDE 106

Conclusion

  • TSLP improves vectorization coverage compared to

the state-of-the-art

slide 16 of 16 www.cl.cam.ac.uk/∼vp331/

slide-107
SLIDE 107

Conclusion

  • TSLP improves vectorization coverage compared to

the state-of-the-art

  • Removes non-profitalbe code regions by:

slide 16 of 16 www.cl.cam.ac.uk/∼vp331/

slide-108
SLIDE 108

Conclusion

  • TSLP improves vectorization coverage compared to

the state-of-the-art

  • Removes non-profitalbe code regions by:
  • Evaluating a number of possible cuts

slide 16 of 16 www.cl.cam.ac.uk/∼vp331/

slide-109
SLIDE 109

Conclusion

  • TSLP improves vectorization coverage compared to

the state-of-the-art

  • Removes non-profitalbe code regions by:
  • Evaluating a number of possible cuts
  • Estimating their cost

slide 16 of 16 www.cl.cam.ac.uk/∼vp331/

slide-110
SLIDE 110

Conclusion

  • TSLP improves vectorization coverage compared to

the state-of-the-art

  • Removes non-profitalbe code regions by:
  • Evaluating a number of possible cuts
  • Estimating their cost
  • Applying the cut with the minimal cost

slide 16 of 16 www.cl.cam.ac.uk/∼vp331/

slide-111
SLIDE 111

Conclusion

  • TSLP improves vectorization coverage compared to

the state-of-the-art

  • Removes non-profitalbe code regions by:
  • Evaluating a number of possible cuts
  • Estimating their cost
  • Applying the cut with the minimal cost
  • TSLP performs better compared to SLP on

commodity SIMD-capable hardware

slide 16 of 16 www.cl.cam.ac.uk/∼vp331/

slide-112
SLIDE 112

Conclusion

  • TSLP improves vectorization coverage compared to

the state-of-the-art

  • Removes non-profitalbe code regions by:
  • Evaluating a number of possible cuts
  • Estimating their cost
  • Applying the cut with the minimal cost
  • TSLP performs better compared to SLP on

commodity SIMD-capable hardware

  • PACT’15 paper:

http://www.cl.cam.ac.uk/∼vp331/

slide 16 of 16 www.cl.cam.ac.uk/∼vp331/