PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , - - PowerPoint PPT Presentation

pslp padded slp automatic vectorization
SMART_READER_LITE
LIVE PREVIEW

PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , - - PowerPoint PPT Presentation

PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , Alberto Magni and Timothy M. Jones University of Cambridge University of Edinburgh EuroLLVM APR 2015 slide 1 of 17 www.cl.cam.ac.uk/ vp331/ Why SIMD


slide-1
SLIDE 1

PSLP: Padded SLP Automatic Vectorization

Vasileios Porpodas†, Alberto Magni‡ and Timothy M. Jones†

University of Cambridge† University of Edinburgh‡

EuroLLVM APR 2015

slide 1 of 17 www.cl.cam.ac.uk/∼vp331/

slide-2
SLIDE 2

Why SIMD Vectorization?

  • Scalable parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-3
SLIDE 3

Why SIMD Vectorization?

  • Scalable parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-4
SLIDE 4

Why SIMD Vectorization?

  • Scalable parallelism

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-5
SLIDE 5

Why SIMD Vectorization?

  • Scalable parallelism

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-6
SLIDE 6

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-7
SLIDE 7

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance
  • Energy efficiency

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-8
SLIDE 8

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance
  • Energy efficiency
  • Supported since mid 90’s
  • Frequent updates of vector

ISAs

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

AVX2 SSE4

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-9
SLIDE 9

Why SIMD Vectorization?

  • Scalable parallelism
  • High Performance
  • Energy efficiency
  • Supported since mid 90’s
  • Frequent updates of vector

ISAs

  • Vector generation not

done in hardware

  • Low-level programming or

capable compiler

1 2 3 Vector Reg. File

  • b. Vector Parallelism

Scalar Func. Units Scalar Reg. File

  • a. ILP

FU FU FU FU Vector Unit

AVX2 SSE4

slide 2 of 17 www.cl.cam.ac.uk/∼vp331/

slide-10
SLIDE 10

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]

slide 3 of 17 www.cl.cam.ac.uk/∼vp331/

slide-11
SLIDE 11

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer

slide 3 of 17 www.cl.cam.ac.uk/∼vp331/

slide-12
SLIDE 12

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

slide 3 of 17 www.cl.cam.ac.uk/∼vp331/

slide-13
SLIDE 13

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

  • In theory it should be a superset of loop-vectorizer

slide 3 of 17 www.cl.cam.ac.uk/∼vp331/

slide-14
SLIDE 14

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

  • In theory it should be a superset of loop-vectorizer
  • Unroll loop and vectorize with SLP
  • Even if loop-vectorizer fails, SLP could partly succeed

slide 3 of 17 www.cl.cam.ac.uk/∼vp331/

slide-15
SLIDE 15

SLP Straight-Line Code Vectorizer

  • Superword Level Parallelism [Larsen PLDI’00]
  • State-of-the-art straight-line code vectorizer
  • Implemented in most compilers (including GCC and

LLVM)

  • In theory it should be a superset of loop-vectorizer
  • Unroll loop and vectorize with SLP
  • Even if loop-vectorizer fails, SLP could partly succeed
  • In practice it is missing features present in the Loop

vectorizer (Interleaved Loads, Predication)

slide 3 of 17 www.cl.cam.ac.uk/∼vp331/

slide-16
SLIDE 16

SLP Vectorization Algorithm

  • Input is scalar IR

Scalar Code

slide 4 of 17 www.cl.cam.ac.uk/∼vp331/

slide-17
SLIDE 17

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions Find vectorization seed instructions 1. Scalar Code

slide 4 of 17 www.cl.cam.ac.uk/∼vp331/

slide-18
SLIDE 18

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

Find vectorization seed instructions 1. Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/∼vp331/

slide-19
SLIDE 19

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/∼vp331/

slide-20
SLIDE 20

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count
  • Check vectorization profitability

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/∼vp331/

slide-21
SLIDE 21

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count
  • Check vectorization profitability
  • Emit vectors only if profitable

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Vectorize groups & emit vectors YES 5. DONE Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/∼vp331/

slide-22
SLIDE 22

SLP Vectorization Algorithm

  • Input is scalar IR
  • Seed instructions are:

1 Consecutive Stores 2 Reductions

  • Graph contains vectorizable

isomorphic instructions

  • Cost: weighted instr. count
  • Check vectorization profitability
  • Emit vectors only if profitable

Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Vectorize groups & emit vectors YES 5. NO DONE Scalar Code 2. Generate graph of isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/∼vp331/

slide-23
SLIDE 23

When SLP Fails

1 Data Dependencies

ADD3 ADD1 ADD2 ADD4

slide 5 of 17 www.cl.cam.ac.uk/∼vp331/

slide-24
SLIDE 24

When SLP Fails

1 Data Dependencies 2 Too many

gather/scatter

  • instructions. Costs
  • utweigh benefits.

ADD3 ADD1 ADD2 ADD4

ADD1 ADD2 ADD3 ADD4 Original Vectorized ADD1 ADD2 ADD3 ADD4 Insert1 Insert2 Insert3 Insert4 Extract1 Extract2 Extract3 Extract4

slide 5 of 17 www.cl.cam.ac.uk/∼vp331/

slide-25
SLIDE 25

When SLP Fails

1 Data Dependencies 2 Too many

gather/scatter

  • instructions. Costs
  • utweigh benefits.

3 Non-isomorphism

ADD3 ADD1 ADD2 ADD4

ADD1 ADD2 ADD3 ADD4 Original Vectorized ADD1 ADD2 ADD3 ADD4 Insert1 Insert2 Insert3 Insert4 Extract1 Extract2 Extract3 Extract4

ADD1 ADD2 ADD4 MUL

slide 5 of 17 www.cl.cam.ac.uk/∼vp331/

slide-26
SLIDE 26

SLP Fails due to non-isomorphism

X Instruction Node or Constant

Data Flow Edge

  • a. Input C code

... ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;

slide 6 of 17 www.cl.cam.ac.uk/∼vp331/

slide-27
SLIDE 27

SLP Fails due to non-isomorphism

X Instruction Node or Constant

Data Flow Edge

S S L L + * +

  • a. Input C code

... ... 7. 1. 5. B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;

  • b. DFG

slide 6 of 17 www.cl.cam.ac.uk/∼vp331/

slide-28
SLIDE 28

SLP Fails due to non-isomorphism

X Instruction Node or Constant

Data Flow Edge

S S L L S S + * +

  • a. Input C code

... ... 7. 1. 5. S S B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;

  • b. DFG
  • c. SLP internal graph
  • d. SLP vectorized groups

slide 6 of 17 www.cl.cam.ac.uk/∼vp331/

slide-29
SLIDE 29

SLP Fails due to non-isomorphism

X Instruction Node or Constant

Data Flow Edge

S S L L S S + * +

  • a. Input C code

... ... 7. 1. 5. S S + + 1 B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;

  • b. DFG
  • c. SLP internal graph
  • d. SLP vectorized groups

+ +

slide 6 of 17 www.cl.cam.ac.uk/∼vp331/

slide-30
SLIDE 30

SLP Fails due to non-isomorphism

X Instruction Node or Constant

Data Flow Edge

S S L L S S + * +

  • a. Input C code

STOP ! NON−ISOMORPHIC * L ... ... L 1. 5. 7. 1. 5. 7. S S + + 1 2 L * B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;

  • b. DFG
  • c. SLP internal graph
  • d. SLP vectorized groups

+ +

slide 6 of 17 www.cl.cam.ac.uk/∼vp331/

slide-31
SLIDE 31

SLP Fails due to non-isomorphism

X Instruction Node or Constant

Data Flow Edge

S S L L S S + S + * * +

  • a. Input C code

STOP ! NON−ISOMORPHIC * L ... ... L 1. 5. 7. 1. 5. 7. S S + + 1 2 L * B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; Scalar Cost

  • b. DFG
  • c. SLP internal graph
  • d. SLP vectorized groups

L L S + + + 7

slide 6 of 17 www.cl.cam.ac.uk/∼vp331/

slide-32
SLIDE 32

SLP Fails due to non-isomorphism

X Instruction Node or Constant

Data Flow Edge

S S L L S S S S L L + S + * * +

  • a. Input C code

STOP ! NON−ISOMORPHIC * L ... ... L 1. 5. 7. 1. 5. 7. S S + + 1 2 L * B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; Vector Cost Scalar Cost

  • b. DFG
  • c. SLP internal graph
  • d. SLP vectorized groups

No Benefit L L S + + + 7 7 * i i + +

slide 6 of 17 www.cl.cam.ac.uk/∼vp331/

slide-33
SLIDE 33

PSLP fixes Non-Isomorphism

S L + S * 7. 1.

  • a. PSLP graphs

+ L 5.

Data Flow Edge Instruction or Constant

X

slide 7 of 17 www.cl.cam.ac.uk/∼vp331/

slide-34
SLIDE 34

PSLP fixes Non-Isomorphism

S L S L S L + S + + * * 7. 1.

  • b. PSLP padded graphs
  • a. PSLP graphs

7. 1. 5. + L 5.

Data Flow Edge Instruction or Constant

X

slide 7 of 17 www.cl.cam.ac.uk/∼vp331/

slide-35
SLIDE 35

PSLP fixes Non-Isomorphism

S L S L S L + S + + * * * 7. 1.

  • b. PSLP padded graphs
  • a. PSLP graphs

7. 1. 7. 5. + L 5.

Data Flow Edge Instruction or Constant

X

slide 7 of 17 www.cl.cam.ac.uk/∼vp331/

slide-36
SLIDE 36

PSLP fixes Non-Isomorphism

S L S L S L + S + + * * * 7. 1.

  • b. PSLP padded graphs
  • a. PSLP graphs

7. 1. 7. 5. Left Right + L 5.

Select Instruction Data Flow Edge Instruction or Constant

X

slide 7 of 17 www.cl.cam.ac.uk/∼vp331/

slide-37
SLIDE 37

PSLP fixes Non-Isomorphism

S L S L S L + S + + * * * 7. 1.

  • b. PSLP padded graphs
  • a. PSLP graphs

7. 1. 7. 5. Left Right + L 5.

Select Instruction Data Flow Edge Instruction or Constant

X

slide 7 of 17 www.cl.cam.ac.uk/∼vp331/

slide-38
SLIDE 38

PSLP fixes Non-Isomorphism

S L S L S L + S + + * * * 7. 1.

  • c. PSLP groups
  • b. PSLP padded graphs
  • a. PSLP graphs

7. 1. 7. 5. 1 2 S S + + 3 * * 4 L L Left Right + L 5.

Select Instruction Data Flow Edge Instruction or Constant

X

slide 7 of 17 www.cl.cam.ac.uk/∼vp331/

slide-39
SLIDE 39

PSLP fixes Non-Isomorphism

S L S L S L + S + + * * * 7. 1.

  • c. PSLP groups
  • b. PSLP padded graphs
  • a. PSLP graphs

7. 1. 7. 5. 1 2 S S + + 3 * * 4 L L Left Right

5

+ L 5.

Select Instruction Data Flow Edge Instruction or Constant

X

slide 7 of 17 www.cl.cam.ac.uk/∼vp331/

slide-40
SLIDE 40

PSLP Algorithm

  • Extension to SLP

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-41
SLIDE 41

PSLP Algorithm

  • Extension to SLP

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-42
SLIDE 42

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

Generate a graph for each seed 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-43
SLIDE 43

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

  • Minimal Padding
  • 3. Perform minimal Padding of graphs

Generate a graph for each seed 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-44
SLIDE 44

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

  • Minimal Padding
  • Cost estimation

Calculate Scalar Cost Calculate Vector Cost

  • 3. Perform minimal Padding of graphs

4. Calculate Padded Vector Cost Generate a graph for each seed 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-45
SLIDE 45

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

  • Minimal Padding
  • Cost estimation

Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5.

  • 3. Perform minimal Padding of graphs

4. Calculate Padded Vector Cost Generate a graph for each seed 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-46
SLIDE 46

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

  • Minimal Padding
  • Cost estimation
  • Emit redundant code

to create isomorphism

Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5.

  • 3. Perform minimal Padding of graphs

4. Calculate Padded Vector Cost Emit Padded Scalars YES 6. Generate a graph for each seed 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-47
SLIDE 47

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

  • Minimal Padding
  • Cost estimation
  • Emit redundant code

to create isomorphism

7. If < Vector Cost Scalar Cost NO Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5.

  • 3. Perform minimal Padding of graphs

4. Calculate Padded Vector Cost Emit Padded Scalars YES 6. Generate a graph for each seed 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-48
SLIDE 48

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

  • Minimal Padding
  • Cost estimation
  • Emit redundant code

to create isomorphism

  • Code vectorized by
  • riginal SLP

YES 7. If < Vector Cost Scalar Cost NO Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5. 8.

  • 3. Perform minimal Padding of graphs

4. Calculate Padded Vector Cost Emit Padded Scalars YES 6.

} Vanilla SLP

Generate a graph for each seed 9. Generate SLP graph containing groups of isomorphic scalars Vectorize groups & emit vectors 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-49
SLIDE 49

PSLP Algorithm

  • Extension to SLP
  • Generate multiple

graphs (unlike SLP)

  • Minimal Padding
  • Cost estimation
  • Emit redundant code

to create isomorphism

  • Code vectorized by
  • riginal SLP

YES 7. If < Vector Cost Scalar Cost NO Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5. 8.

  • 3. Perform minimal Padding of graphs

4. Calculate Padded Vector Cost Emit Padded Scalars YES 6.

} Vanilla SLP

Generate a graph for each seed 9. Generate SLP graph containing groups of isomorphic scalars Vectorize groups & emit vectors NO DONE 2. 1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/∼vp331/

slide-50
SLIDE 50

Minimal Padding Algorithm

S + * 7 L 1 + L S 5 g1 g2 Non−Isomorphic

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-51
SLIDE 51

Minimal Padding Algorithm

S + * 7 L 1 + L S 5 g1 g2 Non−Isomorphic

MCS1 MCS2

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-52
SLIDE 52

Minimal Padding Algorithm

S + * 7 L 1 + L S 5 + L S 1 + L S 5 g1 g2 Non−Isomorphic g1 g2

MCS1 MCS2 MCS1 MCS2

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-53
SLIDE 53

Minimal Padding Algorithm

diff1 diff2 S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 Non−Isomorphic g1 g2 L + L +

MCS1 MCS2 MCS1 MCS2

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-54
SLIDE 54

Minimal Padding Algorithm

diff1 diff2 S + L 1 S 5 + L S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Non−Isomorphic g1 g2 L + L +

MCS1 MCS2 MCS1 MCS1 MCS2 MCS2

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-55
SLIDE 55

Minimal Padding Algorithm

diff1 diff2 S + L 7 * 1 S 5 + L 7 * S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Non−Isomorphic g1 g2 diff1 diff1 L + L +

MCS1 MCS2 MCS1 MCS1 MCS2 MCS2

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-56
SLIDE 56

Minimal Padding Algorithm

diff1 diff2 S + L 7 * 1 S 5 + L 7 * S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Isomorphic ! Non−Isomorphic g1 g2 diff1 diff1 diff2 diff2 L + L +

MCS1 MCS2 MCS1 MCS1 MCS2 MCS2

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-57
SLIDE 57

Minimal Padding Algorithm

diff1 diff2 SELECT SELECT S + L 7 * 1 S 5 + L 7 * S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Isomorphic ! Non−Isomorphic g1 g2 diff1 diff1 diff2 diff2 L + L + Left Right

MCS1 MCS2 MCS1 MCS1 MCS2 MCS2

slide 9 of 17 www.cl.cam.ac.uk/∼vp331/

slide-58
SLIDE 58

We can do better: Remove redundant Selects

S L + S * EXAMPLE: Instruction acting as Select 7. 1. + L 5.

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-59
SLIDE 59

We can do better: Remove redundant Selects

S L + * S L + S * S L + * Left 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-60
SLIDE 60

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-61
SLIDE 61

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-62
SLIDE 62

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 1 + C A

  • a. Instruction acting as Select

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-63
SLIDE 63

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C 1 + C A

  • a. Instruction acting as Select

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-64
SLIDE 64

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 7 2 1 + C A

  • a. Instruction acting as Select
  • b. Select constants

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-65
SLIDE 65

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 2 A 7 2 1 + C A

  • a. Instruction acting as Select
  • b. Select constants

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-66
SLIDE 66

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 2 A 7 2 A B 1 + C A

  • a. Instruction acting as Select
  • b. Select constants
  • c. Select same node

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-67
SLIDE 67

We can do better: Remove redundant Selects

S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 2 A 7 2 A B A B 1 + C A

  • a. Instruction acting as Select
  • b. Select constants
  • c. Select same node

slide 10 of 17 www.cl.cam.ac.uk/∼vp331/

slide-68
SLIDE 68

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computing

conjugates in 433.milc)

a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory

slide 12 of 17 www.cl.cam.ac.uk/∼vp331/

slide-69
SLIDE 69

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computing

conjugates in 433.milc)

a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory

slide 12 of 17 www.cl.cam.ac.uk/∼vp331/

slide-70
SLIDE 70

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computing

conjugates in 433.milc)

a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory

2 Isomorphic source code but non-isomorphic IR due

to high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]*16384 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266

slide 12 of 17 www.cl.cam.ac.uk/∼vp331/

slide-71
SLIDE 71

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computing

conjugates in 433.milc)

a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory

2 Isomorphic source code but non-isomorphic IR due

to high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]<<14 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266 tmp1 = quantval[0]*16384 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266

  • pt

slide 12 of 17 www.cl.cam.ac.uk/∼vp331/

slide-72
SLIDE 72

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computing

conjugates in 433.milc)

a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory

2 Isomorphic source code but non-isomorphic IR due

to high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]<<14 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266 tmp1 = quantval[0]*16384 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266

  • pt

slide 12 of 17 www.cl.cam.ac.uk/∼vp331/

slide-73
SLIDE 73

Experimental Setup

  • Implemented PSLP in the trunk version of the

LLVM 3.6 compiler.

slide 13 of 17 www.cl.cam.ac.uk/∼vp331/

slide-74
SLIDE 74

Experimental Setup

  • Implemented PSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz

slide 13 of 17 www.cl.cam.ac.uk/∼vp331/

slide-75
SLIDE 75

Experimental Setup

  • Implemented PSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7 -ffast-math

slide 13 of 17 www.cl.cam.ac.uk/∼vp331/

slide-76
SLIDE 76

Experimental Setup

  • Implemented PSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7 -ffast-math
  • Kernels, SPEC 2006 and Mediabench II
  • We evaluated the following cases:

slide 13 of 17 www.cl.cam.ac.uk/∼vp331/

slide-77
SLIDE 77

Experimental Setup

  • Implemented PSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7 -ffast-math
  • Kernels, SPEC 2006 and Mediabench II
  • We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3)

slide 13 of 17 www.cl.cam.ac.uk/∼vp331/

slide-78
SLIDE 78

Experimental Setup

  • Implemented PSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7 -ffast-math
  • Kernels, SPEC 2006 and Mediabench II
  • We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP)

slide 13 of 17 www.cl.cam.ac.uk/∼vp331/

slide-79
SLIDE 79

Experimental Setup

  • Implemented PSLP in the trunk version of the

LLVM 3.6 compiler.

  • Target: Intel Core i5-4570 @ 3.2Ghz
  • Compiler flags: -O3 -allow-partial-unroll
  • march=core-avx2 -mtune-core-i7 -ffast-math
  • Kernels, SPEC 2006 and Mediabench II
  • We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP) 3 O3 + PSLP enabled (PSLP)

slide 13 of 17 www.cl.cam.ac.uk/∼vp331/

slide-80
SLIDE 80

PSLP increases performance

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 c

  • n

j u g a t e s s u 3

  • a

d j

  • i

n t m a k e

  • a

h m a t

  • s

l

  • w

j d c t

  • i

f a s t f l

  • y

d

  • w

a r s h a l l G M e a n Normalized Time

Performance of Kernels (Execution Time)

O3 SLP PSLP

0.97 0.98 0.99 1.00 1.01 c j p e g m p e g 2 d e c 4 3 3 . m i l c 4 7 3 . a s t a r G M e a n

Whole Benchmarks (Execution Time)

O3 SLP PSLP slide 14 of 17 www.cl.cam.ac.uk/∼vp331/

slide-81
SLIDE 81

PSLP enables or extends vectorization

10 20 30 40 50 conjugates su3-adjoint make-ahmat-slow jdct-ifast floyd-warshall cjpeg mpeg2dec 433.milc 473.astar Times Technique Succeeds

Vectorization Coverage Breakdown

163 SLP-only PSLP-extends-SLP PSLP-only

SLP

  • nly
  • SLP is adequate

slide 15 of 17 www.cl.cam.ac.uk/∼vp331/

slide-82
SLIDE 82

PSLP enables or extends vectorization

10 20 30 40 50 conjugates su3-adjoint make-ahmat-slow jdct-ifast floyd-warshall cjpeg mpeg2dec 433.milc 473.astar Times Technique Succeeds

Vectorization Coverage Breakdown

163 SLP-only PSLP-extends-SLP PSLP-only

SLP

  • nly

PSLP extends SLP

  • SLP is adequate
  • SLP stops at non-isomorphic
  • code. PSLP extends it.

slide 15 of 17 www.cl.cam.ac.uk/∼vp331/

slide-83
SLIDE 83

PSLP enables or extends vectorization

10 20 30 40 50 conjugates su3-adjoint make-ahmat-slow jdct-ifast floyd-warshall cjpeg mpeg2dec 433.milc 473.astar Times Technique Succeeds

Vectorization Coverage Breakdown

163 SLP-only PSLP-extends-SLP PSLP-only

PSLP

  • nly

SLP

  • nly

PSLP extends SLP

  • SLP is adequate
  • SLP stops at non-isomorphic
  • code. PSLP extends it.
  • SLP fails completely. PSLP

succeeds.

slide 15 of 17 www.cl.cam.ac.uk/∼vp331/

slide-84
SLIDE 84

Optimizing away redundant Selects

  • Select-removal
  • ptimizations

remove about 21%

  • f the Selects

0% 5% 10% 15% 20% 25% 30% 35% c

  • n

j u g a t e s s u 3

  • a

d j

  • i

n t m a k e

  • a

h m a t

  • s

l

  • w

j d c t

  • i

f a s t f l

  • y

d

  • w

a r s h a l l c j p e g m p e g 2 d e c 4 3 3 . m i l c 4 7 3 . a s t a r G M e a n Percentage of Selects

Percentage of Selects per region before and after Optimizations

Original-Selects Optimized-Selects

slide 16 of 17 www.cl.cam.ac.uk/∼vp331/

slide-85
SLIDE 85

Conclusion

  • PSLP improves vectorization coverage compared to

the state-of-the-art

slide 17 of 17 www.cl.cam.ac.uk/∼vp331/

slide-86
SLIDE 86

Conclusion

  • PSLP improves vectorization coverage compared to

the state-of-the-art

  • Converts non-isomorphic code into isomorphic by:

slide 17 of 17 www.cl.cam.ac.uk/∼vp331/

slide-87
SLIDE 87

Conclusion

  • PSLP improves vectorization coverage compared to

the state-of-the-art

  • Converts non-isomorphic code into isomorphic by:
  • Relying on the Min Common Supergraph for minimal

injection of redundant code

slide 17 of 17 www.cl.cam.ac.uk/∼vp331/

slide-88
SLIDE 88

Conclusion

  • PSLP improves vectorization coverage compared to

the state-of-the-art

  • Converts non-isomorphic code into isomorphic by:
  • Relying on the Min Common Supergraph for minimal

injection of redundant code

  • Emitting Select instructions to guarantee correctness

slide 17 of 17 www.cl.cam.ac.uk/∼vp331/

slide-89
SLIDE 89

Conclusion

  • PSLP improves vectorization coverage compared to

the state-of-the-art

  • Converts non-isomorphic code into isomorphic by:
  • Relying on the Min Common Supergraph for minimal

injection of redundant code

  • Emitting Select instructions to guarantee correctness
  • Optimizing away redundant Selects

slide 17 of 17 www.cl.cam.ac.uk/∼vp331/

slide-90
SLIDE 90

Conclusion

  • PSLP improves vectorization coverage compared to

the state-of-the-art

  • Converts non-isomorphic code into isomorphic by:
  • Relying on the Min Common Supergraph for minimal

injection of redundant code

  • Emitting Select instructions to guarantee correctness
  • Optimizing away redundant Selects
  • PSLP performs better compared to SLP on

commodity SIMD-capable hardware

slide 17 of 17 www.cl.cam.ac.uk/∼vp331/