PSLP: Padded SLP Automatic Vectorization
Vasileios Porpodas†, Alberto Magni‡ and Timothy M. Jones†
University of Cambridge† University of Edinburgh‡
EuroLLVM APR 2015
slide 1 of 17 www.cl.cam.ac.uk/∼vp331/
PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , - - PowerPoint PPT Presentation
PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , Alberto Magni and Timothy M. Jones University of Cambridge University of Edinburgh EuroLLVM APR 2015 slide 1 of 17 www.cl.cam.ac.uk/ vp331/ Why SIMD
EuroLLVM APR 2015
slide 1 of 17 www.cl.cam.ac.uk/∼vp331/
Scalar Func. Units Scalar Reg. File
FU FU FU FU
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
Scalar Func. Units Scalar Reg. File
FU FU FU FU
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
1 2 3 Vector Reg. File
Scalar Func. Units Scalar Reg. File
FU FU FU FU Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
1 2 3 Vector Reg. File
Scalar Func. Units Scalar Reg. File
FU FU FU FU Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
1 2 3 Vector Reg. File
Scalar Func. Units Scalar Reg. File
FU FU FU FU Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
1 2 3 Vector Reg. File
Scalar Func. Units Scalar Reg. File
FU FU FU FU Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
1 2 3 Vector Reg. File
Scalar Func. Units Scalar Reg. File
FU FU FU FU Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
1 2 3 Vector Reg. File
Scalar Func. Units Scalar Reg. File
FU FU FU FU Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/∼vp331/
slide 3 of 17 www.cl.cam.ac.uk/∼vp331/
slide 3 of 17 www.cl.cam.ac.uk/∼vp331/
slide 3 of 17 www.cl.cam.ac.uk/∼vp331/
slide 3 of 17 www.cl.cam.ac.uk/∼vp331/
slide 3 of 17 www.cl.cam.ac.uk/∼vp331/
slide 3 of 17 www.cl.cam.ac.uk/∼vp331/
Scalar Code
slide 4 of 17 www.cl.cam.ac.uk/∼vp331/
1 Consecutive Stores 2 Reductions Find vectorization seed instructions 1. Scalar Code
slide 4 of 17 www.cl.cam.ac.uk/∼vp331/
1 Consecutive Stores 2 Reductions
Find vectorization seed instructions 1. Scalar Code 2. Generate graph of isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/∼vp331/
1 Consecutive Stores 2 Reductions
Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. Scalar Code 2. Generate graph of isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/∼vp331/
1 Consecutive Stores 2 Reductions
Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Scalar Code 2. Generate graph of isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/∼vp331/
1 Consecutive Stores 2 Reductions
Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Vectorize groups & emit vectors YES 5. DONE Scalar Code 2. Generate graph of isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/∼vp331/
1 Consecutive Stores 2 Reductions
Find vectorization seed instructions 1. Calculate Vector Cost Calculate Scalar Cost 3. 4. If < Vector Cost Scalar Cost Vectorize groups & emit vectors YES 5. NO DONE Scalar Code 2. Generate graph of isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/∼vp331/
1 Data Dependencies
slide 5 of 17 www.cl.cam.ac.uk/∼vp331/
1 Data Dependencies 2 Too many
ADD1 ADD2 ADD3 ADD4 Original Vectorized ADD1 ADD2 ADD3 ADD4 Insert1 Insert2 Insert3 Insert4 Extract1 Extract2 Extract3 Extract4
slide 5 of 17 www.cl.cam.ac.uk/∼vp331/
1 Data Dependencies 2 Too many
3 Non-isomorphism
ADD1 ADD2 ADD3 ADD4 Original Vectorized ADD1 ADD2 ADD3 ADD4 Insert1 Insert2 Insert3 Insert4 Extract1 Extract2 Extract3 Extract4
slide 5 of 17 www.cl.cam.ac.uk/∼vp331/
X Instruction Node or Constant
Data Flow Edge
... ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;
slide 6 of 17 www.cl.cam.ac.uk/∼vp331/
X Instruction Node or Constant
Data Flow Edge
S S L L + * +
... ... 7. 1. 5. B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;
slide 6 of 17 www.cl.cam.ac.uk/∼vp331/
X Instruction Node or Constant
Data Flow Edge
S S L L S S + * +
... ... 7. 1. 5. S S B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;
slide 6 of 17 www.cl.cam.ac.uk/∼vp331/
X Instruction Node or Constant
Data Flow Edge
S S L L S S + * +
... ... 7. 1. 5. S S + + 1 B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;
+ +
slide 6 of 17 www.cl.cam.ac.uk/∼vp331/
X Instruction Node or Constant
Data Flow Edge
S S L L S S + * +
STOP ! NON−ISOMORPHIC * L ... ... L 1. 5. 7. 1. 5. 7. S S + + 1 2 L * B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0;
+ +
slide 6 of 17 www.cl.cam.ac.uk/∼vp331/
X Instruction Node or Constant
Data Flow Edge
S S L L S S + S + * * +
STOP ! NON−ISOMORPHIC * L ... ... L 1. 5. 7. 1. 5. 7. S S + + 1 2 L * B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; Scalar Cost
L L S + + + 7
slide 6 of 17 www.cl.cam.ac.uk/∼vp331/
X Instruction Node or Constant
Data Flow Edge
S S L L S S S S L L + S + * * +
STOP ! NON−ISOMORPHIC * L ... ... L 1. 5. 7. 1. 5. 7. S S + + 1 2 L * B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; Vector Cost Scalar Cost
No Benefit L L S + + + 7 7 * i i + +
slide 6 of 17 www.cl.cam.ac.uk/∼vp331/
Data Flow Edge Instruction or Constant
slide 7 of 17 www.cl.cam.ac.uk/∼vp331/
Data Flow Edge Instruction or Constant
slide 7 of 17 www.cl.cam.ac.uk/∼vp331/
Data Flow Edge Instruction or Constant
slide 7 of 17 www.cl.cam.ac.uk/∼vp331/
Select Instruction Data Flow Edge Instruction or Constant
slide 7 of 17 www.cl.cam.ac.uk/∼vp331/
Select Instruction Data Flow Edge Instruction or Constant
slide 7 of 17 www.cl.cam.ac.uk/∼vp331/
Select Instruction Data Flow Edge Instruction or Constant
slide 7 of 17 www.cl.cam.ac.uk/∼vp331/
Select Instruction Data Flow Edge Instruction or Constant
slide 7 of 17 www.cl.cam.ac.uk/∼vp331/
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
Generate a graph for each seed 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
Generate a graph for each seed 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
Calculate Scalar Cost Calculate Vector Cost
4. Calculate Padded Vector Cost Generate a graph for each seed 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5.
4. Calculate Padded Vector Cost Generate a graph for each seed 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5.
4. Calculate Padded Vector Cost Emit Padded Scalars YES 6. Generate a graph for each seed 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
7. If < Vector Cost Scalar Cost NO Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5.
4. Calculate Padded Vector Cost Emit Padded Scalars YES 6. Generate a graph for each seed 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
YES 7. If < Vector Cost Scalar Cost NO Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5. 8.
4. Calculate Padded Vector Cost Emit Padded Scalars YES 6.
Generate a graph for each seed 9. Generate SLP graph containing groups of isomorphic scalars Vectorize groups & emit vectors 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
YES 7. If < Vector Cost Scalar Cost NO Calculate Scalar Cost Calculate Vector Cost If Padded Cost is best 5. 8.
4. Calculate Padded Vector Cost Emit Padded Scalars YES 6.
Generate a graph for each seed 9. Generate SLP graph containing groups of isomorphic scalars Vectorize groups & emit vectors NO DONE 2. 1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/∼vp331/
S + * 7 L 1 + L S 5 g1 g2 Non−Isomorphic
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
S + * 7 L 1 + L S 5 g1 g2 Non−Isomorphic
MCS1 MCS2
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
S + * 7 L 1 + L S 5 + L S 1 + L S 5 g1 g2 Non−Isomorphic g1 g2
MCS1 MCS2 MCS1 MCS2
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
diff1 diff2 S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 Non−Isomorphic g1 g2 L + L +
MCS1 MCS2 MCS1 MCS2
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
diff1 diff2 S + L 1 S 5 + L S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Non−Isomorphic g1 g2 L + L +
MCS1 MCS2 MCS1 MCS1 MCS2 MCS2
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
diff1 diff2 S + L 7 * 1 S 5 + L 7 * S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Non−Isomorphic g1 g2 diff1 diff1 L + L +
MCS1 MCS2 MCS1 MCS1 MCS2 MCS2
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
diff1 diff2 S + L 7 * 1 S 5 + L 7 * S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Isomorphic ! Non−Isomorphic g1 g2 diff1 diff1 diff2 diff2 L + L +
MCS1 MCS2 MCS1 MCS1 MCS2 MCS2
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
diff1 diff2 SELECT SELECT S + L 7 * 1 S 5 + L 7 * S + * 7 L 1 + L S 5 + L S 1 + L S 5 * 7 g1 g2 MinCS2 MinCS1 Isomorphic ! Non−Isomorphic g1 g2 diff1 diff1 diff2 diff2 L + L + Left Right
MCS1 MCS2 MCS1 MCS1 MCS2 MCS2
slide 9 of 17 www.cl.cam.ac.uk/∼vp331/
S L + S * EXAMPLE: Instruction acting as Select 7. 1. + L 5.
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + S * S L + * Left 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 1 + C A
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C 1 + C A
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 7 2 1 + C A
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 2 A 7 2 1 + C A
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 2 A 7 2 A B 1 + C A
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
S L + * S L + * S L + S * S L + * S L + * Left 7 1 7 1 EXAMPLE: Instruction acting as Select 7. 1. + L 5. 7 5 Right 1 1 A + C A 2 A 7 2 A B A B 1 + C A
slide 10 of 17 www.cl.cam.ac.uk/∼vp331/
1 Non-isomorphic source code (e.g. computing
a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory
slide 12 of 17 www.cl.cam.ac.uk/∼vp331/
1 Non-isomorphic source code (e.g. computing
a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory
slide 12 of 17 www.cl.cam.ac.uk/∼vp331/
1 Non-isomorphic source code (e.g. computing
a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory
2 Isomorphic source code but non-isomorphic IR due
tmp1 = quantval[0]*16384 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266
slide 12 of 17 www.cl.cam.ac.uk/∼vp331/
1 Non-isomorphic source code (e.g. computing
a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory
2 Isomorphic source code but non-isomorphic IR due
tmp1 = quantval[0]<<14 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266 tmp1 = quantval[0]*16384 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266
slide 12 of 17 www.cl.cam.ac.uk/∼vp331/
1 Non-isomorphic source code (e.g. computing
a[0].real a[0].imag a[1].real a[1].imag ... ... b[0].imag = − a[0].imag b[1].imag = − a[1].imag b[0].real = a[0].real b[1].real = a[1].real Memory
2 Isomorphic source code but non-isomorphic IR due
tmp1 = quantval[0]<<14 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266 tmp1 = quantval[0]*16384 tmp2 = quantval[1]*22725 tmp3 = quantval[2]*21407 tmp4 = quantval[3]*19266
slide 12 of 17 www.cl.cam.ac.uk/∼vp331/
slide 13 of 17 www.cl.cam.ac.uk/∼vp331/
slide 13 of 17 www.cl.cam.ac.uk/∼vp331/
slide 13 of 17 www.cl.cam.ac.uk/∼vp331/
slide 13 of 17 www.cl.cam.ac.uk/∼vp331/
1 All loop, SLP and PSLP vectorizers disabled (O3)
slide 13 of 17 www.cl.cam.ac.uk/∼vp331/
1 All loop, SLP and PSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP)
slide 13 of 17 www.cl.cam.ac.uk/∼vp331/
1 All loop, SLP and PSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP) 3 O3 + PSLP enabled (PSLP)
slide 13 of 17 www.cl.cam.ac.uk/∼vp331/
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 c
j u g a t e s s u 3
d j
n t m a k e
h m a t
l
j d c t
f a s t f l
d
a r s h a l l G M e a n Normalized Time
Performance of Kernels (Execution Time)
O3 SLP PSLP
0.97 0.98 0.99 1.00 1.01 c j p e g m p e g 2 d e c 4 3 3 . m i l c 4 7 3 . a s t a r G M e a n
Whole Benchmarks (Execution Time)
O3 SLP PSLP slide 14 of 17 www.cl.cam.ac.uk/∼vp331/
10 20 30 40 50 conjugates su3-adjoint make-ahmat-slow jdct-ifast floyd-warshall cjpeg mpeg2dec 433.milc 473.astar Times Technique Succeeds
Vectorization Coverage Breakdown
163 SLP-only PSLP-extends-SLP PSLP-only
slide 15 of 17 www.cl.cam.ac.uk/∼vp331/
10 20 30 40 50 conjugates su3-adjoint make-ahmat-slow jdct-ifast floyd-warshall cjpeg mpeg2dec 433.milc 473.astar Times Technique Succeeds
Vectorization Coverage Breakdown
163 SLP-only PSLP-extends-SLP PSLP-only
slide 15 of 17 www.cl.cam.ac.uk/∼vp331/
10 20 30 40 50 conjugates su3-adjoint make-ahmat-slow jdct-ifast floyd-warshall cjpeg mpeg2dec 433.milc 473.astar Times Technique Succeeds
Vectorization Coverage Breakdown
163 SLP-only PSLP-extends-SLP PSLP-only
slide 15 of 17 www.cl.cam.ac.uk/∼vp331/
0% 5% 10% 15% 20% 25% 30% 35% c
j u g a t e s s u 3
d j
n t m a k e
h m a t
l
j d c t
f a s t f l
d
a r s h a l l c j p e g m p e g 2 d e c 4 3 3 . m i l c 4 7 3 . a s t a r G M e a n Percentage of Selects
Percentage of Selects per region before and after Optimizations
Original-Selects Optimized-Selects
slide 16 of 17 www.cl.cam.ac.uk/∼vp331/
slide 17 of 17 www.cl.cam.ac.uk/∼vp331/
slide 17 of 17 www.cl.cam.ac.uk/∼vp331/
slide 17 of 17 www.cl.cam.ac.uk/∼vp331/
slide 17 of 17 www.cl.cam.ac.uk/∼vp331/
slide 17 of 17 www.cl.cam.ac.uk/∼vp331/
slide 17 of 17 www.cl.cam.ac.uk/∼vp331/