pslp padded slp automatic vectorization
play

PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , - PowerPoint PPT Presentation

PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , Alberto Magni and Timothy M. Jones University of Cambridge University of Edinburgh EuroLLVM APR 2015 slide 1 of 17 www.cl.cam.ac.uk/ vp331/ Why SIMD


  1. PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas † , Alberto Magni ‡ and Timothy M. Jones † University of Cambridge † University of Edinburgh ‡ EuroLLVM APR 2015 slide 1 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  2. Why SIMD Vectorization? Scalar Reg. File • Scalable parallelism FU FU FU FU Scalar Func. Units a. ILP slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  3. Why SIMD Vectorization? Scalar Reg. File • Scalable parallelism FU FU FU FU Scalar Func. Units a. ILP slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  4. Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism FU FU FU FU 0 1 2 3 Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  5. Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism FU FU FU FU 0 1 2 3 Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  6. Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  7. Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 • Energy efficiency Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  8. Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 • Energy efficiency Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism • Supported since mid 90’s • Frequent updates of vector ISAs AVX2 SSE4 slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  9. Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 • Energy efficiency Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism • Supported since mid 90’s • Frequent updates of vector ISAs AVX2 • Vector generation not done in hardware • Low-level programming or SSE4 capable compiler slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  10. SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  11. SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  12. SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  13. SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) • In theory it should be a superset of loop-vectorizer slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  14. SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) • In theory it should be a superset of loop-vectorizer • Unroll loop and vectorize with SLP • Even if loop-vectorizer fails, SLP could partly succeed slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  15. SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) • In theory it should be a superset of loop-vectorizer • Unroll loop and vectorize with SLP • Even if loop-vectorizer fails, SLP could partly succeed • In practice it is missing features present in the Loop vectorizer (Interleaved Loads, Predication) slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  16. SLP Vectorization Algorithm Scalar Code • Input is scalar IR slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  17. SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores 2 Reductions slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  18. SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable isomorphic instructions slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  19. SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  20. SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count If 4. Vector Cost < • Check vectorization profitability Scalar Cost slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  21. SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count If 4. Vector Cost < • Check vectorization profitability Scalar Cost YES • Emit vectors only if profitable Vectorize groups 5. & emit vectors DONE slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  22. SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count If 4. Vector Cost NO < • Check vectorization profitability Scalar Cost YES • Emit vectors only if profitable Vectorize groups 5. & emit vectors DONE slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  23. When SLP Fails ADD1 ADD2 ADD3 1 Data Dependencies ADD4 slide 5 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  24. When SLP Fails ADD1 ADD2 ADD3 1 Data Dependencies ADD4 Original Vectorized ADD1 Insert1 2 Too many ADD2 Insert2 ADD3 Insert3 gather/scatter ADD4 Insert4 ADD1 ADD2 ADD3 ADD4 instructions. Costs Extract1 outweigh benefits. Extract2 Extract3 Extract4 slide 5 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  25. When SLP Fails ADD1 ADD2 ADD3 1 Data Dependencies ADD4 Original Vectorized ADD1 Insert1 2 Too many ADD2 Insert2 ADD3 Insert3 gather/scatter ADD4 Insert4 ADD1 ADD2 ADD3 ADD4 instructions. Costs Extract1 outweigh benefits. Extract2 Extract3 Extract4 3 Non-isomorphism ADD1 ADD2 MUL ADD4 slide 5 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  26. SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  27. SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code L 7. L * 1. 5. + + S S b. DFG X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  28. SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code L 7. L * 1. 5. + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  29. SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code L 7. L * 1. 5. + + + + 1 + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  30. SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code NON−ISOMORPHIC L 7. STOP ! L 7. L 2 * 1. L 5. L * * 1. 5. + + + + 1 + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/

  31. SLP Fails due to non-isomorphism Scalar Cost L ... B[i] = A[i] * 7.0 + 1.0; L * 7 B[i+1]= A[i+1] + 5.0; + + ... a. Input C code S S NON−ISOMORPHIC L 7. STOP ! L 7. L 2 * 1. L 5. L * * 1. 5. + + + + 1 + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend