tslp throttling automatic vectorization when less is more
play

TSLP Throttling Automatic Vectorization: When Less is More - PowerPoint PPT Presentation

TSLP Throttling Automatic Vectorization: When Less is More Vasileios Porpodas and Timothy M. Jones University of Cambridge LLVM Developers Meeting 2015 www.cl.cam.ac.uk/ vp331/ slide 1 of 16 Why SIMD Vectorization? Scalar Reg. File


  1. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * + + S S Total Cost: −1 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  2. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * + + + + −1 S S Total Cost: −2 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  3. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * −1 L L + + + + −1 S S Total Cost: −3 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  4. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −4 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  5. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −5 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  6. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −6 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  7. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L L L * L * +1 i i +1 −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −4 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  8. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L +1 +1 L L i i L L L * L * +1 i i +1 −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −2 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  9. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L L +1 +1 L L i i +1 i i +1 L L L * L * +1 i i +1 −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: 0 S S −1 Unprofitable ! www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  10. TSLP removes unprofitable region SLP L L L L +1 +1 i i +1 i i +1 L L +1 i i +1 −1 * * −1 + + −1 L L −1 * * + + −1 S S −1 Total Cost: 0 Unprofitable! www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  11. TSLP removes unprofitable region TSLP L L L L +1 +1 i i +1 i i +1 L L +1 i i +1 −1 * * −1 + + −1 L L −1 * * TSLP CUT + + −1 S S −1 Total Cost: www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  12. TSLP removes unprofitable region L L L L TSLP L L * * + + * * i +1 i +1 −1 L L TSLP CUT + + −1 S S −1 Total Cost: www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  13. TSLP removes unprofitable region L L L L TSLP L L * * + + * * i +1 i +1 −1 L L TSLP CUT + + −1 S S −1 Total Cost: −1 Profitable ! www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  14. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  15. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  16. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  17. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts 4. Throttle (cut) the SLP graph www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  18. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts 4. Throttle (cut) the SLP graph Calculate cost of vectorization 5. www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  19. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. Calculate cost of vectorization 5. 6. Save cut with best cost www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  20. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  21. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. cost < threshold? www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  22. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. cost < threshold? YES 9. Replace scalars with vectors www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  23. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. cost < threshold? YES 9. Replace scalars with vectors DONE www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  24. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. NO cost < threshold? YES 9. Replace scalars with vectors DONE www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  25. Cost calculation example TotalCost L L L L L L * * + + * * L L + + S S www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  26. Cost calculation example TotalCost Vector L L L L V+ S +G −Scalar L L * * + + * * L L + + S S www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  27. Cost calculation example TotalCost Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 − 18 + + − 18 S S − 18 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  28. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 − 18 + + − 18 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  29. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 − 18 + + cut1 1 +16+ 2 − 18 = +1 VEC S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  30. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  31. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  32. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  33. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + cut5 4 + 10+ 4 − 18 = 0 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  34. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * cut6 5 + 8 + 6 − 18 = +1 + + cut5 4 + 10+ 4 − 18 = 0 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  35. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar no cut SLP (SLP) 6 + 6 + 6 − 18 = 0 L L * * cut6 5 + 8 + 6 − 18 = +1 + + cut5 4 + 10+ 4 − 18 = 0 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  36. Subgraph (Cuts) Generation Algorithm L L L * + L * + S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  37. Subgraph (Cuts) Generation Algorithm L L L * + L * + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  38. Subgraph (Cuts) Generation Algorithm L L L * + L * + S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  39. Subgraph (Cuts) Generation Algorithm L L L * + L L * + + S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  40. Subgraph (Cuts) Generation Algorithm L L L L * + L L L * * + + + S S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  41. Subgraph (Cuts) Generation Algorithm L L L L L L * + + L L L L * * * + + + + S S S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  42. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  43. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L + S S * + S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  44. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L + L S S + * * + + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  45. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  46. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  47. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  48. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  49. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  50. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root • Worst time complexity O (2 B xN ) (N=Nodes, B=Neighbors) www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  51. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  52. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  53. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? ... ... X Y subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  54. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? ... ... ... ... X Y X Y subgraph subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  55. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? ... ... ... ... ... ... X Y X Y X Y subgraph subgraph subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  56. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO YES subgraphs > T ? ... ... ... ... ... ... X Y X Y X Y subgraph subgraph subgraph • After T subgraphs, attach all neighbors www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  57. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO YES subgraphs > T ? ... ... ... ... ... ... ... ... X Y X Y X Y X Y subgraph subgraph subgraph subgraph • After T subgraphs, attach all neighbors www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  58. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO YES subgraphs > T ? ... ... ... ... ... ... ... ... X Y X Y X Y X Y subgraph subgraph subgraph subgraph • After T subgraphs, attach all neighbors • Complexity reduced to linear O ( T + N ) www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  59. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  60. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  61. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  62. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  63. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: 1 All loop, SLP and TSLP vectorizers disabled (O3) www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  64. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: 1 All loop, SLP and TSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP) www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  65. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: 1 All loop, SLP and TSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP) 3 O3 + TSLP enabled (TSLP) www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend