is vectorization easy is vectorization enough
play

Is vectorization easy? Is vectorization enough? Sbastien Ponce - PowerPoint PPT Presentation

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Plan Introduction 1 What


  1. Is vectorization easy? Is vectorization enough? Sébastien Ponce Florian Lemaitre

  2. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Plan Introduction 1 What is SIMD ? How is vectorization done? Matrix-Vector product example 2 Impact of other optimizations on vectorization Let’s vectorize Performance Batch processing 3 Array of Structure Structure of Array Hand-made Vectorization 4 Check vectorization 5 Assembly Callgrind Conclusion & Guidelines 6 S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 2 / 19

  3. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines What is SIMD ? Single Instruction Multiple Data Available on Intel architectures since 2000 Same time to process 4 , 8 , . . . float s than 1 on regular arithmetic [] x 0 x 1 x 2 x 3 X X + + + + + y 0 y 1 y 2 y 3 Y [] Y X + Y X []+ Y [] x 0+ y 0 x 1+ y 1 x 2+ y 2 x 3+ y 3 S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 3 / 19

  4. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19

  5. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19

  6. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19

  7. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19

  8. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Plan Introduction 1 What is SIMD ? How is vectorization done? Matrix-Vector product example 2 Impact of other optimizations on vectorization Let’s vectorize Performance Batch processing 3 Array of Structure Structure of Array Hand-made Vectorization 4 Check vectorization 5 Assembly Callgrind Conclusion & Guidelines 6 S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 5 / 19

  9. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Matrix-Vector product Algorithm ( Y = A · X ) : A input // n × n matrix : X input // n vector output : Y // n vector Simple algorithm : s temp // scalar accumulator used a lot for i = 0 : n do change of basis in ROOT s ← 0 for j = 0 : n do s ← s + A [ i, j ] · X [ j ] Y [ i ] ← s S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 5 / 19

  10. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Small loop unrolling Impact of other optimizations Algorithm ( Y = A · X ) Complete unrolling is called input : A // n × n matrix unwinding . input : X // n vector output : Y // n vector Compilers are able to unroll temp : s // scalar accumulator small loops for i = 0 : n do s ← 0 if it is considered worth it for j = 0 : n do s ← s + A [ i, j ] · X [ j ] Loop version easier to Y [ i ] ← s understand For a Human Algorithm ( Y = A · X unwinded) For a compiler too input : A // 3 × 3 matrix input : X // 3 vector Unrolled version makes output : Y // 3 vector vectorization hard Y [0] ← A [0 , 0] · X [0]+ A [0 , 1] · X [1]+ A [0 , 2] · X [2] Pattern not recognized Y [1] ← A [1 , 0] · X [0]+ A [1 , 1] · X [1]+ A [1 , 2] · X [2] Y [2] ← A [2 , 0] · X [0]+ A [2 , 1] · X [1]+ A [2 , 2] · X [2] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 6 / 19

  11. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Small loop unrolling Impact of other optimizations Algorithm ( Y = A · X ) Complete unrolling is called input : A // n × n matrix unwinding . input : X // n vector output : Y // n vector Compilers are able to unroll temp : s // scalar accumulator small loops for i = 0 : n do s ← 0 if it is considered worth it for j = 0 : n do s ← s + A [ i, j ] · X [ j ] Loop version easier to Y [ i ] ← s understand For a Human Algorithm ( Y = A · X unwinded) For a compiler too input : A // 3 × 3 matrix input : X // 3 vector Unrolled version makes output : Y // 3 vector vectorization hard Y [0] ← A [0 , 0] · X [0]+ A [0 , 1] · X [1]+ A [0 , 2] · X [2] Pattern not recognized Y [1] ← A [1 , 0] · X [0]+ A [1 , 1] · X [1]+ A [1 , 2] · X [2] Y [2] ← A [2 , 0] · X [0]+ A [2 , 1] · X [1]+ A [2 , 2] · X [2] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 6 / 19

  12. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Loop Order Impact of other optimizations Loop order can be changed Changes the way Algorithm ( Y = A · X scalar ij ) elements are accessed : s temp // scalar accumulator and processed for i = 0 : n do s ← 0 Vectorization will not be for j = 0 : n do s ← s + A [ i, j ] · X [ j ] applied the same way Y [ i ] ← s ij order: Algorithm ( Y = A · X scalar ji ) temp : x // scalar A elements are accessed for i = 0 : n do in Row-Major order Y [ i ] ← 0 for j = 0 : n do x ← X [ j ] ji order: for i = 0 : n do Y [ i ] ← Y [ i ] + A [ i, j ] · x A elements are accessed in Column-Major order S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 7 / 19

  13. Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Loop Order Impact of other optimizations Loop order can be changed Changes the way Algorithm ( Y = A · X scalar ij ) elements are accessed : s temp // scalar accumulator and processed for i = 0 : n do s ← 0 Vectorization will not be for j = 0 : n do s ← s + A [ i, j ] · X [ j ] applied the same way Y [ i ] ← s ij order: Algorithm ( Y = A · X scalar ji ) temp : x // scalar A elements are accessed for i = 0 : n do in Row-Major order Y [ i ] ← 0 for j = 0 : n do x ← X [ j ] ji order: for i = 0 : n do Y [ i ] ← Y [ i ] + A [ i, j ] · x A elements are accessed in Column-Major order S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 7 / 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend