Is vectorization easy? Is vectorization enough?
Sébastien Ponce Florian Lemaitre
Is vectorization easy? Is vectorization enough? Sbastien Ponce - - PowerPoint PPT Presentation
Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Plan Introduction 1 What
Sébastien Ponce Florian Lemaitre
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Plan
1
Introduction What is SIMD ? How is vectorization done?
2
Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance
3
Batch processing Array of Structure Structure of Array
4
Hand-made Vectorization
5
Check vectorization Assembly Callgrind
6
Conclusion & Guidelines
Vectorization: Easy? Enough? 11/12/2017 2 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
What is SIMD? Single Instruction Multiple Data
X+Y Y + X X []+Y[] Y[] X [] x0+y0 x1+y1 x2+y2 x3+y3 y0 y1 y2 y3 + + + + x0 x1 x2 x3
Available on Intel architectures since 2000 Same time to process 4, 8, . . . floats than 1
Vectorization: Easy? Enough? 11/12/2017 3 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
How is vectorization done?
Algorithm (C = A + B scalar) input : A, B
// n vector
// n vector
for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B
// n vector
// n vector
C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B
// n vector
// n vector
for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]
Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:
1
Detect the pattern
eg: a simple loop
2
Convert pattern into abstract vector code
3
Convert vector code into fixed width vector code
Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
How is vectorization done?
Algorithm (C = A + B scalar) input : A, B
// n vector
// n vector
for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B
// n vector
// n vector
C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B
// n vector
// n vector
for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]
Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:
1
Detect the pattern
eg: a simple loop
2
Convert pattern into abstract vector code
3
Convert vector code into fixed width vector code
Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
How is vectorization done?
Algorithm (C = A + B scalar) input : A, B
// n vector
// n vector
for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B
// n vector
// n vector
C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B
// n vector
// n vector
for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]
Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:
1
Detect the pattern
eg: a simple loop
2
Convert pattern into abstract vector code
3
Convert vector code into fixed width vector code
Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
How is vectorization done?
Algorithm (C = A + B scalar) input : A, B
// n vector
// n vector
for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B
// n vector
// n vector
C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B
// n vector
// n vector
for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]
Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:
1
Detect the pattern
eg: a simple loop
2
Convert pattern into abstract vector code
3
Convert vector code into fixed width vector code
Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Plan
1
Introduction What is SIMD ? How is vectorization done?
2
Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance
3
Batch processing Array of Structure Structure of Array
4
Hand-made Vectorization
5
Check vectorization Assembly Callgrind
6
Conclusion & Guidelines
Vectorization: Easy? Enough? 11/12/2017 5 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Matrix-Vector product Algorithm (Y = A · X) input : A
// n×n matrix
input : X
// n vector
// n vector
temp : s
// scalar accumulator
for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Simple algorithm used a lot change of basis in ROOT
Vectorization: Easy? Enough? 11/12/2017 5 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Small loop unrolling
Impact of other optimizations Algorithm (Y = A · X) input : A // n×n matrix input : X // n vector
// n vector
temp : s
// scalar accumulator
for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X unwinded) input : A
// 3×3 matrix
input : X
// 3 vector
// 3 vector
Y [0] ← A[0, 0]·X[0]+A[0, 1]·X[1]+A[0, 2]·X[2] Y [1] ← A[1, 0]·X[0]+A[1, 1]·X[1]+A[1, 2]·X[2] Y [2] ← A[2, 0]·X[0]+A[2, 1]·X[1]+A[2, 2]·X[2]
Complete unrolling is called unwinding. Compilers are able to unroll small loops
if it is considered worth it
Loop version easier to understand
For a Human For a compiler too
Unrolled version makes vectorization hard
Pattern not recognized
Vectorization: Easy? Enough? 11/12/2017 6 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Small loop unrolling
Impact of other optimizations Algorithm (Y = A · X) input : A // n×n matrix input : X // n vector
// n vector
temp : s
// scalar accumulator
for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X unwinded) input : A
// 3×3 matrix
input : X
// 3 vector
// 3 vector
Y [0] ← A[0, 0]·X[0]+A[0, 1]·X[1]+A[0, 2]·X[2] Y [1] ← A[1, 0]·X[0]+A[1, 1]·X[1]+A[1, 2]·X[2] Y [2] ← A[2, 0]·X[0]+A[2, 1]·X[1]+A[2, 2]·X[2]
Complete unrolling is called unwinding. Compilers are able to unroll small loops
if it is considered worth it
Loop version easier to understand
For a Human For a compiler too
Unrolled version makes vectorization hard
Pattern not recognized
Vectorization: Easy? Enough? 11/12/2017 6 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Loop Order
Impact of other optimizations
Algorithm (Y = A · X scalar ij) temp : s
// scalar accumulator
for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X scalar ji) temp : x
// scalar
for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x
Loop order can be changed Changes the way elements are accessed and processed Vectorization will not be applied the same way ij order: A elements are accessed in Row-Major order ji order: A elements are accessed in Column-Major order
Vectorization: Easy? Enough? 11/12/2017 7 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Loop Order
Impact of other optimizations
Algorithm (Y = A · X scalar ij) temp : s
// scalar accumulator
for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X scalar ji) temp : x
// scalar
for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x
Loop order can be changed Changes the way elements are accessed and processed Vectorization will not be applied the same way ij order: A elements are accessed in Row-Major order ji order: A elements are accessed in Column-Major order
Vectorization: Easy? Enough? 11/12/2017 7 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Vectorize ij: vector reduce
Algorithm (Y = A · X scalar ij) temp : s
// scalar accumulator
for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X SIMD reduce) temp : s
// SIMD register: 4 elements
for i = 0 : 1 : n do s ← {0, 0, 0, 0} for j = 0 : 4 : n do s[0 : 4] ← s[0 : 4] + A[i, j : j+4] · X[j : j+4]
// Reduction (expensive)
Y [i] ← s[0] + s[1] + s[2] + s[3]
A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A20A21A22 · X2 X1 X0 A10A11A12 · X2 X1 X0 A00A01A02 · X2 X1 X0
Vector reduce version is the usual way to express Matrix-Vector product (ij order) Reduction is slow
Requires at least 4 operations Shuffling operations are usually not parallelizable
s accumulator implies a dependency-chain
Vectorization: Easy? Enough? 11/12/2017 8 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Vectorize ij: vector reduce
Algorithm (Y = A · X vector reduce) for i = 0 : n do
// Inner product
Y [i] ← A[i, : ] · X[ : ] Algorithm (Y = A · X SIMD reduce) temp : s
// SIMD register: 4 elements
for i = 0 : 1 : n do s ← {0, 0, 0, 0} for j = 0 : 4 : n do s[0 : 4] ← s[0 : 4] + A[i, j : j+4] · X[j : j+4]
// Reduction (expensive)
Y [i] ← s[0] + s[1] + s[2] + s[3]
A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A20A21A22 · X2 X1 X0 A10A11A12 · X2 X1 X0 A00A01A02 · X2 X1 X0
Vector reduce version is the usual way to express Matrix-Vector product (ij order) Reduction is slow
Requires at least 4 operations Shuffling operations are usually not parallelizable
s accumulator implies a dependency-chain
Vectorization: Easy? Enough? 11/12/2017 8 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Vectorize ij: vector reduce
Algorithm (Y = A · X vector reduce) for i = 0 : n do
// Inner product
Y [i] ← A[i, : ] · X[ : ] Algorithm (Y = A · X SIMD reduce) temp : s
// SIMD register: 4 elements
for i = 0 : 1 : n do s ← {0, 0, 0, 0} for j = 0 : 4 : n do s[0 : 4] ← s[0 : 4] + A[i, j : j+4] · X[j : j+4]
// Reduction (expensive)
Y [i] ← s[0] + s[1] + s[2] + s[3]
A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A20A21A22 · X2 X1 X0 A10A11A12 · X2 X1 X0 A00A01A02 · X2 X1 X0
Vector reduce version is the usual way to express Matrix-Vector product (ij order) Reduction is slow
Requires at least 4 operations Shuffling operations are usually not parallelizable
s accumulator implies a dependency-chain
Vectorization: Easy? Enough? 11/12/2017 8 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Vectorize ji: vector splat
Algorithm (Y = A · X scalar ji) temp : x
// scalar
for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x Algorithm (Y = A · X SIMD splat) temp : x
// SIMD register: 4 elements
for i = 0 : 4 : n do Y [i : i+4] ← {0, 0, 0, 0} for j = 0 : 1 : n do
// Broadcast (cheap)
x ← {X[j], X[j], X[j], X[j]} for i = 0 : 4 : n do Y [i : i+4] ← Y [i : i+4] + A[i : i+4, j] · x[0 : 4]
A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A02 A01 A00
.∗
X0 X0 X0 + A12 A11 A10
.∗
X1 X1 X1 + A22 A21 A20
.∗
X2 X2 X2
Vector version using ji order Better with Column-Major storage Broadcast is fast
Usually one single shuffling operation
Vectorization: Easy? Enough? 11/12/2017 9 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Vectorize ji: vector splat
Algorithm (Y = A · X vector splat) Y [ : ] ← 0 for j = 0 : n do
// X[j] is a scalar value
Y [ : ] ← Y [ : ] + A[ : , j] · X[j] Algorithm (Y = A · X SIMD splat) temp : x
// SIMD register: 4 elements
for i = 0 : 4 : n do Y [i : i+4] ← {0, 0, 0, 0} for j = 0 : 1 : n do
// Broadcast (cheap)
x ← {X[j], X[j], X[j], X[j]} for i = 0 : 4 : n do Y [i : i+4] ← Y [i : i+4] + A[i : i+4, j] · x[0 : 4]
A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A02 A01 A00
.∗
X0 X0 X0 + A12 A11 A10
.∗
X1 X1 X1 + A22 A21 A20
.∗
X2 X2 X2
Vector version using ji order Better with Column-Major storage Broadcast is fast
Usually one single shuffling operation
Vectorization: Easy? Enough? 11/12/2017 9 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Vectorize ji: vector splat
Algorithm (Y = A · X vector splat) Y [ : ] ← 0 for j = 0 : n do
// X[j] is a scalar value
Y [ : ] ← Y [ : ] + A[ : , j] · X[j] Algorithm (Y = A · X SIMD splat) temp : x
// SIMD register: 4 elements
for i = 0 : 4 : n do Y [i : i+4] ← {0, 0, 0, 0} for j = 0 : 1 : n do
// Broadcast (cheap)
x ← {X[j], X[j], X[j], X[j]} for i = 0 : 4 : n do Y [i : i+4] ← Y [i : i+4] + A[i : i+4, j] · x[0 : 4]
A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A02 A01 A00
.∗
X0 X0 X0 + A12 A11 A10
.∗
X1 X1 X1 + A22 A21 A20
.∗
X2 X2 X2
Vector version using ji order Better with Column-Major storage Broadcast is fast
Usually one single shuffling operation
Vectorization: Easy? Enough? 11/12/2017 9 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Performance version time (µs) unwinded 5161 scalar ij 5192 scalar ji 5186 vectorized ij 10033 vectorized ji 5879 All scalar versions have the performance
The loops versions are unwinded by the compiler
WARNING: Vectorized versions are slower: not enough data to process
Out-of-Order performs better than SIMD with such small data
Vectorized ji much faster than vectorized ij
Broadcast is faster than reduction
Vectorization: Easy? Enough? 11/12/2017 10 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Plan
1
Introduction What is SIMD ? How is vectorization done?
2
Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance
3
Batch processing Array of Structure Structure of Array
4
Hand-made Vectorization
5
Check vectorization Assembly Callgrind
6
Conclusion & Guidelines
Vectorization: Easy? Enough? 11/12/2017 11 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Batch processing Need to process several matrices at a time in order to take advantage of the machine parallelism.
Algorithm (Y = A · X scalar ji)
input : A // n×n matrix input : X // n vector
// n vector temp : x // scalar
for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x
Batch Grouping similar objects in order to process them in one go Allows instruction reordering to increase computation parallelism
µs/matrix ×1 ×2 ×4 ×8 ×16 ×32 unwinded 5161 2944 1740 1619 1713 1677 scalar ij 4799 2638 1501 1328 1434 1596 scalar ji 4957 2789 1957 1899 1929 1888 vectroized ij – – – – – – vectroized ji – – – – 1114 993
Vectorization: Easy? Enough? 11/12/2017 11 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Batch processing Need to process several matrices at a time in order to take advantage of the machine parallelism.
Algorithm (Y = A · X scalar ji batch)
input : As // list of n×n matrix input : Xs // list of n vector
// list of n vector temp : x // scalar
for u = 0 : N do for i = 0 : n do Ys[u][i] ← 0 for j = 0 : n do x ← Xs[u][j] for i = 0 : n do Ys[u][i] ← Ys[u][i]+As[u][i, j]·x
Batch Grouping similar objects in order to process them in one go Allows instruction reordering to increase computation parallelism
µs/matrix ×1 ×2 ×4 ×8 ×16 ×32 unwinded 5161 2944 1740 1619 1713 1677 scalar ij 4799 2638 1501 1328 1434 1596 scalar ji 4957 2789 1957 1899 1929 1888 vectroized ij – – – – – – vectroized ji – – – – 1114 993
Vectorization: Easy? Enough? 11/12/2017 11 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Batch processing Need to process several matrices at a time in order to take advantage of the machine parallelism.
Algorithm (Y = A · X scalar ji batch)
input : As // list of n×n matrix input : Xs // list of n vector
// list of n vector temp : x // scalar
for u = 0 : N do for i = 0 : n do Ys[u][i] ← 0 for j = 0 : n do x ← Xs[u][j] for i = 0 : n do Ys[u][i] ← Ys[u][i]+As[u][i, j]·x
Batch Grouping similar objects in order to process them in one go Allows instruction reordering to increase computation parallelism
µs/matrix ×1 ×2 ×4 ×8 ×16 ×32 unwinded 5161 2944 1740 1619 1713 1677 scalar ij 4799 2638 1501 1328 1434 1596 scalar ji 4957 2789 1957 1899 1929 1888 vectroized ij – – – – – – vectroized ji – – – – 1114 993
Vectorization: Easy? Enough? 11/12/2017 11 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Structure of Array (SoA)
Algorithm (Y = A · X scalar ij AoS) input : As
// list of n×n matrix
input : Xs
// list of n vector
// list of n vector
temp : x
// scalar
for u = 0 : N do for i = 0 : n do Ys[u][i] ← 0 for j = 0 : n do x ← Xs[u][j] for i = 0 : n do Ys[u][i] ← Ys[u][i] + As[u][i, j] · x Algorithm (Y = A · X vector ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : xs
// list
for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]
matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code
no shuffling required
Possible to go even further
(Unroll&Jam)
Vectorization: Easy? Enough? 11/12/2017 12 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Structure of Array (SoA)
Algorithm (Y = A · X scalar ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : x
// scalar
for u = 0 : N do for i = 0 : n do Ys[i][u] ← 0 for j = 0 : n do x ← Xs[j][u] for i = 0 : n do Ys[i][u] ← Ys[i][u] + As[i, j][u] · x Algorithm (Y = A · X vector ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : xs
// list
for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]
matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code
no shuffling required
Possible to go even further
(Unroll&Jam)
Vectorization: Easy? Enough? 11/12/2017 12 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Structure of Array (SoA)
Algorithm (Y = A · X scalar ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : x
// scalar
for u = 0 : N do for i = 0 : n do Ys[i][u] ← 0 for j = 0 : n do x ← Xs[j][u] for i = 0 : n do Ys[i][u] ← Ys[i][u] + As[i, j][u] · x Algorithm (Y = A · X vector ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : xs
// list
for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]
matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code
no shuffling required
Possible to go even further
(Unroll&Jam)
Vectorization: Easy? Enough? 11/12/2017 12 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Structure of Array (SoA)
Algorithm (Y = A · X scalar ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : x
// scalar
for u = 0 : N do for i = 0 : n do Ys[i][u] ← 0 for j = 0 : n do x ← Xs[j][u] for i = 0 : n do Ys[i][u] ← Ys[i][u] + As[i, j][u] · x Algorithm (Y = A · X vector ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : xs
// list
for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]
matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code
no shuffling required
Possible to go even further
(Unroll&Jam)
Vectorization: Easy? Enough? 11/12/2017 12 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Plan
1
Introduction What is SIMD ? How is vectorization done?
2
Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance
3
Batch processing Array of Structure Structure of Array
4
Hand-made Vectorization
5
Check vectorization Assembly Callgrind
6
Conclusion & Guidelines
Vectorization: Easy? Enough? 11/12/2017 13 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Hand-made vectorization Results with hand-written SIMD: µs/matrix ×1 ×2 ×4 ×8 ×16 unwinded 4541 2330 2270 2258 2549 best 3699 1939 2153 885 586 VectorClass ij 9189 4994 3697 3704 3710 VectorClass ji 8079 4772 2418 1450 1478 WARNING: Hand-written SIMD is much slower! SIMD code is less optimized by the compiler than scalar code Overrides compiler heuristics Requires a deep knowledge of the architecture Vectorization libraries like VC don’t help here
They only provide a better and more abstract interface
Vectorization: Easy? Enough? 11/12/2017 13 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Hand-made vectorization Do not vectorize yourself! Unless you have checked: The code is critical for performance The compiler has not vectorized You understand why There is no way to make it vectorized by the compiler You know how to do it efficiently Your SIMD code is actually faster than scalar code
Vectorization: Easy? Enough? 11/12/2017 14 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Plan
1
Introduction What is SIMD ? How is vectorization done?
2
Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance
3
Batch processing Array of Structure Structure of Array
4
Hand-made Vectorization
5
Check vectorization Assembly Callgrind
6
Conclusion & Guidelines
Vectorization: Easy? Enough? 11/12/2017 15 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Check the assembly Command
Non vectorized
vmovsd p+0x8 , %xmm4 vmovsd p , %xmm3 vmulsd t+0x48, %xmm4, %xmm0 vmulsd t+0x28, %xmm4, %xmm1 vmulsd t+0x8 , %xmm4, %xmm4 vmovsd p+0x10, %xmm2 vfmadd231sd t+0x40, %xmm3, %xmm0 vfmadd231sd t+0x50, %xmm2, %xmm0 vaddsd t+0x58, %xmm0, %xmm0 vfmadd231sd t+0x20, %xmm3, %xmm1 vfmadd231sd t+0x30, %xmm2, %xmm1 vaddsd t+0x38, %xmm1, %xmm1 vfmadd132sd t , %xmm4, %xmm3 vfmadd132sd t+0x10, %xmm3, %xmm2 vaddsd t+0x18, %xmm2, %xmm2 vmovsd %xmm0, p+0x10 vmovsd %xmm1, p+0x8 vmovsd %xmm2, p retq
Vectorized
vbroadcastsd p+0x10, %ymm0 vbroadcastsd p+0x8 , %ymm1 vbroadcastsd p , %ymm2 vmovapd t+0x60, %ymm3 vfmadd132pd t , %ymm3 ,%ymm2 vfmadd132pd t+0x20, %ymm2, %ymm1 vfmadd132pd t+0x40, %ymm1, %ymm0 vmovapd %ymm0, p vzeroupper retq
%xmm and %ymm designates SIMD registers
%ymm registers are twice bigger (available only with AVX)
WARNING: floating point scalar codes also use %xmm registers Look for sd (scalar double) and pd (packed double)
ss and ps for single precision
Vectorization: Easy? Enough? 11/12/2017 15 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Check the assembly Command
Non vectorized
vmovsd p+0x8 , %xmm4 vmovsd p , %xmm3 vmulsd t+0x48, %xmm4, %xmm0 vmulsd t+0x28, %xmm4, %xmm1 vmulsd t+0x8 , %xmm4, %xmm4 vmovsd p+0x10, %xmm2 vfmadd231sd t+0x40, %xmm3, %xmm0 vfmadd231sd t+0x50, %xmm2, %xmm0 vaddsd t+0x58, %xmm0, %xmm0 vfmadd231sd t+0x20, %xmm3, %xmm1 vfmadd231sd t+0x30, %xmm2, %xmm1 vaddsd t+0x38, %xmm1, %xmm1 vfmadd132sd t , %xmm4, %xmm3 vfmadd132sd t+0x10, %xmm3, %xmm2 vaddsd t+0x18, %xmm2, %xmm2 vmovsd %xmm0, p+0x10 vmovsd %xmm1, p+0x8 vmovsd %xmm2, p retq
Vectorized
vbroadcastsd p+0x10, %ymm0 vbroadcastsd p+0x8 , %ymm1 vbroadcastsd p , %ymm2 vmovapd t+0x60, %ymm3 vfmadd132pd t , %ymm3 ,%ymm2 vfmadd132pd t+0x20, %ymm2, %ymm1 vfmadd132pd t+0x40, %ymm1, %ymm0 vmovapd %ymm0, p vzeroupper retq
%xmm and %ymm designates SIMD registers
%ymm registers are twice bigger (available only with AVX)
WARNING: floating point scalar codes also use %xmm registers Look for sd (scalar double) and pd (packed double)
ss and ps for single precision
Vectorization: Easy? Enough? 11/12/2017 15 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Using Callgrind Custom patch to collect Floating point instructions Generic usage
valgrind --tool=callgrind original command line
Useful flags
location /cvmfs/lhcbdev.cern.ch/tools/valgrind/3.12.0/x86_64-centos7/bin/valgrind
Vectorization: Easy? Enough? 11/12/2017 16 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Using Callgrind with Gaudi use --instr-atstart=no to avoid spending hours in the geometry parsing start instruction profiling in the code using the CallgrindProfile algorithm
in MiniBrunel, just set Profile to True
use a command line like this
Brunel/run valgrind ... python \ $(Brunel/run which gaudirun.py) options.py
Vectorization: Easy? Enough? 11/12/2017 17 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Command kcachegrind callgrind.out.##pid##
Vectorization: Easy? Enough? 11/12/2017 18 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Plan
1
Introduction What is SIMD ? How is vectorization done?
2
Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance
3
Batch processing Array of Structure Structure of Array
4
Hand-made Vectorization
5
Check vectorization Assembly Callgrind
6
Conclusion & Guidelines
Vectorization: Easy? Enough? 11/12/2017 19 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Conclusion & Guidelines Vectorization is hard Vectorizing a code is usually not enough to get maximum performance Goal: help the compiler to vectorize Understand how vectorization works
Think about your code as abstract vector code
The code should be designed for batch processing Check vectorization using existing tools
Callgrind Vtune
. . .
Vectorization: Easy? Enough? 11/12/2017 19 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Vectorization: Easy? Enough? 11/12/2017 19 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Going Further? Unroll&Jam!
Matrix-Vector product
Algorithm (Y = A · X SIMD ij SoA) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : x0
// SIMD register
for u = 0 : 4 : N do for i = 0 : n do Ys[i][u+0 : u+4 ] ← {0, 0, 0, 0} for j = 0 : n do x0[0 : 4] ← Xs[j][u+0 : u+4 ] for i = 0 : n do Ys[i][u+0 : u+4 ] ← Ys[i][u+0 : u+4 ] + As[i, j][u+0 : u+4 ] · x0[0 : 4]
Unroll&Jam
Interleaving independent iterations
Allows to hide the latencies Performance limited by instruction throughput Needs more objects to process
Vectorization: Easy? Enough? 11/12/2017 19 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines
Going Further? Unroll&Jam!
Matrix-Vector product
Algorithm (Y = A · X SIMD ij SoA with unroll&jam ×4) input : As
// n×n matrix of list
input : Xs
// n vector of list
// n vector of list
temp : x0, x1, x2, x3
// SIMD register
for u = 0 : 16 : N do for i = 0 : n do Ys[i][u+0 : u+4 ] ← {0, 0, 0, 0} Ys[i][u+4 : u+8 ] ← {0, 0, 0, 0} Ys[i][u+8 : u+12] ← {0, 0, 0, 0} Ys[i][u+12 : u+16] ← {0, 0, 0, 0} for j = 0 : n do x0[0 : 4] ← Xs[j][u+0 : u+4 ] x1[0 : 4] ← Xs[j][u+4 : u+8 ] x2[0 : 4] ← Xs[j][u+8 : u+12] x3[0 : 4] ← Xs[j][u+12 : u+16] for i = 0 : n do Ys[i][u+0 : u+4 ] ← Ys[i][u+0 : u+4 ] + As[i, j][u+0 : u+4 ] · x0[0 : 4] Ys[i][u+4 : u+8 ] ← Ys[i][u+4 : u+8 ] + As[i, j][u+4 : u+8 ] · x1[0 : 4] Ys[i][u+8 : u+12] ← Ys[i][u+8 : u+12] + As[i, j][u+8 : u+12] · x2[0 : 4] Ys[i][u+12 : u+16] ← Ys[i][u+12 : u+16] + As[i, j][u+12 : u+16] · x3[0 : 4]
Unroll&Jam
Interleaving independent iterations
Allows to hide the latencies Performance limited by instruction throughput Needs more objects to process
Vectorization: Easy? Enough? 11/12/2017 19 / 19