Is vectorization easy? Is vectorization enough? Sbastien Ponce - - PowerPoint PPT Presentation

is vectorization easy is vectorization enough
SMART_READER_LITE
LIVE PREVIEW

Is vectorization easy? Is vectorization enough? Sbastien Ponce - - PowerPoint PPT Presentation

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Plan Introduction 1 What


slide-1
SLIDE 1

Is vectorization easy? Is vectorization enough?

Sébastien Ponce Florian Lemaitre

slide-2
SLIDE 2

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Plan

1

Introduction What is SIMD ? How is vectorization done?

2

Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance

3

Batch processing Array of Structure Structure of Array

4

Hand-made Vectorization

5

Check vectorization Assembly Callgrind

6

Conclusion & Guidelines

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 2 / 19

slide-3
SLIDE 3

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

What is SIMD? Single Instruction Multiple Data

X+Y Y + X X []+Y[] Y[] X [] x0+y0 x1+y1 x2+y2 x3+y3 y0 y1 y2 y3 + + + + x0 x1 x2 x3

Available on Intel architectures since 2000 Same time to process 4, 8, . . . floats than 1

  • n regular arithmetic
  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 3 / 19

slide-4
SLIDE 4

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

How is vectorization done?

Algorithm (C = A + B scalar) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B

// n vector

  • utput : C

// n vector

C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]

Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:

1

Detect the pattern

eg: a simple loop

2

Convert pattern into abstract vector code

3

Convert vector code into fixed width vector code

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 4 / 19

slide-5
SLIDE 5

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

How is vectorization done?

Algorithm (C = A + B scalar) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B

// n vector

  • utput : C

// n vector

C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]

Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:

1

Detect the pattern

eg: a simple loop

2

Convert pattern into abstract vector code

3

Convert vector code into fixed width vector code

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 4 / 19

slide-6
SLIDE 6

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

How is vectorization done?

Algorithm (C = A + B scalar) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B

// n vector

  • utput : C

// n vector

C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]

Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:

1

Detect the pattern

eg: a simple loop

2

Convert pattern into abstract vector code

3

Convert vector code into fixed width vector code

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 4 / 19

slide-7
SLIDE 7

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

How is vectorization done?

Algorithm (C = A + B scalar) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : n do C[i] ← A[i] + B[i] Algorithm (C = A + B vector) input : A, B

// n vector

  • utput : C

// n vector

C[ : ] ← A[ : ] + B[ : ] Algorithm (C = A + B SIMD) input : A, B

// n vector

  • utput : C

// n vector

for i = 0 : 4 : n do C[i : i+4] ← A[i : i+4] + B[i : i+4]

Vector code what you can write using matlab or numpy without matrices Vectorization is done in 3 steps:

1

Detect the pattern

eg: a simple loop

2

Convert pattern into abstract vector code

3

Convert vector code into fixed width vector code

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 4 / 19

slide-8
SLIDE 8

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Plan

1

Introduction What is SIMD ? How is vectorization done?

2

Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance

3

Batch processing Array of Structure Structure of Array

4

Hand-made Vectorization

5

Check vectorization Assembly Callgrind

6

Conclusion & Guidelines

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 5 / 19

slide-9
SLIDE 9

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Matrix-Vector product Algorithm (Y = A · X) input : A

// n×n matrix

input : X

// n vector

  • utput : Y

// n vector

temp : s

// scalar accumulator

for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Simple algorithm used a lot change of basis in ROOT

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 5 / 19

slide-10
SLIDE 10

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Small loop unrolling

Impact of other optimizations Algorithm (Y = A · X) input : A // n×n matrix input : X // n vector

  • utput : Y

// n vector

temp : s

// scalar accumulator

for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X unwinded) input : A

// 3×3 matrix

input : X

// 3 vector

  • utput : Y

// 3 vector

Y [0] ← A[0, 0]·X[0]+A[0, 1]·X[1]+A[0, 2]·X[2] Y [1] ← A[1, 0]·X[0]+A[1, 1]·X[1]+A[1, 2]·X[2] Y [2] ← A[2, 0]·X[0]+A[2, 1]·X[1]+A[2, 2]·X[2]

Complete unrolling is called unwinding. Compilers are able to unroll small loops

if it is considered worth it

Loop version easier to understand

For a Human For a compiler too

Unrolled version makes vectorization hard

Pattern not recognized

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 6 / 19

slide-11
SLIDE 11

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Small loop unrolling

Impact of other optimizations Algorithm (Y = A · X) input : A // n×n matrix input : X // n vector

  • utput : Y

// n vector

temp : s

// scalar accumulator

for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X unwinded) input : A

// 3×3 matrix

input : X

// 3 vector

  • utput : Y

// 3 vector

Y [0] ← A[0, 0]·X[0]+A[0, 1]·X[1]+A[0, 2]·X[2] Y [1] ← A[1, 0]·X[0]+A[1, 1]·X[1]+A[1, 2]·X[2] Y [2] ← A[2, 0]·X[0]+A[2, 1]·X[1]+A[2, 2]·X[2]

Complete unrolling is called unwinding. Compilers are able to unroll small loops

if it is considered worth it

Loop version easier to understand

For a Human For a compiler too

Unrolled version makes vectorization hard

Pattern not recognized

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 6 / 19

slide-12
SLIDE 12

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Loop Order

Impact of other optimizations

Algorithm (Y = A · X scalar ij) temp : s

// scalar accumulator

for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X scalar ji) temp : x

// scalar

for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x

Loop order can be changed Changes the way elements are accessed and processed Vectorization will not be applied the same way ij order: A elements are accessed in Row-Major order ji order: A elements are accessed in Column-Major order

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 7 / 19

slide-13
SLIDE 13

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Loop Order

Impact of other optimizations

Algorithm (Y = A · X scalar ij) temp : s

// scalar accumulator

for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X scalar ji) temp : x

// scalar

for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x

Loop order can be changed Changes the way elements are accessed and processed Vectorization will not be applied the same way ij order: A elements are accessed in Row-Major order ji order: A elements are accessed in Column-Major order

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 7 / 19

slide-14
SLIDE 14

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Vectorize ij: vector reduce

Algorithm (Y = A · X scalar ij) temp : s

// scalar accumulator

for i = 0 : n do s ← 0 for j = 0 : n do s ← s + A[i, j] · X[j] Y [i] ← s Algorithm (Y = A · X SIMD reduce) temp : s

// SIMD register: 4 elements

for i = 0 : 1 : n do s ← {0, 0, 0, 0} for j = 0 : 4 : n do s[0 : 4] ← s[0 : 4] + A[i, j : j+4] · X[j : j+4]

// Reduction (expensive)

Y [i] ← s[0] + s[1] + s[2] + s[3]

A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A20A21A22 · X2 X1 X0 A10A11A12 · X2 X1 X0 A00A01A02 · X2 X1 X0

Vector reduce version is the usual way to express Matrix-Vector product (ij order) Reduction is slow

Requires at least 4 operations Shuffling operations are usually not parallelizable

s accumulator implies a dependency-chain

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 8 / 19

slide-15
SLIDE 15

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Vectorize ij: vector reduce

Algorithm (Y = A · X vector reduce) for i = 0 : n do

// Inner product

Y [i] ← A[i, : ] · X[ : ] Algorithm (Y = A · X SIMD reduce) temp : s

// SIMD register: 4 elements

for i = 0 : 1 : n do s ← {0, 0, 0, 0} for j = 0 : 4 : n do s[0 : 4] ← s[0 : 4] + A[i, j : j+4] · X[j : j+4]

// Reduction (expensive)

Y [i] ← s[0] + s[1] + s[2] + s[3]

A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A20A21A22 · X2 X1 X0 A10A11A12 · X2 X1 X0 A00A01A02 · X2 X1 X0

Vector reduce version is the usual way to express Matrix-Vector product (ij order) Reduction is slow

Requires at least 4 operations Shuffling operations are usually not parallelizable

s accumulator implies a dependency-chain

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 8 / 19

slide-16
SLIDE 16

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Vectorize ij: vector reduce

Algorithm (Y = A · X vector reduce) for i = 0 : n do

// Inner product

Y [i] ← A[i, : ] · X[ : ] Algorithm (Y = A · X SIMD reduce) temp : s

// SIMD register: 4 elements

for i = 0 : 1 : n do s ← {0, 0, 0, 0} for j = 0 : 4 : n do s[0 : 4] ← s[0 : 4] + A[i, j : j+4] · X[j : j+4]

// Reduction (expensive)

Y [i] ← s[0] + s[1] + s[2] + s[3]

A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A20A21A22 · X2 X1 X0 A10A11A12 · X2 X1 X0 A00A01A02 · X2 X1 X0

Vector reduce version is the usual way to express Matrix-Vector product (ij order) Reduction is slow

Requires at least 4 operations Shuffling operations are usually not parallelizable

s accumulator implies a dependency-chain

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 8 / 19

slide-17
SLIDE 17

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Vectorize ji: vector splat

Algorithm (Y = A · X scalar ji) temp : x

// scalar

for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x Algorithm (Y = A · X SIMD splat) temp : x

// SIMD register: 4 elements

for i = 0 : 4 : n do Y [i : i+4] ← {0, 0, 0, 0} for j = 0 : 1 : n do

// Broadcast (cheap)

x ← {X[j], X[j], X[j], X[j]} for i = 0 : 4 : n do Y [i : i+4] ← Y [i : i+4] + A[i : i+4, j] · x[0 : 4]

A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A02 A01 A00

.∗

X0 X0 X0 + A12 A11 A10

.∗

X1 X1 X1 + A22 A21 A20

.∗

X2 X2 X2

Vector version using ji order Better with Column-Major storage Broadcast is fast

Usually one single shuffling operation

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 9 / 19

slide-18
SLIDE 18

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Vectorize ji: vector splat

Algorithm (Y = A · X vector splat) Y [ : ] ← 0 for j = 0 : n do

// X[j] is a scalar value

Y [ : ] ← Y [ : ] + A[ : , j] · X[j] Algorithm (Y = A · X SIMD splat) temp : x

// SIMD register: 4 elements

for i = 0 : 4 : n do Y [i : i+4] ← {0, 0, 0, 0} for j = 0 : 1 : n do

// Broadcast (cheap)

x ← {X[j], X[j], X[j], X[j]} for i = 0 : 4 : n do Y [i : i+4] ← Y [i : i+4] + A[i : i+4, j] · x[0 : 4]

A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A02 A01 A00

.∗

X0 X0 X0 + A12 A11 A10

.∗

X1 X1 X1 + A22 A21 A20

.∗

X2 X2 X2

Vector version using ji order Better with Column-Major storage Broadcast is fast

Usually one single shuffling operation

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 9 / 19

slide-19
SLIDE 19

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Vectorize ji: vector splat

Algorithm (Y = A · X vector splat) Y [ : ] ← 0 for j = 0 : n do

// X[j] is a scalar value

Y [ : ] ← Y [ : ] + A[ : , j] · X[j] Algorithm (Y = A · X SIMD splat) temp : x

// SIMD register: 4 elements

for i = 0 : 4 : n do Y [i : i+4] ← {0, 0, 0, 0} for j = 0 : 1 : n do

// Broadcast (cheap)

x ← {X[j], X[j], X[j], X[j]} for i = 0 : 4 : n do Y [i : i+4] ← Y [i : i+4] + A[i : i+4, j] · x[0 : 4]

A20A21A22 A10A11A12 A00A01A02 · X2 X1 X0 = A02 A01 A00

.∗

X0 X0 X0 + A12 A11 A10

.∗

X1 X1 X1 + A22 A21 A20

.∗

X2 X2 X2

Vector version using ji order Better with Column-Major storage Broadcast is fast

Usually one single shuffling operation

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 9 / 19

slide-20
SLIDE 20

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Performance version time (µs) unwinded 5161 scalar ij 5192 scalar ji 5186 vectorized ij 10033 vectorized ji 5879 All scalar versions have the performance

The loops versions are unwinded by the compiler

WARNING: Vectorized versions are slower: not enough data to process

Out-of-Order performs better than SIMD with such small data

Vectorized ji much faster than vectorized ij

Broadcast is faster than reduction

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 10 / 19

slide-21
SLIDE 21

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Plan

1

Introduction What is SIMD ? How is vectorization done?

2

Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance

3

Batch processing Array of Structure Structure of Array

4

Hand-made Vectorization

5

Check vectorization Assembly Callgrind

6

Conclusion & Guidelines

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 11 / 19

slide-22
SLIDE 22

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Batch processing Need to process several matrices at a time in order to take advantage of the machine parallelism.

Algorithm (Y = A · X scalar ji)

input : A // n×n matrix input : X // n vector

  • utput : Y

// n vector temp : x // scalar

for i = 0 : n do Y [i] ← 0 for j = 0 : n do x ← X[j] for i = 0 : n do Y [i] ← Y [i] + A[i, j] · x

Batch Grouping similar objects in order to process them in one go Allows instruction reordering to increase computation parallelism

µs/matrix ×1 ×2 ×4 ×8 ×16 ×32 unwinded 5161 2944 1740 1619 1713 1677 scalar ij 4799 2638 1501 1328 1434 1596 scalar ji 4957 2789 1957 1899 1929 1888 vectroized ij – – – – – – vectroized ji – – – – 1114 993

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 11 / 19

slide-23
SLIDE 23

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Batch processing Need to process several matrices at a time in order to take advantage of the machine parallelism.

Algorithm (Y = A · X scalar ji batch)

input : As // list of n×n matrix input : Xs // list of n vector

  • utput : Ys

// list of n vector temp : x // scalar

for u = 0 : N do for i = 0 : n do Ys[u][i] ← 0 for j = 0 : n do x ← Xs[u][j] for i = 0 : n do Ys[u][i] ← Ys[u][i]+As[u][i, j]·x

Batch Grouping similar objects in order to process them in one go Allows instruction reordering to increase computation parallelism

µs/matrix ×1 ×2 ×4 ×8 ×16 ×32 unwinded 5161 2944 1740 1619 1713 1677 scalar ij 4799 2638 1501 1328 1434 1596 scalar ji 4957 2789 1957 1899 1929 1888 vectroized ij – – – – – – vectroized ji – – – – 1114 993

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 11 / 19

slide-24
SLIDE 24

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Batch processing Need to process several matrices at a time in order to take advantage of the machine parallelism.

Algorithm (Y = A · X scalar ji batch)

input : As // list of n×n matrix input : Xs // list of n vector

  • utput : Ys

// list of n vector temp : x // scalar

for u = 0 : N do for i = 0 : n do Ys[u][i] ← 0 for j = 0 : n do x ← Xs[u][j] for i = 0 : n do Ys[u][i] ← Ys[u][i]+As[u][i, j]·x

Batch Grouping similar objects in order to process them in one go Allows instruction reordering to increase computation parallelism

µs/matrix ×1 ×2 ×4 ×8 ×16 ×32 unwinded 5161 2944 1740 1619 1713 1677 scalar ij 4799 2638 1501 1328 1434 1596 scalar ji 4957 2789 1957 1899 1929 1888 vectroized ij – – – – – – vectroized ji – – – – 1114 993

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 11 / 19

slide-25
SLIDE 25

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Structure of Array (SoA)

Algorithm (Y = A · X scalar ij AoS) input : As

// list of n×n matrix

input : Xs

// list of n vector

  • utput : Ys

// list of n vector

temp : x

// scalar

for u = 0 : N do for i = 0 : n do Ys[u][i] ← 0 for j = 0 : n do x ← Xs[u][j] for i = 0 : n do Ys[u][i] ← Ys[u][i] + As[u][i, j] · x Algorithm (Y = A · X vector ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : xs

// list

for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]

matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code

no shuffling required

Possible to go even further

(Unroll&Jam)

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 12 / 19

slide-26
SLIDE 26

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Structure of Array (SoA)

Algorithm (Y = A · X scalar ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : x

// scalar

for u = 0 : N do for i = 0 : n do Ys[i][u] ← 0 for j = 0 : n do x ← Xs[j][u] for i = 0 : n do Ys[i][u] ← Ys[i][u] + As[i, j][u] · x Algorithm (Y = A · X vector ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : xs

// list

for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]

matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code

no shuffling required

Possible to go even further

(Unroll&Jam)

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 12 / 19

slide-27
SLIDE 27

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Structure of Array (SoA)

Algorithm (Y = A · X scalar ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : x

// scalar

for u = 0 : N do for i = 0 : n do Ys[i][u] ← 0 for j = 0 : n do x ← Xs[j][u] for i = 0 : n do Ys[i][u] ← Ys[i][u] + As[i, j][u] · x Algorithm (Y = A · X vector ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : xs

// list

for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]

matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code

no shuffling required

Possible to go even further

(Unroll&Jam)

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 12 / 19

slide-28
SLIDE 28

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Structure of Array (SoA)

Algorithm (Y = A · X scalar ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : x

// scalar

for u = 0 : N do for i = 0 : n do Ys[i][u] ← 0 for j = 0 : n do x ← Xs[j][u] for i = 0 : n do Ys[i][u] ← Ys[i][u] + As[i, j][u] · x Algorithm (Y = A · X vector ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : xs

// list

for i = 0 : n do Ys[i][ : ] ← 0 for j = 0 : n do x ← Xs[j][ : ] for i = 0 : n do Ys[i][ : ] ← Ys[i][ : ] + As[i, j][ : ] .∗ x[ : ]

matrix of list rather than list of matrix Helps vectorization Vector code identical to scalar code

no shuffling required

Possible to go even further

(Unroll&Jam)

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 12 / 19

slide-29
SLIDE 29

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Plan

1

Introduction What is SIMD ? How is vectorization done?

2

Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance

3

Batch processing Array of Structure Structure of Array

4

Hand-made Vectorization

5

Check vectorization Assembly Callgrind

6

Conclusion & Guidelines

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 13 / 19

slide-30
SLIDE 30

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Hand-made vectorization Results with hand-written SIMD: µs/matrix ×1 ×2 ×4 ×8 ×16 unwinded 4541 2330 2270 2258 2549 best 3699 1939 2153 885 586 VectorClass ij 9189 4994 3697 3704 3710 VectorClass ji 8079 4772 2418 1450 1478 WARNING: Hand-written SIMD is much slower! SIMD code is less optimized by the compiler than scalar code Overrides compiler heuristics Requires a deep knowledge of the architecture Vectorization libraries like VC don’t help here

They only provide a better and more abstract interface

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 13 / 19

slide-31
SLIDE 31

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Hand-made vectorization Do not vectorize yourself! Unless you have checked: The code is critical for performance The compiler has not vectorized You understand why There is no way to make it vectorized by the compiler You know how to do it efficiently Your SIMD code is actually faster than scalar code

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 14 / 19

slide-32
SLIDE 32

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Plan

1

Introduction What is SIMD ? How is vectorization done?

2

Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance

3

Batch processing Array of Structure Structure of Array

4

Hand-made Vectorization

5

Check vectorization Assembly Callgrind

6

Conclusion & Guidelines

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 15 / 19

slide-33
SLIDE 33

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Check the assembly Command

  • bjdump -dC --no-show-raw-insn binary

Non vectorized

vmovsd p+0x8 , %xmm4 vmovsd p , %xmm3 vmulsd t+0x48, %xmm4, %xmm0 vmulsd t+0x28, %xmm4, %xmm1 vmulsd t+0x8 , %xmm4, %xmm4 vmovsd p+0x10, %xmm2 vfmadd231sd t+0x40, %xmm3, %xmm0 vfmadd231sd t+0x50, %xmm2, %xmm0 vaddsd t+0x58, %xmm0, %xmm0 vfmadd231sd t+0x20, %xmm3, %xmm1 vfmadd231sd t+0x30, %xmm2, %xmm1 vaddsd t+0x38, %xmm1, %xmm1 vfmadd132sd t , %xmm4, %xmm3 vfmadd132sd t+0x10, %xmm3, %xmm2 vaddsd t+0x18, %xmm2, %xmm2 vmovsd %xmm0, p+0x10 vmovsd %xmm1, p+0x8 vmovsd %xmm2, p retq

Vectorized

vbroadcastsd p+0x10, %ymm0 vbroadcastsd p+0x8 , %ymm1 vbroadcastsd p , %ymm2 vmovapd t+0x60, %ymm3 vfmadd132pd t , %ymm3 ,%ymm2 vfmadd132pd t+0x20, %ymm2, %ymm1 vfmadd132pd t+0x40, %ymm1, %ymm0 vmovapd %ymm0, p vzeroupper retq

%xmm and %ymm designates SIMD registers

%ymm registers are twice bigger (available only with AVX)

WARNING: floating point scalar codes also use %xmm registers Look for sd (scalar double) and pd (packed double)

ss and ps for single precision

  • nly for memory and arithmetic operations
  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 15 / 19

slide-34
SLIDE 34

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Check the assembly Command

  • bjdump -dC --no-show-raw-insn binary

Non vectorized

vmovsd p+0x8 , %xmm4 vmovsd p , %xmm3 vmulsd t+0x48, %xmm4, %xmm0 vmulsd t+0x28, %xmm4, %xmm1 vmulsd t+0x8 , %xmm4, %xmm4 vmovsd p+0x10, %xmm2 vfmadd231sd t+0x40, %xmm3, %xmm0 vfmadd231sd t+0x50, %xmm2, %xmm0 vaddsd t+0x58, %xmm0, %xmm0 vfmadd231sd t+0x20, %xmm3, %xmm1 vfmadd231sd t+0x30, %xmm2, %xmm1 vaddsd t+0x38, %xmm1, %xmm1 vfmadd132sd t , %xmm4, %xmm3 vfmadd132sd t+0x10, %xmm3, %xmm2 vaddsd t+0x18, %xmm2, %xmm2 vmovsd %xmm0, p+0x10 vmovsd %xmm1, p+0x8 vmovsd %xmm2, p retq

Vectorized

vbroadcastsd p+0x10, %ymm0 vbroadcastsd p+0x8 , %ymm1 vbroadcastsd p , %ymm2 vmovapd t+0x60, %ymm3 vfmadd132pd t , %ymm3 ,%ymm2 vfmadd132pd t+0x20, %ymm2, %ymm1 vfmadd132pd t+0x40, %ymm1, %ymm0 vmovapd %ymm0, p vzeroupper retq

%xmm and %ymm designates SIMD registers

%ymm registers are twice bigger (available only with AVX)

WARNING: floating point scalar codes also use %xmm registers Look for sd (scalar double) and pd (packed double)

ss and ps for single precision

  • nly for memory and arithmetic operations
  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 15 / 19

slide-35
SLIDE 35

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Using Callgrind Custom patch to collect Floating point instructions Generic usage

valgrind --tool=callgrind original command line

Useful flags

  • -dump-instr=yes enables assembler instruction granularity
  • -instr-atstart=no do not start profiling from the beginning
  • -cache-sim=yes enables cache simulation (aka cachgrind)
  • -branch-sim=yes enable misprediction count
  • -collect-fp=no disable Floating point instruction collection

location /cvmfs/lhcbdev.cern.ch/tools/valgrind/3.12.0/x86_64-centos7/bin/valgrind

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 16 / 19

slide-36
SLIDE 36

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Using Callgrind with Gaudi use --instr-atstart=no to avoid spending hours in the geometry parsing start instruction profiling in the code using the CallgrindProfile algorithm

in MiniBrunel, just set Profile to True

use a command line like this

Brunel/run valgrind ... python \ $(Brunel/run which gaudirun.py) options.py

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 17 / 19

slide-37
SLIDE 37

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Command kcachegrind callgrind.out.##pid##

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 18 / 19

slide-38
SLIDE 38

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Plan

1

Introduction What is SIMD ? How is vectorization done?

2

Matrix-Vector product example Impact of other optimizations on vectorization Let’s vectorize Performance

3

Batch processing Array of Structure Structure of Array

4

Hand-made Vectorization

5

Check vectorization Assembly Callgrind

6

Conclusion & Guidelines

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 19 / 19

slide-39
SLIDE 39

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Conclusion & Guidelines Vectorization is hard Vectorizing a code is usually not enough to get maximum performance Goal: help the compiler to vectorize Understand how vectorization works

Think about your code as abstract vector code

The code should be designed for batch processing Check vectorization using existing tools

Callgrind Vtune

  • bjdump for the brave ones

. . .

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 19 / 19

slide-40
SLIDE 40

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Thank you for your attention!

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 19 / 19

slide-41
SLIDE 41

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Going Further? Unroll&Jam!

Matrix-Vector product

Algorithm (Y = A · X SIMD ij SoA) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : x0

// SIMD register

for u = 0 : 4 : N do for i = 0 : n do Ys[i][u+0 : u+4 ] ← {0, 0, 0, 0} for j = 0 : n do x0[0 : 4] ← Xs[j][u+0 : u+4 ] for i = 0 : n do Ys[i][u+0 : u+4 ] ← Ys[i][u+0 : u+4 ] + As[i, j][u+0 : u+4 ] · x0[0 : 4]

Unroll&Jam

Interleaving independent iterations

  • f a loop

Allows to hide the latencies Performance limited by instruction throughput Needs more objects to process

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 19 / 19

slide-42
SLIDE 42

Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines

Going Further? Unroll&Jam!

Matrix-Vector product

Algorithm (Y = A · X SIMD ij SoA with unroll&jam ×4) input : As

// n×n matrix of list

input : Xs

// n vector of list

  • utput : Ys

// n vector of list

temp : x0, x1, x2, x3

// SIMD register

for u = 0 : 16 : N do for i = 0 : n do Ys[i][u+0 : u+4 ] ← {0, 0, 0, 0} Ys[i][u+4 : u+8 ] ← {0, 0, 0, 0} Ys[i][u+8 : u+12] ← {0, 0, 0, 0} Ys[i][u+12 : u+16] ← {0, 0, 0, 0} for j = 0 : n do x0[0 : 4] ← Xs[j][u+0 : u+4 ] x1[0 : 4] ← Xs[j][u+4 : u+8 ] x2[0 : 4] ← Xs[j][u+8 : u+12] x3[0 : 4] ← Xs[j][u+12 : u+16] for i = 0 : n do Ys[i][u+0 : u+4 ] ← Ys[i][u+0 : u+4 ] + As[i, j][u+0 : u+4 ] · x0[0 : 4] Ys[i][u+4 : u+8 ] ← Ys[i][u+4 : u+8 ] + As[i, j][u+4 : u+8 ] · x1[0 : 4] Ys[i][u+8 : u+12] ← Ys[i][u+8 : u+12] + As[i, j][u+8 : u+12] · x2[0 : 4] Ys[i][u+12 : u+16] ← Ys[i][u+12 : u+16] + As[i, j][u+12 : u+16] · x3[0 : 4]

Unroll&Jam

Interleaving independent iterations

  • f a loop

Allows to hide the latencies Performance limited by instruction throughput Needs more objects to process

  • S. Ponce – F. Lemaitre

Vectorization: Easy? Enough? 11/12/2017 19 / 19