BLASFEO Gianluca Frison University of Freiburg BLIS retreat - - PowerPoint PPT Presentation

blasfeo
SMART_READER_LITE
LIVE PREVIEW

BLASFEO Gianluca Frison University of Freiburg BLIS retreat - - PowerPoint PPT Presentation

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison BLASFEO BLASFEO Basic Linear Algebra Subroutines For Embedded Optimization Gianluca Frison BLASFEO BLASFEO performance dgemm_nt 50 40


slide-1
SLIDE 1

BLASFEO

Gianluca Frison

University of Freiburg

BLIS retreat September 19, 2017

Gianluca Frison BLASFEO

slide-2
SLIDE 2

BLASFEO

◮ Basic Linear Algebra Subroutines For Embedded Optimization

Gianluca Frison BLASFEO

slide-3
SLIDE 3

BLASFEO performance

◮ Intel Core i7 4800MQ

◮ BLASFEO HP ◮ OpenBLAS 0.2.19 ◮ MKL 2017.2.174 ◮ ATLAS 3.10.3 ◮ BLIS 0.1.6

◮ CAVEAT: BLASFEO is

not API compatible with BLAS & LAPACK

10 20 30 40 50 50 100 150 200 250 300 Gflops matrix size n dgemm_nt BLASFEO_HP OpenBLAS MKL ATLAS BLIS 10 20 30 40 50 50 100 150 200 250 300 Gflops matrix size n dpotrf_l BLASFEO_HP OpenBLAS MKL ATLAS BLIS

Gianluca Frison BLASFEO

slide-4
SLIDE 4

Framework: embedded optimization

◮ Framework: embedded optimization and control

Gianluca Frison BLASFEO

slide-5
SLIDE 5

Embedded control

Model Predictive Control (and Moving Horizon Estimation)

◮ solve an optimization problem on-line ◮ sampling times in milli- or micro-second range

Gianluca Frison BLASFEO

slide-6
SLIDE 6

Karush-Kuhn-Tucker optimality conditions

KKT system (for N = 2)           Q0 ST AT S0 R0 BT A0 B0 −I −I Q1 ST

1

AT

1

S1 R1 B1 A1 B1 −I −I Q2                     x0 u0 λ0 x1 u1 λ1 x2           =           −q0 −r0 −b0 −q1 −r1 −b1 −q2          

◮ Large, structured system of linear equations ◮ Sub-matrices are assumed dense or diagonal

Gianluca Frison BLASFEO

slide-7
SLIDE 7

Framework: embedded optimization

Assumptions about embedded optimization:

◮ Computational speed is a key factor: solve optimization

problems in real-time on resources-constrained hardware.

◮ Data matrices are reused several times (e.g. at each

  • ptimization algorithm iteration): look for a good data

structure.

◮ Structure-exploiting algorithms can exploit the high-level

sparsity pattern: data matrices assumed dense.

◮ Size of matrices is relatively small (tens or few hundreds):

generally fitting in cache.

◮ Limitations of embedded optimization hardware and toolchain:

no external libraries and (static/dynamic) memory allocation

Gianluca Frison BLASFEO

slide-8
SLIDE 8

The origin

◮ HPMPC: library for High-Performance implementation of

solvers for Model Predictive Contorl

Gianluca Frison BLASFEO

slide-9
SLIDE 9

How to optimize syrk+potrf (for embedded optimization)

◮ How to optimize

dsyrk + dpotrf (for embedded

  • ptimization)

◮ Test operation:

L =

  • Q + A · AT1/

2

◮ NetlibBLAS

5 10 15 20 25 50 100 150 200 250 300 Gflops matrix size n test dsyrk + dpotrf 2 4 6 8 10 12 14 5 10 15 20 Gflops matrix size n test dsyrk + dpotrf

Gianluca Frison BLASFEO

slide-10
SLIDE 10

How to optimize syrk+potrf (for embedded optimization)

Code Generation

◮ e.g. fix the size of the

loops: compiler can unroll loops and avoid branches

◮ need to generate the code

for each problem size

5 10 15 20 25 50 100 150 200 250 300 Gflops matrix size n test dsyrk + dpotrf 2 4 6 8 10 12 14 5 10 15 20 Gflops matrix size n test dsyrk + dpotrf

Gianluca Frison BLASFEO

slide-11
SLIDE 11

How to optimize syrk+potrf (for embedded optimization)

OpenBLAS

5 10 15 20 25 50 100 150 200 250 300 Gflops matrix size n test dsyrk + dpotrf 2 4 6 8 10 12 14 5 10 15 20 Gflops matrix size n test dsyrk + dpotrf

Gianluca Frison BLASFEO

slide-12
SLIDE 12

How to optimize syrk+potrf (for embedded optimization)

HPMPC - register blocking

5 10 15 20 25 50 100 150 200 250 300 Gflops matrix size n test dsyrk + dpotrf 2 4 6 8 10 12 14 5 10 15 20 Gflops matrix size n test dsyrk + dpotrf

Gianluca Frison BLASFEO

slide-13
SLIDE 13

How to optimize syrk+potrf (for embedded optimization)

HPMPC - SIMD instructions

◮ performance drop for n

multiple of 32 - limited cache associativity

5 10 15 20 25 50 100 150 200 250 300 Gflops matrix size n test dsyrk + dpotrf 2 4 6 8 10 12 14 5 10 15 20 Gflops matrix size n test dsyrk + dpotrf

Gianluca Frison BLASFEO

slide-14
SLIDE 14

How to optimize syrk+potrf (for embedded optimization)

HPMPC - panel-major matrix format

◮ panel-major matrix format ◮ smooth performance

5 10 15 20 25 50 100 150 200 250 300 Gflops matrix size n test dsyrk + dpotrf 2 4 6 8 10 12 14 5 10 15 20 Gflops matrix size n test dsyrk + dpotrf

Gianluca Frison BLASFEO

slide-15
SLIDE 15

Access pattern in optimized BLAS

Figure: Access pattern of data in different cache levels for the dgemm routine in GotoBLAS/OpenBLAS/BLIS. Data is packed (on-line) into buffers following the access pattern.

Gianluca Frison BLASFEO

slide-16
SLIDE 16

Panel-major matrix format

+ = · ps T

◮ matrix format: can do any operation, also on sub-matrices ◮ matrix elements are stored in (almost) the same order such as

the gemm kernel accesses them

◮ optimal ’NT’ gemm variant (A not-transposed, B transposed) ◮ panels width ps is the same for the left and the right matrix

  • perand, as well as for the result matrix

Gianluca Frison BLASFEO

slide-17
SLIDE 17

Optimized BLAS vs HPMPC software stack

linear algebra kernels linear algebra kernels

  • ptimized BLAS

based solver HPMPC based solver linear algebra routines linear algebra routines Riccati solver for unconstrained MPC IPM solver for linear MPC IPM solver for linear MPC Riccati solver for unconstrained MPC high-level wrapper packing of matrices Part I Part II Part III

Figure: Structure of a Riccati-based IPM for linear MPC problems when implemented using linear algebra in either optimized BLAS or HPMPC. Routines in the orange boxes use matrices in column-major format, routines in the green boxes use matrices in panel-major format.

Gianluca Frison BLASFEO

slide-18
SLIDE 18

How to optimize syrk+potrf (for embedded optimization)

HPMPC - merging of linear algebra routines

◮ specialized kernels for

complex operations

◮ improves small-scale

performance

◮ worse large-scale

performance

5 10 15 20 25 50 100 150 200 250 300 Gflops matrix size n test dsyrk + dpotrf 2 4 6 8 10 12 14 5 10 15 20 Gflops matrix size n test dsyrk + dpotrf

Gianluca Frison BLASFEO

slide-19
SLIDE 19

Merging of linear algebra routines: syrk + potrf

L =

  • Q + A · AT 1/

2 =

  L00 ∗ ∗ L10 L11 ∗ L20 L21 L22   =     Q00 ∗ ∗ Q10 Q11 ∗ Q20 Q21 Q22   +   A0 A1 A2   ·

  • AT

AT

1

AT

2

1/ 2

=    (Q00 + A0 · AT

0 )1/ 2

∗ ∗ (Q10 + A1 · AT

0 )L−T 00

(Q11 + A1 · AT

1 − L10 · LT 10)1/ 2

∗ (Q20 + A2 · AT

0 )L−T 00

(Q21 + A2 · AT

1 − L20 · LT 10)L−T 11

(Q22 + A2 · AT

2 − L20 · LT 20 − L21 · LT 21)1/ 2

  

◮ each sub-matrix computed using a single specialized kernel

◮ reduce number of function calls ◮ reduce number of load and store of the same data Gianluca Frison BLASFEO

slide-20
SLIDE 20

The present

◮ BLASFEO

Gianluca Frison BLASFEO

slide-21
SLIDE 21

BLASFEO

◮ aim: satisfy the need for high-performance linear algebra

routines for small dense matrices

◮ mean: adapt high-performance computing techniques to

embedded optimization framework

◮ keep:

◮ LA kernels (register-blocking, SIMD) ◮ optimized data layout

◮ drop:

◮ cache-blocking ◮ on-line data packing ◮ multi-thread (at least for now)

◮ add:

◮ optimized (panel-major) matrix format ◮ off-line data packing ◮ assembly subroutines: tailored LA kernels & code reuse Gianluca Frison BLASFEO

slide-22
SLIDE 22

Linear algebra level

◮ LA level definition

◮ level 1: O(n) storage, O(n) flops ◮ level 2: O(n2) storage, O(n2) flops ◮ level 3: O(n2) storage, O(n3) flops

◮ in level 1 and level 2

◮ reuse factor O(1) ◮ memory-bounded

◮ in level 3

◮ reuse factor O(n) ◮ compute-bounded for large n ◮ disregard O(n2) terms ◮ typically memory-bounded for small n ◮ minimize O(n2) terms ⇒ BLASFEO playground! Gianluca Frison BLASFEO

slide-23
SLIDE 23

BLASFEO RF

BLASFEO ReFerence

◮ 2x2 blocking for registers ◮ column-major matrix format ◮ small code size ◮ ANSI C code, no external dependencies ◮ good performance for tiny matrices

Gianluca Frison BLASFEO

slide-24
SLIDE 24

BLASFEO HP

BLASFEO High-Performance

◮ optimal blocking for registers + vectorization ◮ no blocking for cache ◮ panel-major matrix format ◮ hand-written assembly kernels ◮ optimized for highest performance for matrices fitting in cache

Gianluca Frison BLASFEO

slide-25
SLIDE 25

BLASFEO WR

BLASFEO WRapper to BLAS and LAPACK

◮ provides a performance basis ◮ column-major matrix format ◮ optimized for many architectures ◮ possibly multi-threaded ◮ good performance for large matrices

Gianluca Frison BLASFEO

slide-26
SLIDE 26

Matrix structure

◮ problem: how to switch between panel-major and

column-major formats?

◮ solution: use a C struct for the matrix ”object”

(..., struct d strmat *sA, int ai, int aj, ...)

◮ if column-major, the first matrix element is at

int lda = sA->m; double *ptr = sA->pA + ai + aj*lda;

◮ if panel-major, the first matrix element is at

int ps = sA->ps; int sda = sA->cn; int air = ai&(ps-1); double *ptr = sA->pA + (ai-air)*sda + air + aj*bs;

◮ custom matrix struct allows other tricks

◮ extra linear storage for inverse of diagonal in factorizations Gianluca Frison BLASFEO

slide-27
SLIDE 27

Some tests on Intel Core i7 4800MQ (Haswell)

10 20 30 40 50 50 100 150 200 250 300 Gflops matrix size n dgemm_nt HP RF OB MKL Eig 10 20 30 40 50 50 100 150 200 250 300 Gflops matrix size n dgemm_nn HP RF OB MKL Eig 20 40 60 80 100 50 100 150 200 250 300 Gflops matrix size n sgemm_nt HP RF OB MKL Eig 20 40 60 80 100 50 100 150 200 250 300 Gflops matrix size n sgemm_nn HP RF OB MKL Eig

Gianluca Frison BLASFEO

slide-28
SLIDE 28

Some tests on Intel Core i7 4800MQ (Haswell)

10 20 30 40 50 50 100 150 200 250 300 Gflops matrix size n dpotrf_l HP RF OB MKL Eig 10 20 30 40 50 50 100 150 200 250 300 Gflops matrix size n dgetrf HP RF OB MKL 20 40 60 80 100 50 100 150 200 250 300 Gflops matrix size n spotrf_l HP RF OB MKL Eig 10 20 30 40 50 50 100 150 200 250 300 Gflops matrix size n dgelqf HP RF OB MKL

Gianluca Frison BLASFEO

slide-29
SLIDE 29

Some tests on ARM Cortex A15 and ARM Cortex A57

0.5 1 1.5 2 2.5 3 3.5 4 4.5 50 100 150 200 250 300 Gflops matrix size n dgemm_nt HP RF OB Eig 2 4 6 8 10 12 14 16 18 50 100 150 200 250 300 Gflops matrix size n sgemm_nt HP RF OB Eig 1 2 3 4 5 6 7 8 50 100 150 200 250 300 Gflops matrix size n dgemm_nt HP RF OB 2 4 6 8 10 12 14 16 50 100 150 200 250 300 Gflops matrix size n sgemm_nt HP RF OB

Gianluca Frison BLASFEO

slide-30
SLIDE 30

BLASFEO + PLASMA on Intel Core i7 4800MQ

50 100 150 200 500 1000 1500 2000 2500 3000 3500 4000 Gflops matrix size n dpotrf_l MKL PL_64+MKL PL_64+HP 50 100 150 200 500 1000 1500 2000 2500 3000 3500 4000 Gflops matrix size n dpotrf_l MKL PL_64+MKL PL_64+HP 50 100 150 200 500 1000 1500 2000 2500 3000 3500 4000 Gflops matrix size n dpotrf_l MKL PL_128+MKL PL_128+HP 50 100 150 200 500 1000 1500 2000 2500 3000 3500 4000 Gflops matrix size n dpotrf_l MKL PL_128+MKL PL_128+HP

Gianluca Frison BLASFEO

slide-31
SLIDE 31

The end (for now)

◮ thank you for your attention! ◮ questions?

Gianluca Frison BLASFEO