BLASFEO Gianluca Frison University of Freiburg BLIS retreat - PowerPoint PPT Presentation

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison BLASFEO

BLASFEO ◮ Basic Linear Algebra Subroutines For Embedded Optimization Gianluca Frison BLASFEO

BLASFEO performance dgemm_nt 50 40 Gflops 30 ◮ Intel Core i7 4800MQ BLASFEO_HP 20 ◮ BLASFEO HP OpenBLAS MKL ◮ OpenBLAS 0.2.19 10 ATLAS BLIS ◮ MKL 2017.2.174 0 0 50 100 150 200 250 300 ◮ ATLAS 3.10.3 matrix size n ◮ BLIS 0.1.6 dpotrf_l ◮ CAVEAT: BLASFEO is 50 BLASFEO_HP OpenBLAS not API compatible with 40 MKL ATLAS BLIS BLAS & LAPACK Gflops 30 20 10 0 0 50 100 150 200 250 300 matrix size n Gianluca Frison BLASFEO

Framework: embedded optimization ◮ Framework: embedded optimization and control Gianluca Frison BLASFEO

Embedded control Model Predictive Control (and Moving Horizon Estimation) ◮ solve an optimization problem on-line ◮ sampling times in milli- or micro-second range Gianluca Frison BLASFEO

Karush-Kuhn-Tucker optimality conditions KKT system (for N = 2) S T A T Q 0 x 0 − q 0       0 0 B T S 0 R 0 u 0 − r 0 0             A 0 B 0 − I − b 0 λ 0             S T A T − I Q 1 x 1 = − q 1  1 1            S 1 R 1 B 1 u 1 − r 1             A 1 B 1 − I − b 1 λ 1       − I Q 2 x 2 − q 2 ◮ Large, structured system of linear equations ◮ Sub-matrices are assumed dense or diagonal Gianluca Frison BLASFEO

Framework: embedded optimization Assumptions about embedded optimization: ◮ Computational speed is a key factor: solve optimization problems in real-time on resources-constrained hardware. ◮ Data matrices are reused several times (e.g. at each optimization algorithm iteration): look for a good data structure. ◮ Structure-exploiting algorithms can exploit the high-level sparsity pattern: data matrices assumed dense. ◮ Size of matrices is relatively small (tens or few hundreds): generally fitting in cache. ◮ Limitations of embedded optimization hardware and toolchain: no external libraries and (static/dynamic) memory allocation Gianluca Frison BLASFEO

The origin ◮ HPMPC: library for High-Performance implementation of solvers for Model Predictive Contorl Gianluca Frison BLASFEO

How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 ◮ How to optimize Gflops 15 dsyrk + dpotrf 10 (for embedded 5 optimization) 0 0 50 100 150 200 250 300 ◮ Test operation: matrix size n test dsyrk + dpotrf 14 Q + A · A T � 1 / 2 � L = 12 10 Gflops 8 ◮ NetlibBLAS 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 Code Generation 10 ◮ e.g. fix the size of the 5 loops: compiler can unroll 0 0 50 100 150 200 250 300 matrix size n loops and avoid branches test dsyrk + dpotrf ◮ need to generate the code 14 for each problem size 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 5 0 0 50 100 150 200 250 300 OpenBLAS matrix size n test dsyrk + dpotrf 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 5 0 0 50 100 150 200 250 300 HPMPC - register blocking matrix size n test dsyrk + dpotrf 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 HPMPC - SIMD instructions 5 ◮ performance drop for n 0 0 50 100 150 200 250 300 matrix size n multiple of 32 - limited test dsyrk + dpotrf cache associativity 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 HPMPC - panel-major matrix 5 format 0 0 50 100 150 200 250 300 ◮ panel-major matrix format matrix size n test dsyrk + dpotrf ◮ smooth performance 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

Access pattern in optimized BLAS Figure: Access pattern of data in different cache levels for the dgemm routine in GotoBLAS/OpenBLAS/BLIS. Data is packed (on-line) into buffers following the access pattern. Gianluca Frison BLASFEO

Panel-major matrix format T + = · p s ◮ matrix format: can do any operation, also on sub-matrices ◮ matrix elements are stored in (almost) the same order such as the gemm kernel accesses them ◮ optimal ’NT’ gemm variant ( A not-transposed, B transposed) ◮ panels width p s is the same for the left and the right matrix operand, as well as for the result matrix Gianluca Frison BLASFEO

Optimized BLAS vs HPMPC software stack high-level wrapper IPM solver for IPM solver for Part III linear MPC linear MPC Riccati solver for Riccati solver for Part II unconstrained MPC unconstrained MPC linear algebra linear algebra routines routines packing of Part I matrices linear algebra linear algebra kernels kernels HPMPC optimized BLAS based solver based solver Figure: Structure of a Riccati-based IPM for linear MPC problems when implemented using linear algebra in either optimized BLAS or HPMPC. Routines in the orange boxes use matrices in column-major format, routines in the green boxes use matrices in panel-major format. Gianluca Frison BLASFEO

How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 HPMPC - merging of linear Gflops 15 algebra routines 10 ◮ specialized kernels for 5 complex operations 0 0 50 100 150 200 250 300 matrix size n ◮ improves small-scale test dsyrk + dpotrf performance 14 ◮ worse large-scale 12 10 performance Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

Merging of linear algebra routines: syrk + potrf 2 = Q + A · A T � 1 / � L = 1 / 2         L 00 ∗ ∗ Q 00 ∗ ∗ A 0 � �  =  +  · A T A T A T L 10 L 11 ∗ Q 10 Q 11 ∗ A 1 =     0 1 2  L 20 L 21 L 22 Q 20 Q 21 Q 22 A 2 0 ) 1 / 2  ( Q 00 + A 0 · A T  ∗ ∗ ( Q 10 + A 1 · A T 0 ) L − T ( Q 11 + A 1 · A T 1 − L 10 · L T 10 ) 1 / 2  ∗  00   0 ) L − T 10 ) L − T ( Q 20 + A 2 · A T ( Q 21 + A 2 · A T 1 − L 20 · L T ( Q 22 + A 2 · A T 2 − L 20 · L T 20 − L 21 · L T 21 ) 1 / 2 00 11 ◮ each sub-matrix computed using a single specialized kernel ◮ reduce number of function calls ◮ reduce number of load and store of the same data Gianluca Frison BLASFEO

The present ◮ BLASFEO Gianluca Frison BLASFEO

BLASFEO ◮ aim: satisfy the need for high-performance linear algebra routines for small dense matrices ◮ mean: adapt high-performance computing techniques to embedded optimization framework ◮ keep: ◮ LA kernels (register-blocking, SIMD) ◮ optimized data layout ◮ drop: ◮ cache-blocking ◮ on-line data packing ◮ multi-thread (at least for now) ◮ add: ◮ optimized (panel-major) matrix format ◮ off-line data packing ◮ assembly subroutines: tailored LA kernels & code reuse Gianluca Frison BLASFEO

Linear algebra level ◮ LA level definition ◮ level 1: O ( n ) storage, O ( n ) flops ◮ level 2: O ( n 2 ) storage, O ( n 2 ) flops ◮ level 3: O ( n 2 ) storage, O ( n 3 ) flops ◮ in level 1 and level 2 ◮ reuse factor O (1) ◮ memory-bounded ◮ in level 3 ◮ reuse factor O ( n ) ◮ compute-bounded for large n ◮ disregard O ( n 2 ) terms ◮ typically memory-bounded for small n ◮ minimize O ( n 2 ) terms ⇒ BLASFEO playground! Gianluca Frison BLASFEO

BLASFEO RF BLASFEO ReFerence ◮ 2x2 blocking for registers ◮ column-major matrix format ◮ small code size ◮ ANSI C code, no external dependencies ◮ good performance for tiny matrices Gianluca Frison BLASFEO

BLASFEO HP BLASFEO High-Performance ◮ optimal blocking for registers + vectorization ◮ no blocking for cache ◮ panel-major matrix format ◮ hand-written assembly kernels ◮ optimized for highest performance for matrices fitting in cache Gianluca Frison BLASFEO

BLASFEO WR BLASFEO WRapper to BLAS and LAPACK ◮ provides a performance basis ◮ column-major matrix format ◮ optimized for many architectures ◮ possibly multi-threaded ◮ good performance for large matrices Gianluca Frison BLASFEO

Matrix structure ◮ problem: how to switch between panel-major and column-major formats? ◮ solution: use a C struct for the matrix ”object” (..., struct d strmat *sA, int ai, int aj, ...) ◮ if column-major, the first matrix element is at int lda = sA->m; double *ptr = sA->pA + ai + aj*lda; ◮ if panel-major, the first matrix element is at int ps = sA->ps; int sda = sA->cn; int air = ai&(ps-1); double *ptr = sA->pA + (ai-air)*sda + air + aj*bs; ◮ custom matrix struct allows other tricks ◮ extra linear storage for inverse of diagonal in factorizations Gianluca Frison BLASFEO

BLASFEO Gianluca Frison University of Freiburg BLIS retreat - PowerPoint PPT Presentation

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison BLASFEO BLASFEO Basic Linear Algebra Subroutines For Embedded Optimization Gianluca Frison BLASFEO BLASFEO performance dgemm_nt 50 40

Recent advances in the HPMPC and BLASFEO software packages Gianluca Frison Syscop group retreat

Addressing Modes Chapter 11 S. Dandamudi Outline Addressing modes Examples Simple

Introduction to Object-Oriented Programming Arrays Christopher Simpkins

CSCI 1112: Lecture 3 Mona Diab RoadMap High level view of Object Oriented Programming

Other Camera Controls The LookAt function is only for positioning camera Other ways to

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop

programming in the presence of memory faults Saverio Caminiti , Irene Finocchi, and Emanuele G.

Performance Engineering for Algorithmic Building Blocks in the GHOST Library Georg Hager, Moritz

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Computer Graphics Seminar MTAT.03.305 Spring 2020 Raimond Tunnel Computer Graphics

Page 1 Ridges of Temporal Locality Ridges of Temporal Locality Pentium 4 Pentium 4 Memory

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Nested Lists Motivating this Course Module In previous video series, introduced lists All

1 Arrays Arrays The array's element type is the type of values it stores The size of an array

Arrays CS180 15Feb2008 Announcements Exam1GradesonBlackboard

ThS. Trn Th Thanh Nga Khoa CNTT, Trng H Nng Lm TPHCM Email: ngattt@hcmuaf.edu.vn

Programming day two: functions and the apply family Methods camp instructors September 5th, 2018

COSC 2P91 Dynamic allocation Week 4a Brock University Brock University (Week 4a) Dynamic

Program control in R Base functionality R.W. Oldford Control flow Recall the base data

Crosscomparison of AIRS Cloud Products with ARM and A-train Measurements by Brian Kahn 1 , Amy

Chapter 8: Arrays and the ArrayList Class Chapter Topics Chapter 8 discusses the following main

Lecture 2: Combinational Logic CSE 140: Components and Design Techniques for Digital Systems

Chapter 2: Foundations of combinational structures Computer Structure & Intro. to Digital

BLASFEO Gianluca Frison University of Freiburg BLIS retreat - PowerPoint PPT Presentation

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison BLASFEO BLASFEO Basic Linear Algebra Subroutines For Embedded Optimization Gianluca Frison BLASFEO BLASFEO performance dgemm_nt 50 40

Recent advances in the HPMPC and BLASFEO software packages Gianluca Frison Syscop group retreat

Addressing Modes Chapter 11 S. Dandamudi Outline Addressing modes Examples Simple

Introduction to Object-Oriented Programming Arrays Christopher Simpkins

CSCI 1112: Lecture 3 Mona Diab RoadMap High level view of Object Oriented Programming

Other Camera Controls The LookAt function is only for positioning camera Other ways to

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop

programming in the presence of memory faults Saverio Caminiti , Irene Finocchi, and Emanuele G.

Performance Engineering for Algorithmic Building Blocks in the GHOST Library Georg Hager, Moritz

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Computer Graphics Seminar MTAT.03.305 Spring 2020 Raimond Tunnel Computer Graphics

Page 1 Ridges of Temporal Locality Ridges of Temporal Locality Pentium 4 Pentium 4 Memory

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Nested Lists Motivating this Course Module In previous video series, introduced lists All

1 Arrays Arrays The array's element type is the type of values it stores The size of an array

Arrays CS180 15Feb2008 Announcements Exam1GradesonBlackboard

ThS. Trn Th Thanh Nga Khoa CNTT, Trng H Nng Lm TPHCM Email: ngattt@hcmuaf.edu.vn

Programming day two: functions and the apply family Methods camp instructors September 5th, 2018

COSC 2P91 Dynamic allocation Week 4a Brock University Brock University (Week 4a) Dynamic

Program control in R Base functionality R.W. Oldford Control flow Recall the base data

Crosscomparison of AIRS Cloud Products with ARM and A-train Measurements by Brian Kahn 1 , Amy

Chapter 8: Arrays and the ArrayList Class Chapter Topics Chapter 8 discusses the following main

Lecture 2: Combinational Logic CSE 140: Components and Design Techniques for Digital Systems

Chapter 2: Foundations of combinational structures Computer Structure &amp; Intro. to Digital

Chapter 2: Foundations of combinational structures Computer Structure & Intro. to Digital