Linear Algebra Libraries for High- Performance Computing: Scientific - PowerPoint PPT Presentation

Linear Algebra Libraries for High- Performance Computing: Scientific Computing with Multicore and Accelerators Presenters Prof. Jack Dongarra (8:30 – 10:00) University of Tennessee & Oak Ridge National Lab Dr. Jakub Kurzak (10:30 – 12:00) University of Tennessee Prof. James Demmel (1:30 – 3:00) University of California Berkeley Dr. Michael Heroux (3:30 – 5:00) Sandia National Laboratory 1

Overview ¡of ¡Dense ¡Numerical ¡Linear ¡ Algebra ¡Libraries ¡ • BLAS : ¡kernel ¡for ¡dense ¡linear ¡algebra ¡ • LAPACK : ¡sequen;al ¡dense ¡linear ¡algebra ¡ • ScaLAPACK : ¡parallel ¡distributed ¡dense ¡linear ¡ algebra ¡ L inear ¡ A lgebra ¡ PACK age ¡ 2

What do you mean by performance? ¨ What is a xflop/s? Ø xflop/s is a rate of execution, some number of floating point operations per second. Ø Whenever this term is used it will refer to 64 bit floating point operations and the operations will be either addition or multiplication. Ø Tflop/s refers to trillions (10 12 ) of floating point operations per second and Ø Pflop/s refers to 10 15 floating point operations per second. ¨ What is the theoretical peak performance? Ø The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. Ø The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. Ø For example, an Intel Xeon 5570 quad core at 2.93 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 11.72 GFlop/s per core or 3 46.88 Gflop/s for the socket.

What Is LINPACK? LINPACK is a package of mathematical software for solving ¨ problems in linear algebra, mainly dense linear systems of linear equations. LINPACK: “LINear algebra PACKage” ¨ Ø Written in Fortran 66 The project had its origins in 1974 ¨ The project had four primary contributors: myself when I was ¨ at Argonne National Lab, Jim Bunch from the University of California-San Diego, Cleve Moler who was at New Mexico at that time, and Pete Stewart from the University of Maryland. LINPACK as a software package has been largely superseded by ¨ LAPACK, which has been designed to run efficiently on shared- memory, vector supercomputers. 4

Computing in 1974 ¨ High Performance Computers: Ø IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030 ¨ Fortran 66 ¨ Trying to achieve software portability ¨ Run efficiently ¨ BLAS (Level 1) Ø Vector operations ¨ Software released in 1979 Ø About the time of the Cray 1 5

LINPACK Benchmark? ¨ The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. Ø It is determined by running a computer program that solves a dense system of linear equations. ¨ Over the years the characteristics of the benchmark has changed a bit. Ø In fact, there are three benchmarks included in the Linpack Benchmark report. ¨ LINPACK Benchmark Ø Dense linear system solve with LU factorization using partial pivoting Ø Operation count is: 2/3 n 3 + O(n 2 ) Ø Benchmark Measure: MFlop/s Ø Original benchmark measures the execution rate for a 6 Fortran program on a matrix of size 100x100.

Accidental Benchmarker ¨ Appendix B of the Linpack Users’ Guide Ø Designed to help users extrapolate execution time for Linpack software package ¨ First benchmark report from 1977; Ø Cray 1 to DEC PDP-10 7

High Performance Linpack (HPL) Benchmark Matrix Optimizations Parallel Name dimension allowed Processing Linpack 100 100 compiler – a Linpack 1000 b 1000 – c hand, code replacement Linpack Parallel 1000 Yes hand, code replacement HPLinpack d arbitrary Yes hand, code replacement a Compiler parallelization possible. b Also known as TPP (Toward Peak Performance) or Best Effort c Multiprocessor implementations allowed. d Highly-Parallel LINPACK Benchmark is also known as NxN Linpack Benchmark or High Parallel Computing (HPC). 8

A brief history of (Dense) Linear Algebra software ¨ But the BLAS-1 weren’t enough Ø Consider AXPY ( y = α ·x + y ): 2n flops on 3n read/writes Ø Computational intensity = (2n)/(3n) = 2/3 Ø Too low to run near peak speed (read/write dominates) ¨ So the BLAS-2 were developed (1984-1986) Ø Standard library of 25 operations (mostly) on matrix/vector pairs Ø “GEMV”: y = α ·A·x + β ·x, “GER”: A = A + α ·x·y T , x = T -1 ·x Ø Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC Ø Why BLAS 2 ? They do O(n 2 ) ops on O(n 2 ) data Ø So computational intensity still just ~(2n 2 )/(n 2 ) = 2 Ø OK for vector machines, but not for machine with caches 9

A brief history of (Dense) Linear Algebra software ¨ The next step: BLAS-3 (1987-1988) Ø Standard library of 9 operations (mostly) on matrix/matrix pairs Ø “GEMM”: C = α ·A·B + β ·C, C = α ·A·A T + β ·C , B = T -1 ·B Ø Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC Ø Why BLAS 3 ? They do O(n 3 ) ops on O(n 2 ) data Ø So computational intensity (2n 3 )/(4n 2 ) = n/2 – big at last! Ø Good for machines with caches, other mem. hierarchy levels ¨ How much BLAS1/2/3 code so far (all at www.netlib.org/blas) Ø Source: 142 routines, 31K LOC, Testing: 28K LOC Ø Reference (unoptimized) implementation only Ø Ex: 3 nested loops for GEMM 10

Memory Hierarchy ¨ By taking advantage of the principle of locality: Ø Present the user with as much memory as is available in the cheapest technology. Ø Provide access at the speed offered by the fastest technology. Processor Tertiary Storage Secondary (Disk/Tape) Control Storage (Disk) Level Main Remote Cache On-Chip Registers 2 and 3 Memory Distributed Datapath Cluster Cache (DRAM) Memory Memory (SRAM) 10,000,000s Speed (ns): 1s 10s 100s 10,000,000,000s (10s ms) (10s sec) Size (bytes): 100s 100,000 s Ks Ms 10,000,000 s (.1s ms) (10s ms) Gs Ts

Why Higher Level BLAS? ¨ Can only do arithmetic on data at the top of the hierarchy ¨ Higher level BLAS lets us do this Registers L 1 Cache L 2 Cache Local Memory Remote Memory Secondary Memory 12

Level 1, 2 and 3 BLAS ¨ Level 1 BLAS Vector-Vector + * operations ¨ Level 2 BLAS Matrix-Vector * operations ¨ Level 3 BLAS Matrix-Matrix * + operations 13

Level ¡1, ¡2 ¡and ¡3 ¡BLAS ¡ Before ¡(2007) ¡ 3. 4 G H z EM 64T Xeon M KL8. 1 Peak: 6. 8 G f l op/ s gcc - f om i t - f r am e- poi nt er - f unr ol l - al l - l oops - O 3 7 6 5 4 s op/ l G f 3 2 1 0 1 19 37 55 73 91 109 127 145 163 181 199 217 235 253 271 289 307 325 343 361 379 397 415 433 451 469 487 505 O r der daxpy dgem v dgem m

Level ¡1, ¡2 ¡and ¡3 ¡BLAS ¡ Now ¡(2011) ¡ ¡ AMD Opteron 8439 SE Processor (6 cores total @ 2.8Ghz) Using 1 core 11.2 Gflop/s theoretical peak 12 ¡ 10 ¡ 8 ¡ GFLOPS ¡ Level ¡3 ¡BLAS: ¡DGEMM ¡ 6 ¡ Level ¡2 ¡BLAS: ¡DGEMV ¡ Level ¡1 ¡BLAS: ¡DAXPY ¡ 4 ¡ 2 ¡ 0 ¡ 200 ¡ 400 ¡ 600 ¡ 800 ¡ 1000 ¡ 2000 ¡ 3000 ¡ 4000 ¡ 5000 ¡ Matrix ¡Size ¡

02/25/2009 CS267 Lecture 8 16

A brief history of (Dense) Linear Algebra software ¨ LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now) Ø Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1 Ø How do we reorganize GE to use BLAS-3 ? (details later) Ø Contents of LAPACK (summary) Ø Algorithms we can turn into (nearly) 100% BLAS 3 Ø Linear Systems: solve Ax=b for x Ø Least Squares: choose x to minimize ||Ax - b|| 2 Ø Algorithms that are only 50% BLAS 3 (so far) Ø “Eigenproblems”: Find λ and x where Ax = λ x Ø Singular Value Decomposition (SVD): (A T A)x= σ 2 x Ø Generalized problems (eg Ax = λ Bx) Ø Error bounds for everything Ø Lots of variants depending on A’s structure (banded, A=A T , etc) Ø How much code? (Release 3.4, Nov 2011) (www.netlib.org/ lapack) Ø Source: 1674 routines, 490K LOC, Testing: 448K LOC 17

A brief history of (Dense) Linear Algebra software ¨ Is LAPACK parallel? Ø Only if the BLAS are parallel (possible in shared memory) ¨ ScaLAPACK – “Scalable LAPACK” (1995 – now) Ø For distributed memory – uses MPI Ø More complex data structures, algorithms than LAPACK Ø Only (small) subset of LAPACK’s functionality available Ø All at www.netlib.org/scalapack 18

Linear Algebra Libraries for High- Performance Computing: Scientific - PowerPoint PPT Presentation

Linear Algebra Libraries for High- Performance Computing: Scientific Computing with Multicore and Accelerators Presenters Prof. Jack Dongarra (8:30 10:00) University of Tennessee & Oak Ridge National Lab Dr. Jakub Kurzak (10:30

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is

Libraries Jonathan Platt Head of Libraries and Heritage 22 nd July 2014 Libraries 1.

Libraries In C++ its possible to create static libraries and shared libraries Static

Linear Algebra 1: Computing canonical forms in exact linear algebra Clment P ERNET ,

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

An Overview of Guava: Google Core Libraries for Java Kevin Bourrillion Java Core Libraries Team

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin

A library to manipulate Z-polyhedron in image representation Guillaume Iooss, Sanjay Rajopadhye

Western Library Online hello! WE ARE YOUR LIBRARIANS! Ellen Range Linda VanSistine-Yost

Starfish Efficient Concurrency Support for Computer Vision Applications Robert

Enterprise Community Partners Title Reports / Surveys This training is made possible by the

Making it Click E-Government in Public Libraries E-Government What is it? Everything you always

The National Institutes of Health The NLM Strategic Plan Leadership in Science Policy Institute

Linear Algebra Libraries for High- Performance Computing: Scientific - PowerPoint PPT Presentation

Linear Algebra Libraries for High- Performance Computing: Scientific Computing with Multicore and Accelerators Presenters Prof. Jack Dongarra (8:30 10:00) University of Tennessee & Oak Ridge National Lab Dr. Jakub Kurzak (10:30

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen &amp; Greg Corrado Linear Algebra is

Libraries Jonathan Platt Head of Libraries and Heritage 22 nd July 2014 Libraries 1.

Libraries In C++ its possible to create static libraries and shared libraries Static

Linear Algebra 1: Computing canonical forms in exact linear algebra Clment P ERNET ,

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

An Overview of Guava: Google Core Libraries for Java Kevin Bourrillion Java Core Libraries Team

Algorithms for Large, Sparse Network Alignment Mohsen Bayati , David Gleich, Margot Gerritsen, Amin

A library to manipulate Z-polyhedron in image representation Guillaume Iooss, Sanjay Rajopadhye

Western Library Online hello! WE ARE YOUR LIBRARIANS! Ellen Range Linda VanSistine-Yost

Starfish Efficient Concurrency Support for Computer Vision Applications Robert

Enterprise Community Partners Title Reports / Surveys This training is made possible by the

Making it Click E-Government in Public Libraries E-Government What is it? Everything you always

The National Institutes of Health The NLM Strategic Plan Leadership in Science Policy Institute

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is