The Challenge of Multicore The Challenge of Multicore and and - PDF document

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for Specialized Accelerators for Mathematical Software Mathematical Software Jack Dongarra Alfredo Buttari, Jakub Kurzak, Julie Langou, Julien Langou, Piotr Luszczek, Stan Tomov University of Tennessee and Oak Ridge National Laboratory 1 A Growth- -Factor of more than a Trillion Factor of more than a Trillion A Growth in Performance in the Past 65 Years in Performance in the Past 65 Years 2005 1959 2003 1948 1976 1991 IBM 1996 IBM 7094 Cray X1 Manchr Cray 1 Intel Delta 1949 T3E BG/L Baby Edsac 1 10 3 10 6 10 9 10 12 10 15 One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS 2001 1964 1982 1988 1997 1943 Earth 1951 CDC 6600 Cray XMP Cray YMP ASCI Red Harvard Simulator Pilot Ace Mark 1 2 Scalar to super scalar to vector to SMP to DMP to massively parallel to many-core designs 1

Future Large Systems, Say in 5 Years Future Large Systems, Say in 5 Years ♦ 128 cores per socket � May be heterogeneous 1 Chip = ♦ 32 sockets per node ♦ 128 nodes per system ♦ System = 128*32*128 = 524,288 Cores! ♦ And by the way, its 4-8 threads of exec per core ♦ That’s about 4M threads to manage 3 Major Changes to Math Software Major Changes to Math Software ♦ Scalar � Fortran code in EISPACK ♦ Vector � Level 1 BLAS use in LINPACK ♦ SMP � Level 3 BLAS use in LAPACK ♦ Distributed Memory � Message Passing w/MPI in ScaLAPACK ♦ Many-Core � Event driven multi-threading in PLASMA � Parallel Linear Algebra Software for Multicore Architectures 4 2

Time to Rethink Software Again Time to Rethink Software Again ♦ Must rethink the design of our software � Another disruptive technology � Similar to what happened with cluster computing and message passing � Rethink and rewrite the applications, algorithms, and software ♦ Numerical libraries for example will change � For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 5 Parallelism in LAPACK / ScaLAPACK Shared Memory Distributed Memory ScaLAPACK LAPACK Parallel Specialized Specialized Specialized PBLAS PBLAS PBLAS ATLAS ATLAS ATLAS BLAS BLAS BLAS BLACS BLACS threads BLACS threads threads MPI MPI MPI Two well known open source software efforts for dense matrix problems. 6 3

Steps in the LAPACK LU Steps in the LAPACK LU DGETF2 LAPACK (Factor a panel) DLSWP LAPACK (Backward swap) DLSWP LAPACK (Forward swap) DTRSM BLAS (Triangular solve) DGEMM BLAS (Matrix multiply) 7 Most of the work done here LU Timing Profile (4 processor system) LU Timing Profile (4 processor system) Threads – no lookahead Time for each component 1D decomposition and SGI Origin DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM DGETF2 DLSWP DLSWP DTRSM Bulk Sync Phases Bulk Sync Phases 8 DGEMM 4

Adaptive Lookahead Adaptive Lookahead - - Dynamic Dynamic Reorganizing algorithms to use Event Driven Multithreading Event Driven Multithreading 9 this approach Fork- -Join vs. Dynamic Execution Join vs. Dynamic Execution Fork A T T A T Fork-Join – parallel BLAS B C C Time Experiments on Experiments on Intel’ ’s Quad Core s Quad Core Clovertown Clovertown Intel 10 with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads 5

Fork- -Join vs. Dynamic Execution Join vs. Dynamic Execution Fork A A T T T Fork-Join – parallel BLAS B C C Time DAG-based – dynamic scheduling Time saved Experiments on Experiments on Intel’ ’s Quad Core s Quad Core Clovertown Clovertown Intel 11 with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads Fork-Join vs. Dynamic Execution Breaking the “hour-glass” pattern of parallel processing � � � LU Factorization Cholesky Factorization QR Factorization 40 32 40 Dynamic Dynamic Dynamic 36 36 28 32 32 28 28 24 24 24 Gflop/s Gflop/s Gflop/s 20 20 20 16 16 Fork-Join 16 12 12 Fork-Join 8 8 12 Fork-Join 4 4 0 8 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000 Size Size Size � Intel Clovertown � clock - 2.66 GHz � 2 sockets - quad-core � 8 cores total 12 � 85 GFlop/s Theoretical Peak 6

Intel’ ’s s Clovertown Clovertown Quad Core Quad Core Intel 3 Implementations of LU factorization 3 Implementations of LU factorization 1. LAPACK (BLAS Fork-Join Parallelism) Quad core w/2 sockets per board, w/ 8 Treads 2. ScaLAPACK (Mess Pass using mem copy) Quad core w/2 sockets per board, w/ 8 Treads 3. DAG Based (Dynamic Scheduling) 45000 40000 35000 30000 Mflop/s 25000 20000 15000 8 Core Experiments 10000 5000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 Problems Size 13 What about the IBM’ ’s s What about the IBM Cell Processor? Cell Processor? ♦ Power PC at 3.2 GHz ♦ 8 SPEs $600 � 204.8 Gflop/s peak! � The catch is that this is for 32 bit floating point; (Single Precision SP) � And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!! � Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues The SPEs are fully IEEE-754 compliant in double precision. 14 In single precision, they only implement round-towards-zero. PowerPC part is fully IEEE compliant. 7

On the Way to Understanding How to Use On the Way to Understanding How to Use the Cell Something Else Happened … the Cell Something Else Happened … ♦ Realized have the similar situation on our commodity Speedup Speedup processors. Size Size SGEMM/ SGEMV/ � That is, SP is 2X DGEMM DGEMV as fast as DP on AMD Opteron 246 3000 2.00 5000 1.70 many systems Sun UltraSparc-IIe 3000 1.64 5000 1.66 Intel PIII Coppermine 3000 2.03 5000 2.09 ♦ Standard Intel PowerPC 970 3000 2.04 5000 1.44 Pentium and AMD Intel Woodcrest 3000 1.81 5000 2.18 Opteron have SSE2 Intel XEON 3000 2.04 5000 1.82 � 2 flops/cycle DP Intel Centrino Duo 3000 2.71 5000 2.21 � 4 flops/cycle SP ♦ IBM PowerPC has AltiVec � 8 flops/cycle SP Two things going on: � 4 flops/cycle DP • SP has higher execution rate and � No DP on AltiVec • Less data to move. 15 Idea Something Like This… … Idea Something Like This ♦ Exploit 32 bit floating point as much as possible. � Especially for the bulk of the computation ♦ Correct or update the solution with selective use of 64 bit floating point to provide a refined results ♦ Intuitively: � Compute a 32 bit result, � Calculate a correction to 32 bit result using selected higher precision and, � Perform the update of the 32 bit results with the correction using high precision. 16 8

Mixed- -Precision Iterative Refinement Precision Iterative Refinement Mixed ♦ Iterative refinement for dense systems, Ax = b , can work this way. L U = lu(A) O ( n 3 ) SINGLE x = L\(U\b) O ( n 2 ) SINGLE r = b – Ax O ( n 2 ) DOUBLE WHILE || r || not small enough z = L\(U\r) O ( n 2 ) SINGLE x = x + z O ( n 1 ) DOUBLE r = b – Ax O ( n 2 ) DOUBLE END � Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. � It can be shown that using this approach we can compute the solution to 64-bit floating point precision. � Requires extra storage, total is 1.5 times normal; � O(n 3 ) work is done in lower precision � O(n 2 ) work is done in high precision 17 � Problems if the matrix is ill-conditioned in sp; O(10 8 ) In Matlab Matlab on My Laptop! on My Laptop! In ♦ Matlab has the ability to perform 32 bit floating point for some computations � Matlab uses LAPACK and MKL BLAS underneath. sa=single(a); sb=single(b); [sl,su,sp]=lu(sa); Most of the work: O(n 3 ) sx=su\(sl\(sp*sb)); x=double(sx); r=b-a*x; O(n 2 ) i=0; while(norm(r)>res1), i=i+1; sr = single(r); sx1=su\(sl\(sp*sr)); x1=double(sx1); x=x1+x; r=b-a*x; O(n 2 ) if (i==30), break; end; ♦ Bulk of work, O(n 3 ), in “single” precision ♦ Refinement, O(n 2 ), in “double” precision � Computing the correction to the SP results in DP and adding it to the SP results in DP. 18 9

Another Look at Iterative Refinement Another Look at Iterative Refinement On a Pentium; using SSE2, single precision can perform 4 floating ♦ point operations per cycle and in double precision 2 floating point operations per cycle. In addition there is reduced memory traffic (for sp data) ♦ In Matlab Comparison of 32 bit w/iterative refinement and 64 Bit Computation for Ax=b 3.5 Intel Pentium M (T2500 2 GHz) 3 2.5 2 Gflop/s 1.4 GFlop/s! A\b; Double Precision 1.5 Not bad for Matlab 1 0.5 0 19 0 500 1000 1500 2000 2500 3000 Ax = b Size of Problem Another Look at Iterative Refinement Another Look at Iterative Refinement On a Pentium; using SSE2, single precision can perform 4 floating ♦ point operations per cycle and in double precision 2 floating point operations per cycle. In addition there is reduced memory traffic (factor on sp data) ♦ In Matlab Comparison of 32 bit w/iterative refinement and 64 Bit Computation for Ax=b 3.5 Intel Pentium M (T2500 2 GHz) 3 GFlop/s!! A\b; Single Precision w/iterative refinement 3 With same accuracy as DP 2.5 2 Gflop/s A\b; Double Precision 1.5 1 2 X speedup Matlab 0.5 on my laptop! 0 20 0 500 1000 1500 2000 2500 3000 Ax = b Size of Problem 10

The Challenge of Multicore The Challenge of Multicore and and - PDF document

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for Specialized Accelerators for Mathematical Software Mathematical Software Jack Dongarra Alfredo Buttari, Jakub Kurzak, Julie Langou, Julien Langou,

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Multicore Multicore on Math Software on Math Software The Impact of and

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

1 Growth is a process Growth is a process Growth is a process Repeated growth measurements are

T0 T1 T2

Other models used to Model Demand Dr. Randa Oqab Mujalli TRIP GENERATION Dr. Randa Oqab Mujalli

30 Years into the Era of SCI Research and the Hope for Clinical Trials in SCI Cure: Where are

Sea Lane Human Antibody Libraries Sea Lane Human Antibody Libraries ConCIRT and EnCORE

The 2019 Long-Term Budget Outlook in 23 Slides August 2019 For more details, see

Lyxor Research Paper http://ssrn.com/abstract=2524547 Thierry Roncalli Factor Investing and

61A Lecture 20 Monday, March 11 Announcements Project 3 due Thursday 3/12 @ 11:59pm

The Challenge of Multicore The Challenge of Multicore and and - PDF document

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for Specialized Accelerators for Mathematical Software Mathematical Software Jack Dongarra Alfredo Buttari, Jakub Kurzak, Julie Langou, Julien Langou,

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Multicore Multicore on Math Software on Math Software The Impact of and

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore Based Packet Splitting Multicore Based Packet Splitting Approaches for High Speed

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

1 Growth is a process Growth is a process Growth is a process Repeated growth measurements are

T0 T1 T2

Other models used to Model Demand Dr. Randa Oqab Mujalli TRIP GENERATION Dr. Randa Oqab Mujalli

30 Years into the Era of SCI Research and the Hope for Clinical Trials in SCI Cure: Where are

Sea Lane Human Antibody Libraries Sea Lane Human Antibody Libraries ConCIRT and EnCORE

The 2019 Long-Term Budget Outlook in 23 Slides August 2019 For more details, see

Lyxor Research Paper http://ssrn.com/abstract=2524547 Thierry Roncalli Factor Investing and

61A Lecture 20 Monday, March 11 Announcements Project 3 due Thursday 3/12 @ 11:59pm

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA