Dynamically balanced synchronization-avoiding LU factorization - PowerPoint PPT Presentation

Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs Simplice D ONFACK ∗ Stanimire T OMOV † Jack D ONGARRA ‡ presenter: Piotr L USZCZEK § ∗ formerly: University of Tennessee, currently: CSCS Lugano, Switzerland † University of Tennessee ‡ University of Tennessee, Oak Ridge National Laboratory, and University of Manchester § University of Tennessee 1

HPC Hardware Zoo • Intel – x86 tick-tock: ➥ Nehalem ➥ Westmere ➥ Sandy Bridge ➥ Ivy Bridge ➥ Haswell ➥ Broadwell – MIC/Phi core-counts: Knights Corner: 57, 62, . . . • AMD – x86 architectures: ➥ Bulldozer ➥ Piledriver – x86 models: ➥ Barcelona ➥ Shanghai ➥ Istanbul ➥ Magny-Cours ➥ War- saw ➥ Seattle • NVIDIA: ➥ Tesla ➥ Fermi ➥ Kepler • Per-core fl op/s: 10, 20, 40 • Per-socket fl op/s: 100 – 600 • Per-accelerator fl op/s: 500 – 1500 Balance between CPU and accelerator: 2x – 10x AsHES 2014 May 19, 2014 2/19

Motivation for Communication Avoiding Algorithm • Running time is a function of : – Time for arithmetic operations = Total( fl ops) × time/ fl op. – Time for moving data = Total(messages) × latency + Total(bytes) / bandwidth. • Exponentially growing gaps between communication and computation. – Annual improvements predictions [FOSC ’04 ]. time/ fl op Bandwidth Latency Network 26 % 15 % 59 % DRAM 23 % 5 % AsHES 2014 May 19 , 2014 3 / 19

Communication avoiding algorithms: • aim at reducing communication by doing some redundant computations. – Work more, talk less. • are becoming a part of the numerical algorithm design. Communication avoiding LU (CALU): • removes the bottleneck in classic LU by performing the panel as a reduction operation. – Tournament pivoting replaces partial pivoting. • factorizes the panel twice. AsHES 2014 May 19 , 2014 4 / 19

CALU [Grigori, Demmel, Xiang ’08 ] The main di ff erence with classic approach lies on the panel factorization. The panel factorization is performed in two steps. • A preprocessing step aims at identifying at low communication cost good pivot rows. • The pivot rows are permuted in the fi rst positions of the panel and LU without pivoting of the panel is performed. • The update of the trailing matrix is performed as in classic LU (Gaussian Elimination with Partial Pivoting – GEPP). • The main di ff erence lies on the panel factorization. I n classic approach as ScaLAPACK, panel is factorized column by column, while with CALU it is factorized block by block using a reduction tree. • The algorithm was fi rst introduce for QR. The obvious generalization of CAQR to CALU was not stable in practice. CALU uses a new pivoting strategy. • CALU is stable in practice (and so is classic LU). AsHES 2014 May 19 , 2014 5 / 19

CALU ’ s Tournament Pivoting AsHES 2014 May 19 , 2014 6 / 19

Communication Avoiding Algorithm Lowers Bounds • General lower bounds for all direct linear algebra. – Total(bytes moved) = Ω ( Total ( flops ) ) = Ω ( n 2 P ) √ √ M – Total(messages) = Ω ( Total ( flops ) ) [Ballard, Demmel, Holtz, Schwartz ’11 ] √ M M • Performance model of CALU, PDGETRF with optimal layout for general matrix. M = O ( n 2 P ) PDGETRF CALU Optimal Lower bounds √ √ P log 3 P Ω ( P ) Total(messages) n log P 3 √ + 3 P log P 2 n 2 n 2 Ω ( n 2 P log P P log P P ) √ √ √ Total(words) n 3 n 3 n 3 2 2 2 Total( fl ops) P P P 3 3 3 n 3 + O ( P log 2 P ) AsHES 2014 May 19 , 2014 7 / 19

MAGMA ’ s Approach to LU Factorization • MAGMA = Matrix Algebra on GPU and Multicore Architectures • Hybrid LU factorization in MAGMA – Panel are factorized on the CPUs. – Update of the trailing submatrices are performed on the GPUs. Example of execution of magma dgetrf() on a square matrix in 4 steps. matrix/data view: DAG view: • Load imbalance between CPUs and GPUs. • E ffi cient updates and optimal use • Poor multicore scalability. of the GPUs. AsHES 2014 May 19 , 2014 8 / 19

CALU for MAGMA First goal • Adapt and evaluate CALU as panel factorization in MAGMA. Approach • Replace standard panel factorization in MAGMA with CALU. • I ncrease then panel block size B to improve the load balance. • I ntroduce two (algorithmic) block sizes: – panel block size B , and – internal block size ib for CALU. AsHES 2014 May 19 , 2014 9 / 19

MAGMA approach with CALU as panel: I nitial results First performance results on AMD Opteron 6172 • 4 sockets • 12 cores @ 2 . 1 Ghz • Peak performance CPU: 403 . 2 G fl ops/s • NV I D I A Fermi GPU: 504 G fl ops/s • Total: 907 . 2 G fl ops/s. Fast panel factorization technique is not enough. AsHES 2014 May 19 , 2014 10 / 19

Balanced Approach to Accelerated CALU • The matrix is partitioned into two parts for the CPUs and the GPU. • Each factorized panel is asynchronously sent to the GPU. • A block column is dynamically sent to the CPUs during the runtime to balance work. a. Example of execution. b. Corresponding DAG. AsHES 2014 May 19 , 2014 11 / 19

Performance of Asynchronous CALU with Fixed Parameters Variants of CALU on AMD Opteron 6172 using 12 cores and 1 GPU: Results on: ✧ AMD Opteron 6172 ✧ 4 sockets ✧ 12 cores @ 2 . 1 Ghz ✧ Peak performance CPU: 403 . 2 G fl ops/s ✧ NV I D I A Fermi GPU: 504 G fl ops/s ✧ Total: 907 . 2 G fl ops/s. How to determine the initial amount of work for the CPUs part? AsHES 2014 May 19 , 2014 12 / 19

Performance Model Parameters Global parameters: • d — the number of block column in the CPU ’ s part. • P — the number of processors for the CPU ’ s part. • g 1 and g 2 — the peak performance of one CPU and one GPU respectively. At each step of the factorization K , temporal parameters: • N K — the number of block column of the remaining matrix. • W CPUs and W GPU — the amount of work required to compute the CPU ’ s part and GPU ’ s part, respectively. • T CPUs and T GPU — the time required to complete W CPUs and W GPU , respectively. AsHES 2014 May 19 , 2014 13 / 19

Performance Model ’ s Details I nitial matrix decomposition: T CPUs = W CPUs W CPUs = W 1 panel + ( d − 1 ) W 1 update and P × g 1 T GPU = W GPUs W GPU = ( N K − d ) W 1 update and g 2 By solving T CPUs = T GPU , we obtain: d Pg 1 = Pg 1 + g 2 N K d N K represents the percentage of the matrix to assign to the CPUs. AsHES 2014 May 19 , 2014 14 / 19

Performance Model ’ s Prediction • AMD Opteron 6172 : 4 x 12 cores @ 2 . 1 Ghz; Peak performance CPU: 403 . 2 G fl ops/s, GPU: 504 G fl ops/s, Total: 907 . 2 G fl ops/s. • AMD Opteron 6180 : 4 x 12 cores @ 2 . 5 Ghz; Peak performance CPU: 480 . 0 G fl ops/s, GPU: 504 G fl ops/s, Total: 984 . 0 G fl ops/s. • I ntel Xeon E 5 - 2670 : 2 x 8 cores @ 2 . 6 Ghz; Peak performance CPU: 332 . 8 G fl ops/s, GPU: 665 G fl ops/s, Total: 997 . 8 G fl ops/s. AsHES 2014 May 19 , 2014 15 / 19

Scalability Experiments • AMD Opteron 6172 : 4 x 12 cores @ 2 . 1 Ghz; Peak performance CPU: 403 . 2 G fl ops/s, GPU: 504 G fl ops/s, Total: 907 . 2 G fl ops/s. AsHES 2014 May 19 , 2014 16 / 19

Performance of Asynchronous CALU with Estimated Parameters Performance of CALU for square matrices. • AMD Opteron 6180 : 4 x 12 cores @ 2 . 5 Ghz; Peak performance CPU: 480 . 0 G fl ops/s, GPU: 504 G fl ops/s, Total: 984 . 0 G fl ops/s. • I ntel Xeon E 5 - 2670 : 2 x 8 cores @ 2 . 6 Ghz; Peak performance CPU: 332 . 8 G fl ops/s, GPU: 665 G fl ops/s, Total: 997 . 8 G fl ops/s. AsHES 2014 May 19 , 2014 17 / 19

Scalability of Asynchronous CALU for Tall-and-Skinny Matrices Performance and scalability using 48 cores. Results on: ✧ AMD Opteron 6172 ✧ 4 sockets ✧ 12 cores @ 2 . 1 Ghz ✧ Peak performance CPU: 403 . 2 G fl ops/s ✧ NV I D I A Fermi GPU: 504 G fl ops/s ✧ Total: 907 . 2 G fl ops/s. AsHES 2014 May 19 , 2014 18 / 19

Summary, Conclusions, and Future Work Contributions: • Accelerated CALU LU factorization for a wide range of CPU-GPU hardware combinations. • E ffi cient and scalable implementation for tens of CPU cores. • Simple model that makes the algorithm self-adapting in practice. Possible extensions: • I ntegrate dynamic load-balancing using runtime schedulers such as QUARK. • Extend the approach to other algorithms – Recursive parallel panel LU, RRLU, QR, CAQR. – Two-sided factorizations: symmetric eigenvalues, SVD reduction. ∗ Please attend my Friday ’ s talk. – Support for multiple GPUs. – Support for hetergeneous accelerator con fi gurations. ∗ Please attend my Tuesday ’ s talk. AsHES 2014 May 19 , 2014 19 / 19

Dynamically balanced synchronization-avoiding LU factorization - PowerPoint PPT Presentation

Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs Simplice D ONFACK Stanimire T OMOV Jack D ONGARRA presenter: Piotr L USZCZEK formerly: University of Tennessee, currently: CSCS Lugano,

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

The programmer's view The programmer's view of a dynamically reconfigurable of a dynamically

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Two-dimensional self-avoiding walks Mireille Bousquet-Mlou CNRS, LaBRI, Bordeaux, France

The Synchronization Toolbox The Synchronization Toolbox Mutual Exclusion Mutual Exclusion Race

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Synchronization Synchronization

CS 134: Operating Systems More Synchronization 1 / 21 Overview CS34 Overview 2013-05-19

1 Tournament Branch Predictor Accuracy of Return Address Predictor Used in Alpha 21264: Track

1. Identify the areas of tight-coupling in the following code:

Tabletop Notifier Group 6 Critical Design Review Presentation Motivation Tabletop display

KDE Device spectrum: Plasma Netbook Marco Martin Why for netbooks? Why KDE SC? Why KDE SC?

CS344M Autonomous Multiagent Systems Todd Hester Department or Computer Science The University

1 The first problem we're going to tackle is quicksort. Just as all of the complexity of

Chapter 1: Getting Started The Probabilistic Method Summer 2020 Freie Universitt Berlin

Chapter 10, Object design is situated between system design and Mapping Models to Code

Dynamically balanced synchronization-avoiding LU factorization - PowerPoint PPT Presentation

Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs Simplice D ONFACK Stanimire T OMOV Jack D ONGARRA presenter: Piotr L USZCZEK formerly: University of Tennessee, currently: CSCS Lugano,

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

The programmer's view The programmer's view of a dynamically reconfigurable of a dynamically

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Two-dimensional self-avoiding walks Mireille Bousquet-Mlou CNRS, LaBRI, Bordeaux, France

The Synchronization Toolbox The Synchronization Toolbox Mutual Exclusion Mutual Exclusion Race

Clock Synchronization Synchronization Clock Henrik Lnn Electronics &amp; Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Module 6: Process Synchronization Background The Critical-Section Problem

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

Chapter 6: Process [&amp; Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Synchronization Synchronization

CS 134: Operating Systems More Synchronization 1 / 21 Overview CS34 Overview 2013-05-19

1 Tournament Branch Predictor Accuracy of Return Address Predictor Used in Alpha 21264: Track

1. Identify the areas of tight-coupling in the following code:

Tabletop Notifier Group 6 Critical Design Review Presentation Motivation Tabletop display

KDE Device spectrum: Plasma Netbook Marco Martin Why for netbooks? Why KDE SC? Why KDE SC?

CS344M Autonomous Multiagent Systems Todd Hester Department or Computer Science The University

1 The first problem we're going to tackle is quicksort. Just as all of the complexity of

Chapter 1: Getting Started The Probabilistic Method Summer 2020 Freie Universitt Berlin

Chapter 10, Object design is situated between system design and Mapping Models to Code

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]