MAGMA: Matrix Algebra on GPU and Multicore Architectures - PowerPoint PPT Presentation

MAGMA: ¡Matrix ¡Algebra ¡on ¡GPU ¡ and ¡Multicore ¡Architectures ¡ Jack Dongarra University ¡of ¡Tennessee, ¡Knoxville ¡ Oak ¡Ridge ¡National ¡Laboratory ¡

ORNL’s “Titan” Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors SYSTEM SPECIFICATIONS: • Peak performance of 27 PF • 24.5 Pflop/s GPU + 2.6 Pflop/s AMD • 18,688 Compute Nodes each with: • 16-Core AMD Opteron CPU • NVIDIA Tesla “K20x” GPU • 32 + 6 GB memory • 512 Service and I/O nodes 4,352 ft 2 • 200 Cabinets 404 m 2 • 710 TB total system memory • Cray Gemini 3D Torus Interconnect • 9 MW peak power 2

Cray XK7 Compute Node XK7 Compute Node Characteristics AMD Opteron 6274 Interlagos 16 core processor Tesla K20x @ 1311 GF PCIe Gen2 Host Memory 32GB 1600 MHz DDR3 3 T H Tesla K20x Memory 3 T H 6GB GDDR5 Gemini High Speed Interconnect Z ¡ Y ¡ X ¡ Slide courtesy of Cray, Inc. 3

Titan: Cray XK7 System System: 200 Cabinets 18,688 Nodes 27 PF 710 TB Cabinet: 24 Boards 96 Nodes 139 TF Board: 3.6 TB 4 Compute Nodes 5.8 TF 152 GB Compute Node: 1.45 TF 38 GB 4

November 2012: The TOP10 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt DOE / OS Titan, Cray XK7 (16C) + Nvidia 1 USA 560,640 17.6 66 8.3 2120 Oak Ridge Nat Lab Kepler GPU (14c) + custom DOE / NNSA Sequoia, BlueGene/Q (16c) 2 USA 1,572,864 16.3 81 7.9 2063 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 3 Japan 705,024 10.5 93 12.7 827 for Comp Sci VIIIfx (8c) + custom DOE / OS Mira, BlueGene/Q (16c) 4 USA 786,432 81 3.95 2066 8.16 Argonne Nat Lab + custom Forschungszentrum JuQUEEN, BlueGene/Q (16c) 5 Germany 393,216 4.14 82 1.97 2102 Juelich + custom Leibniz 6 SuperMUC, Intel (8c) + IB Germany 147,456 90* 3.42 848 2.90 Rechenzentrum Texas Advanced Stampede, Dell Intel (8) + Intel 7 USA 204,900 67 3.3 806 2.66 Computing Center Xeon Phi (61) + IB Tianhe-1A, NUDT Nat. SuperComputer Intel (6c) + Nvidia Fermi GPU 8 China 186,368 2.57 55 4.04 636 Center in Tianjin (14c) + custom Fermi, BlueGene/Q (16c) 9 CINECA Italy 163,840 82 .822 2105 1.73 + custom DARPA Trial System, Power7 10 IBM USA 63,360 1.51 78 .358 422 (8C) + custom 5 500 Slovak Academy Sci IBM Power 7 Slovak Rep 3,074 .077 81

Accelerators (62 systems) Intel ¡MIC ¡(7) ¡ 60 ¡ Clearspeed ¡CSX600 ¡(0) ¡ 50 ¡ ATI ¡GPU ¡(3) ¡ IBM ¡PowerXCell ¡8i ¡(2) ¡ 40 ¡ Systems ¡ NVIDIA ¡2070 ¡(7) ¡ NVIDIA ¡2050 ¡(11) ¡ 30 ¡ NVIDIA ¡2090 ¡(30) ¡ 20 ¡ NVIDIA ¡K20 ¡(2) ¡ 32 US 1 Australia 10 ¡ 6 China 1 Brazil 2 Japan 1 Canada 4 Russia 1 Saudi Arabia 0 ¡ 2 France 1 South Korea 2006 ¡ 2007 ¡ 2008 ¡ 2009 ¡ 2010 ¡ 2011 ¡ 2012 ¡ 2 Germany 1 Spain 1 India 1 Switzerland 2 Italy 1 Taiwan 2 Poland 1 UK

A New Generation of DLA Software Software/Algorithms follow hardware evolution in time LINPACK (70’s) Rely on (Vector operations) - Level-1 BLAS operations LAPACK (80’s) Rely on (Blocking, cache - Level-3 BLAS friendly) operations ScaLAPACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLASMA (00’s) Rely on New Algorithms - a DAG/scheduler (many-core friendly) - block data layout - some extra kernels MAGMA Rely on - hybrid scheduler Hybrid Algorithms �� - hybrid kernels �� (heterogeneity friendly) ��

Methodology overview A ¡methodology ¡to ¡use ¡all ¡available ¡resources: ¡ ¡ ¨ MAGMA uses hybridization methodology based on Ø Representing linear algebra algorithms as collections Hybrid ¡CPU+GPU ¡algorithms ¡ of tasks and data dependencies among them (small ¡tasks ¡for ¡multicores ¡and ¡ ¡ large ¡tasks ¡for ¡GPUs) ¡ Ø Properly scheduling tasks' execution over multicore and GPU hardware components ¨ Successfully applied to fundamental linear algebra algorithms Ø One- and two-sided factorizations and solvers Ø Iterative linear and eigensolvers ¨ Productivity Ø 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

Commodity plus Accelerator Today 32 CUDA Cores/SMX Commodity Accelerator (GPU) Intel Xeon Nvidia M2090 “Fermi” 8 cores 512 “Cuda cores” 3 GHz 1.3 GHz 8*4 ops/cycle 512 ops/cycle 96 Gflop/s (DP) 665 Gflop/s (DP) or 1302 Gflop/s (SP) 6 GB Interconnect 9 PCI-X 16 lane 64 Gb/s (8 GB/s) 1 GW/s

Commodity plus Accelerator Today 192 Cuda cores/SMX Commodity Accelerator (GPU) Intel Xeon Nvidia K20X “Kepler” 8 cores 2688 “Cuda cores” 3 GHz .732 GHz 8*4 ops/cycle 2688*2/3 ops/cycle 96 Gflop/s (DP) 1.31 Tflop/s (DP) or 3.62 Tflop/s (SP) 6 GB Interconnect 10 PCI-X 16 lane 64 Gb/s (8 GB/s) 1 GW/s

MAGMA: LAPACK for GPUs ¨ MAGMA ¡ Ø Matrix ¡algebra ¡for ¡GPU ¡and ¡multicore ¡architecture ¡ Ø To ¡provide ¡LAPACK/ScaLAPACK ¡on ¡hybrid ¡architectures ¡ ¡ Ø http://icl.cs.utk.edu/magma/ ¡ ¨ MAGMA 1.3 Ø For NVIDIA CUDA GPUs on shared memory systems Ø 80+ hybrid algorithms have been developed (total of 320+ routines) Ø One-sided factorizations and linear system solvers Ø Two-sided factorizations and eigenproblem solvers Ø A subset of BLAS and auxiliary routines in CUDA ¨ MAGMA ¡developers ¡& ¡collaborators ¡ Ø UTK, ¡UC ¡Berkeley, ¡UC ¡Denver, ¡INRIA ¡(France), ¡KAUST ¡(Saudi ¡Arabia) ¡ Ø Community ¡effort, ¡similar ¡to ¡LAPACK/ScaLAPACK ¡

MAGMA Functionality ¨ ¡80+ ¡hybrid ¡algorithms ¡ have ¡been ¡developed ¡(total ¡of ¡320+ ¡ routines) ¡ Ø Every ¡algorithm ¡is ¡in ¡4 ¡precisions ¡(s/c/d/z) ¡ Ø There ¡are ¡3 ¡mixed ¡precision ¡algorithms ¡ ¡(zc ¡& ¡ds) ¡ Ø These ¡are ¡hybrid ¡algorithms, ¡expressed ¡in ¡terms ¡of ¡BLAS ¡ ¨ MAGMA ¡BLAS ¡ ¡ Ø A ¡subset ¡of ¡GPU ¡BLAS, ¡optimized ¡for ¡Tesla ¡and ¡Fermi ¡GPUs ¡

MAGMA Software Stack C P U H Y B R I D G P U distr . Hybrid LAPACK/ScaLAPACK & Tile Algorithms / StarPU / DAGuE MAGMA 1.2.1 Hybrid Tile (PLASMA) Algorithms multi PLASMA / Quark StarPU run-time system MAGMA 1.2.1 Hybrid LAPACK and Tile Kernels MAGMA 1.2.1 MAGMA SPARSE MAGMA BLAS single LAPACK BLAS BLAS CUDA / OpenCL / MIC Linux, Windows, Mac OS X | C/C++, Fortran | Matlab, Python

Hybrid Algorithms One-‑Sided ¡Factorizations ¡ (LU, ¡QR, ¡and ¡Cholesky) ¡ ¨ Hybridization Ø Panels (Level 2 BLAS) are factored on CPU using LAPACK Ø Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead”

A Hybrid Algorithm Example ¨ Left-looking hybrid Cholesky factorization in MAGMA ¨ The difference with LAPACK – the 4 additional lines in red ¨ Line 8 (done on CPU) is overlapped with work on the GPU (from line 6)

Multiple precision support Performance of the LU factorization in various precisions 900 800 CGETRF_GPU 700 SGETRF_GPU ZGETRF_GPU 600 DGETRF_GPU 500 CPU DP peak (134 GFlop/s) 400 300 200 Keeneland GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) 100 CPU Intel Xeon X5660 (2x6 cores @2.8GHz) 0 Matrix size

LU Factorization (single GPU)

Mixed Precision Methods • Mixed precision, use the lowest precision required to achieve a given accuracy outcome § Improves runtime, reduce power consumption, lower data movement § Reformulate to find correction to solution, rather than solution; Δ x rather than x. 18

Idea Goes Something Like This… • Exploit 32 bit floating point as much as possible. § Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: § Compute a 32 bit result, § Calculate a correction to 32 bit result using selected higher precision and, § Perform the update of the 32 bit results with the correction using high precision. 19

MAGMA: Matrix Algebra on GPU and Multicore Architectures - PowerPoint PPT Presentation

MAGMA: Matrix Algebra on GPU and Multicore Architectures Jack Dongarra University of Tennessee, Knoxville Oak Ridge National Laboratory ORNLs Titan Hybrid System: Cray

Introduction to Magma What is Magma? Magma is a computer algebra system for computations in

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Sha um Za n Compan ecretary ACS No. 13918 Encl: as above corn/MagrnaFincorpLtd

MAGMA HIGH-TEMPERATURE RECYCLING OF MUNICIPAL WASTE BY WASTE-FREE TECHNOLOGY GERMANY |

The origins of magmas What is the primary magma? Basalt Origin of Basaltic Magma Basalt in

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Architectures Architectural styles Software architectures Architectures versus middleware

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

The challenges of climate change for energy markets Tenth ACCC Regulatory Conference The

Coca-Cola FEMSA Coca-Cola FEMSA October 2005 Cautionary Statement Cautionary Statement

Evolution Selects For and Against Complexity Larry Yaeger School of Informatics, Indiana

An evolutionary perspective on the philosophy of information Christophe Menant - Independent

Ground deformation in and around Sakurajima volcano: the data of precise leveling surveys and

February, 2014 Important Notice This presentation contains only a brief overview of Greenland

Gallagher Low Head Siphon Public Information Presentation 2020 April 2020 Project: 306-1621

Starting Soon Arts District Project Area WE ARE RECORDING WE HAVE MUTED THIS MEETING. THE