Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, - PowerPoint PPT Presentation

Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, 2019

Acknowledgments Joint work with : ● Bastian Hagedorn Sam Elliott ● ● Henrik Barthels Ras Bodik ● Optimized And contributions from many others at NVIDIA. ode 2

OVERVIEW High Performance DSL for linear algebra on GPUs A hierarchical scheduling language based on Halide and TVM ● designed to express GPU optimizations for maximum performance Can directly represent elements of storage hierarchy ● ○ registers, fragments, shared memory Optimized compute hierarchy ● ode ○ threads, warps, blocks, kernels Can reason about tensorcore and machine level operations. Suitable for auto-scheduling and auto-tuning 3

DECOMPOSING MATMUL Exploiting Hierarchical Structure of GPU Kernels Describing the problem this box implements Hierarchical Structure: Original Problem is decomposed into “smaller” instances of the same type of problem 4 Matrix Multiplication Kernel

DECOMPOSING MATMUL Exploiting Hierarchical Structure of GPU Kernels Describing the problem this box implements Hierarchical Structure: Original Problem is decomposed into “smaller” instances of the same type of problem 5 Matrix Multiplication Kernel

INTRODUCTION GEMM Spec(ification) Specs define the current problem to optimize Fireiron MatMul Spec MatMul( Kernel , } A: Matrix(1536,2048, GL,FP32 , RowMajor ), B: Matrix(2048,1024, GL,FP32 , ColMajor ), C: Matrix(1536,1024, GL,FP32 , ColMajor )) and contain enough information to fully describe it Idea: A programmer should be able to provide a valid implementation for a given spec! 6

INTRODUCTION Working with Specs Goal : Generate high-performance MatMul Kernel -> We start with Kernel-level Spec Given a Spec, you can: } a) Provide a handwritten microkernel, or b) Arrive at an executable Spec, or c) Decompose it into a “smaller” spec Fireiron MatMul Spec MatMul(Kernel, A: Matrix(1536,2048,GL , FP32,RowMajor), B: Matrix(2048,1024,GL,FP32,ColMajor), C: Matrix(1536,1024,GL,FP32,ColMajor)) 7

INTRODUCTION Working with Specs Goal : Generate high-performance MatMul Kernel -> We start with Kernel-level Spec Given a Spec, you can: } a) Provide a handwritten microkernel, or b) Arrive at an executable Spec, or c) Decompose it into a “smaller” spec Fireiron MatMul Spec MatMul(Kernel, A: Matrix(1536,2048,GL ,FP32 , RowMajor ), B: Matrix(2048,1024,GL, FP32 , ColMajor ), C: Matrix(1536,1024,GL, FP32 , ColMajor )) 8

DECOMPOSITIONS Halide-like transformations constructing the IR Every Decomposition: 1. is a function: Spec -> Spec (returning a “smaller” subspec) 2. provides a partial implementation to our code generator Two Main Decompositions: .tile(m,n) - enables descending the compute-hierarchy ● .load(matrix, loc, impl) - enables descending the memory hierarchy ● We allow to define operation-specific Decompositions: .split(k) ● .epilog(...) ● ... ● 9

DESCENDING THE COMPUTE HIERARCHY .tile(m,n) 1024 Current Spec MatMul(Kernel, A:Matrix(1536,2048,GL,FP32,RowMajor), B:Matrix(2048,1024,GL,FP32,ColMajor), 2048 C:Matrix(1536,1024,GL,FP32,ColMajor)) 128 128 1536 .tile( 128 , 128 ) New Spec MatMul(Kernel, A:Matrix( 128 ,2048,GL,FP32,RowMajor), B:Matrix(2048, 128 ,GL,FP32,ColMajor), C:Matrix( 128 , 128 ,GL,FP32,ColMajor)) 10

DESCENDING THE COMPUTE HIERARCHY .tile(m,n) 1024 2048 128 128 1536 .tile( 128 , 128 ) .to( Block ) “ Refinement ”: Adding implementation details 11

OUTER PRODUCT BLOCKED GEMM .split( kBlock ) 128 Current Spec 8 2048 MatMul(Block, A:Matrix(128,2048,GL,FP32,RowMajor), B:Matrix(2048,128,GL,FP32,ColMajor), 8 C:Matrix(128 ,128,GL,FP32,ColMajor)) 128 128 .split( 8 ) 2048 128 New Spec 128 8 MatMul(Block, 8 A:Matrix(128, 8 ,GL,FP32,RowMajor), B:Matrix( 8 ,128,GL,FP32,ColMajor), 128 128 C:Matrix(128,128,GL,FP32,ColMajor)) 12

DESCENDING THE MEMORY HIERARCHY . load ( Matrix , Location , Strategy ) GL .load( A , SH , strategy ) GL GL new Spec describing data movement GL this spec is decomposed with the given strategy GL SH 13

WMMA IN FIREIRON adding support for CUDA’s WMMA API GL GL “ Before the MMA operation is performed the operand Memory Hierarchy matrices must be represented in the registers of the GPU. As an MMA is a warp-wide operation these registers are SH distributed amongst the threads of a warp with each thread SH holding a fragment of the overall matrix.” FR<M,N,K> RF RF Updating Fireiron’s Memory Hierarchy 14

fp16 performance on Volta 15

QUESTIONS? bhagedorn@nvidia.com vgrover@nvidia.com 16

Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, - PowerPoint PPT Presentation

Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, 2019 Acknowledgments Joint work with : Bastian Hagedorn Sam Elliott Henrik Barthels Ras Bodik Optimized And contributions from many others at NVIDIA. ode

Dec 2017 Progress Report Nov Dec Maddux Nov Dec Maddux Nov Dec Maddux Nov Dec Maddux

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

COLOMBIAN CAPITAL MARKETS EVOLUTION 2011: YTD Dec 2011: YTD Dec 2011: YTD Dec 2011: YTD

Steven Grover Architect, P.E. SPUR Oakland | January 14, 2019 Steven Grover & Associates

Quantum Attacks on Symmetric Cryptography Gregor Leander (joint work with Alex May) MMC 2017

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Schedule Date Day Class Title Chapters HW Lab Exam No. Due date Due date 1 Dec Mon

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a

Tizen Platform SDK: The Easy Way to Develop Tizen Platform Donghyuk Yang, Donghee Yang,

DEBUGGER CSSE 120 Rose-Hulman Institute of Technology Integrated Development Environments

How and Why eFPGA Will Become Pervasive Over the Next Decade D&R IP-SoC Grenoble 6 December

Com Computation onal Structures in Data Science Le Lecture 14: 4: UC Berkeley EECS Lecturer

Pythons Guide to the Galaxy Tom Ron Swiss Python Summit February 2016 Tom Ron - Senior

Adding two equivalence relations to the interval temporal logic AB Angelo Montanari 1 , Marco

Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, - PowerPoint PPT Presentation

Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, 2019 Acknowledgments Joint work with : Bastian Hagedorn Sam Elliott Henrik Barthels Ras Bodik Optimized And contributions from many others at NVIDIA. ode

Dec 2017 Progress Report Nov Dec Maddux Nov Dec Maddux Nov Dec Maddux Nov Dec Maddux

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

COLOMBIAN CAPITAL MARKETS EVOLUTION 2011*: YTD Dec 2011*: YTD Dec 2011*: YTD Dec 2011*: YTD

Steven Grover Architect, P.E. SPUR Oakland | January 14, 2019 Steven Grover &amp; Associates

Quantum Attacks on Symmetric Cryptography Gregor Leander (joint work with Alex May) MMC 2017

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Schedule Date Day Class Title Chapters HW Lab Exam No. Due date Due date 1 Dec Mon

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a

Tizen Platform SDK: The Easy Way to Develop Tizen Platform Donghyuk Yang, Donghee Yang,

DEBUGGER CSSE 120 Rose-Hulman Institute of Technology Integrated Development Environments

How and Why eFPGA Will Become Pervasive Over the Next Decade D&amp;R IP-SoC Grenoble 6 December

Com Computation onal Structures in Data Science Le Lecture 14: 4: UC Berkeley EECS Lecturer

Pythons Guide to the Galaxy Tom Ron Swiss Python Summit February 2016 Tom Ron - Senior

Adding two equivalence relations to the interval temporal logic AB Angelo Montanari 1 , Marco

COLOMBIAN CAPITAL MARKETS EVOLUTION 2011: YTD Dec 2011: YTD Dec 2011: YTD Dec 2011: YTD

Steven Grover Architect, P.E. SPUR Oakland | January 14, 2019 Steven Grover & Associates

How and Why eFPGA Will Become Pervasive Over the Next Decade D&R IP-SoC Grenoble 6 December