Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, - PowerPoint PPT Presentation

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1

Agenda • How Apple uses LLVM to build a GPU Compiler • Factors that affect GPU performance • The Apple GPU compiler • Pipeline passes • Challenges 2

How Apple uses LLVM • Live on Trunk and merge continuously • Benefit from latest improvements on trunk • Identify any regressions immediately and report back • Minimize changes to open source llvm code • Reuse as much as possible 3

Continuous Integration LLVM Trunk GPU Compiler Year 1 production compiler 4

Continuous Integration LLVM Trunk GPU Compiler Year 1 production compiler 5

Continuous Integration LLVM Trunk GPU Compiler Year 1 Year 2 Year 3 production production production compiler compiler compiler 6 3

Testing • Regression testing involves: • register count • instruction count • FileCheck : correctness • compile time • compiler size • runtime performance 7

The GPU SW Stack IR XPC Service iOS / watchOS / tvOS Process Metal-FE App Interacts Metal Framework, User GPU Driver XPC Service Shader Result Backend .metal .exec IR .obj 8

About GPUs 9

About GPUs Shader Core PC LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 GPUs are massively parallel vector processors Threads are grouped together and execute in lockstep (they share the same PC) 10

About GPUs Shader Core float kernel(float a, float b) { PC float c = a + b; return c; } LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 The parallelism is implicit, a single thread looks like normal CPU code 11

About GPUs Shader Core float8 kernel(float8 a, float8 b) { PC float8 c = add_v8(a, b);   return c; } LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 The parallelism is implicit, a single thread looks like normal CPU code 12

About GPUs : Latency hiding float kernel(struct In_PS ) { Shader Core PC PC PC PC float4 color = texture_fetch(); float4 c = In_PS.a * In_PS.b; … float4 d = c + color; … LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 } Multiple groups of threads are resident on the GPU at the same time for latency hiding 13

About GPUs : Latency hiding float kernel(struct In_PS ) { Shader Core PC PC PC PC float4 color = texture_fetch(); float4 c = In_PS.a * In_PS.b; … float4 d = c + color; … LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 } The GPU picks up work from the various different groups of threads to hide the latency from the other groups 14

About GPUs: Register file Shader Core PC PC PC PC LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 Register File 0a 0b 0c 0d 1a 1b 1c 1d 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 4d 5a 5b 5c 5d 6a 6b 6c 6d 7a 7b 7c 7d Registers per Registers per a b c d thread lane The groups of threads share a big register file that is split between the threads 18

About GPUs: Register file Shader Core PC LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 Register File 0 1 2 3 4 5 6 7 Registers per thread The number of registers used per-thread impact the number of resident group of threads on the machine (occupancy) 19

About GPUs: Register file VERY IMPORTANT! Shader Core PC LANE 0 LANE 1 LANE 2 LANE 3 LANE 4 LANE 5 LANE 6 LANE 7 Register File 0 1 2 3 4 5 6 7 Registers per thread This in turn will impact the latency hiding capability 20

About GPUs: Spilling Register File L1$ The huge register file and number of concurrent threads makes spilling pretty costly 21

About GPUs: Spilling Register File L1$ Example (spilling 1 register): 1024 threads x 32-bit register = 4 KB ! The huge register file and number of concurrent threads makes spilling pretty costly Spilling is typically not an effective way of reducing register pressure to increase occupancy and should be avoided at all costs 22

Pipeline 23

Inlining Unoptimized IR All functions + main kernel linked together in a single module Inlining We support function calls and we try to exploit them Like most GPU programming models though, we can inline everything if we want 24

Inlining Not inlining showed significant speedup on some shaders where big functions were called multiple times I-Cache savings! 25

Inlining Dead Arg Elimination Get rid of dead arguments to functions 26

Inlining Dead Arg Elimination Argument Promotion Convert to pass by value as many objects as we can 27

Inlining Dead Arg Elimination Argument Promotion Proceed to the actual inlining Inlining 28

Inlining Dead Arg Elimination Argument Promotion Inlining decision based on standard LLVM inlining policy + custom threshold + additional constraints Inlining 29

Inlining Objective of our inlining policy is to be very conservative while trying exploit cases where we can keep a function call can benefit us potentially a lot Custom policies try to minimize the impact that not inlining could have on other key optimizations for performance (SROA, Buffer preloading) int function(int addrspace(stack)* v) { … We force inline these cases } int function(int addrspace(constant)* v) { … } 30

Inlining The new IPRA support in LLVM has been key in avoiding pointless calling convention register store/reload Without IPRA With IPRA int callee() { int callee() { add r1, r2, r3 add r1, r2, r3 ret ret } } int caller () { int caller () { mul r4, r1, r3 mul r4, r1, r3 push r4 push r4 call callee() call callee() pop r4 pop r4 add r1, r1, r4 add r1, r1, r4 } } 31

SROA Argument Promotion Inlining SROA 32

SROA Argument Promotion We run it multiple times in our pipeline in order to be sure that we promote as many Inlining Inlining allocas to register values as possible SROA SROA 33

Alloca Opt Argument Promotion int function(int i) { int a[4] = { x, y, z, w }; Inlining … … = a[i]; } SROA Alloca Opt 34

Alloca Opt Argument Promotion int function(int i) { int a[4] = { x, y, z, w }; Inlining … … = i == 0 ? x : (i == 1 ? y : i == 2 ? z : w); } SROA Less stack Alloca Opt accesses! 35

Loop Unrolling SROA Alloca Opt Loop Unrolling 36

Loop Unrolling int a[5] = { x, y, z, w, q }; int a[5] = { x, y, z, w, q }; int b = x; int b = 0; b += y; b += z; for (int i = 0; i < 5; ++i) { b += w; b += a[i]; b += q; } Completely unrolling loops allows SROA to remove stack accesses If we have dynamic memory access to stack or constant memory that we can promote to uniform memory we want to greatly increase the unrolling thresholds 37

Loop Unrolling for (int i = 0; i < 5; ++i) { float4 a = texture_fetch(); float4 b = texture_fetch(); float4 c = texture_fetch(); float4 d = texture_fetch(); float4 e = texture_fetch(); // Math involving the above } We also keep track of register pressure Our scheduler is very eager to try and help latency hiding by moving most of memory accesses at the top of the shader (and is difficult to teach it otherwise) so we limit unrolling when we detect we could blow up the register pressure 38

Loop Unrolling for (int i = 0; i < 4; ++i) { float4 a1 = texture_fetch(); for (int i = 0; i < 16; ++i) { float4 a2 = texture_fetch(); float4 a = texture_fetch(); float4 a3 = texture_fetch(); float4 a4 = texture_fetch(); // Math involving the above … } // Unrolled 4 times } We allow partial unrolling if we detect a static loop count and the loop would be bigger than our unrolling threshold 39

Flatten CFG Loop Unrolling if (val == x) { Flatten CFG a = v + z; c = q + a; } else { b = v * z; c = q * b; } … = c; Speculation helps in creating bigger blocks for the scheduler to do a better job and reduces the total overhead introduced by small blocks 40

Flatten CFG Loop Unrolling if (val == x) { Flatten CFG a = v + z; a = v + z; c = q + a; c1 = q + a; } else { b = v * z; b = v * z; c2 = q * b; c = q * b; c = (val == x) ? c1 : c2; } … = c; … = c; Speculation helps in creating bigger blocks for the scheduler to do a better job and reduces the total overhead introduced by small blocks 41

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, - PowerPoint PPT Presentation

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler Pipeline passes

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

A Brief Introduction to Using LLVM Nick Sumner Spring 2013 What is LLVM? A compiler? What

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

FTL WebKits LLVM based JIT Andrew Trick, Apple Juergen Ributzka, Apple LLVM Developers

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?

Here be dragons: Using clang/LLVM to build Android Presented by: Behan Webster (LLVMLinux

Introduction to the LLVM Compiler System Chris Lattner llvm.org Architect November 4, 2008

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

Debug Info for Optimized Code LLVM BoF Session Adrian Prantl & Vedant Kumar, Apple October

LLVM Numerics Improvements Michael C. Berg, Apple LLVM Developers Meeting, Brussels,

LLVM Simone Campanoni simonec@eecs.northwestern.edu Problems with Canvas? Problems with slides?

Music and Words by Stephen Eisenhauer At JDPS were dragons from now until the end. And this

CHINA FINE WINES & DRAGONS HOLLOW A NEW WORLD WINE Dragons Hollow Vineyards

Soccer Club Coach Information Session July 2020 Agenda Introductions Why coach at the

Probability and Statistics for Computer Science A major use of probability in sta4s4cal

Please Sign Up for WinLink On your Laptop PC WinLink

4 1 3 2 Instruction ALU Registers Memory Fetch and Decode Instruction Set Architecture

Synthesizing Software Verifiers from Proof Rules Corneliu Popeea Technical University Munich

An Overlapping Compiler for Planguage (PC) of Parallel Program Poster: Yingqun Wang Advisor: Dr.

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

MIPS Instructions Note: You can have this handout on both exams. Instruction Formats : Instruction

CSSE232 Computer Architecture I Mul5cycle Control Groups

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, - PowerPoint PPT Presentation

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler Pipeline passes

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

A Brief Introduction to Using LLVM Nick Sumner Spring 2013 What is LLVM? A compiler? What

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

FTL WebKits LLVM based JIT Andrew Trick, Apple Juergen Ributzka, Apple LLVM Developers

LLVM/Clang Mouna Abidi &amp; Manel Grichi 1 Plan What is LLVM? How will you be using it?

Here be dragons: Using clang/LLVM to build Android Presented by: Behan Webster (LLVMLinux

Introduction to the LLVM Compiler System Chris Lattner llvm.org Architect November 4, 2008

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

Debug Info for Optimized Code LLVM BoF Session Adrian Prantl &amp; Vedant Kumar, Apple October

LLVM Numerics Improvements Michael C. Berg, Apple LLVM Developers Meeting, Brussels,

LLVM Simone Campanoni simonec@eecs.northwestern.edu Problems with Canvas? Problems with slides?

Music and Words by Stephen Eisenhauer At JDPS were dragons from now until the end. And this

CHINA FINE WINES &amp; DRAGONS HOLLOW A NEW WORLD WINE Dragons Hollow Vineyards

Soccer Club Coach Information Session July 2020 Agenda Introductions Why coach at the

Probability and Statistics for Computer Science A major use of probability in sta4s4cal

Please Sign Up for WinLink On your Laptop PC WinLink

4 1 3 2 Instruction ALU Registers Memory Fetch and Decode Instruction Set Architecture

Synthesizing Software Verifiers from Proof Rules Corneliu Popeea Technical University Munich

An Overlapping Compiler for Planguage (PC) of Parallel Program Poster: Yingqun Wang Advisor: Dr.

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

MIPS Instructions Note: You can have this handout on both exams. Instruction Formats : Instruction

CSSE232 Computer Architecture I Mul5cycle Control Groups

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?

Debug Info for Optimized Code LLVM BoF Session Adrian Prantl & Vedant Kumar, Apple October

CHINA FINE WINES & DRAGONS HOLLOW A NEW WORLD WINE Dragons Hollow Vineyards