From Shader Code to a Tera Terafl flop op: How Shader Cores Work - PowerPoint PPT Presentation

From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford University SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs – NVIDIA GTX 285 – AMD Radeon 4890 – Intel Larrabee 3. Memory hierarchy: moving data to processors 2 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Part 1: throughput processing • Three key concepts behind how modern GPU processing cores run code • Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU processing core) designs 2. Optimize shaders/compute kernels 3. Establish intuition: what workloads might benefit from the design of these architectures? 3 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

What’s in a GPU? Input Assembly Shader Shader Tex Core Core Rasterizer Shader Shader Tex Output Blend Core Core Video Decode Shader Shader Tex Core Core HW Work or Shader Shader Tex Distributor Core Core SW? Heterogeneous chip multi-processor (highly tuned for graphics) 4 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

A diffuse reflectance shader sampler mySamp; Texture2D<float3> myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp( dot(lightDir, norm), 0.0, 1.0); return float4(kd, 1.0); } Independent, but no explicit parallelism 5 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Compile shader 1 unshaded fragment input record sampler mySamp; Texture2D<float3> myTex; <diffuseShader>: float3 lightDir; sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 float4 diffuseShader(float3 norm, float2 uv) madd r3, v2, cb0[2], r3 { clmp r3, r3, l(0.0), l(1.0) float3 kd; mul o0, r0, r3 kd = myTex.Sample(mySamp, uv); mul o1, r1, r3 kd *= clamp ( dot(lightDir, norm), 0.0, 1.0); mul o2, r2, r3 mov o3, l(1.0) return float4(kd, 1.0); } 1 shaded fragment output record 6 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 (Execute) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 Execution mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0) 7 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) 13 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context 14 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseShader>: <diffuseShader>: sample r0, v4, t0, s0 ALU ALU sample r0, v4, t0, s0 mul r3, v0, cb0[0] mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v1, cb0[1], r3 (Execute) (Execute) madd r3, v2, cb0[2], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o0, r0, r3 mul o1, r1, r3 Execution Execution mul o1, r1, r3 mul o2, r2, r3 mul o2, r2, r3 mov o3, l(1.0) Context Context mov o3, l(1.0) 15 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Four cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context 16 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams 17 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Instruction stream sharing But… many fragments should be able to share an instruction str eam! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 18 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context 19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs Idea #2: Fetch/ Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction stream across many ALUs ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx SIMD processing Ctx Ctx Ctx Ctx Shared Ctx Data 20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Modifying the shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU 1 ALU 2 ALU 3 ALU 4 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 ALU 5 ALU 6 ALU 7 ALU 8 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 Ctx Ctx Ctx Ctx mul o2, r2, r3 mov o3, l(1.0) Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one fragment using scalar ops on scalar registers 21 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Modifying the shader Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 ALU 1 ALU 2 ALU 3 ALU 4 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 Ctx Ctx Ctx Ctx VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes 8 fragments using vector ops on vector registers 22 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 ALU 1 ALU 2 ALU 3 ALU 4 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 Ctx Ctx Ctx Ctx VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data 23 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams 24 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

vertices / fragments primitives 128 [ ] in parallel CUDA threads OpenCL work items compute shader threads primitives vertices fragments 25 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> 26 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

From Shader Code to a Tera Terafl flop op: How Shader Cores Work - PowerPoint PPT Presentation

From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford University SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ This talk 1. Three major ideas that make GPU processing cores

TERA CONTRIBUTIONS TO PARTNER Ugo Amaldi University of Milano Bicocca and TERA Foundation

Displacement Shader Writing CSCD 472 Slide 1 4/5/10 Displacement Shader Variables CSCD 472

Nanostructures for Tera Tera- -bit Level bit Level Nanostructures for Charge Trap Flash

Those Who Live by the FLOP May Die by the Flop Cherri Pancake Oregon State University

RenderMan Shader Assignment So You Want to Write RenderMan shaders Due: Monday, May 3 rd

Shaders Rasmus Vahtra, Andres Traks What is a shader? Maybe this thing? Shader definition

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

Global Risk Resources ITER & RiskIE Databases Andrea Wullenweber & Oliver Kroner

FlashBacter UPO-Sevilla Team APPLICATIONS Biosensors Killer proteins production Multiple

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy Seongmoo Heo, Ronny

PATMOS 2010 An On-Chip Flip-flop Characterization Circuit Andrea Veggetti (ST Agrate) Abhishek

Flip-Flops Assume the an edge-triggered flip-flop FF implements a Boolean function f with

Markov modulated Brownian motion and the flip-flop fluid queue Guy Latouche Universit e libre

VkRunner A simple shader script tester for Vulkan Neil Roberts Based on Piglits

The Fragment Shader CS418 Computer Graphics John C. Hart Fragment Pipeline Rasterization Model

Shaders Slide credit to Prof. Zwicker Today Shader programming 2 Complete model Blinn

Computing Environments Saeid Mofrad, Ishtiaq Ahmed, Shiyong Lu, Ping Yang, Heming Cui, Fengwei

AMD Pacifica Virtualization Technology AMD Unveils Virtualization Platform AMD Pacifica

GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need

Reaching "EPYC" Virtualization Performance Case Study: Tuning VMs for Best Performance

Practical DirectX 12 - Programming Model and Hardware Capabilities Gareth Thomas & Alex Dunn

NUMA Siloing in the FreeBSD Network Stack Drew Gallatin EuroBSDCon 2019 (Or how to serve

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

From Shader Code to a Tera Terafl flop op: How Shader Cores Work - PowerPoint PPT Presentation

From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford University SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ This talk 1. Three major ideas that make GPU processing cores

TERA CONTRIBUTIONS TO PARTNER Ugo Amaldi University of Milano Bicocca and TERA Foundation

Displacement Shader Writing CSCD 472 Slide 1 4/5/10 Displacement Shader Variables CSCD 472

Nanostructures for Tera Tera- -bit Level bit Level Nanostructures for Charge Trap Flash

Those Who Live by the FLOP May Die by the Flop Cherri Pancake Oregon State University

RenderMan Shader Assignment So You Want to Write RenderMan shaders Due: Monday, May 3 rd

Shaders Rasmus Vahtra, Andres Traks What is a shader? Maybe this thing? Shader definition

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

Global Risk Resources ITER &amp; RiskIE Databases Andrea Wullenweber &amp; Oliver Kroner

FlashBacter UPO-Sevilla Team APPLICATIONS Biosensors Killer proteins production Multiple

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy Seongmoo Heo, Ronny

PATMOS 2010 An On-Chip Flip-flop Characterization Circuit Andrea Veggetti (ST Agrate) Abhishek

Flip-Flops Assume the an edge-triggered flip-flop FF implements a Boolean function f with

Markov modulated Brownian motion and the flip-flop fluid queue Guy Latouche Universit e libre

VkRunner A simple shader script tester for Vulkan Neil Roberts Based on Piglits

The Fragment Shader CS418 Computer Graphics John C. Hart Fragment Pipeline Rasterization Model

Shaders Slide credit to Prof. Zwicker Today Shader programming 2 Complete model Blinn

Computing Environments Saeid Mofrad, Ishtiaq Ahmed, Shiyong Lu, Ping Yang, Heming Cui, Fengwei

AMD Pacifica Virtualization Technology AMD Unveils Virtualization Platform AMD Pacifica

GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need

Reaching &quot;EPYC&quot; Virtualization Performance Case Study: Tuning VMs for Best Performance

Practical DirectX 12 - Programming Model and Hardware Capabilities Gareth Thomas &amp; Alex Dunn

NUMA Siloing in the FreeBSD Network Stack Drew Gallatin EuroBSDCon 2019 (Or how to serve

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

Global Risk Resources ITER & RiskIE Databases Andrea Wullenweber & Oliver Kroner

Reaching "EPYC" Virtualization Performance Case Study: Tuning VMs for Best Performance

Practical DirectX 12 - Programming Model and Hardware Capabilities Gareth Thomas & Alex Dunn