From Shader Code to a Tera Terafl flop op: How Shader Cores Work - - PowerPoint PPT Presentation

from shader code to a tera terafl flop
SMART_READER_LITE
LIVE PREVIEW

From Shader Code to a Tera Terafl flop op: How Shader Cores Work - - PowerPoint PPT Presentation

From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford University SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ This talk 1. Three major ideas that make GPU processing cores


slide-1
SLIDE 1

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

From Shader Code to a Tera

Terafl flop

  • p:

How Shader Cores Work

Kayvon Fatahalian Stanford University

slide-2
SLIDE 2

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

This talk

1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs

– NVIDIA GTX 285 – AMD Radeon 4890 – Intel Larrabee

3. Memory hierarchy: moving data to processors

2

slide-3
SLIDE 3

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Part 1: throughput processing

  • Three key concepts behind how modern

GPU processing cores run code

  • Knowing these concepts will help you:
  • 1. Understand space of GPU core (and throughput

CPU processing core) designs

  • 2. Optimize shaders/compute kernels
  • 3. Establish intuition: what workloads might

benefit from the design of these architectures?

3

slide-4
SLIDE 4

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

What’s in a GPU?

Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core

Tex Tex Tex Tex

Input Assembly Rasterizer Output Blend Video Decode Work Distributor

Heterogeneous chip multi-processor (highly tuned for graphics)

HW

  • r

SW?

4

slide-5
SLIDE 5

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

A diffuse reflectance shader

sampler mySamp; Texture2D<float3> myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp( dot(lightDir, norm), 0.0, 1.0); return float4(kd, 1.0); }

Independent, but no explicit parallelism

5

slide-6
SLIDE 6

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Compile shader

sampler mySamp; Texture2D<float3> myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp ( dot(lightDir, norm), 0.0, 1.0); return float4(kd, 1.0); }

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

1 unshaded fragment input record 1 shaded fragment output record

6

slide-7
SLIDE 7

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Execute shader

ALU

(Execute)

Fetch/ Decode Execution Context

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

7

slide-8
SLIDE 8

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Execute shader

ALU

(Execute)

Fetch/ Decode Execution Context

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

8

slide-9
SLIDE 9

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Execute shader

ALU

(Execute)

Fetch/ Decode Execution Context

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

9

slide-10
SLIDE 10

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Execute shader

ALU

(Execute)

Fetch/ Decode Execution Context

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

10

slide-11
SLIDE 11

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Execute shader

ALU

(Execute)

Fetch/ Decode Execution Context

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

11

slide-12
SLIDE 12

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Execute shader

ALU

(Execute)

Fetch/ Decode Execution Context

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

12

slide-13
SLIDE 13

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” cores

ALU

(Execute)

Fetch/ Decode Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher Data cache

(A big one)

13

slide-14
SLIDE 14

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Slimming down

ALU

(Execute)

Fetch/ Decode Execution Context

Idea #1: Remove components that help a single instruction stream run fast

14

slide-15
SLIDE 15

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Two cores (two fragments in parallel)

ALU

(Execute)

Fetch/ Decode Execution Context

ALU

(Execute)

Fetch/ Decode Execution Context

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

fragment 1

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

fragment 2

15

slide-16
SLIDE 16

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Four cores (four fragments in parallel)

ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context 16

slide-17
SLIDE 17

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel)

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU

16 cores = 16 simultaneous instruction streams

17

slide-18
SLIDE 18

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Instruction stream sharing

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

But… many fragments should be able to share an instruction stream!

18

slide-19
SLIDE 19

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU

(Execute)

Execution Context

19

slide-20
SLIDE 20

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing

Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

20

slide-21
SLIDE 21

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Modifying the shader

<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Original compiled shader:

Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

Processes one fragment using scalar ops on scalar registers

21

slide-22
SLIDE 22

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Modifying the shader

Fetch/ Decode

<VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0)

Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

Processes 8 fragments using vector ops on vector registers

ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

New compiled shader:

22

slide-23
SLIDE 23

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Modifying the shader

Fetch/ Decode

<VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0)

Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

2 3 1 4 6 7 5 8

ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 23

slide-24
SLIDE 24

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24

slide-25
SLIDE 25

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 [ ] in parallel

vertices / fragments primitives CUDA threads OpenCL work items compute shader threads

primitives vertices fragments

25

slide-26
SLIDE 26

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

26

slide-27
SLIDE 27

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

T T T F F F F F

27

slide-28
SLIDE 28

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

T T T F F F F F

Not all ALUs do useful work! Worst case: 1/8 performance

28

slide-29
SLIDE 29

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

T T T F F F F F

29

slide-30
SLIDE 30

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Clarification

  • Option 1: Explicit vector instructions

– Intel/AMD x86 SSE, Intel Larrabee

  • Option 2: Scalar instructions, implicit HW vectorization

– HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) – NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures

SIMD processing does not imply SIMD instructions

In practice: 16 to 64 fragments share an instruction stream

30

slide-31
SLIDE 31

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Stalls!

Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls. Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.

31

slide-32
SLIDE 32

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But we have LOTS of independent fragments.

Idea #3:

Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations.

32

slide-33
SLIDE 33

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls

Time (clocks) Frag 1 … 8 Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

ALU ALU ALU ALU ALU ALU ALU ALU 33

slide-34
SLIDE 34

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls

Time (clocks) Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

1 2 3 4 1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

slide-35
SLIDE 35

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls

Time (clocks) Stall

Runnable 1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

35

slide-36
SLIDE 36

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls

Time (clocks) Stall

Runnable 1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

36

slide-37
SLIDE 37

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls

Time (clocks)

1 2 3 4

Stall Stall Stall Stall

Runnable Runnable Runnable

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

37

slide-38
SLIDE 38

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Throughput!

Time (clocks) Stall

Runnable 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

Done!

Stall

Runnable Done!

Stall

Runnable Done!

Stall

Runnable Done! 1

Increase run time of one group To maximum throughput of many groups

Start Start Start

38

slide-39
SLIDE 39

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Storing contexts

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

Pool of context storage 64 KB

39

slide-40
SLIDE 40

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Twenty small contexts

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

1 2 3 4 5 6 7 8 9 10 11 15 12 13 14 16 20 17 18 19

(maximal latency hiding ability)

40

slide-41
SLIDE 41

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Twelve medium contexts

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

1 2 3 4 5 6 7 8 9 10 11 12

41

slide-42
SLIDE 42

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Four large contexts

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

4 3 1 2

(low latency hiding ability)

42

slide-43
SLIDE 43

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Clarification

  • NVIDIA / AMD Radeon GPUs

– HW schedules / manages all contexts (lots of them) – Special on-chip storage holds fragment state

  • Intel Larrabee

– HW manages four x86 (big) contexts at fine granularity – SW scheduling interleaves many groups of fragments on each HW context – L1-L2 cache holds fragment state (as determined by SW)

Interleaving between contexts can be managed by HW or SW (or both!)

43

slide-44
SLIDE 44

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

My chip!

16 cores 8 mul-add ALUs per core (128 total) 16 simultaneous instruction streams 64 concurrent (but interleaved) instruction streams 512 concurrent fragments = 256 GFLOPs (@ 1GHz)

44

slide-45
SLIDE 45

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

My “enthusiast” chip!

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)

45

slide-46
SLIDE 46

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Summary: three key ideas

  • 1. Use many “slimmed down cores” to run in

parallel

  • 2. Pack cores full of ALUs (by sharing instruction

stream across groups of fragments)

– Option 1: Explicit SIMD vector instructions – Option 2: Implicit sharing managed by hardware

  • 3. Avoid latency stalls by interleaving execution of

many groups of fragments

– When one group stalls, work on another group

46

slide-47
SLIDE 47

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Part 2:

Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 285 AMD Radeon HD 4890 Intel Larrabee (as proposed)

47

slide-48
SLIDE 48

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Disclaimer

  • The following slides describe “how one can

think” about the architecture of NVIDIA, AMD, and Intel GPUs

  • Many factors play a role in actual chip

performance

48

slide-49
SLIDE 49

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

NVIDIA GeForce GTX 285

  • NVIDIA-speak:

– 240 stream processors – “SIMT execution”

  • Generic speak:

– 30 cores – 8 SIMD functional units per core

49

slide-50
SLIDE 50

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

NVIDIA GeForce GTX 285 “core”

50

= instruction stream decode = SIMD functional unit, control shared across 8 units = execution context storage = multiply-add = multiply 64 KB of storage for fragment contexts (registers)

slide-51
SLIDE 51

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

NVIDIA GeForce GTX 285 “core”

51

64 KB of storage for fragment contexts (registers)

  • Groups of 32 [fragments/vertices/threads/etc.] share

instruction stream (they are called “WARPS”)

  • Up to 32 groups are simultaneously interleaved
  • Up to 1024 fragment contexts can be stored
slide-52
SLIDE 52

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

NVIDIA GeForce GTX 285

Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex

… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 52

There are 30 of these things on the GTX 285: 30,000 fragments!

slide-53
SLIDE 53

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

NVIDIA GeForce GTX 285

  • Generic speak:

– 30 processing cores – 8 SIMD functional units per core – Best case: 240 mul-adds + 240 muls per clock

53

slide-54
SLIDE 54

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

AMD Radeon HD 4890

  • AMD-speak:

– 800 stream processors – HW-managed instruction stream sharing (like “SIMT”)

  • Generic speak:

– 10 cores – 16 “beefy” SIMD functional units per core – 5 multiply-adds per functional unit

54

slide-55
SLIDE 55

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

AMD Radeon HD 4890 “core”

55

= instruction stream decode = SIMD functional unit, control shared across 16 units = execution context storage = multiply-add

slide-56
SLIDE 56

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

AMD Radeon HD 4890 “core”

56

  • Groups of 64 [fragments/vertices/etc.] share instruction stream

(AMD doesn’t have a fancy name like “WARP”)

– One fragment processed by each of the 16 SIMD units – Repeat for four clocks

slide-57
SLIDE 57

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

AMD Radeon HD 4890

Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex

… … … … … … … … … … 57

slide-58
SLIDE 58

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

AMD Radeon HD 4890

  • Generic speak:

– 10 processing “cores” – 16 “beefy” SIMD functional units per core – 5 multiply-adds per functional unit – Best case: 800 multiply-adds per clock

  • Scale of interleaving similar to NVIDIA GPUs

58

slide-59
SLIDE 59

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Intel Larrabee

  • Intel speak:

– We won’t say anything about core count or clock rate – Explicit 16-wide vector ISA – Each core interleaves four x86 instruction streams – Software implements additional interleaving

  • Generic speak:

– That was the generic speak

59

slide-60
SLIDE 60

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Intel Larrabee “core”

60

32 KB of L1 cache 256 KB of L2 cache Each HW context: 32 vector registers = instruction stream decode = SIMD functional unit, control shared across 16 units = execution context storage/ HW registers = mul-add

slide-61
SLIDE 61

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Intel Larrabee

… … … … ? ?

Tex Tex Tex Tex

… ? …

61

slide-62
SLIDE 62

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

The talk thus far: processing data

62

Part 3: moving data to processors

slide-63
SLIDE 63

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: CPU-“style” core

ALU Fetch/Decode Execution Context OOO exec logic branch pred.

Data cache

(big one)

63

slide-64
SLIDE 64

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” memory hierarchy

ALU Fetch/Decode Execution Context OOO exec logic branch pred.

L1 cache

32 KB

64

CPU cores run efficiently when data is resident in cache (caches reduce latency, provide high bandwidth)

L2 cache

256 KB

L3 cache

8 MB (shared across cores)

25 GB/sec to memory

slide-65
SLIDE 65

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Throughput core (GPU-style)

65

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

Execution Contexts (64 KB)

Memory More ALUs, no traditional cache hierarchy: Need high-bandwidth connection to memory 150 GB/sec

slide-66
SLIDE 66

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Bandwidth is critical

  • On a high-end GPU:

– 11x compute performance of high-end CPU – 6x bandwidth to feed it – No complicated cache hierarchy

  • GPU memory system is designed for throughput

– Wide bus (150 GB/sec) – Repack/reorder/interleave memory requests to maximize use of memory bus

66

slide-67
SLIDE 67

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Bandwidth thought experiment

  • Element-wise multiply two long vectors A and B

Load input A[i] Load input B[i] Multiply Store result C[i]

  • 3 memory operations every 4 cycles (12 bytes)
  • Needs ~1 TB/sec of bandwidth on a high-end GPU
  • 7x available bandwidth

67

15% efficiency… but 6x faster than high-end CPU!

slide-68
SLIDE 68

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Bandwidth limited!

If processors request data at too high a rate, the memory system cannot keep up.

68

No amount of latency hiding helps this. Overcoming bandwidth limits are a common challenge for GPU-compute application developers.

slide-69
SLIDE 69

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Reducing required bandwidth

Request data less often (do more math)

69

Share/reuse data across fragments (increase on-chip storage)

slide-70
SLIDE 70

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Reducing required bandwidth

  • Two examples of on-chip storage

– Texture cache – CUDA shared memory (“OpenCL local”)

70

2 3 1 4

Texture data

slide-71
SLIDE 71

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

GPU memory hierarchy

71

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

Execution Contexts (64 KB) Shared “local” storage (16 KB) Texture cache (read only)

Memory On-chip storage takes load off memory system Many developers calling for larger, more cache-like storage

slide-72
SLIDE 72

Blocks and warps on NVidia

  • The programmer groups threads into blocks which are assigned

to a stream processor

  • These in turn are grouped into warps, which are groups of

threads that execute together. The number of threads in a warp is a function of ALUs

  • Warps are a scheduling unit
  • Coalescing is attempted on memory accesses by a warp
  • Attempt to group a set of accesses into as few accesses as

possible

  • Reduces the number of DMA operations, increases

bandwidth

slide-73
SLIDE 73

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Summary

72

slide-74
SLIDE 74

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 73

Think of a GPU as a multi-core processor

  • ptimized for maximum throughput.

(currently at extreme end of design space)

slide-75
SLIDE 75

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

  • Has thousands of independent pieces of work

– Uses many ALUs on many cores – Supports massive interleaving for latency hiding

  • Is amenable to instruction stream sharing

– Maps to SIMD execution well

  • Is compute-heavy: the ratio of math operations to

memory access is high – Not limited by bandwidth

74

An efficient GPU workload…

slide-76
SLIDE 76

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Acknowledgements

  • Kurt Akeley
  • Solomon Boulos
  • Mike Doggett
  • Pat Hanrahan
  • Mike Houston
  • Jeremy Sugerman

75

slide-77
SLIDE 77

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Thank you

http://gates381.blogspot.com

76