Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 - - PowerPoint PPT Presentation

lecture 5 hw1 discussion intro to gpus
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 - - PowerPoint PPT Presentation

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1 Intro to GPU Computing Outline Discuss HW1 Intro to GPU Computing Discuss HW1 Intro to GPU Computing Outline Discuss HW1 Intro to GPU


slide-1
SLIDE 1

Lecture 5: HW1 Discussion, Intro to GPUs

G63.2011.002/G22.2945.001 · October 5, 2010

Discuss HW1 Intro to GPU Computing

slide-2
SLIDE 2

Outline

Discuss HW1 Intro to GPU Computing

Discuss HW1 Intro to GPU Computing

slide-3
SLIDE 3

Outline

Discuss HW1 Intro to GPU Computing

Discuss HW1 Intro to GPU Computing

slide-4
SLIDE 4

Dense Matrix Multiply: Blocking vs Scalar

We provided a blocked example matrix multiplication code. Why is blocked matmul faster than un-blocked? Key: Computational Intensity Definition: Flops per FPN moved up the memory hierarchy Large intensity: good for deep memory hierarchies

Discuss HW1 Intro to GPU Computing

slide-5
SLIDE 5

Computational Intensity for Scalar Matmul

Floating Point operations: 2N3 Assume: Size(L1) ≪ N2 FPNs N2 read each row of A once + N3 read each column of B N times + 2N2 read/write C N3 + 3N2 FPN-size cache misses (neglecting cache lines, etc.) Computational Intensity: about 2

Discuss HW1 Intro to GPU Computing

slide-6
SLIDE 6

Computational Intensity for Blocked Matmul

Floating Point operations: still 2N3 b: block size n: ⌈N/b⌉ b2n3 read each block of A n3 times + b2n3 same for B + 2N2 read/write C 2b2n3 + 2N2 FPN-size cache misses Rewrite: b2n3 ≈ b2 N3 b3 = N3 b Computational Intensity: 2N3 2N3/b + 2N2 ≈ 2N3 2N3/b = b → incentive to choose b ≫ 2

Discuss HW1 Intro to GPU Computing

slide-7
SLIDE 7

Computational Intensity for Blocked Matmul

Floating Point operations: still 2N3 b: block size n: ⌈N/b⌉ b2n3 read each block of A n3 times + b2n3 same for B + 2N2 read/write C 2b2n3 + 2N2 FPN-size cache misses Rewrite: b2n3 ≈ b2 N3 b3 = N3 b Computational Intensity: 2N3 2N3/b + 2N2 ≈ 2N3 2N3/b = b → incentive to choose b ≫ 2 The power of assumptions: Can we choose b = N?

Discuss HW1 Intro to GPU Computing

slide-8
SLIDE 8

Hatching a Plan

Consider each level of the memory hierarchy. How do we exploit. . .

  • . . . L2: Ignore–we’re nearly L2-local at

most sizes.

  • . . . L1: 32 KiB = 4096 Floats.

Key: memory layout.

  • . . . registers: 16 FP registers.

Key: loop/operation ordering.

Discuss HW1 Intro to GPU Computing

slide-9
SLIDE 9

Optimizing for L1: Memory Layout

Memory layout of A: column-major. Only use one entry of each cache line per fetch. Better to store A in row-major order. Input is row-major. If memory available (not swap!), storing a transposed copy of A can be a good idea. (Copy takes O(N2) time.)

Discuss HW1 Intro to GPU Computing

slide-10
SLIDE 10

Optimizing for L1: Memory Layout

Memory layout of A: column-major. Only use one entry of each cache line per fetch. Better to store A in row-major order. Input is row-major. If memory available (not swap!), storing a transposed copy of A can be a good idea. (Copy takes O(N2) time.)

Discuss HW1 Intro to GPU Computing

slide-11
SLIDE 11

Optimizing for L1: Reuse Pattern, Block Size

Question

Blocking: good idea. Optimal bL1? Follow-up question: How much needs to fit in L1?

Discuss HW1 Intro to GPU Computing

slide-12
SLIDE 12

Optimizing for L1: Reuse Pattern, Block Size

Question

Blocking: good idea. Optimal bL1? Follow-up question: How much needs to fit in L1? One block of each of A, B, C.

Discuss HW1 Intro to GPU Computing

slide-13
SLIDE 13

Optimizing for L1: Reuse Pattern, Block Size

Question

Blocking: good idea. Optimal bL1? Follow-up question: How much needs to fit in L1? One block of each of A, B, C.

Discuss HW1 Intro to GPU Computing

slide-14
SLIDE 14

Optimizing for L1: Reuse Pattern, Block Size

Question

Blocking: good idea. Optimal bL1? Follow-up question: How much needs to fit in L1? One block of each of A, B, C. All of A, plus one column of B and C. 32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60

Discuss HW1 Intro to GPU Computing

slide-15
SLIDE 15

L1 Block Copy

Further concerns:

  • Cache line boundaries
  • SIMD
  • Cache set conflicts

All solved by small-block copy

  • ptimization.

Copy all of A. Copy bL1-sized blocks of A, B, and C,

  • perate on those, then copy output

back.

Discuss HW1 Intro to GPU Computing

slide-16
SLIDE 16

L1 Block Copy: The Plan

Basic plan: For each i: For each j: Load Block C[i, j] For each k: Load Block A[i, k] Load Block B[k, j] ⌈bL1/br⌉3 register kernels: C+ = AB Store Block C[i, j] (can be improved: many A, B loads)

Discuss HW1 Intro to GPU Computing

slide-17
SLIDE 17

L1 Block Copy: The Plan

Basic plan: For each i: For each j: Load Block C[i, j] For each k: Load Block A[i, k] Load Block B[k, j] ⌈bL1/br⌉3 register kernels: C+ = AB Store Block C[i, j] (can be improved: many A, B loads) Aside: Also neatly deals with fringes. So: how does this solve the problems above? Can you define “alignment”?

Discuss HW1 Intro to GPU Computing

slide-18
SLIDE 18

Alignment

A memory address a is n-byte aligned when n is a power of two and a is a multiple of n bytes. (see also IBM devWorks article)

#include <stdlib.h> /∗ dynamic allocation ∗/ double ∗ attribute (( aligned (64))) var; int error = posix memalign( (void ∗∗) &var, 64, array size ); if ( error ) abort (); /∗ static allocation ∗/ double attribute (( aligned (64))) ary2 [500];

Discuss HW1 Intro to GPU Computing

slide-19
SLIDE 19

Alignment

A memory address a is n-byte aligned when n is a power of two and a is a multiple of n bytes. (see also IBM devWorks article)

#include <stdlib.h> /∗ dynamic allocation ∗/ double ∗ attribute (( aligned (64))) var; int error = posix memalign( (void ∗∗) &var, 64, array size ); if ( error ) abort (); /∗ static allocation ∗/ double attribute (( aligned (64))) ary2 [500];

Examples: Cache-line-aligned, SIMD-aligned. Code generation in the non-aligned case?

Discuss HW1 Intro to GPU Computing

slide-20
SLIDE 20

Register Kernel

Choose block size br = 2k, with bL1 mod br = 0.

for (int j = 0; j < b r; ++j) for (int k = 0; k < b r; ++k) for (int i = 0; i < b r; ++i) C[i+j∗b l1] += A[i+k∗b l1] ∗ B[k+j∗b l1];

For each Ab matvec: Perform br scalar·vector updates.

  • Vectorizable
  • Pipeline-friendly

(min. data dependencies)

  • Access to A, C unit-stride
  • Access to B is inner-loop invariant
  • Unrolling, software pipelining: Compiler

Discuss HW1 Intro to GPU Computing

slide-21
SLIDE 21

Psychoanalyzing the Compiler

Flags for Intel:

  • O3 -fno-alias -funroll-loops
  • std=c99 -D XOPEN SOURCE=500
  • opt-streaming-stores auto -static
  • fast -xHost

Flags for GCC:

  • O3 -funroll-loops -march=native
  • std=c99 -D XOPEN SOURCE=500
  • ftree-vectorizer-verbose=2
  • ffast-math

GCC 4.3 sometimes better than GCC 4.4. Self-study material:

  • Compiler Reference: Intel GNU
  • C99 restrict keyword, Aliasing

Discuss HW1 Intro to GPU Computing

slide-22
SLIDE 22

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linux

  • nly, needs root.

Discuss HW1 Intro to GPU Computing

slide-23
SLIDE 23

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linux

  • nly, needs root.

Many event types countable:

CPU CLK UNHALTED : Clock cycles when not halted L2 RQSTS : number of L2 cache requests LLC MISSES : L2 cache demand requests from this core that missed the L2 FLOPS : number of FP computational micro-ops executed IDLE DURING DIV : cycles divider is busy and all other execution units are idle. L1D ALL REF : All references to the L1 data cache L1D PEND MISS : Total number of outstanding L1 data cache misses at any cycle IFU MEM STALL : cycles instruction fetch pipe is stalled INST RETIRED : number of instructions retired UOPS RETIRED : number of UOPs retired MACHINE NUKES SMC : number of pipeline flushing events RAT STALLS : Partial register stall cycles BR INST DECODED : number of branch instructions decoded

Discuss HW1 Intro to GPU Computing

slide-24
SLIDE 24

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linux

  • nly, needs root.

Discuss HW1 Intro to GPU Computing

slide-25
SLIDE 25

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linux

  • nly, needs root.

FLOPS L1D PEND MISS 8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7 187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm5 7 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3 470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm4 49 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2 2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1 434 0.0144 8 0.3127 xchg %ax,%ax 184312 6.0959 26 1.0164 movsd (%rdx),%xmm0 2022 0.0669 14 0.5473 inc %esi 19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0 5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm0 31888 1.0547 68 2.6583 movsd %xmm0,(%rax) 66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp) 114001 3.7704 43 1.6810 movsd (%rcx),%xmm0 1131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm0 11913 0.3940 2 0.0782 addsd %xmm0,%xmm14 94565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax) 108501 3.5885 25 0.9773 movsd (%rcx),%xmm0 4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm0 76622 2.5342 81 3.1665 addsd %xmm0,%xmm15 82075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax) 119036 3.9370 36 1.4073 movsd (%rcx),%xmm0 5 1.7e−04 mulsd 0x18(%rdx),%xmm0 2700 0.0893 addsd %xmm0,%xmm12 14861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax) Discuss HW1 Intro to GPU Computing

slide-26
SLIDE 26

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linux

  • nly, needs root.

Discuss HW1 Intro to GPU Computing

slide-27
SLIDE 27

Solution Performance

100 200 300 400 500 600 700 800 Matrix dimension N 1000 2000 3000 4000 5000 6000 7000 8000 9000 MFlops/s

basic tuned blas

git clone ssh://git@forge.tiker.net:2234/hw1-solution.git (Private, works if you signed up for an account.)

Discuss HW1 Intro to GPU Computing

slide-28
SLIDE 28

Solution Performance

100 200 300 400 500 600 700 800 Matrix dimension N 1000 2000 3000 4000 5000 6000 7000 8000 9000 MFlops/s

basic tuned blas

git clone ssh://git@forge.tiker.net:2234/hw1-solution.git (Private, works if you signed up for an account.) Great–but: Most BLAS lose out to triple-loops for special-case matrices. Want to see code of a “real” BLAS? GotoBLAS2

Discuss HW1 Intro to GPU Computing

slide-29
SLIDE 29

Key Messages of HW1

In HPC:

  • Very simple things quickly become

rather complex.

  • Need: ideas, careful analysis.
  • Flexibility ↔ performance
  • Run-time code generation can be

useful. This class helps by introducing

  • known tricks
  • helpful tools.

Discuss HW1 Intro to GPU Computing

slide-30
SLIDE 30

Key Messages of HW1

In HPC:

  • Very simple things quickly become

rather complex.

  • Need: ideas, careful analysis.
  • Flexibility ↔ performance
  • Run-time code generation can be

useful. This class helps by introducing

  • known tricks
  • helpful tools.

Matmul is a “microcosm” of single-proc

  • ptimization.

Do not worry if you did not figure out the tricks here on your own.

Discuss HW1 Intro to GPU Computing

slide-31
SLIDE 31

Questions?

?

Discuss HW1 Intro to GPU Computing

slide-32
SLIDE 32

Outline

Discuss HW1 Intro to GPU Computing

Discuss HW1 Intro to GPU Computing

slide-33
SLIDE 33

GPUs: System Context

Discuss HW1 Intro to GPU Computing

slide-34
SLIDE 34

GPUs: System Context

Processor Memory

Discuss HW1 Intro to GPU Computing

slide-35
SLIDE 35

GPUs: System Context

Processor Memory Expansion Slots PCI-Express (x4, x16, x1, x16) and regular PCI PCIe V2, x16 Bandwidth: ∼ 6 GB/s

Discuss HW1 Intro to GPU Computing

slide-36
SLIDE 36

GPUs: System Context

Processor Memory Expansion Slots PCI-Express (x4, x16, x1, x16) and regular PCI PCIe V2, x16 Bandwidth: ∼ 6 GB/s GPU goes here

Discuss HW1 Intro to GPU Computing

slide-37
SLIDE 37

GPUs: System Context

Discuss HW1 Intro to GPU Computing

slide-38
SLIDE 38

GPU Computing?

  • Design target for CPUs:
  • Make a single thread very fast
  • Take control away from

programmer

  • GPU Computing takes a

different approach:

  • Throughput matters—

single threads do not

  • Give explicit control to

programmer

Discuss HW1 Intro to GPU Computing

slide-39
SLIDE 39

“CPU-style” Cores

ALU

(Execute)

Fetch/ Decode Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher Data cache

(A big one)

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-40
SLIDE 40

Slimming down

ALU

(Execute)

Fetch/ Decode Execution Context

Idea #1: Remove components that help a single instruction stream run fast

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-41
SLIDE 41

More Space: Double the Number of Cores

ALU

(Execute)

Fetch/ Decode Execution Context

ALU

(Execute)

Fetch/ Decode Execution Context

t 1

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-42
SLIDE 42

. . . again

ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-43
SLIDE 43

. . . and again

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU

16 cores = 16 simultaneous instruction streams

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-44
SLIDE 44

. . . and again

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU

16 cores = 16 simultaneous instruction streams

Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams Reality: instruction streams not actually very different/independent

Discuss HW1 Intro to GPU Computing

slide-45
SLIDE 45

Saving Yet More Space

Fetch/ Decode

ALU

(Execute)

Execution Context

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-46
SLIDE 46

Saving Yet More Space

Fetch/ Decode

ALU

(Execute)

Execution Context

Idea #2 Amortize cost/complexity of managing an instruction stream across many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-47
SLIDE 47

Saving Yet More Space

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

Execution Context

Idea #2 Amortize cost/complexity of managing an instruction stream across many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-48
SLIDE 48

Saving Yet More Space

Fetch/ Decode

ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

Idea #2 Amortize cost/complexity of managing an instruction stream across many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-49
SLIDE 49

Gratuitous Amounts of Parallelism!

  • res = 128 ALUs

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-50
SLIDE 50

Gratuitous Amounts of Parallelism!

  • res = 128 ALUs

Credit: Kayvon Fatahalian (Stanford)

Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams

Discuss HW1 Intro to GPU Computing

slide-51
SLIDE 51

Gratuitous Amounts of Parallelism!

  • res = 128 ALUs

Credit: Kayvon Fatahalian (Stanford)

Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams Great if everybody in a group does the same thing. But what if not? What leads to divergent instruction streams?

Discuss HW1 Intro to GPU Computing

slide-52
SLIDE 52

Branches

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-53
SLIDE 53

Branches

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

T T T F F F F F

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-54
SLIDE 54

Branches

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

T T T F F F F F

Not all ALUs do useful work! Worst case: 1/8 performance

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-55
SLIDE 55

Branches

ALU 1 ALU 2 . . . ALU 8 . . . Time (clocks) 2 ... 1 ... 8

if (x > 0) { } else { } <unconditional shader code> <resume unconditional shader code> y = pow(x, exp); y *= Ks; refl = y + Ka; x = 0; refl = Ka;

T T T F F F F F

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-56
SLIDE 56

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed

  • caches
  • branch prediction
  • out-of-order execution

So what now?

Discuss HW1 Intro to GPU Computing

slide-57
SLIDE 57

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed

  • caches
  • branch prediction
  • out-of-order execution

So what now? Idea #3 Even more parallelism + Some extra memory = A solution!

Discuss HW1 Intro to GPU Computing

slide-58
SLIDE 58

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed

  • caches
  • branch prediction
  • out-of-order execution

So what now?

Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

ALU ALU ALU ALU ALU ALU ALU ALU

Idea #3 Even more parallelism + Some extra memory = A solution!

Discuss HW1 Intro to GPU Computing

slide-59
SLIDE 59

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed

  • caches
  • branch prediction
  • out-of-order execution

So what now?

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

1 2 3 4 Idea #3 Even more parallelism + Some extra memory = A solution!

Discuss HW1 Intro to GPU Computing

slide-60
SLIDE 60

Hiding Memory Latency

Time (clocks) Frag 1 … 8 Fetch/ Decode Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

ALU ALU ALU ALU ALU ALU ALU ALU 33

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-61
SLIDE 61

Hiding Memory Latency

Time (clocks) Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

1 2 3 4 1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-62
SLIDE 62

Hiding Memory Latency

Time (clocks) Stall

Runnable 1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

35

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-63
SLIDE 63

Hiding Memory Latency

Time (clocks) Stall

Runnable 1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

36

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-64
SLIDE 64

Hiding Memory Latency

Time (clocks)

1 2 3 4

Stall Stall Stall Stall

Runnable Runnable Runnable

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

37

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-65
SLIDE 65

Hiding Memory Latency

Time (clocks) Stall

Runnable 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

Done!

Stall

Runnable Done!

Stall

Runnable Done!

Stall

Runnable Done! 1

Increase run time of one group To maximum throughput of many groups

Start Start Start

38

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-66
SLIDE 66

GPU Architecture Summary

Core Ideas:

  • 1. Many slimmed down cores

→ lots of parallelism

  • 2. More ALUs, Fewer Control Units
  • 3. Avoid memory stalls by interleaving

execution of SIMD groups

Credit: Kayvon Fatahalian (Stanford)

Discuss HW1 Intro to GPU Computing

slide-67
SLIDE 67

GPU-CPU Bird’s Eye Comparison

Floorplan: VIA Isaiah (2008) 65 nm, 4 SP ops at a time, 1 MiB L2. Floorplan: AMD RV770 (2008) 55 nm, 800 SP ops at a time.

Discuss HW1 Intro to GPU Computing

slide-68
SLIDE 68

Nvidia GTX200

Fetch/ Decode

ALU ALU ALU ALU ALU ALU ALU ALU

DP ALU

32 kiB Ctx Private 16 kiB Ctx Shared

Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 4× DP ALU 32 kiB Ctx Private 16 kiB Ctx Shared

Off-chip Memory 150 GB/s

Discuss HW1 Intro to GPU Computing

slide-69
SLIDE 69

GPU Architecture (e.g. Nvidia GT200)

  • 1 GPU = 30 SIMD cores
  • 1 SIMD core: 32 × 32 PCs,

HW Sched + 1 ID (1/4 clock) + 8 SP + 1 DP + 16 KiB Shared + 32 KiB Reg

  • Device ↔ RAM: 140 GB/s
  • Device ↔ Host: 6 GB/s
  • User manages memory hierarchy

Discuss HW1 Intro to GPU Computing

slide-70
SLIDE 70

What is OpenCL?

OpenCL (Open Computing Language) is an

  • pen, royalty-free standard for general purpose

parallel programming across CPUs, GPUs and

  • ther processors.

[OpenCL 1.1 spec]

  • Device-neutral (Nv GPU, AMD GPU,

Intel/AMD CPU)

  • Vendor-neutral
  • Comes with RTCG

Defines:

  • Host-side programming interface (library)
  • Device-side programming language (!)

Discuss HW1 Intro to GPU Computing

slide-71
SLIDE 71

Questions?

?

Discuss HW1 Intro to GPU Computing

slide-72
SLIDE 72

Image Credits

  • Blocks: sxc.hu/Avolore
  • Flag: sxc.hu/Ambrozjo
  • Mainboard: Wikimedia Commons
  • PCI Express slots: Wikimedia Commons
  • Fighting chips: flickr.com/oskay
  • Isaiah die shot: VIA Technologies
  • RV770 die shot: AMD Corp.
  • Nvidia Tesla Architecture: Nvidia Corp.

Discuss HW1 Intro to GPU Computing