Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute - - PowerPoint PPT Presentation

single processor tuning 2 2
SMART_READER_LITE
LIVE PREVIEW

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute - - PowerPoint PPT Presentation

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.16] Thursday, February 28, 2008 1 Todays sources CS 267 (Demmel & Yelick @ UCB; Spring 2007) A


slide-1
SLIDE 1

Single processor tuning (2/2)

  • Prof. Richard Vuduc

Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.16] Thursday, February 28, 2008

1

slide-2
SLIDE 2

Today’s sources

CS 267 (Demmel & Yelick @ UCB; Spring 2007) “A family of high-performance matrix multiplication algorithms,” by Gunnels, et al. (2006) “Anatomy of high-performance matrix multiplication,” by Goto and van de Geijn (2006) “An experimental comparison of cache-oblivious and cache-conscious programs?” by Yotov, et al. (SPAA 2007) Talk by Matteo Frigo at CScADS Autotuning Workshop (2007)

2

slide-3
SLIDE 3

Review: GPGPUs.

(I don’t know; you tell me!)

3

slide-4
SLIDE 4

Review: A one-level model of the memory hierarchy

4

slide-5
SLIDE 5

A simple model of memory

Machine balance ⇐ Computational intensity

m ≡

  • No. words moved from slow to fast memory

f ≡

  • No. of flops

α ≡ Time per slow memory op. τ ≡ Time per flop q ≡ f m = Flop-to-mop ratio T = f · τ + m · α = f · τ ·

  • 1 + α

τ · 1 q

  • 5
slide-6
SLIDE 6

Blocked (tiled) matrix multiply

I K K J

m ≈ n3 b = ⇒ q ≈ b T f · τ = 1 + α τ · 1 b

// Let I, J, K = blocks of b indices for I ← index blocks 1 to n b do for J ← index blocks 1 to n b do // Read block CIJ for K ← index blocks 1 to n b do // Read block AIK // Read block BKJ CIJ ← CIJ + AIK · BKJ // Write CIJ to slow memory

6

slide-7
SLIDE 7

Can we do better? Nope.

Theorem [Hong and Kung (1981)]: Any schedule of conventional matrix multiply must transfer Ω(n3 / √M) words between slow and fast memory, where M < n2 / 6. Last time: We did intuitive proof by Toledo (1999) Historical note: Rutledge & Rubinstein (1951—52) So cached block matrix multiply is asymptotically optimal.

b = O √ M

  • =

⇒ m = O n3 b

  • = O

n3 √ M

  • 7
slide-8
SLIDE 8

Architectural implications

Arch. ≈ α / τ M Ultra 2i Ultra 3 Pentium 3 P-3M Power3 Power4 Itanium 1 Itanium 2 25 1.5 MB 14 460 KB 6.3 94 KB 10 240 KB 8.8 180 KB 15 527 KB 36 3.0 MB 5.5 71 KB

M ≡ Size of fast mem. 3b2 ≤ M q ≈ b ⇓ M ≥ 3q2 1 + α τ · 1 q < 1.1 = ⇒ M ≥ 300 α τ 2

Note: “M” in bytes to 2 digits; assumes 8-byte (double-precision) words

8

slide-9
SLIDE 9

What happens in practice?

Experiment: One-level cache-blocked matrix multiply Block size chosen as square, by exhaustive search over sizes up to 64

9

slide-10
SLIDE 10

Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache < 25% peak! We evidently still have a lot of work to do...

10

slide-11
SLIDE 11

Review: Real memory hierarchies

11

slide-12
SLIDE 12

What happened at powers of 2?

Byte addressable 32-bit addresses Cache Direct-mapped 8 KB capacity 16-byte lines

XXXX XXXX XXXX XXXX XXX0 0000 0000 0000 XXXX XXXX XXXX XXXX XXX0 0000 0001 0000 XXXX XXXX XXXX XXXX XXX0 0000 0010 0000 XXXX XXXX XXXX XXXX XXX0 0000 0011 0000 XXXX XXXX XXXX XXXX XXX0 0000 0100 0000 XXXX XXXX XXXX XXXX XXX0 0000 0101 0000 ... XXXX XXXX XXXX XXXX XXX1 1111 1111 0000 16 B

12

slide-13
SLIDE 13

Fast Slow Registers L1 L2 Main

13

slide-14
SLIDE 14

Fast Slow Registers L1 TLB L2 Main

14

slide-15
SLIDE 15

TLB is part of the memory hierarchy

Translation Look-aside Buffer (TLB) for virtual address space management

Divide address space into pages (4—32 KB typical, larger possible) Page table maps virtual to physical addrs & whether page in mem or on disk Page table can be large; TLB caches recent translations May be set-associative or fully-associative

Conceptually like a cache with large block size, i.e., 1 page

May have multiple levels of TLB, just like cache Can prefetch to hide cache misses, but not TLB misses

15

slide-16
SLIDE 16

s

Experiment to observe memory parameters.

Strided-stream through array; measure average access time. (Saavedra-Barrera benchmark)

16

slide-17
SLIDE 17

Average Memory Access Time (Saavedra-Barerra) — Sun Ultra IIi (333 MHz)

L1: 16 KB 16 B lines L2: 2 MB 64 B lines TLB: 8 KB page 32 entries Mem

17

slide-18
SLIDE 18

Average Memory Access Time (Saavedra-Barerra) — Pentium III (Katmai; 550 MHz)

L1: 16 KB 32 B lines L2: 512 KB 32 B lines TLB: 4 KB page 64 entries Mem

18

slide-19
SLIDE 19

General multi-level blocking

[Goto & van de Geijn (2006)]

19

slide-20
SLIDE 20

C ← C + A · B

C B A

m k k n

“Matrix-matrix” “Matrix-panel” “Panel-matrix” “Panel-Panel”

  • r “Fat Outer Product”

20

slide-21
SLIDE 21

C B A

21

slide-22
SLIDE 22

C B A

22

slide-23
SLIDE 23

C B A

23

slide-24
SLIDE 24

C B A

24

slide-25
SLIDE 25

C ← C + A · B

C B A

m k k n

“Matrix-matrix” “Block-Panel” “Panel-block” “Fat Dot Product”

25

slide-26
SLIDE 26

C B A

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

m k n bm bk bk bk bn bk bm

J I K

// Let I, J, K = blocks of indices for K ← blocks 1 to k bk do for I ← blocks 1 to m bm do for J ← blocks 1 to n bn do CIJ ← CIJ + AIK × BKJ

30

slide-31
SLIDE 31

bk bm J

// “Block-panel” multiply // Load bm × bk block of A into cache for J ← blocks 1 to n bn do // Load bk × bn block of B into cache // Load bm × bn block of C into cache CJ ← CJ + A × BJ // Store bm × bn block of C to memory

n bn

bmbk + (bk + bm)bn ≤ M

Assumes:

  • 1. A, BJ, CJ fit in cache (e.g., size M)
  • 2. Above ⇒ Product runs at peak
  • 3. A not evicted prematurely

31

slide-32
SLIDE 32

bn bk bm J

// “Block-panel” multiply // Load bm × bk block of A into cache for J ← blocks 1 to n bn do // Load bk × bn block of B into cache // Load bm × bn block of C into cache CJ ← CJ + A × BJ // Store bm × bn block of C to memory

n

f = 2bmbkn m = bmbk + (bk + 2bm) n ⇓ q = 2

1 n +

  • 1

bm + 2 bk

  • bmbk + (bk + bm)bn ≤ M

Assumes:

  • 1. A, BJ, CJ fit in cache (e.g., size M)
  • 2. Above ⇒ Product runs at peak
  • 3. A not evicted prematurely

32

slide-33
SLIDE 33

bn bk bm J n

Given a multi-level memory hierarchy, in what cache should “A” block live?

❖ Want large A block ❖ L1 cache usually quite small ❖ What about L2?

Assumes:

  • 1. A, BJ, CJ fit in cache (e.g., size M)
  • 2. Above ⇒ Product runs at peak
  • 3. A not evicted prematurely

bmbk + (bk + bm)bn ≤ M

2bmbkbn ρ1 ≥ bmbk β2 ⇓ bn ≥ ρ1 2β2

ρ1 ≡ Peak L1 flop/s β2 ≡ Peak L2 to CPU bw

Typically, need bn >= 2.

33

slide-34
SLIDE 34

bn bk bm J n

Assumes:

  • 1. A, BJ, CJ fit in cache (e.g., size M)
  • 2. Above ⇒ Product runs at peak
  • 3. A not evicted prematurely

bmbk + (bk + bm)bn ≤ M bn ≥ ρ1 2β2

34

slide-35
SLIDE 35

bn bk bm J n

What about the TLB? Assumes:

  • 1. A, BJ, CJ fit in cache (e.g., size M)
  • 2. Above ⇒ Product runs at peak
  • 3. A not evicted prematurely

bmbk + (bk + bm)bn ≤ M bn ≥ ρ1 2β2

35

slide-36
SLIDE 36

Considerations for TLB Matrix

n = 1024 Column-major order

TLB

Page = 4 KB 32 entries

1 512 1024 1 2 32 33 3

36

slide-37
SLIDE 37

bn bk bm J n

What about the TLB? Block of A straddles pages, so re-pack

  • n-the-fly ⇒ “Copy optimization”

Copy B panel as well Assumes:

  • 1. A, BJ, CJ fit in cache (e.g., size M)
  • 2. Above ⇒ Product runs at peak
  • 3. A not evicted prematurely
  • 4. Operands “fit in” TLB

bmbk + (bk + bm)bn ≤ M bn ≥ ρ1 2β2

37

slide-38
SLIDE 38

Panel-Block Fat-Dot

38

slide-39
SLIDE 39

m k n bm bk bk bk bn bk bm

J I K

// Let I, J, K = blocks of indices for K ← blocks 1 to k bk do ˜ B ← BK,⋆ for I ← blocks 1 to m bm do ˜ A ← AIK for J ← blocks 1 to n bn do ˜ C ← ˜ A × ˜ BJ // Compute in buffer, ˜ C CIJ ← CIJ + ˜ C // Unpack ˜ C

39

slide-40
SLIDE 40

C B A

Which is better?

40

slide-41
SLIDE 41

Source: Vuduc, Demmel, Bilmes (IJHPCA 2004)

0.0751 0.1502 0.2252 0.3003 0.3754 0.4505 0.5255 0.6006 0.6757 0.7508 0.8258 0.9009 Fraction of Peak 100 200 300 400 500 600 700 800 900 1000 1100 1200 50 100 150 200 250 300 350 400 450 500 550 600 matrix dimension (n) Performance (Mflop/s) Dense Matrix Multiply Performance (Square n×n Operands) [333 MHz Sun Ultra 2i] Vendor Reg/insn−level + cache tiling + copy opt. Cache tiling + copy opt. Reference

41

slide-42
SLIDE 42

0.0625 0.125 0.1875 0.25 0.3125 0.375 0.4375 0.5 0.5625 0.625 0.6875 0.75 0.8125 0.875 Fraction of Peak 100 200 300 400 500 600 700 800 900 1000 1100 1200 50 100 150 200 250 300 350 400 450 500 550 600 650 700 matrix dimension (n) Performance (Mflop/s) Dense Matrix Multiply Performance (Square n×n Operands) [800 MHz Intel Pentium III−mobile] Vendor Goto−BLAS Reg/insn−level + cache tiling + copy Cache tiling + copy opt. Reference

Source: Vuduc, Demmel, Bilmes (IJHPCA 2004)

42

slide-43
SLIDE 43

bn bk bm

J

r c c

Inner-kernel Scheduling

Register allocation

43

slide-44
SLIDE 44

Administrivia

44

slide-45
SLIDE 45

Two joint classes with CS 8803 SC

Tues 2/19: Floating-point issues in parallel computing by me Tues 2/26: GPGPUs by Prof. Hyesoon Kim

Scribe?

Both classes meet in Klaus 1116E

45

slide-46
SLIDE 46

Homework 1: Parallel conjugate gradients

Extension: Due Wednesday 2/27 @ 8:30 am Implement a parallel solver for Ax = b (serial C version provided)

Evaluate on three matrices: 27-pt stencil, and two application matrices “Simplified:” No preconditioning

Performance models to understand scalability of your implementation

Make measurements Build predictive models

Collaboration encouraged: Compare programming models or platforms

46

slide-47
SLIDE 47

Administrative stuff

New room (dumpier, but cozier?): College of Computing Building (CCB) 101 Accounts: Apparently, you already have them Front-end login node: ccil.cc.gatech.edu (CoC Unix account)

We “own” warp43—warp56 Some docs (MPI): http://www-static.cc.gatech.edu/projects/ihpcl/mpi.html Sign-up for mailing list: https://mailman.cc.gatech.edu/mailman/listinfo/ihpc-lab

47

slide-48
SLIDE 48

Projects

Your goal should be to do something useful, interesting, and/or publishable!

Something you’re already working on, suitably adapted for this course Faculty-sponsored/mentored Collaborations encouraged

48

slide-49
SLIDE 49

My criteria for “approving” your project

“Relevant to this course:” Many themes, so think (and “do”) broadly

Parallelism and architectures Numerical algorithms Programming models Performance modeling/analysis

49

slide-50
SLIDE 50

General styles of projects

Theoretical: Prove something hard (high risk) Experimental:

Parallelize something Take existing parallel program, and improve it using models & experiments Evaluate algorithm, architecture, or programming model

50

slide-51
SLIDE 51

Anything of interest to a faculty member/project outside CoC Parallel sparse triple product (R*A*RT, used in multigrid) Future FFT Out-of-core or I/O-intensive data analysis and algorithms Block iterative solvers (convergence & performance trade-offs) Sparse LU Data structures and algorithms (trees, graphs) Look at mixed-precision Discrete-event approaches to continuous systems simulation Automated performance analysis and modeling, tuning “Unconventional,” but related Distributed deadlock detection for MPI UPC language extensions (dynamic block sizes) Exact linear algebra

Examples

51

slide-52
SLIDE 52

Inner-kernel

52

slide-53
SLIDE 53

Doesn’t the compiler do scheduling and reg. allocation?

Theorem (Motwani, et al., 1995): Given a DAG, finding the schedule and register assignment to minimize register spills is NP-Hard. Theorem (Belady, 1966): Given a DAG and a schedule, finding the register assignment to minimize register spills can be done in ≈ linear time. Source: Talk by M. Frigo at CScADS autotuning workshop (2007)

53

slide-54
SLIDE 54

Loop unrolling: Reducing loop overheads Source: Clint Whaley’s code optimization course (UTSA Spring 2007)

54

slide-55
SLIDE 55

Source: Clint Whaley’s code optimization course (UTSA Spring 2007) Scalar expansion: Removing serial dependencies

55

slide-56
SLIDE 56

Source: Clint Whaley’s code optimization course (UTSA Spring 2007) Unroll and jam + register blocking

56

slide-57
SLIDE 57

Software pipelining: Interleave iterations to delay dependent instructions i-4 i-3 i i+1 Source: Clint Whaley’s code optimization course (UTSA Spring 2007)

m3;

57

slide-58
SLIDE 58

Fetch scheduling, for cache lines and hardware prefetching engines Source: Clint Whaley’s code optimization course (UTSA Spring 2007)

58

slide-59
SLIDE 59

Software prefetching Source: Clint Whaley’s code optimization course (UTSA Spring 2007)

59

slide-60
SLIDE 60

0.0625 0.125 0.1875 0.25 0.3125 0.375 0.4375 0.5 0.5625 0.625 0.6875 0.75 0.8125 0.875 Fraction of Peak 100 200 300 400 500 600 700 800 900 1000 1100 1200 50 100 150 200 250 300 350 400 450 500 550 600 650 700 matrix dimension (n) Performance (Mflop/s) Dense Matrix Multiply Performance (Square n×n Operands) [800 MHz Intel Pentium III−mobile] Vendor Goto−BLAS Reg/insn−level + cache tiling + copy Cache tiling + copy opt. Reference

60

slide-61
SLIDE 61

“In conclusion…”

61

slide-62
SLIDE 62

Backup slides

62

slide-63
SLIDE 63

L1: 32 KB 128 B lines ~ 0.5+ cy L2: 8 MB 128 B lines ~ 6 cy TLB: 4 KB 256 entries Mem ~ 21 cy?

63