Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc - - PowerPoint PPT Presentation

autotuning 1 2 cache oblivious algorithms
SMART_READER_LITE
LIVE PREVIEW

Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc - - PowerPoint PPT Presentation

Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Todays sources CS 267 (Demmel & Yelick @ UCB; Spring 2007)


slide-1
SLIDE 1

Autotuning (1/2): Cache-oblivious algorithms

  • Prof. Richard Vuduc

Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008

1

slide-2
SLIDE 2

Today’s sources

CS 267 (Demmel & Yelick @ UCB; Spring 2007) “An experimental comparison of cache-oblivious and cache-conscious programs?” by Yotov, et al. (SPAA 2007) “The memory behavior of cache oblivious stencil computations,” by Frigo & Strumpen (2007) Talks by Matteo Frigo and Kaushik Datta at CScADS Autotuning Workshop (2007) Demaine’s @ MIT: http://courses.csail.mit.edu/6.897/spring03/scribe_notes

2

slide-3
SLIDE 3

Review: Tuning matrix multiply

3

slide-4
SLIDE 4

Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache < 25% peak! We evidently still have a lot of work to do...

4

slide-5
SLIDE 5

Fast Slow Registers L1 TLB L2 Main

5

slide-6
SLIDE 6

C B A

6

slide-7
SLIDE 7

Software pipelining: Interleave iterations to delay dependent instructions i-4 i-3 i i+1 Source: Clint Whaley’s code optimization course (UTSA Spring 2007)

m3;

7

slide-8
SLIDE 8

0.0625 0.125 0.1875 0.25 0.3125 0.375 0.4375 0.5 0.5625 0.625 0.6875 0.75 0.8125 0.875 Fraction of Peak 100 200 300 400 500 600 700 800 900 1000 1100 1200 50 100 150 200 250 300 350 400 450 500 550 600 650 700 matrix dimension (n) Performance (Mflop/s) Dense Matrix Multiply Performance (Square n×n Operands) [800 MHz Intel Pentium III−mobile] Vendor Goto−BLAS Reg/insn−level + cache tiling + copy Cache tiling + copy opt. Reference

Source: Vuduc, Demmel, Bilmes (IJHPCA 2004)

8

slide-9
SLIDE 9

Cache-oblivious matrix multiply

[Yotov, Roeder, Pingali, Gunnels, Gustavson (SPAA 2007)] [Talk by M. Frigo at CScADS Autotuning Workshop 2007]

9

slide-10
SLIDE 10

Fast Slow Two-level memory hierarchy M = capacity of cache (“fast”) L = cache line size Fully associative Optimal replacement

Evicts most distant use Sleator & Tarjan (CACM 1985): LRU, FIFO w/in constant of optimal w/ cache larger by constant factor

“Tall-cache:” M ≥ O(L2)

Limits: See Brodal & Fagerberg (STOC 2003) When might this not hold?

Memory model for analyzing cache-oblivious algorithms

10

slide-11
SLIDE 11

A recursive algorithm for matrix-multiply

A11 A12 A21 A22 C11 C12 C21 C22 B11 B12 B21 B22

Divide all dimensions in half Bilardi, et al.: Use Gray code ordering

Cost (flops) = T(n) =

  • 8 · T( n

2 )

n > 1 O(1) n = 1 = O(n3)

11

slide-12
SLIDE 12

A recursive algorithm for matrix-multiply

A11 A12 A21 A22 C11 C12 C21 C22 B11 B12 B21 B22

Divide all dimensions in half Bilardi, et al.: Use grey-code ordering

I/O Complexity?

12

slide-13
SLIDE 13

A recursive algorithm for matrix-multiply

A11 A12 A21 A22 C11 C12 C21 C22 B11 B12 B21 B22

Divide all dimensions in half Bilardi, et al.: Use grey-code ordering

  • No. of misses, with tall-cache assumption:

Q(n) =

  • 8 · Q( n

2 )

if n >

  • M

3 3n2 L

  • therwise
  • ≤ Θ
  • n3

L √ M

  • 13
slide-14
SLIDE 14

Alternative: Divide longest dimension (Frigo, et al.)

C1 C2 A1 A2

m k

B

k n

C A1 A2

k

B2 B1

k n

B2 B1

k n k

A C1 C2

Cache misses Q(m, k, n) ≤        Θ mk+kn+mn

L

  • if mk + kn + mn ≤ αM

2Q m

2 , k, n

  • if m ≥ k, n

2Q(m, k

2, n)

if k > m, k ≥ n 2Q(m, k, n

2 )

  • therwise

= Θ mkn L √ M

  • 14
slide-15
SLIDE 15

Relax tall-cache assumption using suitable layout Source: Yotov, et al. (SPAA 2007) and Frigo, et al. (FOCS ’99) Row-major Row-block-row Morton Z No assumption M ≥ Ω(L) Need tall cache

15

slide-16
SLIDE 16

I K K J

Latency-centric vs. bandwidth-centric views of blocking

Time per flop ≈ 1 + α τ · 1 b α τ · 1 κ ≤ b ≤

  • M

3

⇐ Assume can perfectly overlap computation & communication

Peak flop/cy ≡ φ Bandwidth, word/cy ≡ β 2n3 b · 1 β 2n3 · 1 φ = ⇒ φ β ≤ b

16

slide-17
SLIDE 17

FPU Registers L2 L3 Memory L1

4* ≥2 2* 4 4 ≥6 ≈0.5

1 ≤ bR ≤ 6 1.33 ≤ β(R,L2) ≤ 4 1 ≤ bL2 ≤ 6 1.33 ≤ β(L2,L3) ≤ 4 8 ≤ bL3 ≤ 418 0.02 ≤ β(L3,Memory) ≤ 0.5 2 FMAs/cycle

Latency-centric vs. bandwidth-centric views of blocking Example platform: Itanium 2 Consider L3 ←→ memory bandwidth

Φ = 4 flops / cycle; β = 0.5 words / cycle L3 capacity = 4 MB (512 kwords) Need 8 ≤ bL3 ≤ 418

Implications: Approximate cache-oblivious blocking works

Wide range of block sizes should be OK If upper bound > 2*lower, divide-and-conquer generates block size in range

Source: Yotov, et al. (SPAA 2007)

17

slide-18
SLIDE 18

Cache-oblivious vs. cache-aware

Does cache-oblivious perform as well as cache-aware? If not, what can be done? Next: Summary of Yotov, et al., study (SPAA 2007)

Stole slides liberally

18

slide-19
SLIDE 19

All- vs. largest-dimension

Similar; assume “all-dim”

19

slide-20
SLIDE 20

Data structures

Morton-Z complicated and yields same or worse performance, so assume row-block-row

20

slide-21
SLIDE 21

Example 1: Ultra IIIi

1 GHz ⇒ 2 Gflop/s peak Memory hierarchy

32 registers L1 = 64 KB, 4-way L2 = 1 MB, 4-way

Sun compiler

21

slide-22
SLIDE 22
  • Iterative: triply nested loop
  • Recursive: down to 1 x 1 x 1

Outer Control Structure Iterative Recursive Inner Control Structure Statement

22

slide-23
SLIDE 23

Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compiler

  • Recursion down to NB
  • Unfold completely below

NB to get a basic block

  • Micro-Kernel:
  • The basic block compiled

with native compiler

  • Best performance for

NB =12

  • Compiler unable to use

registers

  • Unfolding reduces control
  • verhead
  • limited by I-cache

23

slide-24
SLIDE 24
  • Recursion down to NB
  • Unfold completely

below NB to get a basic block

  • Micro-Kernel
  • Scalarize all array

references in the basic block

  • Compile with native

compiler Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compil er

24

slide-25
SLIDE 25
  • Recursion down to NB
  • Unfold completely below NB to get a

basic block

  • Micro-Kernel
  • Perform Belady’s register allocation on

the basic block

  • Schedule using BRILA compiler

Belady / BRILA

Scalarized / Compiler Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compiler

25

slide-26
SLIDE 26
  • Recursion down to NB
  • Unfold completely below NB to get a

basic block

  • Micro-Kernel
  • Construct a preliminary schedule
  • Perform Graph Coloring register

allocation

  • Schedule using BRILA compiler

Belady / BRILA

Scalarized / Compiler Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compiler Coloring / BRILA

26

slide-27
SLIDE 27
  • Recursion down to MU x NU

x KU

  • Micro-Kernel
  • Completely unroll MU x NU

x KU triply nested loop

  • Construct a preliminary

schedule

  • Perform Graph Coloring

register allocation

  • Schedule using BRILA

compiler

Belady / BRILA

Scalarized / Compiler Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compiler Coloring / BRILA

Iterative

27

slide-28
SLIDE 28

Mini-Kernel

  • Recursion down to NB
  • Mini-Kernel
  • NB x NB x NB triply

nested loop

  • Tiling for L1 cache
  • Body is Micro-Kernel

Belady / BRILA

Scalarized / Compiler Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compiler Coloring / BRILA

Iterative

28

slide-29
SLIDE 29

Mini-Kernel Belady / BRILA

Scalarized / Compiler Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compiler Coloring / BRILA

Iterative

ATLAS CGw/S ATLAS Unleashed

Specialized code generator with search

29

slide-30
SLIDE 30

Mini-Kernel Belady / BRILA

Scalarized / Compiler Outer Control Structure Iterative Recursive Inner Control Structure

Statement

Recursive Micro-Kernel

None / Compiler Coloring / BRILA

Iterative

ATLAS CGw/S ATLAS Unleashed

30

slide-31
SLIDE 31

Summary: Engineering considerations

Need to cut-off recursion Careful scheduling/tuning required at “leaves” Yotov, et al., report that full-recursion + tuned micro-kernel ≤ 2/3 best Open issues

Recursively-scheduled kernels worse than iteratively-schedule kernels — why? Prefetching needed, but how best to apply in recursive case?

31

slide-32
SLIDE 32

Administrivia

32

slide-33
SLIDE 33

Upcoming schedule changes

Some adjustment of topics (TBD) Tu 3/11 — Project proposals due Th 3/13 — SIAM Parallel Processing (attendance encouraged) Tu 4/1 — No class Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program

33

slide-34
SLIDE 34

Homework 1: Parallel conjugate gradients

Put name on write-up! Grading: 100 pts max

Correct implementation — 50 pts Evaluation — 30 pts Tested on two samples matrices — 5 Implemented and tested on stencil — 10 “Explained” performance (e.g., per proc, load balance, comp. vs. comm) — 15 Performance model — 15 pts Write-up “quality” — 5 pts

34

slide-35
SLIDE 35

Projects

Proposals due Tu 3/11 Your goal should be to do something useful, interesting, and/or publishable!

Something you’re already working on, suitably adapted for this course Faculty-sponsored/mentored Collaborations encouraged

35

slide-36
SLIDE 36

My criteria for “approving” your project

“Relevant to this course:” Many themes, so think (and “do”) broadly

Parallelism and architectures Numerical algorithms Programming models Performance modeling/analysis

36

slide-37
SLIDE 37

General styles of projects

Theoretical: Prove something hard (high risk) Experimental:

Parallelize something Take existing parallel program, and improve it using models & experiments Evaluate algorithm, architecture, or programming model

37

slide-38
SLIDE 38

Anything of interest to a faculty member/project outside CoC Parallel sparse triple product (R*A*RT, used in multigrid) Future FFT Out-of-core or I/O-intensive data analysis and algorithms Block iterative solvers (convergence & performance trade-offs) Sparse LU Data structures and algorithms (trees, graphs) Look at mixed-precision Discrete-event approaches to continuous systems simulation Automated performance analysis and modeling, tuning “Unconventional,” but related Distributed deadlock detection for MPI UPC language extensions (dynamic block sizes) Exact linear algebra

Examples

38

slide-39
SLIDE 39

Switch: M. Frigo’s talk slides from CScADS 2007 autotuning workshop

http://cscads.rice.edu/workshops/july2007/autotune-workshop-07

39

slide-40
SLIDE 40

Cache-oblivious stencil computations

[Frigo and Strumpen (ICS 2005)] [Datta, et al. (2007)]

40

slide-41
SLIDE 41

t=0 x=0 16 5 8

Cache-oblivious stencil computation

10

41

slide-42
SLIDE 42

t=0 x=0 16 5 8

Cache-oblivious stencil computation

10

w < 2×h:

42

slide-43
SLIDE 43

t=0 x=0 16 5 8

Cache-oblivious stencil computation w < 2×h ⇒ “Time-cut”:

10

43

slide-44
SLIDE 44

t=0 x=0 16 5 8

Cache-oblivious stencil computation

10

w ≥ 2×h:

44

slide-45
SLIDE 45

t=0 x=0 16 5 8

Cache-oblivious stencil computation

10

w ≥ 2×h ⇒ “Space-cut”:

45

slide-46
SLIDE 46

t=0 x=0 16 5 8

Cache-oblivious stencil computation

10

w ≥ 2×h ⇒ “Space-cut”:

46

slide-47
SLIDE 47

t=0 x=0 16 5 8

Cache-oblivious stencil computation

10

w < 2×h ⇒ “Time-cut”:

47

slide-48
SLIDE 48

t=0 x=0 16 5 8

Cache-oblivious stencil computation

10 Theorem [Frigo & Strumpen (ICS 2005)]: d = dimension ⇒

Q(n, t; d) = O nd · t M

1 d

  • 48
slide-49
SLIDE 49

Source: Datta, et al. (2007)

Cache-oblivious stencil computation: Fewer misses but more time

49

slide-50
SLIDE 50

t=0 x=0 16 5 8

Cache-conscious algorithm

10

b

50

slide-51
SLIDE 51

Cache-conscious algorithm

Source: Datta, et al. (2007)

51

slide-52
SLIDE 52

“In conclusion…”

52

slide-53
SLIDE 53

Backup slides

53