[PPT] - Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute PowerPoint Presentation

SLIDE 1

Single processor tuning (1/2)

Prof. Richard Vuduc

Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.14] Thursday, February 21, 2008

SLIDE 2

Today’s sources

CS 267 (Yelick @ UCB; Spring 2007) “A survey of out-of-core algorithms in numerical linear algebra,” by Toledo (1999) “A family of high-performance matrix multiplication algorithms,” by Gunnels, et al. (2006) “On reducing TLB misses in matrix multiplication,” by Goto and van de Geijn (2002) “Is search really necessary to generate high-performance BLAS?” by Yotov, et al. (2005)

SLIDE 3

Review: Accuracy, stability, and parallelism

SLIDE 4

The impact of parallelism on numerical algorithms

Larger problems magnify errors: Round-off, ill-conditioning, instabilities Reproducibility: a + (b + c) ≠ (a + b) + c Fast parallel algorithm may be much less stable than fast serial algorithm Flops cheaper than communication Speeds at different precisions may vary significantly [e.g., SSEk, Cell] Perils of arithmetic heterogenity, e.g., CPU vs. GPU support of IEEE

SLIDE 5

Mixed (forward-backward) stability

Computed answer “near” exact solution of a nearby problem

∆x, ∆y : ˆ y + ∆y = f(x + ∆x) x x + ∆x y = f(x) ˆ y f(x + ∆x) ∆y

SLIDE 6

Conditioning: Relating forward and backward error

Define (relative) condition number: Roughly: (Forward error) ≤ (Condition number) * (Backward error)

c(x) =

xf ′(x)

f(x)

ˆ

y − y y

xf ′(x)

f(x)

·
∆x

x

SLIDE 7

Mixed-precision iterative refinement

Inner-loop of mixed-precision iterative refinement algorithm: Theorem: Repeated iterative refinement converges by η at each stage, and ⇐ Independent of κ(A)! ˆ x = Estimated solution to Ax = b ˆ r ← b − A · ˆ x Solve A · ˆ d = ˆ r ˆ x(improved) ← ˆ x + ˆ d Single, O(n3) ⇒ Double, O(n2) ⇒ Single, O(n2) ⇒ Double, O(n) ⇒ x(t) ≡ Estimate at iteration t, in precision ǫ r(t) ≡ Residual, in precision ǫ2 η ≡ ǫ · || |A−1| · |ˆ L| · | ˆ U| ||∞ < 1 ||x(t) − x||∞ ||x||∞ → O(ǫ)

SLIDE 8

Obstacles to fast and stable parallel numerical algorithms

Algorithms that work on small problems may fail at large sizes

Round-off accumulates Condition number increases Probability of “random instability” increases

Fast (parallel) algorithm may be less stable ⇒ trade-off

SLIDE 9

± −125 = emin ≤ e ≤ emax = 128 0 ≤ m < 224 ≈ 16 million

y = ± m × 2e−t y = 0 = ⇒ m ≥ 2t−1

“Normalized”

SLIDE 10

IEEE formats

± emin ≤ e ≤ emax 0 ≤ m < 2t

Format Total bits

Exp. bits

(emin, emax) t-1 ε Fortran / C Single Double Extended (Intel) 32 8 (-125, 128) 23 6 × 10-8 REAL*4 float 64 11 (-1021, 1024) 52 10-16 REAL*8 double 80 15 (-16381, 16384) 64 5 × 10-20 REAL*10 long double

SLIDE 11

Reasoning about memory hierarchies

SLIDE 12

n-chip

cache registers datapath control processor Second level cache (SRAM) Main memory (DRAM) Secondary storage (Disk) Tertiary storage (Disk/Tape)

TB GB MB KB B Size 10sec 10ms 100ns 10ns 1ns Cost

Recall: Memory hierarchies.

Cost of accessing data depends on where data lives.

SLIDE 13

Memory hierarchies reflect growing processor-memory speed gap.

SLIDE 14

Dealing with high memory latency

Use caches as fast memory

Store data that will be reused many times: temporal locality Save chunks of contiguous data: spatial locality

Exploit fact that bandwidth improves faster than latency: prefetch Modern processors automate cache management

All loads cached automatically (LRU), and loaded in chunks (cache line size) Typical to have a hardware prefetcher that detects simple patterns

SLIDE 15

A simple model of memory

Machine balance ⇐ Computational intensity

m ≡

No. words moved from slow to fast memory

f ≡

No. of flops

α ≡ Time per slow memory op. τ ≡ Time per flop q ≡ f m = Flop-to-mop ratio T = f · tf + m · α = f · tf ·

1 + α

τ · 1 q

SLIDE 16

Example: Matrix-vector multiply

← + *

// Implements y ← y + A · x for i ← 1 to n do for j ← 1 to n do yi ← yi + aij · xj

SLIDE 17

Example: Matrix-vector multiply

← + *

// Implements y ← y + A · x // Read x (into fast memory) // Read y for i ← 1 to n do // Read ai,⋆ for j ← 1 to n do yi ← yi + aij · xj // Write y to slow memory

SLIDE 18

Example: Matrix-vector multiply

← + *

// Implements y ← y + A · x // Read x (into fast memory) // Read y for i ← 1 to n do // Read ai,⋆ for j ← 1 to n do yi ← yi + aij · xj // Write y to slow memory

f = 2n2 m = 3n + n2 q ≈ 2 ⇓ T f · τ ≈ 1 + α τ · 1 2

SLIDE 19

Machine balance, α / τ

[See my thesis] 5 10 15 20 25 30 35 40 Ultra 2i Ultra 3 Pentium 3 3M Power3 Power 4 Itanium 1 Empirically-derived sustainable machine balance α / τ

SLIDE 20

Simplifying assumptions

Ignored flop/mop parallelism within processor → drop arithmetic term Assumed fast memory large enough to hold vectors Assumed no-cost fast memory access Memory latency is constant, charged per word

Ignored cache lines / block transfers Ignored bandwidth

SLIDE 21

Predictive accuracy of this model

375 750 1,125 1,500 Ultra 2i Ultra 3 P3 P3M Power3 Power4 Itanium 1Itanium 2 Mflop/s

Model Measured

SLIDE 22

Naive matrix-matrix multiply

f = 2n3 m ≥ 4n2 T f · τ ≥ 1 + α τ · 1 2n // Implements C ← C + A · B for i ← 1 to n do for j ← 1 to n do for k ← 1 to n do cij ← cij + aik · bkj

Best case ⇒

SLIDE 23

Naive matrix-matrix multiply

// Implements C ← C + A · B for i ← 1 to n do // Read row ai,⋆ for j ← 1 to n do // Read col b⋆,j // Read ci,j for k ← 1 to n do cij ← cij + aik · bkj // Write cij to slow memory f = 2n3 m = n3 + 3n2 T f · τ ≈ 1 + α τ · 1 2

SLIDE 24

Blocked (tiled) matrix multiply

// Let I, J, K = blocks of b indices for I ← index blocks 1 to n b do for J ← index blocks 1 to n b do // Read block CIJ for K ← index blocks 1 to n b do // Read block AIK // Read block BKJ BIJ ← cIJ + AIK · BKJ // Write CIJ to slow memory I K K J

SLIDE 25

Blocked (tiled) matrix multiply

m ≈ n3 b = ⇒ q ≈ 2b T f · τ = 1 + α τ · 1 2b

SLIDE 26

Architectural implications

Arch. ≈ α / τ KBytes Ultra 2i Ultra 3 Pentium 3 P-3M Power3 Power4 Itanium 1 Itanium 2 25 5.7 14 1.8 6.3 0.36 10 0.92 8.8 0.71 15 2.1 36 12 5.5 0.28 M ≡ Size of fast mem. 3b2 ≤ M q ≈ 2b ⇓ M ≥ 3 4q2 1 + α τ · 1 q < 1.1 = ⇒ M ≥ 75 α τ 2

SLIDE 27

Can we do better?

b = O √ M

=

⇒ m = O n3 b

= O

n3 √ M

SLIDE 28

Bounding amount of I/O possible

Consider a schedule in phases of exactly M transfers each (except last) Definition: c(i,j) is live during phase p if ...

… for some k, we compute a(i,k) * b(k, j); and some partial sum of c(i, j) is either in cache or moved to main memory

At most 2*M alive elements in phase p At most 2*M distinct elements of A in cache during phase p; same for B

Either in cache at beginning or moved to cache during phase Let Ap be set of elements in cache during phase p

SLIDE 29

How many multiplies in phase p?

Let Sp,+ = set of rows of A with M1/2 or more elements in Ap Let Sp,- = set of rows of A with fewer |Sp,+| ≤ 2*M1/2 Consider rows in Sp,+:

Operation “a(i, :) × B” touches each element of B only once So, no. of scalar multiplies ≤ |Sp,+| * (2*M) = 4*M3/2

For rows in Sp,-, consider that “c(i,j) = row x col” Thus, (# multiplies) ≤ (no. live) x (max row len) ≤ 2*M3/2

SLIDE 30

Final bound on multiplies

Total no. of multiplies = n3

No. of multiplies per phase

≤ 6M

3 2

No. of phases

≥ n3 6M

3 2

Total no. of words transferred

≥ M · n3 6M

3 2 − 1

=

n3 6 √ M − M

SLIDE 31

Can we do better? No.

Theorem [Hong and Kung (1981)]: Any schedule of conventional matrix multiplication must transfer Ω(n3 / √M) words between slow and fast memory, where M < n2 / 6. Preceding proof by Toledo (1999) So cached block matrix multiply is asymptotically optimal.

b = O √ M

=

⇒ m = O n3 b

= O

n3 √ M

SLIDE 32

What happens in practice?

Experiment: One-level cache-blocked matrix multiply Block size chosen as square, by exhaustive search over sizes up to 64

SLIDE 33

Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache We evidently still have a lot of work to do...

SLIDE 34

Administrivia

SLIDE 35

Two joint classes with CS 8803 SC

Tues 2/19: Floating-point issues in parallel computing by me Tues 2/26: GPGPUs by Prof. Hyesoon Kim

Scribe?

Both classes meet in Klaus 1116E

SLIDE 36

Homework 1: Parallel conjugate gradients

Extension: Due Wednesday 2/27 @ 8:30 am Implement a parallel solver for Ax = b (serial C version provided)

Evaluate on three matrices: 27-pt stencil, and two application matrices “Simplified:” No preconditioning

Performance models to understand scalability of your implementation

Make measurements Build predictive models

Collaboration encouraged: Compare programming models or platforms

SLIDE 37

Administrative stuff

New room (dumpier, but cozier?): College of Computing Building (CCB) 101 Accounts: Apparently, you already have them Front-end login node: ccil.cc.gatech.edu (CoC Unix account)

We “own” warp43—warp56 Some docs (MPI): http://www-static.cc.gatech.edu/projects/ihpcl/mpi.html Sign-up for mailing list: https://mailman.cc.gatech.edu/mailman/listinfo/ihpc-lab

SLIDE 38

Projects

Your goal should be to do something useful, interesting, and/or publishable!

Something you’re already working on, suitably adapted for this course Faculty-sponsored/mentored Collaborations encouraged

SLIDE 39

My criteria for “approving” your project

“Relevant to this course:” Many themes, so think (and “do”) broadly

Parallelism and architectures Numerical algorithms Programming models Performance modeling/analysis

SLIDE 40

General styles of projects

Theoretical: Prove something hard (high risk) Experimental:

Parallelize something Take existing parallel program, and improve it using models & experiments Evaluate algorithm, architecture, or programming model

SLIDE 41

Anything of interest to a faculty member/project outside CoC Parallel sparse triple product (R*A*RT, used in multigrid) Future FFT Out-of-core or I/O-intensive data analysis and algorithms Block iterative solvers (convergence & performance trade-offs) Sparse LU Data structures and algorithms (trees, graphs) Discrete-event approaches to continuous systems simulation Automated performance analysis and modeling, tuning “Unconventional,” but related Distributed deadlock detection for MPI UPC language extensions (dynamic block sizes) Exact linear algebra

Examples

SLIDE 42

Blocking at multiple levels

SLIDE 43

C ← C + A · B

C B A B C A

“Matrix- Panel” “Panel- Matrix”

C A B

“Panel-Panel”

SLIDE 44

C ← C + A · B

A B

“Repeated Matrix- Panel” “Repeated Panel- Matrix”

C

“Repeated Panel-Panel”

B3 B2 B1 C1 C2 C3 A1 C1 A2 A3 C3 C2 B3 A3 A1 A2 B1 B2

SLIDE 45

“Repeated Panel-Panel”

C B3 A3 A1 A2 B1 B2

kh kh

Consider matrices that “live” in cache level h+1. Execute “RPP” algorithm on panels that live in level h.

mh+1 = mh nh+1 = nh kh+1 kh+1

SLIDE 46

C

“Repeated Panel-Panel”

B3 A3 A1 A2 B1 B2

kh kh ρh ≡ Time to read from Lh+1 to Lh σh ≡ Time to store to Lh+1 from Lh γh ≡ Effective I/O time at Lh Th+1 ≡ 2mh+1nh+1kh+1γh+1 = mh+1nh+1(ρh + σh) +mh+1nh+1kh+1ρh 1 nh + 1 mh

+2mh+1nh+1kh+1γh

⇓ γh+1 = ρh + σh 2kh+1 + ρh 1 nh + 1 mh

+ γh

mh+1 = mh nh+1 = nh kh+1

SLIDE 47

C

“Repeated Panel-Panel”

B3 A3 A1 A2 B1 B2

kh kh ρh ≡ Time to read from Lh+1 to Lh σh ≡ Time to store to Lh+1 from Lh γh ≡ Effective I/O time at Lh Th+1 ≡ 2mh+1nh+1kh+1γh+1 = mh+1nh+1(ρh + σh) +mh+1nh+1kh+1ρh 1 nh + 1 mh

+2mh+1nh+1kh+1γh

⇓ γh+1 = ρh + σh 2kh+1 + ρh 1 nh + 1 mh

+ γh

mh+1 = mh nh+1 = nh kh+1

SLIDE 48

TLB effects

SLIDE 49

TLB is part of the memory hierarchy

Translation Look-aside Buffer (TLB): Mechanism to support efficient implementation of virtual address spaces that exceed physical memory

Divide address space into pages (4 to 32 KB typical, larger possible) Page table maps virtual to physical addrs, whether page in mem or on disk Page table can be large; TLB caches recent translations

Conceptually like a cache with a large block size (e.g., page)

SLIDE 50

s

Experiment to observe memory parameters.

Strided-stream through array; measure average access time. (Saavedra-Barrera benchmark)

SLIDE 51

Average Memory Access Time (Saavedra-Barerra) — Sun Ultra IIi (333 MHz)

L1: 16 KB 16 B lines L2: 2 MB 64 B lines TLB: 8 KB 32 entries Mem

SLIDE 52

Average Memory Access Time (Saavedra-Barerra) — Pentium III (Katmai; 550 MHz)

L1: 16 KB 32 B lines L2: 512 KB 32 B lines TLB: 4 KB page 64 entries Mem

SLIDE 53

L1: 32 KB 128 B lines ~ 0.5+ cy L2: 8 MB 128 B lines ~ 6 cy TLB: 4 KB 256 entries Mem ~ 21 cy?

SLIDE 54

“In conclusion…”

SLIDE 55