Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute - PowerPoint PPT Presentation

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.16] Thursday, February 28, 2008 1

Today’s sources CS 267 (Demmel & Yelick @ UCB; Spring 2007) “ A family of high-performance matrix multiplication algorithms ,” by Gunnels, et al . (2006) “ Anatomy of high-performance matrix multiplication ,” by Goto and van de Geijn (2006) “ An experimental comparison of cache-oblivious and cache-conscious programs? ” by Yotov, et al . (SPAA 2007) Talk by Matteo Frigo at CScADS Autotuning Workshop (2007) 2

Review: GPGPUs. (I don’t know; you tell me!) 3

Review: A one-level model of the memory hierarchy 4

A simple model of memory m No. words moved from slow to fast memory ≡ f No. of flops ≡ Time per slow memory op. ≡ α Time per flop ≡ τ f q m = Flop-to-mop ratio ⇐ Computational intensity ≡ � � τ · 1 1 + α T = f · τ + m · α = f · τ · q Machine balance 5

Blocked (tiled) matrix multiply J // Let I, J, K = blocks of b indices for I ← index blocks 1 to n b do K for J ← index blocks 1 to n b do K // Read block C IJ I for K ← index blocks 1 to n b do m ≈ n 3 // Read block A IK q ≈ b = ⇒ // Read block B KJ b T C IJ ← C IJ + A IK · B KJ τ · 1 1 + α = // Write C IJ to slow memory f · τ b 6

Can we do better? Nope. Theorem [Hong and Kung (1981)]: Any schedule of conventional matrix multiply must transfer Ω ( n 3 / √ M ) words between slow and fast memory, where M < n 2 / 6. Last time: We did intuitive proof by Toledo (1999) Historical note: Rutledge & Rubinstein (1951—52) So cached block matrix multiply is asymptotically optimal . � n 3 � n 3 � � � √ � b = O M ⇒ m = O = O √ = b M 7

Architectural implications Size of fast mem. M ≡ ≈ α / τ Arch. M 3 b 2 ≤ M Ultra 2i 25 1.5 MB q ≈ b Ultra 3 14 460 KB ⇓ Pentium 3 6.3 94 KB 3 q 2 M ≥ P-3M 10 240 KB Power3 8.8 180 KB τ · 1 1 + α 15 527 KB < 1 . 1 Power4 q 36 3.0 MB Itanium 1 � 2 � α = ⇒ M ≥ 300 5.5 71 KB Itanium 2 τ Note: “M” in bytes to 2 digits; assumes 8-byte (double-precision) words 8

What happens in practice? Experiment: One-level cache-blocked matrix multiply Block size chosen as square, by exhaustive search over sizes up to 64 9

Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache < 25% peak! We evidently still have a lot of work to do... 10

Review: Real memory hierarchies 11

What happened at powers of 2? 16 B Byte addressable XXXX XXXX XXXX XXXX XXX0 0000 0000 0000 32-bit addresses XXXX XXXX XXXX XXXX XXX0 0000 0001 0000 Cache XXXX XXXX XXXX XXXX XXX0 0000 0010 0000 Direct-mapped XXXX XXXX XXXX XXXX XXX0 0000 0011 0000 8 KB capacity XXXX XXXX XXXX XXXX XXX0 0000 0100 0000 16-byte lines XXXX XXXX XXXX XXXX XXX0 0000 0101 0000 ... XXXX XXXX XXXX XXXX XXX1 1111 1111 0000 12

Fast Registers L1 Slow L2 Main 13

Fast Registers L1 TLB Slow L2 Main 14

TLB is part of the memory hierarchy Translation Look-aside Buffer (TLB) for virtual address space management Divide address space into pages (4—32 KB typical, larger possible) Page table maps virtual to physical addrs & whether page in mem or on disk Page table can be large; TLB caches recent translations May be set-associative or fully-associative Conceptually like a cache with large block size, i.e. , 1 page May have multiple levels of TLB, just like cache Can prefetch to hide cache misses, but not TLB misses 15

Experiment to observe memory parameters. s Strided-stream through array; measure average access time. (Saavedra-Barrera benchmark) 16

Average Memory Access Time (Saavedra-Barerra) — Sun Ultra IIi (333 MHz) Mem TLB: 8 KB page 32 entries L2: 2 MB 64 B lines L1: 16 KB 16 B lines 17

Average Memory Access Time (Saavedra-Barerra) — Pentium III (Katmai; 550 MHz) TLB: 4 KB page 64 entries Mem L2: 512 KB 32 B lines L1: 16 KB 32 B lines 18

General multi-level blocking [Goto & van de Geijn (2006)] 19

C ← C + A · B “Matrix-matrix” n B k k A C m “Panel-Panel” “Matrix-panel” “Panel-matrix” or “Fat Outer Product” 20

B A C 21

B A C 22

B A C 23

B A C 24

C ← C + A · B “Matrix-matrix” n B k k A C m “Block-Panel” “Panel-block” “Fat Dot Product” 25

B A C 26

b k b k b n b k b k K I b m b m m J k n // Let I, J, K = blocks of indices for K ← blocks 1 to k do b k for I ← blocks 1 to m do b m for J ← blocks 1 to n do b n C IJ ← C IJ + A IK × B KJ 30

n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n ≤ M // “Block-panel” multiply // Load b m × b k block of A into cache for J ← blocks 1 to n do b n // Load b k × b n block of B into cache // Load b m × b n block of C into cache C J ← C J + A × B J // Store b m × b n block of C to memory 31

n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n ≤ M // “Block-panel” multiply = 2 b m b k n f // Load b m × b k block of A into cache for J ← blocks 1 to n = b m b k + ( b k + 2 b m ) n m do b n ⇓ // Load b k × b n block of B into cache 2 // Load b m × b n block of C into cache = q � � C J ← C J + A × B J 1 b m + 2 1 n + b k // Store b m × b n block of C to memory 32

n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n ≤ M Peak L1 flop/s ρ 1 ≡ Peak L2 to CPU bw β 2 ≡ Given a multi-level memory hierarchy, in what cache should “ A ” block live? b m b k 2 b m b k b n ≥ ρ 1 β 2 ❖ Want large A block ❖ L1 cache usually quite small ⇓ ❖ What about L2? ρ 1 b n ≥ 2 β 2 Typically, need b n >= 2. 33

n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n M ≤ ρ 1 b n ≥ 2 β 2 34

n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely J b m b k + ( b k + b m ) b n M ≤ ρ 1 b n ≥ 2 β 2 What about the TLB? 35

Considerations for TLB 1 2 3 32 33 1 Matrix n = 1024 Column-major order TLB 512 Page = 4 KB 32 entries 1024 36

n Assumes : b n 1. A , B J , C J fit in cache ( e.g. , size M ) b k 2. Above ⇒ Product runs at peak b m 3. A not evicted prematurely 4. Operands “fit in” TLB J b m b k + ( b k + b m ) b n M ≤ ρ 1 b n ≥ 2 β 2 What about the TLB? Block of A straddles pages, so re-pack on-the-fly ⇒ “Copy optimization” Copy B panel as well 37

Panel-Block Fat-Dot 38

b k b k b n b k b k K I b m b m m J k n // Let I, J, K = blocks of indices for K ← blocks 1 to k do b k ˜ B ← B K, ⋆ for I ← blocks 1 to m do b m ˜ A ← A IK for J ← blocks 1 to n do b n C ← ˜ ˜ A × ˜ // Compute in bu ff er, ˜ B J C C IJ ← C IJ + ˜ // Unpack ˜ C C 39

B A C Which is better? 40

Dense Matrix Multiply Performance (Square n × n Operands) [333 MHz Sun Ultra 2i] 600 0.9009 Vendor Reg/insn−level + cache tiling + copy opt. 550 0.8258 Cache tiling + copy opt. Reference 500 0.7508 450 0.6757 400 0.6006 Performance (Mflop/s) Fraction of Peak 350 0.5255 300 0.4505 250 0.3754 200 0.3003 150 0.2252 100 0.1502 50 0.0751 0 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 matrix dimension (n) Source: Vuduc, Demmel, Bilmes (IJHPCA 2004) 41

Dense Matrix Multiply Performance (Square n × n Operands) [800 MHz Intel Pentium III−mobile] 700 0.875 650 0.8125 600 0.75 550 0.6875 500 0.625 450 0.5625 Performance (Mflop/s) Fraction of Peak 400 0.5 350 0.4375 300 0.375 Vendor 250 0.3125 Goto−BLAS Reg/insn−level + cache tiling + copy 200 0.25 Cache tiling + copy opt. Reference 150 0.1875 100 0.125 50 0.0625 0 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 matrix dimension (n) Source: Vuduc, Demmel, Bilmes (IJHPCA 2004) 42

b n r b k b m c J c Inner-kernel Scheduling Register allocation 43

Administrivia 44

Two joint classes with CS 8803 SC Tues 2/19 : Floating-point issues in parallel computing by me Tues 2/26 : GPGPUs by Prof. Hyesoon Kim Scribe? Both classes meet in Klaus 1116E 45

Homework 1: Parallel conjugate gradients Extension : Due Wednesday 2/27 @ 8:30 am Implement a parallel solver for Ax = b (serial C version provided) Evaluate on three matrices: 27-pt stencil, and two application matrices “Simplified:” No preconditioning Performance models to understand scalability of your implementation Make measurements Build predictive models Collaboration encouraged: Compare programming models or platforms 46

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute - PowerPoint PPT Presentation

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.16] Thursday, February 28, 2008 1 Todays sources CS 267 (Demmel & Yelick @ UCB; Spring 2007) A

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only

Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single

Single processor tuning (1/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803

Lecture 2: Processor Design, Single-Processor Performance G63.2011.002/G22.2945.001 September

The Processor: Datapath and Control 3/ 24/ 2016 1 A single-cycle MIPS processor An

AScalable,NonblockingApproachto TransactionalMemory JaredCasper

Operating Systems Design and Implementation Chapter 03 (version January 30, 2008 ) Melanie

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear

Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey

Harmony: Collec.on and Analysis of Parallel Block Vectors

Final Review Logistics Start Midterm next class! Same style as Midterm, 5 questions

Design of a blocking-resistant anonymity system Roger Dingledine, Nick Mathewson The Tor Project