Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc - PowerPoint PPT Presentation

Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1

Today’s sources CS 267 (Demmel & Yelick @ UCB; Spring 2007) “ An experimental comparison of cache-oblivious and cache-conscious programs? ” by Yotov, et al . (SPAA 2007) “ The memory behavior of cache oblivious stencil computations ,” by Frigo & Strumpen (2007) Talks by Matteo Frigo and Kaushik Datta at CScADS Autotuning Workshop (2007) Demaine’s @ MIT: http://courses.csail.mit.edu/6.897/spring03/scribe_notes 2

Review: Tuning matrix multiply 3

Tiled MM on AMD Opteron 2.2 GHz (4.4 Gflop/s peak), 1 MB L2 cache < 25% peak! We evidently still have a lot of work to do... 4

Fast Registers L1 TLB Slow L2 Main 5

B A C 6

Software pipelining: Interleave iterations to delay dependent instructions i i+1 m3; i-4 i-3 Source: Clint Whaley’s code optimization course (UTSA Spring 2007) 7

Dense Matrix Multiply Performance (Square n × n Operands) [800 MHz Intel Pentium III−mobile] 700 0.875 650 0.8125 600 0.75 550 0.6875 500 0.625 450 0.5625 Performance (Mflop/s) Fraction of Peak 400 0.5 350 0.4375 300 0.375 Vendor 250 0.3125 Goto−BLAS Reg/insn−level + cache tiling + copy 200 0.25 Cache tiling + copy opt. Reference 150 0.1875 100 0.125 50 0.0625 0 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 matrix dimension (n) Source: Vuduc, Demmel, Bilmes (IJHPCA 2004) 8

Cache-oblivious matrix multiply [Yotov, Roeder, Pingali, Gunnels, Gustavson (SPAA 2007)] [Talk by M. Frigo at CScADS Autotuning Workshop 2007] 9

Memory model for analyzing cache-oblivious algorithms Two-level memory hierarchy M = capacity of cache (“fast”) L = cache line size Fast Fully associative Optimal replacement Evicts most distant use Sleator & Tarjan (CACM 1985): LRU, FIFO w/in constant of optimal Slow w/ cache larger by constant factor “Tall-cache:” M ≥ O( L 2 ) Limits: See Brodal & Fagerberg (STOC 2003) When might this not hold? 10

A recursive algorithm for matrix-multiply B 11 B 12 Divide all dimensions in half B 21 B 22 Bilardi, et al .: Use Gray code ordering A 11 A 12 C 11 C 12 A 21 A 22 C 21 C 22 � 8 · T ( n 2 ) n > 1 Cost (flops) = T ( n ) = O (1) n = 1 O ( n 3 ) = 11

A recursive algorithm for matrix-multiply B 11 B 12 Divide all dimensions in half B 21 B 22 Bilardi, et al .: Use grey-code ordering A 11 A 12 C 11 C 12 A 21 A 22 C 21 C 22 I/O Complexity? 12

A recursive algorithm for matrix-multiply B 11 B 12 Divide all dimensions in half B 21 B 22 Bilardi, et al .: Use grey-code ordering A 11 A 12 C 11 C 12 A 21 A 22 C 21 C 22 No. of misses, with tall-cache assumption: � � � n 3 8 · Q ( n M � � 2 ) if n > Q ( n ) = 3 ≤ Θ √ 3 n 2 L M otherwise L 13

Alternative: Divide longest dimension (Frigo, et al .) n n n B 1 B 2 B 1 B k k k B 2 k k k A 1 C 1 C 2 A 1 A 2 C 1 C A m A 2 C 2 � mk + kn + mn  � if mk + kn + mn ≤ α M Θ L � m   � 2 Q if m ≥ k, n 2 , k, n  Cache misses Q ( m, k, n ) ≤ 2 Q ( m, k 2 , n ) if k > m, k ≥ n   2 Q ( m, k, n 2 ) otherwise  � mkn � = Θ √ L M 14

Relax tall-cache assumption using suitable layout Row-major Row-block-row Morton Z Need tall cache M ≥ Ω ( L ) No assumption Source: Yotov, et al . (SPAA 2007) and Frigo, et al . (FOCS ’99) 15

Latency-centric vs. bandwidth-centric views of blocking J K τ · 1 1 + α Time per flop ≈ b K � τ · 1 M α b κ ≤ ≤ 3 I Peak flop/cy φ ≡ Bandwidth, word/cy β ≡ 2 n 3 · 1 β � 2 n 3 · 1 φ = β ≤ b ⇒ ⇐ Assume can perfectly overlap b φ computation & communication 16

Latency-centric vs. bandwidth-centric views of blocking 4* ≥ 6 4 ≈ 0.5 FPU Registers L1 L2 L3 Memory ≥ 2 4 2* 1 ≤ b L2 ≤ 6 8 ≤ b L3 ≤ 418 1 ≤ b R ≤ 6 2 FMAs/cycle 1.33 ≤ β (L2,L3) ≤ 4 0.02 ≤ β (L3,Memory) ≤ 0.5 1.33 ≤ β (R,L 2 ) ≤ 4 Example platform: Itanium 2 Consider L3 ←→ memory bandwidth Φ = 4 flops / cycle; β = 0.5 words / cycle L3 capacity = 4 MB (512 kwords) Need 8 ≤ b L3 ≤ 418 Implications: Approximate cache-oblivious blocking works Wide range of block sizes should be OK If upper bound > 2*lower, divide-and-conquer generates block size in range Source: Yotov, et al . (SPAA 2007) 17

Cache-oblivious vs. cache-aware Does cache-oblivious perform as well as cache-aware? If not, what can be done? Next: Summary of Yotov, et al ., study (SPAA 2007) Stole slides liberally 18

All- vs. largest-dimension Similar; assume “all-dim” 19

Data structures Morton-Z complicated and yields same or worse performance, so assume row-block-row 20

Example 1: Ultra IIIi 1 GHz ⇒ 2 Gflop/s peak Memory hierarchy 32 registers L1 = 64 KB, 4-way L2 = 1 MB, 4-way Sun compiler 21

• Iterative: triply nested loop • Recursive: down to 1 x 1 x 1 Outer Control Structure Iterative Recursive Inner Control Structure Statement 22

• Recursion down to NB Outer Control Structure • Unfold completely below NB to get a basic block Iterative Recursive • Micro-Kernel: • The basic block compiled Inner Control Structure with native compiler • Best performance for NB =12 Recursive Statement • Compiler unable to use registers • Unfolding reduces control overhead Micro-Kernel • limited by I-cache None / Compiler 23

• Recursion down to NB Outer Control Structure • Unfold completely below NB to get a Iterative Recursive basic block • Micro-Kernel Inner Control Structure • Scalarize all array references in the Recursive Statement basic block • Compile with native compiler Micro-Kernel None / Compil er 24

Outer Control Structure • Recursion down to NB • Unfold completely below NB to get a basic block Iterative Recursive • Micro-Kernel • Perform Belady’s register allocation on the basic block • Schedule using BRILA compiler Inner Control Structure Recursive Statement Micro-Kernel Scalarized / None / Belady / Compiler Compiler BRILA 25

Outer Control Structure • Recursion down to NB • Unfold completely below NB to get a basic block Iterative Recursive • Micro-Kernel • Construct a preliminary schedule • Perform Graph Coloring register allocation Inner Control Structure • Schedule using BRILA compiler Recursive Statement Micro-Kernel Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 26

Outer Control Structure • Recursion down to MU x NU x KU • Micro-Kernel Iterative Recursive • Completely unroll MU x NU x KU triply nested loop Inner Control Structure • Construct a preliminary schedule • Perform Graph Coloring Recursive Iterative Statement register allocation • Schedule using BRILA compiler Micro-Kernel Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 27

Outer Control Structure • Recursion down to NB • Mini-Kernel Iterative Recursive • NB x NB x NB triply nested loop • Tiling for L1 cache Inner Control Structure • Body is Micro-Kernel Recursive Iterative Statement Mini-Kernel Micro-Kernel Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 28

Outer Control Structure Iterative Recursive Inner Control Structure Recursive Iterative Statement Specialized Mini-Kernel code generator with search Micro-Kernel ATLAS CGw/S ATLAS Unleashed Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 29

Outer Control Structure Iterative Recursive Inner Control Structure Recursive Iterative Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed Scalarized / None / Coloring / Belady / Compiler BRILA Compiler BRILA 30

Summary: Engineering considerations Need to cut-off recursion Careful scheduling/tuning required at “leaves” Yotov, et al., report that full-recursion + tuned micro-kernel ≤ 2/3 best Open issues Recursively-scheduled kernels worse than iteratively-schedule kernels — why? Prefetching needed, but how best to apply in recursive case? 31

Administrivia 32

Upcoming schedule changes Some adjustment of topics (TBD) Tu 3/11 — Project proposals due Th 3/13 — SIAM Parallel Processing (attendance encouraged) Tu 4/1 — No class Th 4/3 — Attend talk by Doug Post from DoD HPC Modernization Program 33

Homework 1: Parallel conjugate gradients Put name on write-up! Grading: 100 pts max Correct implementation — 50 pts Evaluation — 30 pts Tested on two samples matrices — 5 Implemented and tested on stencil — 10 “Explained” performance (e.g., per proc, load balance, comp. vs. comm) — 15 Performance model — 15 pts Write-up “quality” — 5 pts 34

Projects Proposals due Tu 3/11 Your goal should be to do something useful, interesting, and/or publishable! Something you’re already working on, suitably adapted for this course Faculty-sponsored/mentored Collaborations encouraged 35

My criteria for “approving” your project “Relevant to this course:” Many themes, so think (and “do”) broadly Parallelism and architectures Numerical algorithms Programming models Performance modeling/analysis 36

Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc - PowerPoint PPT Presentation

Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Todays sources CS 267 (Demmel & Yelick @ UCB; Spring 2007)

Part 2, course 2: Cache Oblivious Algorithms CR10: Data Aware Algorithms October 2, 2019 Agenda

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman & Rob H. Bisseling

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett BUILDING BETTER TOOLS Cache-Oblivious

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus Algorithms and Data

Part 2: External Memory and Cache Oblivious Algorithms CR10: Data Aware Algorithms September 25,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

A Cache-Oblivious Heap Introduced by Arge et al. [1]. Based on distribution of elements

Part 2, course 3: Parallel External Memory and Cache Oblivious Algorithms CR10: Data Aware

Generating a Fixed Number of Masks with Word Permutations and XORs Tetsu Iwata, Nagoya University

3D Modeling with Depth Sensors Andreas Geiger, Torsten Sattler Spring 2017

Process-Processor Mapping (2.7) Alexandre David B2-206 Example Underlying architecture

COL863: Quantum Computation and Information Ragesh Jaiswal, CSE, IIT Delhi Ragesh Jaiswal, CSE,

Modular Arithmetic Modular Arithmetic: refresher. Notation x ( mod m ) or mod ( x , m ) -

MA/CSSE 473 Day 07 More Mathematical Induction Euclid's Algorithm MA/CSSE 473 Day 07 HW 4

Recursion and Induction: Lexical Issues; Recursive Programming Greg Plaxton Theory in

Order of Operations MPM1D: Principles of Mathematics Recap Evaluate (5 2) 4 5 2 .

Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc - PowerPoint PPT Presentation

Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Todays sources CS 267 (Demmel & Yelick @ UCB; Spring 2007)

Part 2, course 2: Cache Oblivious Algorithms CR10: Data Aware Algorithms October 2, 2019 Agenda

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman &amp; Rob H. Bisseling

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett BUILDING BETTER TOOLS Cache-Oblivious

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus Algorithms and Data

Part 2: External Memory and Cache Oblivious Algorithms CR10: Data Aware Algorithms September 25,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

A Cache-Oblivious Heap Introduced by Arge et al. [1]. Based on distribution of elements

Part 2, course 3: Parallel External Memory and Cache Oblivious Algorithms CR10: Data Aware

Generating a Fixed Number of Masks with Word Permutations and XORs Tetsu Iwata, Nagoya University

3D Modeling with Depth Sensors Andreas Geiger, Torsten Sattler Spring 2017

Process-Processor Mapping (2.7) Alexandre David B2-206 Example Underlying architecture

COL863: Quantum Computation and Information Ragesh Jaiswal, CSE, IIT Delhi Ragesh Jaiswal, CSE,

Modular Arithmetic Modular Arithmetic: refresher. Notation x ( mod m ) or mod ( x , m ) -

MA/CSSE 473 Day 07 More Mathematical Induction Euclid's Algorithm MA/CSSE 473 Day 07 HW 4

Recursion and Induction: Lexical Issues; Recursive Programming Greg Plaxton Theory in

Order of Operations MPM1D: Principles of Mathematics Recap Evaluate (5 2) 4 5 2 .

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman & Rob H. Bisseling