Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown - PowerPoint PPT Presentation

Cache-Oblivious Algorithms 1

Cache-Oblivious Model 2

The Unknown Machine Algorithm Algorithm ↓ ↓ C program Java program ↓ gcc ↓ javac Object code Java bytecode ↓ linux ↓ java Execution Interpretation Can be executed on machines with a Can be executed on any machine specific class of CPUs with a Java interpreter 3

The Unknown Machine Algorithm Algorithm ↓ ↓ C program Java program ↓ gcc ↓ javac Object code Java bytecode ↓ linux ↓ java Execution Interpretation Can be executed on machines with a Can be executed on any machine specific class of CPUs with a Java interpreter Goal Develop algorithms that are optimized w.r.t. memory hierarchies without knowing the parameters 3

Cache-Oblivious Model CPU Memory I/O Disk • I/O model • Algorithms do not know the parameters B and M • Optimal off-line cache replacement strategy Frigo et al. 1999 4

Justification of the ideal-cache model Optimal replacement LRU + 2 × cache size ⇒ at most 2 × cache misses Sleator an Tarjan, 1985 Corollary T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ #cache misses using LRU is O ( T M,B ( N )) Two memory levels Optimal cache-oblivious algorithm satisfying T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ optimal #cache misses on each level of a multilevel cache using LRU Fully associativity cache Simulation of LRU • Direct mapped cache • Explicit memory management • Dictionary (2-universal hash functions) of cache lines in memory • Expected O (1) access time to a cache line in memory 5

Matrix Multiplication 6

Matrix Multiplication Problem � C = A · B , c ij = a ik · b kj k =1 ..N Layout of matrices 0 1 2 3 4 5 6 7 0 8 16 24 32 40 48 56 0 1 2 3 16 17 18 19 0 1 4 5 16 17 20 21 8 9 10 11 12 13 14 15 1 9 17 25 33 41 49 57 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23 16 17 18 19 20 21 22 23 2 10 18 26 34 42 50 58 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29 24 25 26 27 28 29 30 31 3 11 19 27 35 43 51 59 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31 32 33 34 35 36 37 38 39 4 12 20 28 36 44 52 60 32 33 34 35 48 49 50 51 32 33 36 37 48 49 52 53 40 41 42 43 44 45 46 47 5 13 21 29 37 45 53 61 36 37 38 39 52 53 54 55 34 35 38 39 50 51 54 55 48 49 50 51 52 53 54 55 6 14 22 30 38 46 54 62 40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61 56 57 58 59 60 61 62 63 7 15 23 31 39 47 55 63 44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63 Column major 4 × 4 -blocked Bit interleaved Row major 7

Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj 8

Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 8

Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 – s × s -blocked or ( row major and M = Ω( B 2 ) ) 56 57 58 59 60 61 62 63 � 3 · s 2 �� N 3 N 3 N O = O = O I/Os √ s B s · B B M 8

Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 – s × s -blocked or ( row major and M = Ω( B 2 ) ) 56 57 58 59 60 61 62 63 � 3 · s 2 �� N 3 N 3 N O = O = O I/Os √ s B s · B B M – Optimal Hong & Kung, 1981 8

Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious)        A 11 A 12  B 11 B 12  A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22  =   A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums 9

Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious)        A 11 A 12  B 11 B 12  A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22  =   A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √  O ( N 2 B ) if N ≤ ε M  T ( N ) ≤ � � � � N 2 N 8 · T + O otherwise  2 B � N 3 � T ( N ) O √ ≤ B M 9

Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious)        A 11 A 12  B 11 B 12  A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22  =   A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √  O ( N 2 B ) if N ≤ ε M  T ( N ) ≤ � � � � N 2 N 8 · T + O otherwise  2 B � N 3 � T ( N ) O √ ≤ B M – Optimal Hong & Kung, 1981 – Non-square matrices Frigo et al., 1999 9

Matrix Multiplication Algorithm 4: Strassen’s algorithm (cache-oblivious) – 7 recursive N 2 × N 2 matrix multiplications + O (1) matrix sums        C 11 C 12  A 11 A 12  B 11 B 12  =   C 21 C 22 A 21 A 22 B 21 B 22 m 1 := ( a 21 + a 22 − a 11 )( b 22 − b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 − m 7 m 4 := ( a 11 − a 21 )( b 22 − b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := ( a 21 + a 22 )( b 12 − b 11 ) m 6 := ( a 12 − a 21 + a 11 − a 22 ) b 22 m 7 := a 22 ( b 11 + b 22 − b 12 − b 21 ) 10

Matrix Multiplication Algorithm 4: Strassen’s algorithm (cache-oblivious) – 7 recursive N 2 × N 2 matrix multiplications + O (1) matrix sums        C 11 C 12  A 11 A 12  B 11 B 12  =   C 21 C 22 A 21 A 22 B 21 B 22 m 1 := ( a 21 + a 22 − a 11 )( b 22 − b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 − m 7 m 4 := ( a 11 − a 21 )( b 22 − b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := ( a 21 + a 22 )( b 12 − b 11 ) m 6 := ( a 12 − a 21 + a 11 − a 22 ) b 22 m 7 := a 22 ( b 11 + b 22 − b 12 − b 21 ) – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √  O ( N 2 B ) if N ≤ ε M  T ( N ) ≤ � � � � N 2 N 7 · T + O otherwise  2 B � � N log2 7 T ( N ) O log 2 7 ≈ 2 . 81 ≤ √ B M 10

Cache-Oblivious Search Trees 11

Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Prokop 1999 12

Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Range reportings use � � log B N + k O I/Os B Prokop 1999 12

Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Range reportings use � � log B N + k O I/Os B Prokop 1999 Bender, Brodal, Fagerberg, Ge, He, Hu Best possible (log 2 e + o (1)) log B N Iacono, López-Ortiz 2003 12

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown - PowerPoint PPT Presentation

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm C program Java program gcc javac Object code Java bytecode linux java Execution Interpretation Can be executed on

Part 2, course 2: Cache Oblivious Algorithms CR10: Data Aware Algorithms October 2, 2019 Agenda

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman & Rob H. Bisseling

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett BUILDING BETTER TOOLS Cache-Oblivious

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus Algorithms and Data

Part 2: External Memory and Cache Oblivious Algorithms CR10: Data Aware Algorithms September 25,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

A Cache-Oblivious Heap Introduced by Arge et al. [1]. Based on distribution of elements

Part 2, course 3: Parallel External Memory and Cache Oblivious Algorithms CR10: Data Aware

Cache-Oblivious and Cache-Aware Algorithms , July 2004 Data Structures , February-March 2002

Congruences for Fishburn numbers modulo prime powers Partitions, q -series, and modular forms AMS

Improving Word Alignment With Bridge Languages Shankar Kumar Joint Work with Franz Och and

Optimization and Dynamical Systems: Variational, Hamiltonian, and Symplectic Perspectives

Sequence comparison: Score matrices http://faculty.washington.edu/jht/GS559_2014/ Genome 559:

Pseudospectra of structured random matrices Oberwolfach, 2019/12/13 Nicholas Cook, Stanford

Some Recent Advances in Nonnegative Matrix Factorization and their Applications to Hyperspectral

A smoothing majorization method for l 2 2 - l p p matrix minimization Liwei Zhang Dalian

Transfer Matrix Method G. Eric Moorhouse, University of Wyoming Reference: Transfer Matrix

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown - PowerPoint PPT Presentation

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm C program Java program gcc javac Object code Java bytecode linux java Execution Interpretation Can be executed on

Part 2, course 2: Cache Oblivious Algorithms CR10: Data Aware Algorithms October 2, 2019 Agenda

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman &amp; Rob H. Bisseling

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett BUILDING BETTER TOOLS Cache-Oblivious

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus Algorithms and Data

Part 2: External Memory and Cache Oblivious Algorithms CR10: Data Aware Algorithms September 25,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

A Cache-Oblivious Heap Introduced by Arge et al. [1]. Based on distribution of elements

Part 2, course 3: Parallel External Memory and Cache Oblivious Algorithms CR10: Data Aware

Cache-Oblivious and Cache-Aware Algorithms , July 2004 Data Structures , February-March 2002

Congruences for Fishburn numbers modulo prime powers Partitions, q -series, and modular forms AMS

Improving Word Alignment With Bridge Languages Shankar Kumar Joint Work with Franz Och and

Optimization and Dynamical Systems: Variational, Hamiltonian, and Symplectic Perspectives

Sequence comparison: Score matrices http://faculty.washington.edu/jht/GS559_2014/ Genome 559:

Pseudospectra of structured random matrices Oberwolfach, 2019/12/13 Nicholas Cook, Stanford

Some Recent Advances in Nonnegative Matrix Factorization and their Applications to Hyperspectral

A smoothing majorization method for l 2 2 - l p p matrix minimization Liwei Zhang Dalian

Transfer Matrix Method G. Eric Moorhouse, University of Wyoming Reference: Transfer Matrix

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman & Rob H. Bisseling