cache oblivious algorithms
play

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown - PowerPoint PPT Presentation

Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm Algorithm C program Java program gcc javac Object code Java bytecode linux java Execution Interpretation Can be executed on


  1. Cache-Oblivious Algorithms 1

  2. Cache-Oblivious Model 2

  3. The Unknown Machine Algorithm Algorithm ↓ ↓ C program Java program ↓ gcc ↓ javac Object code Java bytecode ↓ linux ↓ java Execution Interpretation Can be executed on machines with a Can be executed on any machine specific class of CPUs with a Java interpreter 3

  4. The Unknown Machine Algorithm Algorithm ↓ ↓ C program Java program ↓ gcc ↓ javac Object code Java bytecode ↓ linux ↓ java Execution Interpretation Can be executed on machines with a Can be executed on any machine specific class of CPUs with a Java interpreter Goal Develop algorithms that are optimized w.r.t. memory hierarchies without knowing the parameters 3

  5. Cache-Oblivious Model CPU Memory I/O Disk • I/O model • Algorithms do not know the parameters B and M • Optimal off-line cache replacement strategy Frigo et al. 1999 4

  6. Justification of the ideal-cache model Optimal replacement LRU + 2 × cache size ⇒ at most 2 × cache misses Sleator an Tarjan, 1985 Corollary T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ #cache misses using LRU is O ( T M,B ( N )) Two memory levels Optimal cache-oblivious algorithm satisfying T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ optimal #cache misses on each level of a multilevel cache using LRU Fully associativity cache Simulation of LRU • Direct mapped cache • Explicit memory management • Dictionary (2-universal hash functions) of cache lines in memory • Expected O (1) access time to a cache line in memory 5

  7. Matrix Multiplication 6

  8. Matrix Multiplication Problem � C = A · B , c ij = a ik · b kj k =1 ..N Layout of matrices 0 1 2 3 4 5 6 7 0 8 16 24 32 40 48 56 0 1 2 3 16 17 18 19 0 1 4 5 16 17 20 21 8 9 10 11 12 13 14 15 1 9 17 25 33 41 49 57 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23 16 17 18 19 20 21 22 23 2 10 18 26 34 42 50 58 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29 24 25 26 27 28 29 30 31 3 11 19 27 35 43 51 59 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31 32 33 34 35 36 37 38 39 4 12 20 28 36 44 52 60 32 33 34 35 48 49 50 51 32 33 36 37 48 49 52 53 40 41 42 43 44 45 46 47 5 13 21 29 37 45 53 61 36 37 38 39 52 53 54 55 34 35 38 39 50 51 54 55 48 49 50 51 52 53 54 55 6 14 22 30 38 46 54 62 40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61 56 57 58 59 60 61 62 63 7 15 23 31 39 47 55 63 44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63 Column major 4 × 4 -blocked Bit interleaved Row major 7

  9. Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj 8

  10. Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 8

  11. Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 – s × s -blocked or ( row major and M = Ω( B 2 ) ) 56 57 58 59 60 61 62 63 � 3 · s 2 �� � � � � � N 3 N 3 N O = O = O I/Os √ s B s · B B M 8

  12. Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 – s × s -blocked or ( row major and M = Ω( B 2 ) ) 56 57 58 59 60 61 62 63 � 3 · s 2 �� � � � � � N 3 N 3 N O = O = O I/Os √ s B s · B B M – Optimal Hong & Kung, 1981 8

  13. Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious)        A 11 A 12  B 11 B 12  A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22  =   A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums 9

  14. Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious)        A 11 A 12  B 11 B 12  A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22  =   A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √  O ( N 2 B ) if N ≤ ε M  T ( N ) ≤ � � � � N 2 N 8 · T + O otherwise  2 B � N 3 � T ( N ) O √ ≤ B M 9

  15. Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious)        A 11 A 12  B 11 B 12  A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22  =   A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √  O ( N 2 B ) if N ≤ ε M  T ( N ) ≤ � � � � N 2 N 8 · T + O otherwise  2 B � N 3 � T ( N ) O √ ≤ B M – Optimal Hong & Kung, 1981 – Non-square matrices Frigo et al., 1999 9

  16. Matrix Multiplication Algorithm 4: Strassen’s algorithm (cache-oblivious) – 7 recursive N 2 × N 2 matrix multiplications + O (1) matrix sums        C 11 C 12  A 11 A 12  B 11 B 12  =   C 21 C 22 A 21 A 22 B 21 B 22 m 1 := ( a 21 + a 22 − a 11 )( b 22 − b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 − m 7 m 4 := ( a 11 − a 21 )( b 22 − b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := ( a 21 + a 22 )( b 12 − b 11 ) m 6 := ( a 12 − a 21 + a 11 − a 22 ) b 22 m 7 := a 22 ( b 11 + b 22 − b 12 − b 21 ) 10

  17. Matrix Multiplication Algorithm 4: Strassen’s algorithm (cache-oblivious) – 7 recursive N 2 × N 2 matrix multiplications + O (1) matrix sums        C 11 C 12  A 11 A 12  B 11 B 12  =   C 21 C 22 A 21 A 22 B 21 B 22 m 1 := ( a 21 + a 22 − a 11 )( b 22 − b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 − m 7 m 4 := ( a 11 − a 21 )( b 22 − b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := ( a 21 + a 22 )( b 12 − b 11 ) m 6 := ( a 12 − a 21 + a 11 − a 22 ) b 22 m 7 := a 22 ( b 11 + b 22 − b 12 − b 21 ) – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √  O ( N 2 B ) if N ≤ ε M  T ( N ) ≤ � � � � N 2 N 7 · T + O otherwise  2 B � � N log2 7 T ( N ) O log 2 7 ≈ 2 . 81 ≤ √ B M 10

  18. Cache-Oblivious Search Trees 11

  19. Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Prokop 1999 12

  20. Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Range reportings use � � log B N + k O I/Os B Prokop 1999 12

  21. Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Range reportings use � � log B N + k O I/Os B Prokop 1999 Bender, Brodal, Fagerberg, Ge, He, Hu Best possible (log 2 e + o (1)) log B N Iacono, López-Ortiz 2003 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend