a fast analytical model of fully associative caches
play

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , L AURIN B RANDNER , AND T ORSTEN H OEFLER A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data Movement Depends on Global State and Does


  1. spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , L AURIN B RANDNER , AND T ORSTEN H OEFLER A Fast Analytical Model of Fully Associative Caches

  2. spcl.inf.ethz.ch @spcl_eth The Cost of Data Movement Depends on Global State and Does Not Compose int N = 1000; for ( int i = 0; i < N; i++) { for ( int j = 0; j < i; j++) { for ( int k = 0; k < j; k++) { percentage of cache misses? A[i][j] -= A[i][k] * A[j][k]; L1 cache 1.6% } L2 cache 1.4% A[i][j] /= A[j][j]; } most expensive memory access? for ( int k = 0; k < i; k++) { A[j][k] A[i][i] -= A[i][k] * A[i][k]; } amount of compulsory and capacity misses? A[i][i] = sqrt(A[i][i]); # compulsory misses 31,752 } # capacity misses 10,630,620 Cholesky kernel from https://sourceforge.net/projects/polybench/ 2

  3. spcl.inf.ethz.ch @spcl_eth HayStack Output for Cholesky Factorization parameters: relative number of cache misses (statement) โ€ข cache sizes (32k and 512k) โ€ข cacheline size (64B) 5 for (int i = 0; i < N; i++) { 6 for (int j = 0; j < i; j++) { 7 for (int k = 0; k < j; k++) { 8 A[i][j] -= A[i][k] * A[j][k]; ----------------------------------------------------------------------------- ref type comp[%] L1[%] L2[%] tot[%] reuse[ln] A[i][j] rd 0.00459 0.00000 0.00000 24.86910 8,10 A[i][k] rd 0.00000 0.00000 0.00000 24.86910 8,10 A[j][k] rd 0.00000 1.58635 1.38213 24.86910 8,10,13,15 A[i][j] wr 0.00000 0.00000 0.00000 24.86910 8 ----------------------------------------------------------------------------- absolute number of cache misses (program) compulsory: 31'752 capacity (L1): 10'630'620 capacity (L2): 9'258'460 total: 668'166'500 3

  4. spcl.inf.ethz.ch @spcl_eth Comparison to Simulation 1 day 1 hour 1 minute 1 seconds 4

  5. spcl.inf.ethz.ch @spcl_eth Symbolic Counting Avoids the Explicit Enumeration 1d illustration i j Barvinok algorithm enumeration #points = 3 #points = 1 #points = 2 #points = j-i+1 = 3 symbolic i j enumeration #points = 1 #points = 2 #points = 3 #points = 4 #points = 5 #points = 6 #points = 7 #points = 8 #points = 9 symbolic #points = j-i+1 = 9 Alexander I. Barvinok, A Polynomial Time Algorithm for Counting Integral Points in Polyhedra When the Dimension is Fixed . 1994. 5

  6. spcl.inf.ethz.ch @spcl_eth The LRU Stack Distance Allows Us to Model Fully Associative Caches example memory accesses LRU stack int sum = 0; distance 3 distance 4 distance 2 distance 1 i=0 M(0) M(0) M(1) M(0) M(2) M(1) M(3) M(2) for ( int i=0; i<4; ++i) hit i=1 M(1) M(0) M(1) M(2) M(2) M(3) M(1) S0: M[i] = i; for ( int j=0; j<4; ++j) i=2 M(2) M(1) M(0) M(3) M(2) S1: sum += M[3-j]; miss i=3 M(3) M(3) M(0) j=0 M(3) j=1 M(2) deliberately generic model j=2 M(1) j=3 M(0) Richard L Mattson, Jan Gecsei, Donald R Slutz, and Irving L Traiger, Evaluation techniques for storage hierarchies. 1970. 6

  7. spcl.inf.ethz.ch @spcl_eth Compute the LRU Stack Distance example ๐‘ž ๐‘˜ = ๐‘˜ + 1 int sum = 0; 4 for ( int i=0; i<4; ++i) S0: M[i] = i; stack distance 3 for ( int j=0; j<4; ++j) S1: sum += M[3-j]; 2 1 apply symbolic counting once j 0 1 2 3 4 Kristof Beyls and Erik H Dโ€™Hollander , Generating cache hints for improved program efficiency . 2005. 7

  8. spcl.inf.ethz.ch @spcl_eth Count the Cache Misses Given the LRU Stack Distance example ๐‘ž ๐‘˜ = ๐‘˜ + 1 int sum = 0; misses 4 for ( int i=0; i<4; ++i) hits S0: M[i] = i; stack distance 3 for ( int j=0; j<4; ++j) S1: sum += M[3-j]; cache size C=2 2 1 apply symbolic counting twice j 0 1 2 3 4 many different pieces and ๐‘˜ โˆถ ๐’’ ๐’Œ > ๐‘ซ โˆง 0 โ‰ค ๐‘˜ < 4 ๐‘˜ โˆถ ๐’’ ๐’Œ > ๐‘ซ โˆง 0 โ‰ค ๐‘˜ < 4 = 2 sometimes non-affine polynomials 8

  9. spcl.inf.ethz.ch @spcl_eth Some Access Patterns Result in Non-Linearities example original int sum = 0; int sum = 0; for ( int i=0; i<4; ++i) for ( int t=0; t<4; ++t) { S0: M[i] = i; for ( int i=0; i<4; ++i) for ( int j=0; j<4; ++j) S0: M[i] = i; i j S1: sum += M[3-j]; for ( int m=0; m<t; ++m) ๐‘ž ๐‘˜ = ๐‘˜ + 1 for ( int n=0; n<t; ++n) N[m][n] = t; additional time loop for ( int j=0; j<4; ++j) t S1: sum += M[3-j]; } i j ๐‘ž ๐‘˜, ๐‘ข = ๐’– ๐Ÿ‘ + ๐‘˜ + 1 partial enumeration 9

  10. spcl.inf.ethz.ch @spcl_eth Enumerate the Non-Affine Dimensions p ๐ฎ=๐Ÿ ๐‘˜ = ๐Ÿ + ๐‘˜ + 1 0 1 2 ๐‘ž ๐‘˜, ๐‘ข = ๐’– ๐Ÿ‘ + ๐‘˜ + 1 0 โ‰ค ๐‘ข < 3 (0,0) (1,0) (2,0) 0 ฯ€ t ๐‘ž ๐ฎ=๐Ÿ ๐‘˜ = ๐Ÿ + ๐‘˜ + 1 (0,1) (1,1) (2,1) 1 0 1 2 2 (0,2) (1,2) (2,2) 0 โ‰ค (๐‘˜, ๐‘ข) < 3 p ๐ฎ=๐Ÿ‘ ๐‘˜ = ๐Ÿ“ + ๐‘˜ + 1 0 1 2 12.4x speedup due to partial enumeration 10

  11. spcl.inf.ethz.ch @spcl_eth Modelling Cache Lines Introduces Floor Terms example original int sum = 0; for ( int i=0; i<4; ++i) S0: M[i] = i; i j for ( int j=0; j<4; ++j) S1: sum += M[3-j]; ๐‘ž ๐‘˜ = ๐‘˜ + 1 modelling cache lines equalization and rasterization i j ๐‘ž ๐‘˜ = ๐‘˜ ๐Ÿ‘ โˆ’ ๐’Œ โˆ’ ๐Ÿ ๐’Œ + 1 2 ๐Ÿ‘ 11

  12. spcl.inf.ethz.ch @spcl_eth Split the Domain to Eliminate Floor Terms ๐‘ž ๐‘˜ ๐’Œ%๐Ÿ‘=๐Ÿ = ๐‘˜ 2 ๐Ÿ + 1 0 2 ๐‘ž ๐‘˜ = ๐‘˜ ๐Ÿ‘ โˆ’ ๐’Œ โˆ’ ๐Ÿ ๐’Œ + 1 2 ๐Ÿ‘ 3 0 1 2 ๐‘ž ๐‘˜ ๐’Œ%๐Ÿ‘>๐Ÿ = ๐‘˜ 0 โ‰ค ๐‘˜ < 4 2 ๐Ÿ + 1 1 3 1.9x speedup due to equalization 12

  13. spcl.inf.ethz.ch @spcl_eth Accuracy of HayStack for the L1 Cache of Our Test System 13

  14. spcl.inf.ethz.ch @spcl_eth Error of HayStack Compared to Simulation (Dinero IV) HayStack (fully associative) Dinero IV (fully associative) Dinero IV (8-way associative) 14

  15. spcl.inf.ethz.ch @spcl_eth Performance of HayStack for the Large Problem Size of PolyBench 15

  16. spcl.inf.ethz.ch @spcl_eth Performance of HayStack Compared to PolyCache and Dinero Dinero IV PolyCache - simulator - analytical cache model - setup to simulate full associativity - models set associativity - problem size dependent performance - one core per cache set 370x speedup 21x speedup Jan Elder and Mark D. Hill, Dinero IV Trace-Driven Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P Sadayappan, Uniprocessor Cache Simulator . 2003. Analytical modeling of cache behavior for affine programs. 2017. 16

  17. spcl.inf.ethz.ch @spcl_eth Conclusion generic model of fully associative caches accurate results compared to measurements fast enough to provide interactive feedback excellent performance compared to alternatives 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend