criticality aware tiered cache hierarchy catch
play

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - PowerPoint PPT Presentation

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India Popular Three Level Cache


  1. Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India

  2. Popular Three Level Cache Hierarchy 2

  3. L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server 2

  4. L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency • Lower L2 latency important 2

  5. L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact -6% -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

  6. Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

  7. Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% Is a large L2 the most NoL2 + 6.5MB LLC -9.5% -12% efficient design choice? NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

  8. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 LLC L2 Exclusive LLC L2 3

  9. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC L2 Exclusive LLC L2 3

  10. Large L2 caches LLC  L  Inclusive 2  • Inclusive LLC → Exclusive LLC LLC L  2  • Lower effective on-die cache per core LLC • Large LLC better for multiple threads  L2  Exclusive with disparate cache footprints  LLC  L2  3

  11. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory 3

  12. Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory Despite area and power overheads, average latency reduction (performance) drives large L2 3

  13. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

  14. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

  15. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

  16. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD 4

  17. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 30 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4

  18. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4

  19. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit LLC hit L2 Hit LLC miss 2 30 30 10 E E E E E E E 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if L2 HIT → LLC HIT L2 HIT → LLC HIT 4

  20. Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if Only critical load L2 hits matter to performance. L2 HIT → LLC HIT L2 HIT → LLC HIT 4

  21. Cache Hierarchy and Program Criticality Oracle study • Track critical load PCs • Increase latencies of targeted load PCs 5

  22. Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. 5

  23. Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. L2 cache most amenable to criticality optimizations 5

  24. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs 6

  25. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches 6

  26. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches B. Prefetch critical loads into L1 • Accelerate the critical path 6

  27. Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs Oracle Performance Potential (*prefetchers disabled) • Served from non-L1 on-die caches 10% 20% % L1 misses converted 17.0% 14.1% 15.5% 8% 16% perf. impact 6.6% 6.2% 6.1% 5.8% 6% 12% B. Prefetch critical loads into L1 5.5% 4% 8% • Accelerate the critical path 2% 4% 0% 0% 32 PC 128 2048 All PC NoL2 PC PC + 2048 PC PerfImpact %loads Converted 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend