Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - PowerPoint PPT Presentation

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India

Popular Three Level Cache Hierarchy 2

L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server 2

L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency • Lower L2 latency important 2

L1 (Pvt.) Data Code 32 Popular Three Level 5 cyc KB Cache Hierarchy L2 (Pvt) 1MB 15 cyc Cache capacity ↔ Access latency • 1.375MB/core LLC 5.5MB (4 core) • Target low average latency Exclusive (Shared) 40 cyc Skylake-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact -6% -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% NoL2 + 6.5MB LLC -9.5% -12% NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

Data Code Data Code L1 (Pvt.) 32 32 Popular Three Level 5 cyc KB KB 256 Cache Hierarchy 1MB L2 (Pvt) KB 15 cyc Cache capacity ↔ Access latency • 2MB/core 1.375MB/core LLC 8MB (4 core) 5.5MB (4 core) • Target low average latency Exclusive (Shared) Inclusive 40 cyc Skylake-like Server Broadwell-like Server • Large distributed LLC, high latency 0% • Lower L2 latency important -3% -2.7% -3.3% -4.0% -4.2% perf. impact • -6% Trend to larger L2 sizes -5.1% -6.5% -6.8% -6.9% -7.0% -9% -7.8% Is a large L2 the most NoL2 + 6.5MB LLC -9.5% -12% efficient design choice? NoL2 + 9.5MB LLC -14.0% -15% client FSPEC HPC ISPEC server GeoMean 2

Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 LLC L2 Exclusive LLC L2 3

Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC L2 Exclusive LLC L2 3

Large L2 caches LLC  L  Inclusive 2  • Inclusive LLC → Exclusive LLC LLC L  2  • Lower effective on-die cache per core LLC • Large LLC better for multiple threads  L2  Exclusive with disparate cache footprints  LLC  L2  3

Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory 3

Large L2 caches LLC L Inclusive 2 • Inclusive LLC → Exclusive LLC LLC L 2 • Lower effective on-die cache per core LLC • Large LLC better for multiple threads L2 Exclusive with disparate cache footprints LLC L2 • Area for Snoop-filter/Coherence-directory Despite area and power overheads, average latency reduction (performance) drives large L2 3

Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.) 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C 4

Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD 4

Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 30 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4

Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT 4

Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit LLC hit L2 Hit LLC miss 2 30 30 10 E E E E E E E 2 10 200 2 30 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if L2 HIT → LLC HIT L2 HIT → LLC HIT 4

Loads and Program Criticality Program execution expressed in a Data Dependency Graph (Fields et. al.)  Execution time governed by “Critical Path” 5 Load 6 7 3 4 Load Load 2 Load 1 D D D D D D D L2 Hit LLC Hit L2 Hit LLC miss 2 30 10 E E E E E E E 2 10 200 2 10 2 200 2 30 10 C C C C C C C CRITICAL L2 HIT LOAD NON-CRITICAL L2 HIT LOAD ~9% performance loss if No performance impact if Only critical load L2 hits matter to performance. L2 HIT → LLC HIT L2 HIT → LLC HIT 4

Cache Hierarchy and Program Criticality Oracle study • Track critical load PCs • Increase latencies of targeted load PCs 5

Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. 5

Cache Hierarchy and Program Criticality 0% 100% % loads converted to higer latency -0.8% 90% -2% 80% -4% 70% -4.9% -6% perf. impact 60% -8% -7.8% 50% 49.1% -10% 40% 39.6% -12% 30% -14% 20% Perf. Impact – All loads -16% 10% Perf. Impact – NonCritical loads % loads converted -16.1% -18% 0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. L2 cache most amenable to criticality optimizations 5

Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs 6

Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches 6

Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs • Served from non-L1 on-die caches B. Prefetch critical loads into L1 • Accelerate the critical path 6

Criticality Aware Tiered Cache Hierarchy (CATCH) A. Track critical load PCs Oracle Performance Potential (*prefetchers disabled) • Served from non-L1 on-die caches 10% 20% % L1 misses converted 17.0% 14.1% 15.5% 8% 16% perf. impact 6.6% 6.2% 6.1% 5.8% 6% 12% B. Prefetch critical loads into L1 5.5% 4% 8% • Accelerate the critical path 2% 4% 0% 0% 32 PC 128 2048 All PC NoL2 PC PC + 2048 PC PerfImpact %loads Converted 6

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - PowerPoint PPT Presentation

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori, Jayesh Gaur, Siddharth Rai # , Sreenivas Subramoney, Hong Wang * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India Popular Three Level Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Balancing Fairness and Efficiency in Tiered Storage Systems with Bottleneck-Aware Allocation Hui

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Doppelgnger: A Cache for Approximate Computing Joshua San Miguel Jorge Albericio Andreas

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

Lecture 21: Memory Hierarchy Todays topics: Cache organization Cache hits/misses 1

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Code Similarity via Natural Language Descriptions Meital Ben Sinai & Eran Yahav Technion

Hang Zhao Massachusetts Institute of Technology Last years challenge at ECCV ADE Dataset New

3 rd Data Prefetching Championship June 23 rd , 2019 Held in conjunction with ISCA 2019 Seth

CSEP504: Advanced topics in software systems Tonight India trip report Apologies and

Probabilistic Programming Frank Wood frank@invrea.com

Make your code count Quantum simulations and collaborative code QuTiP: Shahnawaz Ahmed The

Whole-body Compliant Dynamical Contacts for Humanoids: the CoDyCo project (FP7 EU project No.

Nested Interpolants Matthias Heizmann Jochen Hoenicke Andreas Podelski University of Freiburg,

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - PowerPoint PPT Presentation

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India Popular Three Level Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Balancing Fairness and Efficiency in Tiered Storage Systems with Bottleneck-Aware Allocation Hui

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Doppelgnger: A Cache for Approximate Computing Joshua San Miguel Jorge Albericio Andreas

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

Lecture 21: Memory Hierarchy Todays topics: Cache organization Cache hits/misses 1

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Code Similarity via Natural Language Descriptions Meital Ben Sinai &amp; Eran Yahav Technion

Hang Zhao Massachusetts Institute of Technology Last years challenge at ECCV ADE Dataset New

3 rd Data Prefetching Championship June 23 rd , 2019 Held in conjunction with ISCA 2019 Seth

CSEP504: Advanced topics in software systems Tonight India trip report Apologies and

Probabilistic Programming Frank Wood frank@invrea.com

Make your code count Quantum simulations and collaborative code QuTiP: Shahnawaz Ahmed The

Whole-body Compliant Dynamical Contacts for Humanoids: the CoDyCo project (FP7 EU project No.

Nested Interpolants Matthias Heizmann Jochen Hoenicke Andreas Podelski University of Freiburg,

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori, Jayesh Gaur, Siddharth Rai # , Sreenivas Subramoney, Hong Wang * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India Popular Three Level Cache

Code Similarity via Natural Language Descriptions Meital Ben Sinai & Eran Yahav Technion