scd a s calable c oherence d irectory
play

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E - PowerPoint PPT Presentation

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012 Executive Summary 2 Directories are hard to scale, degrade performance


  1. SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012

  2. Executive Summary 2  Directories are hard to scale, degrade performance  SCD: A scalable directory with performance guarantees  Flexible sharer set encoding: Lines with few sharers use one entry, widely shared lines use multiple entries  Scalability  Use ZCache  Efficient high associativity, analytical models  Negligible invalidations with minimal overprovisioning (~10%)  At 1024 cores, SCD is 13x smaller than a sparse directory, and 2x smaller, faster, simpler than a hierarchical directory

  3. Outline 3  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  4. Directory-Based Coherence 4 Main Memory Shared L3 Directory Private Private Private Private Private Private Private Private L2 L2 L2 L2 L2 L2 L2 L2 Core Core Core Core Core Core Core Core  Scalable coherence protocols use a directory  Tracks contents of private caches  Ordering point for conflicting requests

  5. Directory-Induced Invalidations 5 Main Memory Shared L3 Directory Limited associativity  To track INV B INV B GET A A, must invalidate B, C, D, or E Private Private Private Private Private Private Private Private L2 0 L2 1 L2 2 L2 3 L2 4 L2 5 L2 6 L2 7 GET A INV B INV B ld A Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 ld B  MISS

  6. Desirable Directory Properties 6 Scalability 1. Latency, energy, area  Constant or log(cores) growth  Minimal complexity 2. No changes to coherence protocol  Exact sharer information 3. Negligible directory-induced invalidations 4. With minimal, bounded overprovisioning 

  7. Sparse Full-Map Directories 7  Associative array indexed by address  Sharer sets encoded in a bit-vector Way 1 Way 2 Way 3 Way 4 Directory Entry Format Line Address Coherence State Sharer Set 0xF00 Shared 0 1 0 0 1 1 0 0  Single lookup  Low latency, energy-efficient  Bit-vectors grow with # cores  Area scales poorly  Limited associativity  Directory-induced invalidations, overprovisioning (~2x)

  8. Hierarchical Sparse Directories 8  Multi-level hierarchy of sparse directories … Level-2 Directory L1 Dirs 0-31 … 32 Level-1 Directories Cores 0-31 Cores 32-63 Cores 992-1023  Small bit-vectors  Scalable area & energy  Multiple lookups in critical path  Additional latency  Needs hierarchical coherence protocol  More complexity  Directory-induced invalidations more expensive

  9. Single-Level Dirs with Inexact Sharer Sets 9  Coarse-grain bit-vectors (e.g., 1 bit for every 4 cores)  Limited pointers: Maintain a few sharer pointers, invalidate or broadcast on overflow  Tagless [MICRO 09]: Encode sharers with Bloom filters  SPACE [PACT 10]: De-duplicate sharing patterns  Reduced area & energy overheads  Overheads still not scalable  Inexact sharers  Broadcasts, invalidations or spurious lookups

  10. Efficient Highly-Associative Caches 10  ZCache [MICRO 10]: High-associativity cache with few ways  Draws from skew-associativity and Cuckoo hashing  Hits take a single lookup Indexes Way1 Way2 Way3  In a miss, replacement process H1 provides many candidates Line H2 address  Provides cheap high associativity H3 (e.g., 64-way associativity with 4 ways)  Described by simple & accurate analytical models  Cuckoo Directory [Ferdman et al., HPCA 11]:  Apply Cuckoo hashing to sparse directories  Empirically show that smaller overprovisioning (~25%) eliminates most invalidations

  11. Outline 11  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  12. Scalable Coherence Directory: Insights 12  Use ZCache  Cheap high associativity  Analytical models  Bounds on overprovisioning  Negligible difference with ideal directory regardless of workload  Validated in simulation  Provision space per tracked sharer, not line  Flexible sharer set encoding: Lines with few sharers use a single entry, widely shared lines use additional entries

  13. SCD Array 13  ZCache array indexed by (Line Address, Entry Number)  Allows multiple entries per line Indexes Way1 Way2 Way3 H1 (Line Address, Entry Number) H2 H3  Insertions walk array until an unused entry is found, or a limit of candidates (R) is reached, then invalidate one  Could use a replacement policy to decide victim  Evictions are negligible  no need for replacement policy

  14. SCD Entry Formats 14  Example: 1024 sharers Line Address Type 37b (44b) (2b) Unused I NVALID 0 0 (37b) Coherence State #ptrs 3x 10-bit sharer pointers L IMITED P OINTERS 0 1 (5b) (2b) (30b) Coherence State Root bit-vector R OOT B IT -V ECTOR 1 0 (5b) (32b) Leaf number Leaf bit-vector L EAF B IT -V ECTOR 1 1 (5b) (32b)  Lines with one or few sharers use a limited pointer entry  Lines with >3 sharers use root + leaves bit-vector entries

  15. Example: Adding a Sharer 15 0x5CA1AB1E 01 S 3 37 265 267 (L IM P TRS ) Add sharer 64 to address 0x5CA1AB1E : Lookup (0x5CA1AB1E, 0), all pointers are used  switch to multi-entry format 1 Allocate entries (0x5CA1AB1E, leafNum+1) with leafNum=1,2,8 2 3 Write leaf bit-vectors 4 Write (0x5CA1AB1E, 0) as a root bit-vector 0x5CA1AB1E 10 S 01100000 10000000 0…0 0…0 (R OOT ) (L EAF ) 0x5CA1AB1E 11 1 00000010 00000000 0…0 0…0 0x5CA1AB1E 11 2 10000000 00000000 0…0 0…0 0x5CA1AB1E 11 8 00000000 10100000 0…0 0…0

  16. SCD & Desirable Properties 16 Scalability 1. Flexible sharer set encoding  Scalable energy and area  Coherence state stored in a single entry  Most operations  have 1 lookup on critical path  Scalable latency Minimal complexity 2. All entries in the same array  No coherence protocol changes  Exact sharer information 3. Negligible directory-induced invalidations ? 4. With minimal, bounded overprovisioning 

  17. Outline 17  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  18. Analytical Models 18  Directories built with ZCache arrays can be characterized with simple, workload-independent analytical models W Ways R Replacement candidates occ Occupancy (fraction of used entries) Average lookups Fraction of insertions that per replacement cause a directory invalidation Determines performance Determines replacement impact, interference latency and energy R  1 occ R   AvgLookups P occ W inv  1 occ

  19. Bounding Invalidations 19  SCD bounds invalidations with minimal overprovisioning  Bounded worst-case behavior independent of workload  For Pinv=10 -3  W=4, R=64, 11% overprovisioning  Max directory occupancy 90%  Overprovisioning is:  Smaller than previous empirical results (25%-2x)  Bounded  Strict guarantees, no design-time uncertainty

  20. Outline 20  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

  21. Methodology 21 64-tile CMP (1024 cores)  Simulated system: 1024-core tiled CMP  In-order cores with split L1s  Private inclusive L2s, 128KB/core  Shared non-inclusive L3, 256MB  MESI directory protocol  Directory implementations:  Sparse, 2-level Hierarchical, SCD Core Core Core Core Core Core Core Core  Directories 100%-provisioned for L2s Mem  All directories use ZCache arrays  Dir Ctrl L3 Bank Bank negligible invalidations Router Core Core Core Core Core Core Core Core  14 workloads from PARSEC, SPLASH2, SPECOMP/JBB, BioParallel suites 16-core tile

  22. Area 22 Cores Sparse Hierarchical SCD Sparse/SCD Hier/SCD 128 34.2% 21.1% 10.9% 3.12x 1.93x 256 59.2% 24.2% 12.5% 4.73x 1.94x 512 109.2% 27.0% 13.9% 7.87x 1.95x 1024 209.2% 30.9% 15.8% 13.22x 1.95x  Area given as a percentage of L2 caches  At 1024 cores, SCD is:  13x smaller than Sparse  2x smaller than Hierarchical  Takes ~3% of total die area

  23. Performance 23 12 Ideal Directory (%) Slowdown over 10 8 Hierarchical 6 Sparse 4 SCD 2 0 bscholes applu jbb ocean svm canneal  Hierarchical up to 10% slower than Ideal  Sparse has Ideal-like performance, but too expensive  SCD as fast as Ideal & Sparse, cheapest

  24. Energy Efficiency 24  Directory energy = Accesses * Energy/access 97% 25 SCD array accesses over Sparse (%) 20 15 10 5 0 bscholes applu jbb ocean svm canneal  SCD performs slightly more accesses (lookups, writes) than Sparse  Some operations require multiple lookups  SCD has higher occupancy, replacements take longer  SCD energy/access is smaller (narrow entries)

  25. Analytical Models 25  Empirical results on invalidations match analytical models  Bounds worst-case invalidations with minimal overprovisioning  Can provision directory using simple formulas  Set-associative arrays do not meet analytical models  Need significant overprovisioning (~2x), no bounds  Similar results for Sparse & Hierarchical

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend