SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E - PowerPoint PPT Presentation

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012

Executive Summary 2  Directories are hard to scale, degrade performance  SCD: A scalable directory with performance guarantees  Flexible sharer set encoding: Lines with few sharers use one entry, widely shared lines use multiple entries  Scalability  Use ZCache  Efficient high associativity, analytical models  Negligible invalidations with minimal overprovisioning (~10%)  At 1024 cores, SCD is 13x smaller than a sparse directory, and 2x smaller, faster, simpler than a hierarchical directory

Outline 3  Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

Directory-Based Coherence 4 Main Memory Shared L3 Directory Private Private Private Private Private Private Private Private L2 L2 L2 L2 L2 L2 L2 L2 Core Core Core Core Core Core Core Core  Scalable coherence protocols use a directory  Tracks contents of private caches  Ordering point for conflicting requests

Directory-Induced Invalidations 5 Main Memory Shared L3 Directory Limited associativity  To track INV B INV B GET A A, must invalidate B, C, D, or E Private Private Private Private Private Private Private Private L2 0 L2 1 L2 2 L2 3 L2 4 L2 5 L2 6 L2 7 GET A INV B INV B ld A Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 ld B  MISS

Desirable Directory Properties 6 Scalability 1. Latency, energy, area  Constant or log(cores) growth  Minimal complexity 2. No changes to coherence protocol  Exact sharer information 3. Negligible directory-induced invalidations 4. With minimal, bounded overprovisioning 

Sparse Full-Map Directories 7  Associative array indexed by address  Sharer sets encoded in a bit-vector Way 1 Way 2 Way 3 Way 4 Directory Entry Format Line Address Coherence State Sharer Set 0xF00 Shared 0 1 0 0 1 1 0 0  Single lookup  Low latency, energy-efficient  Bit-vectors grow with # cores  Area scales poorly  Limited associativity  Directory-induced invalidations, overprovisioning (~2x)

Hierarchical Sparse Directories 8  Multi-level hierarchy of sparse directories … Level-2 Directory L1 Dirs 0-31 … 32 Level-1 Directories Cores 0-31 Cores 32-63 Cores 992-1023  Small bit-vectors  Scalable area & energy  Multiple lookups in critical path  Additional latency  Needs hierarchical coherence protocol  More complexity  Directory-induced invalidations more expensive

Single-Level Dirs with Inexact Sharer Sets 9  Coarse-grain bit-vectors (e.g., 1 bit for every 4 cores)  Limited pointers: Maintain a few sharer pointers, invalidate or broadcast on overflow  Tagless [MICRO 09]: Encode sharers with Bloom filters  SPACE [PACT 10]: De-duplicate sharing patterns  Reduced area & energy overheads  Overheads still not scalable  Inexact sharers  Broadcasts, invalidations or spurious lookups

Efficient Highly-Associative Caches 10  ZCache [MICRO 10]: High-associativity cache with few ways  Draws from skew-associativity and Cuckoo hashing  Hits take a single lookup Indexes Way1 Way2 Way3  In a miss, replacement process H1 provides many candidates Line H2 address  Provides cheap high associativity H3 (e.g., 64-way associativity with 4 ways)  Described by simple & accurate analytical models  Cuckoo Directory [Ferdman et al., HPCA 11]:  Apply Cuckoo hashing to sparse directories  Empirically show that smaller overprovisioning (~25%) eliminates most invalidations

Scalable Coherence Directory: Insights 12  Use ZCache  Cheap high associativity  Analytical models  Bounds on overprovisioning  Negligible difference with ideal directory regardless of workload  Validated in simulation  Provision space per tracked sharer, not line  Flexible sharer set encoding: Lines with few sharers use a single entry, widely shared lines use additional entries

SCD Array 13  ZCache array indexed by (Line Address, Entry Number)  Allows multiple entries per line Indexes Way1 Way2 Way3 H1 (Line Address, Entry Number) H2 H3  Insertions walk array until an unused entry is found, or a limit of candidates (R) is reached, then invalidate one  Could use a replacement policy to decide victim  Evictions are negligible  no need for replacement policy

SCD Entry Formats 14  Example: 1024 sharers Line Address Type 37b (44b) (2b) Unused I NVALID 0 0 (37b) Coherence State #ptrs 3x 10-bit sharer pointers L IMITED P OINTERS 0 1 (5b) (2b) (30b) Coherence State Root bit-vector R OOT B IT -V ECTOR 1 0 (5b) (32b) Leaf number Leaf bit-vector L EAF B IT -V ECTOR 1 1 (5b) (32b)  Lines with one or few sharers use a limited pointer entry  Lines with >3 sharers use root + leaves bit-vector entries

Example: Adding a Sharer 15 0x5CA1AB1E 01 S 3 37 265 267 (L IM P TRS ) Add sharer 64 to address 0x5CA1AB1E : Lookup (0x5CA1AB1E, 0), all pointers are used  switch to multi-entry format 1 Allocate entries (0x5CA1AB1E, leafNum+1) with leafNum=1,2,8 2 3 Write leaf bit-vectors 4 Write (0x5CA1AB1E, 0) as a root bit-vector 0x5CA1AB1E 10 S 01100000 10000000 0…0 0…0 (R OOT ) (L EAF ) 0x5CA1AB1E 11 1 00000010 00000000 0…0 0…0 0x5CA1AB1E 11 2 10000000 00000000 0…0 0…0 0x5CA1AB1E 11 8 00000000 10100000 0…0 0…0

SCD & Desirable Properties 16 Scalability 1. Flexible sharer set encoding  Scalable energy and area  Coherence state stored in a single entry  Most operations  have 1 lookup on critical path  Scalable latency Minimal complexity 2. All entries in the same array  No coherence protocol changes  Exact sharer information 3. Negligible directory-induced invalidations ? 4. With minimal, bounded overprovisioning 

Analytical Models 18  Directories built with ZCache arrays can be characterized with simple, workload-independent analytical models W Ways R Replacement candidates occ Occupancy (fraction of used entries) Average lookups Fraction of insertions that per replacement cause a directory invalidation Determines performance Determines replacement impact, interference latency and energy R  1 occ R   AvgLookups P occ W inv  1 occ

Bounding Invalidations 19  SCD bounds invalidations with minimal overprovisioning  Bounded worst-case behavior independent of workload  For Pinv=10 -3  W=4, R=64, 11% overprovisioning  Max directory occupancy 90%  Overprovisioning is:  Smaller than previous empirical results (25%-2x)  Bounded  Strict guarantees, no design-time uncertainty

Methodology 21 64-tile CMP (1024 cores)  Simulated system: 1024-core tiled CMP  In-order cores with split L1s  Private inclusive L2s, 128KB/core  Shared non-inclusive L3, 256MB  MESI directory protocol  Directory implementations:  Sparse, 2-level Hierarchical, SCD Core Core Core Core Core Core Core Core  Directories 100%-provisioned for L2s Mem  All directories use ZCache arrays  Dir Ctrl L3 Bank Bank negligible invalidations Router Core Core Core Core Core Core Core Core  14 workloads from PARSEC, SPLASH2, SPECOMP/JBB, BioParallel suites 16-core tile

Area 22 Cores Sparse Hierarchical SCD Sparse/SCD Hier/SCD 128 34.2% 21.1% 10.9% 3.12x 1.93x 256 59.2% 24.2% 12.5% 4.73x 1.94x 512 109.2% 27.0% 13.9% 7.87x 1.95x 1024 209.2% 30.9% 15.8% 13.22x 1.95x  Area given as a percentage of L2 caches  At 1024 cores, SCD is:  13x smaller than Sparse  2x smaller than Hierarchical  Takes ~3% of total die area

Performance 23 12 Ideal Directory (%) Slowdown over 10 8 Hierarchical 6 Sparse 4 SCD 2 0 bscholes applu jbb ocean svm canneal  Hierarchical up to 10% slower than Ideal  Sparse has Ideal-like performance, but too expensive  SCD as fast as Ideal & Sparse, cheapest

Energy Efficiency 24  Directory energy = Accesses * Energy/access 97% 25 SCD array accesses over Sparse (%) 20 15 10 5 0 bscholes applu jbb ocean svm canneal  SCD performs slightly more accesses (lookups, writes) than Sparse  Some operations require multiple lookups  SCD has higher occupancy, replacements take longer  SCD energy/access is smaller (narrow entries)

Analytical Models 25  Empirical results on invalidations match analytical models  Bounds worst-case invalidations with minimal overprovisioning  Can provision directory using simple formulas  Set-associative arrays do not meet analytical models  Need significant overprovisioning (~2x), no bounds  Similar results for Sparse & Hierarchical

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E - PowerPoint PPT Presentation

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012 Executive Summary 2 Directories are hard to scale, degrade performance

S S S S calable calable Agent calable calable Agent Agent Plat forms Agent Plat forms

SCD Postdoc Mentoring Robert Harris Fermilab All SCD Scientists Community Meeting Nov. 15,

Single-Case Designs (SCD) I. Use of SCD in SW II. Requirements for SCD 1. Target problem (DV) 2.

CalABLE Workshop for Service Providers CALIFORNIA ABLE ACT BOARD CalABLE Workshop for Service

D ata I ntensive I ntensive S calable S calable C omputing C omputing Randal E. Bryant

C OHERENCE M AKING AND D EEP L EARNING S TRATEGIES FOR SYSTEM CHANGE THAT BENEFIT ALL

A CHIEVING E QUITY , C OHERENCE & I NNOVATION : I NSTRUCTIONAL V ISION AND S UPPORT IN B OSTON

SICKLE CELL DISEASE IN B & D DR I R GRANT 15 FEB 2012 250,000 children born every year

1 SCD Landscape Magnitude of SCD in the US 167,366 SCD claims Stroke 3 more lives each year

CasADi Joel Andersson Moritz Diehl Department of Electrical Engineering (ESAT-SCD) &

Charge resolution of the ISS-CREAM SCD measured with a heavy-ion beam G.H. Hong, SungKyunKwan

Tianlai survey and Fermilab Scientific Computing Division (SCD) 9/27/2016 Stu Fuess, Margaret

CalABLE CALIFORNIA ACHIEVING A BETTER LIFE EXPERIENCE ACT BOARD Implementing Californias ABLE

CalABLE Providing greater financial security to Californians living with a disability 1

CalABLE CALIFORNIA ACHIEVING A BETTER LIFE EXPERIENCE ACT BOARD ABLE Act of 2014 Federal

Mini Monitoring Station (MMS) A Low-cost , S calable Remot e Monit oring S yst em for

STATS 507 Data Analysis in Python Lecture 12: Text Encoding and Regular Expressions Some slides

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Chameleon: Keeping data safe for the nave and thri6y Ansley Post and Peter Druschel MPISWS

IT452 Advanced Web and Internet Systems Set 8: XML, XPath, and XSLT (Chapter 15.1-4,15.8) Some

Fix Fixed ed-PS PSNR Lossy Compres essio ion for Sci Scientific D c Data Dingwen Tao (The

Quantum Algorithms for Estimating Physical Quantities using Block-Encodings Patrick Rall Quantum

Announcements 61A Extra Lecture 4 Representing Strings: UTF-8 Encoding UTF (UCS (Universal

Separable Automorphisms on Matrix Algebras over Finite Field Extensions. Applications to Ideal

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E - PowerPoint PPT Presentation

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012 Executive Summary 2 Directories are hard to scale, degrade performance

S S S S calable calable Agent calable calable Agent Agent Plat forms Agent Plat forms

SCD Postdoc Mentoring Robert Harris Fermilab All SCD Scientists Community Meeting Nov. 15,

Single-Case Designs (SCD) I. Use of SCD in SW II. Requirements for SCD 1. Target problem (DV) 2.

CalABLE Workshop for Service Providers CALIFORNIA ABLE ACT BOARD CalABLE Workshop for Service

D ata I ntensive I ntensive S calable S calable C omputing C omputing Randal E. Bryant

C OHERENCE M AKING AND D EEP L EARNING S TRATEGIES FOR SYSTEM CHANGE THAT BENEFIT ALL

A CHIEVING E QUITY , C OHERENCE &amp; I NNOVATION : I NSTRUCTIONAL V ISION AND S UPPORT IN B OSTON

SICKLE CELL DISEASE IN B &amp; D DR I R GRANT 15 FEB 2012 250,000 children born every year

1 SCD Landscape Magnitude of SCD in the US 167,366 SCD claims Stroke 3 more lives each year

CasADi Joel Andersson Moritz Diehl Department of Electrical Engineering (ESAT-SCD) &amp;

Charge resolution of the ISS-CREAM SCD measured with a heavy-ion beam G.H. Hong, SungKyunKwan

Tianlai survey and Fermilab Scientific Computing Division (SCD) 9/27/2016 Stu Fuess, Margaret

CalABLE CALIFORNIA ACHIEVING A BETTER LIFE EXPERIENCE ACT BOARD Implementing Californias ABLE

CalABLE Providing greater financial security to Californians living with a disability 1

CalABLE CALIFORNIA ACHIEVING A BETTER LIFE EXPERIENCE ACT BOARD ABLE Act of 2014 Federal

Mini Monitoring Station (MMS) A Low-cost , S calable Remot e Monit oring S yst em for

STATS 507 Data Analysis in Python Lecture 12: Text Encoding and Regular Expressions Some slides

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Chameleon: Keeping data safe for the nave and thri6y Ansley Post and Peter Druschel MPISWS

IT452 Advanced Web and Internet Systems Set 8: XML, XPath, and XSLT (Chapter 15.1-4,15.8) Some

Fix Fixed ed-PS PSNR Lossy Compres essio ion for Sci Scientific D c Data Dingwen Tao (The

Quantum Algorithms for Estimating Physical Quantities using Block-Encodings Patrick Rall Quantum

Announcements 61A Extra Lecture 4 Representing Strings: UTF-8 Encoding UTF (UCS (Universal

Separable Automorphisms on Matrix Algebras over Finite Field Extensions. Applications to Ideal

A CHIEVING E QUITY , C OHERENCE & I NNOVATION : I NSTRUCTIONAL V ISION AND S UPPORT IN B OSTON

SICKLE CELL DISEASE IN B & D DR I R GRANT 15 FEB 2012 250,000 children born every year

CasADi Joel Andersson Moritz Diehl Department of Electrical Engineering (ESAT-SCD) &