SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E - - PowerPoint PPT Presentation

scd a s calable c oherence d irectory
SMART_READER_LITE
LIVE PREVIEW

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E - - PowerPoint PPT Presentation

SCD: A S CALABLE C OHERENCE D IRECTORY WITH F LEXIBLE S HARER S ET E NCODING Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27 th 2012 Executive Summary 2 Directories are hard to scale, degrade performance


slide-1
SLIDE 1

SCD: A SCALABLE COHERENCE DIRECTORY

WITH FLEXIBLE SHARER SET ENCODING

Daniel Sanchez and Christos Kozyrakis Stanford University HPCA-18, February 27th 2012

slide-2
SLIDE 2

Executive Summary

2

 Directories are hard to scale, degrade performance  SCD: A scalable directory with performance guarantees

 Flexible sharer set encoding: Lines with few sharers use one

entry, widely shared lines use multiple entries  Scalability

 Use ZCache  Efficient high associativity, analytical models

 Negligible invalidations with minimal overprovisioning (~10%)

 At 1024 cores, SCD is 13x smaller than a sparse directory,

and 2x smaller, faster, simpler than a hierarchical directory

slide-3
SLIDE 3

Outline

3

 Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

slide-4
SLIDE 4

Directory-Based Coherence

4

 Scalable coherence protocols use a directory

 Tracks contents of private caches  Ordering point for conflicting requests

Shared L3

Core Core Core Core Core Core Core

Directory Main Memory

Core

Private L2 Private L2 Private L2 Private L2 Private L2 Private L2 Private L2 Private L2

slide-5
SLIDE 5

Directory-Induced Invalidations

5

Shared L3

Core 0 Core 1 Core 2 Core 3 Core 4 Core 6 Core 7

Directory Main Memory

Core 5 Private L2 0 Private L2 1 Private L2 2 Private L2 3 Private L2 4 Private L2 6 Private L2 7 Private L2 5

GET A ld A GET A INV B Limited associativity  To track A, must invalidate B, C, D, or E INV B INV B INV B ld B  MISS

slide-6
SLIDE 6

Desirable Directory Properties

6

1.

Scalability

Latency, energy, area

Constant or log(cores) growth

2.

Minimal complexity

No changes to coherence protocol

3.

Exact sharer information

4.

Negligible directory-induced invalidations

With minimal, bounded overprovisioning

slide-7
SLIDE 7

Sparse Full-Map Directories

7

 Associative array indexed by address  Sharer sets encoded in a bit-vector

0xF00 Shared

Line Address Coherence State Sharer Set

1 1 1

Single lookup  Low latency, energy-efficient  Bit-vectors grow with # cores  Area scales poorly  Limited associativity  Directory-induced invalidations,

  • verprovisioning (~2x)

Directory Entry Format

Way 1 Way 2 Way 3 Way 4

slide-8
SLIDE 8

Hierarchical Sparse Directories

8

 Multi-level hierarchy of sparse directories

Level-2 Directory 32 Level-1 Directories

… …

Cores 0-31 Cores 32-63 Cores 992-1023 L1 Dirs 0-31

Small bit-vectors  Scalable area & energy  Multiple lookups in critical path  Additional latency  Needs hierarchical coherence protocol  More complexity  Directory-induced invalidations more expensive

slide-9
SLIDE 9

Single-Level Dirs with Inexact Sharer Sets

9

 Coarse-grain bit-vectors (e.g., 1 bit for every 4 cores)  Limited pointers: Maintain a few sharer pointers,

invalidate or broadcast on overflow

 Tagless [MICRO 09]: Encode sharers with Bloom filters  SPACE [PACT 10]: De-duplicate sharing patterns

Reduced area & energy overheads  Overheads still not scalable  Inexact sharers  Broadcasts, invalidations or spurious lookups

slide-10
SLIDE 10

Efficient Highly-Associative Caches

10

 ZCache [MICRO 10]: High-associativity cache with few ways

 Draws from skew-associativity and Cuckoo hashing  Hits take a single lookup  In a miss, replacement process

provides many candidates

 Provides cheap high associativity

(e.g., 64-way associativity with 4 ways)

 Described by simple & accurate analytical models

 Cuckoo Directory [Ferdman et al., HPCA 11]:

 Apply Cuckoo hashing to sparse directories  Empirically show that smaller overprovisioning (~25%) eliminates

most invalidations

Indexes

H1 H2 H3

Line address

Way1 Way2 Way3

slide-11
SLIDE 11

Outline

11

 Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

slide-12
SLIDE 12

Scalable Coherence Directory: Insights

12

 Use ZCache

 Cheap high associativity  Analytical models  Bounds on overprovisioning

 Negligible difference with ideal directory regardless of workload  Validated in simulation  Provision space per tracked sharer, not line

 Flexible sharer set encoding: Lines with few sharers use a

single entry, widely shared lines use additional entries

slide-13
SLIDE 13

SCD Array

13

 ZCache array indexed by (Line Address, Entry Number)

 Allows multiple entries per line

 Insertions walk array until an unused entry is found, or a

limit of candidates (R) is reached, then invalidate one

 Could use a replacement policy to decide victim  Evictions are negligible  no need for replacement policy

Indexes

H1 H2 H3

(Line Address, Entry Number)

Way1 Way2 Way3

slide-14
SLIDE 14

SCD Entry Formats

14

 Lines with one or few sharers use a limited pointer entry  Lines with >3 sharers use root + leaves bit-vector entries

Line Address (44b) 37b Unused (37b) 0 0

INVALID

Coherence State (5b) #ptrs (2b) 3x 10-bit sharer pointers (30b) 0 1

LIMITED POINTERS

Coherence State (5b) 1 0

ROOT BIT-VECTOR

Root bit-vector (32b) Leaf number (5b) 1 1

LEAF BIT-VECTOR

Leaf bit-vector (32b) Type (2b)

 Example: 1024 sharers

slide-15
SLIDE 15

Example: Adding a Sharer

15 1

0x5CA1AB1E S 3 01 37 265 267

Add sharer 64 to address 0x5CA1AB1E : Lookup (0x5CA1AB1E, 0), all pointers are used  switch to multi-entry format 2 Allocate entries (0x5CA1AB1E, leafNum+1) with leafNum=1,2,8 4 Write (0x5CA1AB1E, 0) as a root bit-vector

(LIMPTRS)

3 Write leaf bit-vectors

2 11 10000000 00000000 0…0 0…0 0x5CA1AB1E 8 11 00000000 10100000 0…0 0…0 0x5CA1AB1E S 01100000 10000000 0…0 0…0 10 0x5CA1AB1E (ROOT) 1 11 00000010 00000000 0…0 0…0 0x5CA1AB1E (LEAF)

slide-16
SLIDE 16

SCD & Desirable Properties

16

1.

Scalability

Flexible sharer set encoding  Scalable energy and area

Coherence state stored in a single entry  Most operations have 1 lookup on critical path  Scalable latency

2.

Minimal complexity

All entries in the same array  No coherence protocol changes

3.

Exact sharer information

4.

Negligible directory-induced invalidations

With minimal, bounded overprovisioning

?

slide-17
SLIDE 17

Outline

17

 Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

slide-18
SLIDE 18

Analytical Models

18

 Directories built with ZCache arrays can be characterized with simple,

workload-independent analytical models

W R

  • cc
  • cc

AvgLookups    1 1

R inv

  • cc

P 

Fraction of insertions that cause a directory invalidation Average lookups per replacement

W Ways R Replacement candidates

  • cc Occupancy (fraction of used entries)

Determines performance impact, interference Determines replacement latency and energy

slide-19
SLIDE 19

Bounding Invalidations

19

 SCD bounds invalidations with minimal overprovisioning

 Bounded worst-case behavior independent of workload  For Pinv=10-3  W=4, R=64, 11% overprovisioning

 Max directory occupancy 90%  Overprovisioning is:

 Smaller than previous empirical results (25%-2x)  Bounded  Strict guarantees, no design-time uncertainty

slide-20
SLIDE 20

Outline

20

 Introduction  SCD Design  Analytical Bounds on Overprovisioning  Evaluation

slide-21
SLIDE 21

Methodology

21

 Simulated system: 1024-core tiled CMP  In-order cores with split L1s  Private inclusive L2s, 128KB/core  Shared non-inclusive L3, 256MB  MESI directory protocol  Directory implementations:  Sparse, 2-level Hierarchical, SCD  Directories 100%-provisioned for L2s  All directories use ZCache arrays 

negligible invalidations

 14 workloads from PARSEC, SPLASH2,

SPECOMP/JBB, BioParallel suites

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core L3 Bank Mem Ctrl

Router

Dir Bank Core

16-core tile 64-tile CMP (1024 cores)

slide-22
SLIDE 22

Area

22

 Area given as a percentage of L2 caches  At 1024 cores, SCD is:

 13x smaller than Sparse  2x smaller than Hierarchical  Takes ~3% of total die area Cores Sparse Hierarchical SCD Sparse/SCD Hier/SCD 128 34.2% 21.1% 10.9% 3.12x 1.93x 256 59.2% 24.2% 12.5% 4.73x 1.94x 512 109.2% 27.0% 13.9% 7.87x 1.95x 1024 209.2% 30.9% 15.8% 13.22x 1.95x

slide-23
SLIDE 23

Performance

23

 Hierarchical up to 10% slower than Ideal  Sparse has Ideal-like performance, but too expensive  SCD as fast as Ideal & Sparse, cheapest

2 4 6 8 10 12 bscholes applu jbb

  • cean

svm canneal Slowdown over Ideal Directory (%) Hierarchical Sparse SCD

slide-24
SLIDE 24

Energy Efficiency

24

 Directory energy = Accesses * Energy/access

 SCD performs slightly more accesses (lookups, writes) than Sparse

 Some operations require multiple lookups  SCD has higher occupancy, replacements take longer

 SCD energy/access is smaller (narrow entries)

5 10 15 20 25 bscholes applu jbb

  • cean

svm canneal SCD array accesses

  • ver Sparse (%)

97%

slide-25
SLIDE 25

Analytical Models

25

 Empirical results on invalidations match analytical models  Bounds worst-case invalidations with minimal overprovisioning  Can provision directory using simple formulas  Set-associative arrays do not meet analytical models  Need significant overprovisioning (~2x), no bounds  Similar results for Sparse & Hierarchical

slide-26
SLIDE 26

Conclusions

26

 SCD insights:

 Use a variable number of entries/line  Keep entries small  Use ZCache  High associativity + Analytical models

 SCD = Scalability + Performance guarantees

 Scalable area, energy, latency  Simple: No modifications to coherence protocol  Negligible invalidations with bounded overprovisioning  At 1024 cores, SCD is 13x smaller than Sparse, and 2x

smaller, faster and simpler than Hierarchical

slide-27
SLIDE 27

THANK YOU FOR YOUR ATTENTION QUESTIONS?