Ti Ti Tiny Directory Tiny Directory Di Di t t Making - - PowerPoint PPT Presentation

ti ti tiny directory tiny directory di di t t
SMART_READER_LITE
LIVE PREVIEW

Ti Ti Tiny Directory Tiny Directory Di Di t t Making - - PowerPoint PPT Presentation

Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence Tracking Making Coherence Tracking Making Coherence Tracking Feather light Feather Feather-light Feather light light Mainak Chaudhuri Indian


slide-1
SLIDE 1

Ti Di t Ti Di t Tiny Directory Tiny Directory

Making Coherence Tracking Making Coherence Tracking Making Coherence Tracking Making Coherence Tracking Feather Feather-light light Feather Feather light light

Mainak Chaudhuri Indian Institute of Technology Kanpur Indian Institute of Technology Kanpur (J i k i h S dh h Sh kl IITK) (Joint work with Sudhanshu Shukla, IITK)

slide-2
SLIDE 2

Forty Forty-year anniversary year anniversary

  • Forty years of directory-based coherence
  • L M Censier and P Feautrier A New
  • L. M. Censier and P. Feautrier. A New

Solution to Coherence Problems in Multicache Systems In IEEE Transactions on Multicache Systems. In IEEE Transactions on Computers, c-27(12):1112-1118, December 1978 1978.

– “A new solution is presented and discussed here: the presence flag solution.” the presence flag solution.

Lucien M. Censier, CII-Honeywell Bull Paul Feautrier, University Pierre y y et Marie Curie

slide-3
SLIDE 3

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-4
SLIDE 4

Sketch Sketch

  • Talk in one slide
  • Result highlights
  • Result highlights
  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-5
SLIDE 5

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network Interconnection Network

B2

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-6
SLIDE 6

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Sparse directory height is an important determinant of performance

Interconnection Network

B2

determinant of performance

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-7
SLIDE 7

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

We show how to design very small sparse directories hile deli ering high performance

Interconnection Network

B2

directories while delivering high performance

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-8
SLIDE 8

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Track privately owned blocks by borro ing bits from LLC data a

Interconnection Network

B2

borrowing bits from LLC data way

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-9
SLIDE 9

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Track privately owned blocks by borro ing bits from LLC data a

Interconnection Network

B2

borrowing bits from LLC data way

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-10
SLIDE 10

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Track privately owned blocks by borro ing bits from LLC data a

Interconnection Network

B2

borrowing bits from LLC data way

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-11
SLIDE 11

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Track privately owned blocks by borro ing bits from LLC data a

Interconnection Network

B2

borrowing bits from LLC data way

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-12
SLIDE 12

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Track privately owned blocks by borro ing bits from LLC data a

Interconnection Network

B2

borrowing bits from LLC data way

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-13
SLIDE 13

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Critical shared blocks with large-scale read sharing are tracked in a tin director

Interconnection Network

B2

sharing are tracked in a tiny directory

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-14
SLIDE 14

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Entries evicted from tiny directory can be spilled into LLC space at a controlled rate

Interconnection Network

B2

spilled into LLC space at a controlled rate

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-15
SLIDE 15

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Entries evicted from tiny directory can be spilled into LLC space at a controlled rate

Interconnection Network

B2

spilled into LLC space at a controlled rate

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-16
SLIDE 16

Talk in One Slide Talk in One Slide

C0 C1 C2 C3

Private Cache(s) B1 B1 B2 B3

Interconnection Network

Entries evicted from tiny directory can be spilled into LLC space at a controlled rate

Interconnection Network

B2

spilled into LLC space at a controlled rate

Shared LLC Shared LLC

B2 B2 B3 B1 Sparse Directory Height

Bank Bank

Sparse Directory Slice Sparse Directory Slice B1 B3

slide-17
SLIDE 17

Result highlights Result highlights

  • 128-core chip-multiprocessor running

scientific computing, general-purpose, and l l h d d kl d commercial multi-threaded workloads

– Our Tiny Directory proposal using sparse directories with (1/32)x to (1/256)x entries performs within 1% of a 2x sparse directory

Ti Di t it f 187KB t 23 75KB

  • Tiny Directory capacity ranges from 187KB to 23.75KB

– Our Tiny Directory proposal exercising (1/256)x entries saves 16% energy in the LLC and the entries saves 16% energy in the LLC and the sparse directory compared to the 2x baseline – Our proposal outperforms the state-of-the-art – Our proposal outperforms the state of the art multi-grain directory by large margins

slide-18
SLIDE 18

Result highlights Result highlights

  • 128-core chip-multiprocessor running

scientific computing, general-purpose, and l l h d d kl d commercial multi-threaded workloads

– Our Tiny Directory proposal using sparse directories with (1/32)x to (1/256)x entries performs within 1% of a 2x sparse directory

Ti Di t it f 187KB t 23 75KB

A significant leap forward in saving on-die SRAM in estment for coherence tracking

  • Tiny Directory capacity ranges from 187KB to 23.75KB

– Our Tiny Directory proposal exercising (1/256)x entries saves 16% energy in the LLC and the

SRAM investment for coherence tracking

entries saves 16% energy in the LLC and the sparse directory compared to the 2x baseline – Our proposal outperforms the state-of-the-art – Our proposal outperforms the state of the art multi-grain directory by large margins

slide-19
SLIDE 19

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-20
SLIDE 20

Introduction Introduction

  • Sparse directory is a set-associative tagged

Sparse directory is a set associative tagged structure attached to each last-level cache (LLC) bank ( )

– Each sparse directory entry tracks the location(s)

  • f an LLC block in the private cache hierarchy

tt hed to e h o e attached to each core – Sparse directory implementation needs to be space-efficient as the number of cores in the space efficient as the number of cores in the chip-multiprocessor increases – The number of sparse directory entries imposes p y p an upper bound on the number of distinct blocks tracked at any point in time

This parameter plays an important role in determining

  • This parameter plays an important role in determining

the overall performance and the total space investment for coherence tracking

slide-21
SLIDE 21

Sparse directory height Sparse directory height

  • Sparse directory height is an important
  • Sparse directory height is an important

determinant of performance

Number of sparse directory entries is mentioned – Number of sparse directory entries is mentioned as a fraction of the number of blocks in the last- level private cache (L2 cache in our case) level private cache (L2 cache in our case)

slide-22
SLIDE 22

Sparse directory height Sparse directory height

  • Sparse directory height is an important
  • Sparse directory height is an important

determinant of performance

Number of sparse directory entries is mentioned – Number of sparse directory entries is mentioned as a fraction of the number of blocks in the last- level private cache (L2 cache in our case) level private cache (L2 cache in our case)

Compared to a 2x sparse directory, execution time increases by 3%, 11%, and 28% for (1/4)x, (1/8)x, and (1/16)x directory heights

slide-23
SLIDE 23

Sparse directory height Sparse directory height

  • Sparse directory height is an important
  • Sparse directory height is an important

determinant of performance

Number of sparse directory entries is mentioned – Number of sparse directory entries is mentioned as a fraction of the number of blocks in the last- level private cache (L2 cache in our case) level private cache (L2 cache in our case)

With decreasing directory height premature With decreasing directory height, premature directory evictions cause back-invalidation of live blocks from private cache hierarchy live blocks from private cache hierarchy

Compared to a 2x sparse directory, execution time increases by 3%, 11%, and 28% for (1/4)x, (1/8)x, and (1/16)x directory heights

slide-24
SLIDE 24

Private vs. shared blocks Private vs. shared blocks

  • Recent proposals have recognized the

presence of a large volume of private blocks in the on-chip cache hierarchy

79% f ll ll d LLC bl k i – 79% of all allocated LLC blocks in our case – Techniques have been proposed to reduce the

  • verhead of tracking private blocks
  • verhead of tracking private blocks

– Multi-grain directory devotes one directory entry to track a 1 KB private region [MICRO’13] to track a 1 KB private region [MICRO 13]

  • Requires support for dual-grain coherence

– Stash directory does not back-invalidate a private block on evicting its directory entry [HPCA’14]

  • Requires broadcast-based recovery if such a block

gets shared in future gets shared in future

– OS-identified private pages not tracked [ISCA’11]

  • Requires custom OS support
slide-25
SLIDE 25

Tracking shared blocks: Limit study Tracking shared blocks: Limit study

  • How small the sparse directory can be if
  • How small the sparse directory can be if

private blocks are not tracked in the directory

A block is tracked in the directory only when it – A block is tracked in the directory only when it has at least two sharers; tracked until it becomes unowned/non-shared or evicted from directory unowned/non shared or evicted from directory

slide-26
SLIDE 26

Tracking shared blocks: Limit study Tracking shared blocks: Limit study

  • How small the sparse directory can be if
  • How small the sparse directory can be if

private blocks are not tracked in the directory

A block is tracked in the directory only when it – A block is tracked in the directory only when it has at least two sharers; tracked until it becomes unowned/non-shared or evicted from directory unowned/non shared or evicted from directory

Compared to a 2x sparse directory, execution time increases by 1%, 4%, 13%, and 28% for (1/16)x, (1/32)x, (1/64)x, (1/128)x directories

slide-27
SLIDE 27

Tracking shared blocks: Limit study Tracking shared blocks: Limit study

  • How small the sparse directory can be if
  • How small the sparse directory can be if

private blocks are not tracked in the directory

A block is tracked in the directory only when it – A block is tracked in the directory only when it has at least two sharers; tracked until it becomes unowned/non-shared or evicted from directory unowned/non shared or evicted from directory

N t ibl t i t i d f Not possible to maintain good performance below (1/16)x even when all tracking h d f i t bl k i li i t d

  • verhead of private blocks is eliminated

Compared to a 2x sparse directory, execution time increases by 1%, 4%, 13%, and 28% for (1/16)x, (1/32)x, (1/64)x, (1/128)x directories

slide-28
SLIDE 28

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction
  • Ti

Di t

  • Tiny Directory

−In-LLC coherence tracking −Tiny Directory design −Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-29
SLIDE 29

Tiny Directory: General plan Tiny Directory: General plan

Th t t ti i di t h i ht

  • Three steps to optimize directory height

– Start with a naïve design that doesn’t have a sparse directory sparse directory

  • A block is tracked by borrowing bits from the block’s

LLC data way (in-LLC coherence tracking)

– Assumes a traditional non-inclusive/non-exclusive LLC where blocks are filled in LLC on miss and no back-invalidation sent

  • n LLC eviction

W k ll f i bl k

  • Works well for private blocks
  • All read requests to a shared block must be forwarded

to a sharer

– Track critical read-shared blocks in a tiny directory to avoid three-hop critical paths – Make the design robust by spilling a subset of evicted tiny directory entries into LLC

slide-30
SLIDE 30

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory
  • In-LLC coherence tracking

– Tiny Directory design – Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-31
SLIDE 31

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Salient features
  • Salient features

– Uses no extra storage for coherence tracking – Borrows bits from the LLC data way of a block Borrows bits from the LLC data way of a block for tracking its location(s) – Extends the traditional baseline MESI protocol p

  • Coherence state encoding

– Two state bits per LLC block as in the baseline p

  • V=0, D=0: invalid LLC block
  • V=1, D=0: valid LLC block, not modified, unowned,

not shared not shared

  • V=1, D=1: valid LLC block, modified, unowned, not

shared

  • V=0, D=1: valid LLC block, either owned by a core or

shared, bits of data way used for extended encoding

slide-32
SLIDE 32

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Extended state encoding when V=0 D=1
  • Extended state encoding when V=0, D=1

– Data bit#0: dirty D t bit#1 di /b – Data bit#1: pending/busy – Data bit#2: owned if 1 and shared if 0 D bi #3 / h di f – Data bit#3: owner/sharer encoding format

  • If set to 1, next log C bits encode a sharer/owner (C is

the number of cores) the number of cores)

  • If set to 0, next C bits encode a sharer bitvector

– Either 4+C or 4+log C data bits can be corrupted Either 4+C or 4+log C data bits can be corrupted

slide-33
SLIDE 33

In In-

  • LLC coherence tracking

LLC coherence tracking

Sparse Directory Entry

Tag Full-Map Sharer Set V B O/S

C bits

D Data Block V Tag

LLC Entry

C-bits

Baseline

LLC Entry

Partial Data O/S En Sharers B D

Corrupted Data

V D Tag

Corrupted Data

In-LLC coherence tracking

slide-34
SLIDE 34

In In-

  • LLC coherence tracking

LLC coherence tracking

  • The state transitions are trivially extended
  • The state transitions are trivially extended

from baseline MESI T f i t t h t f

  • Two performance issues to watch out for

– Extra interconnect traffic due to block t ti bit (4 C 4 l C) b i reconstruction bits (4+C or 4+log C) being carried by the clean block (E and a fraction of S) eviction notices to the LLC from cores eviction notices to the LLC from cores

  • M state evictions carry the full block as in the baseline

– Reads to shared blocks suffer from lengthened Reads to shared blocks suffer from lengthened critical path

  • Two hops in baseline and three hops in in-LLC tracking

p p g

slide-35
SLIDE 35

In In-

  • LLC coherence tracking

LLC coherence tracking

Home bank LLC T Home bank LLC D t LLC Tag LLC Data

slide-36
SLIDE 36

In In-

  • LLC coherence tracking

LLC coherence tracking

R Read Home bank LLC T Home bank LLC D t LLC Tag LLC Data

slide-37
SLIDE 37

In In-

  • LLC coherence tracking

LLC coherence tracking

R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data

slide-38
SLIDE 38

In In-

  • LLC coherence tracking

LLC coherence tracking

R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data

slide-39
SLIDE 39

In In-

  • LLC coherence tracking

LLC coherence tracking

R Read V=0, D=1 Tag hit 000 (shared) 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data

slide-40
SLIDE 40

In In-

  • LLC coherence tracking

LLC coherence tracking

R S Elect a sharer and forward Read V=0, D=1 Tag hit 000 (shared) 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data

slide-41
SLIDE 41

In In-

  • LLC coherence tracking

LLC coherence tracking

R S Respond with data Elect a sharer and forward Busy Read V=0, D=1 Tag hit 000 (shared) Busy clear 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data

slide-42
SLIDE 42

In In-

  • LLC coherence tracking

LLC coherence tracking

R S Respond with data Elect a sharer and forward Busy Read V=0, D=1 Tag hit 000 (shared) Busy clear 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data

In baseline, home LLC bank would have responded to R directly

slide-43
SLIDE 43

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Interconnect traffic (bytes of header and
  • Interconnect traffic (bytes of header and

payload) comparison between in-LLC coherence tracking and sparse 2x directory coherence tracking and sparse 2x directory

slide-44
SLIDE 44

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Interconnect traffic (bytes of header and
  • Interconnect traffic (bytes of header and

payload) comparison between in-LLC coherence tracking and sparse 2x directory coherence tracking and sparse 2x directory

Compared to a 2x sparse directory, processor request and eviction traffic Co pa ed o a spa se d ec o y, p ocesso eques a d e c o a c increases by a percentage each; coherence traffic increases by >5%

slide-45
SLIDE 45

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Interconnect traffic (bytes of header and
  • Interconnect traffic (bytes of header and

payload) comparison between in-LLC coherence tracking and sparse 2x directory coherence tracking and sparse 2x directory Additional three-hop read requests to shared blocks lead to coherence traffic increase

Compared to a 2x sparse directory, processor request and eviction traffic Co pa ed o a spa se d ec o y, p ocesso eques a d e c o a c increases by a percentage each; coherence traffic increases by >5%

slide-46
SLIDE 46

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Performance comparison with 2x sparse
  • Performance comparison with 2x sparse

directory

On average in LLC coherence tracking performs – On average, in-LLC coherence tracking performs 11% worse than a 2x sparse directory Several applications lose at least 10% – Several applications lose at least 10% performance: swaptions, barnes, ocean_cp, 316.applu, 324.apsi, SPECWeb 316.applu, 324.apsi, SPECWeb – Primary reason for this loss in performance is the lengthened critical path of reads to shared blocks g p

slide-47
SLIDE 47

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Fraction of LLC accesses that experience
  • Fraction of LLC accesses that experience

lengthened critical path

slide-48
SLIDE 48

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Fraction of LLC accesses that experience
  • Fraction of LLC accesses that experience

lengthened critical path

On average 30% LLC accesses suffer from this problem On average, 30% LLC accesses suffer from this problem

slide-49
SLIDE 49

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Fraction of LLC accesses that experience
  • Fraction of LLC accesses that experience

lengthened critical path

On average 30% LLC accesses suffer from this problem On average, 30% LLC accesses suffer from this problem For commercial applications, code accesses suffer more than data

slide-50
SLIDE 50

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Fraction of allocated LLC blocks that
  • Fraction of allocated LLC blocks that

experience accesses with lengthened critical path path

On average, only 8% LLC blocks experience this problem Can we design a small sparse directory to track these offending blocks?

slide-51
SLIDE 51

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Among the small fraction of offending LLC

Among the small fraction of offending LLC blocks, is there a small subset covering a large fraction of lengthened accesses? large fraction of lengthened accesses?

– Define Shared Three-hop Read Access (STRA) ratio of a block = fraction of LLC read accesses to the block that need forwarding to a sharer because the block is in shared corrupted state

  • All offending LLC blocks have non-zero STRA ratio
  • Rest of the LLC blocks have zero STRA ratio

Di id ll LLC bl k i i h i (C – Divide all LLC blocks into eight categories (C0 to C7) based on their STRA ratio: 0, (0, 1/2], (1/2, 3/4] (3/4 7/8] (31/32 63/64] (63/64 1] 3/4], (3/4, 7/8], …, (31/32, 63/64], (63/64, 1]

  • A block may change its STRA ratio category during its

residence in the LLC

slide-52
SLIDE 52

In In-

  • LLC coherence tracking

LLC coherence tracking

  • Among the small fraction of offending LLC

Among the small fraction of offending LLC blocks, is there a small subset covering a large fraction of lengthened accesses? large fraction of lengthened accesses?

– Key observation: LLC blocks in STRA ratio categories C6 and C7 with STRA ratio in (31/32, g

6 7

( / , 1] have only 12% of the offending blocks (i.e., 12% of 8%), but cover 54% of the accesses with lengthened critical path

  • Higher STRA ratio categories have less offending

blocks but cover more lengthened accesses blocks, but cover more lengthened accesses

  • Blocks in these higher STRA ratio categories could be

the target of a small sparse directory to avoid the problem of lengthened accesses

– Sets the stage for Tiny Directory design

slide-53
SLIDE 53

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking

  • Tiny Directory design

– Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-54
SLIDE 54

Tiny Directory design Tiny Directory design

  • Tiny Directory is a traditional sparse directory
  • Tiny Directory is a traditional sparse directory

– Augments in-LLC coherence tracking and specializes in tracking a subset of the critical p g read-shared blocks (with high STRA ratio)

  • These blocks remain uncorrupted in the LLC and

tracked in the Tiny Directory so that reads to these tracked in the Tiny Directory so that reads to these blocks can be responded by the LLC w/o forwarding

– Very small in size and therefore, must carefully y , y select what to track – A block is considered to be tracked in the Tiny Di t LLC d t th bl k if Directory on an LLC read to the block if

  • State of the block is corrupted shared or
  • Code block in invalid/unowned/non-shared state

Code block in invalid/unowned/non shared state

– Tracking such a block in Tiny Directory allows future reads to the block to conclude in two hops

slide-55
SLIDE 55

Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA

Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tiny Dir.

slide-56
SLIDE 56

Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA

R Read Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tiny Dir.

slide-57
SLIDE 57

Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA

R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tag miss

  • Min. STRA cat. Ci

Tag miss Tiny Dir.

slide-58
SLIDE 58

Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA

R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tag miss

  • Min. STRA cat. Ci

Tag miss Tiny Dir.

slide-59
SLIDE 59

Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA

R Read V=0, D=1 Tag hit 000 (shared) Home bank LLC T Home bank LLC D t 000 (shared) LLC Tag LLC Data Tag miss

  • Min. STRA cat. Ci

STRA cat. Ck Tag miss i < k Tiny Dir.

slide-60
SLIDE 60

Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA

R S Elect a sharer and forward Read V=0, D=1 Tag hit 000 (shared) Home bank LLC T Home bank LLC D t 000 (shared) LLC Tag LLC Data Tag miss

  • Min. STRA cat. Ci

STRA cat. Ck Tag miss i < k Track in Tiny Directory Tiny Dir.

slide-61
SLIDE 61

Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA

R S Respond with data Elect a sharer and forward Read V=0, D=1 Tag hit 000 (shared) Reconst. Home bank LLC T Home bank LLC D t 000 (shared) bits LLC Tag LLC Data Tag miss

  • Min. STRA cat. Ci

STRA cat. Ck Tag miss i < k Track in Tiny Directory Tiny Dir.

slide-62
SLIDE 62

Tiny Directory policy#2 Tiny Directory policy#2

  • Allocation/Eviction policy#2: DSTRA+gNRU

– Major shortcoming of DSTRA: tracking entries for C7 blocks may stay for too long in the Tiny Directory even if they are not useful any more A t DSTRA ith ti l NRU li – Augment DSTRA with a generational NRU policy

  • If an entry does not receive any access for a full

generation it is considered for eviction generation, it is considered for eviction

  • The length of a generation is defined to be the

average interval between two consecutive reads to a shared block

  • Generation length is determined dynamically
slide-63
SLIDE 63

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking – Tiny Directory design

  • Spilling into LLC space
  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-64
SLIDE 64

Spilling into LLC space Spilling into LLC space

Ti Di t d t b i d t

  • Tiny Directory needs to be sized to

accommodate the critical read-shared working set working set

– Such a requirement is impractical because the size of the critical read-shared working set is size of the critical read shared working set is unknown at design time

  • Can vary across applications and across phases of an

li ti application

– To make the proposal robust and practical, we incorporate the provision of spilling tracking incorporate the provision of spilling tracking entries into the LLC – Two possible spill situations: eviction from the Tiny Directory and denial of allocation in the Tiny Directory by the allocation policy

slide-65
SLIDE 65

Spilling into LLC space Spilling into LLC space

LLC Tiny Directory EB: Coherence Information B: Block T: Tag

T EB T B

EB: Coherence Information, B: Block, T: Tag

Eviction of EB from Spill

B

Tiny Directory Allocation of E in Spill in LLC ? Allocation of EB in Tiny Directory denied

Tag Array, Data Array

slide-66
SLIDE 66

Spilling into LLC space Spilling into LLC space

LLC Tiny Directory EB: Coherence Information B: Block T: Tag

T EB T B

EB: Coherence Information, B: Block, T: Tag

Eviction of EB from Spill

LLC

T EB

S A

Spill

V=0,D=1

B

Tiny Directory Allocation of E in Spill in LLC ?

T B

Set A

p EB in LLC Yes Allocation of EB in Tiny Directory denied

Set B Tag Array, Data Array

slide-67
SLIDE 67

Spilling into LLC space Spilling into LLC space

LLC Tiny Directory EB: Coherence Information B: Block T: Tag

T EB T B

EB: Coherence Information, B: Block, T: Tag

Eviction of EB from Spill

LLC

T EB

S A

Spill

V=0,D=1

B

Tiny Directory Allocation of E in Spill in LLC ?

T B

Set A

p EB in LLC Yes Allocation of EB in Tiny Directory denied Use In-LLC

Set B

No Coherence Tracking Eviction of EB from LLC

T EB Partial B

LLC Tag Array, Data Array

slide-68
SLIDE 68

Spilling into LLC space Spilling into LLC space

C t lli ill t t t i LLC i

  • Controlling spill rate to constrain LLC miss

rate increase

l ll h ll bl f – Goal is to allow as much spill as possible from high STRA categories while keeping the LLC miss rate in check rate in check – Each LLC bank dynamically computes the smallest STRA category C such that all smallest STRA category Ci such that all categories Ck with k ≥ i are allowed to spill provided the miss rate of that bank increases by p y no more than δ – For a given δ, how to determine Ci for a bank? g

i

slide-69
SLIDE 69

Spilling into LLC space Spilling into LLC space

LLC bank (256 sets) 240 spill sets Miss rate = MRspill 16 no-spill sets Miss rate = MRno-spill Current lower bound category index i category index i End of 8K-access window MR – MR Increase spilling Decrease spilling Yes No MRspill – MRno-spill ≤ δ i←i-1 i←i+1

slide-70
SLIDE 70

Spilling into LLC space Spilling into LLC space

At th d f h 8K i d LLC b k l ifi th

L t ti l i f ill Not much gain from spill

At the end of each 8K-access window an LLC bank classifies the running application into categories A, B, C, D

100% Large potential gain from spill, Relatively high tolerance Not much gain from spill Class B δB=1/32 Class A δA=1/4 ate 100% Class D δ =1/32 Class C δ =1/16

B A

Miss Ra 10% δD=1/32 δC=1/16 0% STRA Ratio 0.4 1.0 0.0 Latency sensitive, Medium tolerance Low tolerance

slide-71
SLIDE 71

Putting it all together Putting it all together

Ti Di Hit U l h fl Core Tiny Dir. Hit Usual coherence flow Miss V=1 Usual flow E t l t request LLC Single tag match V=0,D=1Corrupted Extra latency Allocate in Tiny Dir./Spill? Dual tag match Spilled entry flow Tiny Dir eviction Dual tag match Spilled entry flow Move to corrupted state? No tag match LLC fill flow Tiny Dir. eviction Move to corrupted state or spill? No tag match LLC fill flow Move to corrupted state? Allocate in Tiny Dir /Spill (for code)? state or spill? Allocate in Tiny Dir./Spill (for code)? Read to corrupted shared: extra one cyc. for state decoding Read to corrupted exclusive: extra two cyc. (data read)+one cyc.

slide-72
SLIDE 72

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-73
SLIDE 73

Simulation infra Simulation infra-

  • structure

structure

  • CPU cores

CPU cores

– 128 out-of-order issue dynamically scheduled x86 cores clocked at 2 GHz (private L1$ L2$) x86 cores clocked at 2 GHz (private L1$, L2$)

  • L3 cache

Shared across all cores 128 banks (set – Shared across all cores, 128 banks (set interleaved), 256 KB 16-way per bank, 64B blocks, LRU, 4 cycles tag+2 cycles data per bank blocks, LRU, 4 cycles tag+2 cycles data per bank

  • Sparse directory

Each L3 cache bank has a sparse directory slice – Each L3 cache bank has a sparse directory slice responsible for tracking the blocks of the bank

  • Main memory
  • Main memory

– Eight single-channel DDR3-2133 controllers

slide-74
SLIDE 74

Simulation infra Simulation infra-

  • structure

structure

  • Sparse directory overhead
  • Sparse directory overhead

– Baseline 2x: about 11 MB (8-way set-associative)

About 100 KB per slice

  • About 100 KB per slice

– Tiny (1/32)x: 187 KB (8-way set-associative)

  • About 1500 bytes per slice
  • About 1500 bytes per slice

– Tiny (1/64)x: 94 KB (8-way set-associative)

  • About 750 bytes per slice
  • About 750 bytes per slice

– Tiny (1/128)x: 47.5 KB (fully associative 16/slice)

  • 380 bytes per slice

380 bytes per slice

– Tiny (1/256)x: 23.75 KB (fully associative 8/slice)

  • 190 bytes per slice

190 bytes per slice

slide-75
SLIDE 75

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space

  • Simulation infra-structure
  • Simulation results
  • Simulation results
  • Summary and extensions
slide-76
SLIDE 76

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny Lower is better Tiny (1/32)x Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-77
SLIDE 77

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-78
SLIDE 78

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-79
SLIDE 79

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-80
SLIDE 80

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-81
SLIDE 81

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-82
SLIDE 82

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-83
SLIDE 83

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-84
SLIDE 84

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-85
SLIDE 85

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-86
SLIDE 86

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-87
SLIDE 87

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-88
SLIDE 88

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-89
SLIDE 89

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA In-LLC coherence Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill tracking performs 11% worse than baseline 2x Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-90
SLIDE 90

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA In-LLC coherence Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill tracking performs 11% worse than baseline 2x Tiny (1/64)x DSTRA and DSTRA+gNRU always perform better than in-LLC coherence tracking Tiny (1/128)x in-LLC coherence tracking (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-91
SLIDE 91

Simulation results Simulation results

  • Execution cycles normalized to baseline 2x
  • Execution cycles normalized to baseline 2x

Tiny DSTRA In-LLC coherence Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill tracking performs 11% worse than baseline 2x Tiny (1/64)x DSTRA and DSTRA+gNRU always perform better than in-LLC coherence tracking Tiny (1/128)x in-LLC coherence tracking DSTRA+gNRU+Spill almost bridges the gap with baseline (1/128)x Tiny bridges the gap with baseline 2x and performs within 1% (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08

slide-92
SLIDE 92

Simulation results Simulation results

  • Energy comparison
  • Energy comparison

– Execution cycles and LLC+Dir. total energy (dynamic+leakage) at 22 nm relative to Tiny (dynamic+leakage) at 22 nm relative to Tiny (1/256)x exercising DSTRA+gNRU+Spill Cycles Energy Cycles Energy Tiny (1/128)x 0.998 0.995 Base 2x 0 988 1 198

1.5 MB space

Base 2x 0.988 1.198 Base 1x 0.995 1.095 Base (1/2)x 1 003 1 044 Base (1/2)x 1.003 1.044 Base (1/4)x 1.025 1.039 B (1/8) 1 100 1 104 Base (1/8)x 1.100 1.104 Base (1/16)x 1.268 1.269

Lowest energy base

slide-93
SLIDE 93

Simulation results Simulation results

  • Comparison to related proposals

Comparison to related proposals

– Multi-grain directory (MgD) devotes one directory entry to track a 1 KB private region [MICRO’13]

  • Requires support for dual-grain coherence

– Stash directory does not back-invalidate a private block on evicting its directory entry [HPCA’14] block on evicting its directory entry [HPCA’14]

  • Requires broadcast-based recovery if such a block

gets shared in future g

– Exec. cycles relative to base 2x (lower is better) MgD (1/8)x: 1.001 (baseline (1/8)x: 1.11) MgD (1/16)x: 1.08 (baseline (1/16)x: 1.28) MgD (1/32)x: 1.29 (baseline (1/32)x: 1.71) Stash (1/32)x: 1.41 Tiny (1/32)x to (1/256)x: 1.005 to 1.01

slide-94
SLIDE 94

Sketch Sketch

  • Talk in one slide
  • Result highlights

Result highlights

  • Introduction

Ti Di t

  • Tiny Directory

– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space

  • Simulation infra-structure
  • Simulation results

Simulation results

  • Summary and extensions
slide-95
SLIDE 95

Summary Summary

  • A novel coherence tracking mechanism

A novel coherence tracking mechanism exercising very small sparse directories in the range (1/32)x to (1/256)x range (1/32)x to (1/256)x

– Just few hundred bytes of storage per LLC slice

  • Smart allocation policies for the sparse
  • Smart allocation policies for the sparse

directory entries backed by controlled spilling

  • f entries into the LLC space
  • f entries into the LLC space
  • Performs within a percentage of a traditional

2x sparse directory 2x sparse directory

  • A significant leap forward in saving on-die

f h k SRAM investment for coherence tracking

  • Ref: Shukla and Chaudhuri. HPCA 2017.
slide-96
SLIDE 96

Next Steps/Questions Next Steps/Questions

Wh di t /LLC t hi h

  • Why so many directory/LLC accesses to high

STRA ratio blocks?

h ff h – Private caches are inefficient in retaining these critical blocks: equal treatment for all blocks  O t it f d i i b tt i t h – Opportunity for designing better private cache hierarchies in many-core server processors A sharing aware private cache hierarchy design – A sharing-aware private cache hierarchy design discussed in Shukla and Chaudhuri. ICCD 2017.

Can Tiny Directory be applied to more

  • Can Tiny Directory be applied to more

exclusive LLCs a la Magny-Cours/Skylake-X? C b l d k

  • Can Tiny Directory be applied to inter-socket

coherence tracking in a multi-socket system?

slide-97
SLIDE 97

Next Steps/Questions Next Steps/Questions

C Ti Di t b li d t CPU GPU

  • Can Tiny Directory be applied to CPU-GPU

heterogeneous chip-multiprocessors?

h kl d b lk – Most heterogeneous workloads exercise bulk synchronous model with little fine-grain sharing and large regions private to CPU or GPU and large regions private to CPU or GPU

  • Ideal for Tiny Directory: most tracking entries would

be amalgamated with shared LLC data ways g y

slide-98
SLIDE 98

... nothing clears up a case so much t ti it t th as stating it to another person …

  • Sherlock Holmes

[Sir Arthur Conan Doyle. Silver Blaze.]