Ti Ti Tiny Directory Tiny Directory Di Di t t Making - - PowerPoint PPT Presentation
Ti Ti Tiny Directory Tiny Directory Di Di t t Making - - PowerPoint PPT Presentation
Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence Tracking Making Coherence Tracking Making Coherence Tracking Feather light Feather Feather-light Feather light light Mainak Chaudhuri Indian
Forty Forty-year anniversary year anniversary
- Forty years of directory-based coherence
- L M Censier and P Feautrier A New
- L. M. Censier and P. Feautrier. A New
Solution to Coherence Problems in Multicache Systems In IEEE Transactions on Multicache Systems. In IEEE Transactions on Computers, c-27(12):1112-1118, December 1978 1978.
– “A new solution is presented and discussed here: the presence flag solution.” the presence flag solution.
Lucien M. Censier, CII-Honeywell Bull Paul Feautrier, University Pierre y y et Marie Curie
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Sketch Sketch
- Talk in one slide
- Result highlights
- Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network Interconnection Network
B2
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Sparse directory height is an important determinant of performance
Interconnection Network
B2
determinant of performance
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
We show how to design very small sparse directories hile deli ering high performance
Interconnection Network
B2
directories while delivering high performance
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Track privately owned blocks by borro ing bits from LLC data a
Interconnection Network
B2
borrowing bits from LLC data way
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Track privately owned blocks by borro ing bits from LLC data a
Interconnection Network
B2
borrowing bits from LLC data way
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Track privately owned blocks by borro ing bits from LLC data a
Interconnection Network
B2
borrowing bits from LLC data way
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Track privately owned blocks by borro ing bits from LLC data a
Interconnection Network
B2
borrowing bits from LLC data way
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Track privately owned blocks by borro ing bits from LLC data a
Interconnection Network
B2
borrowing bits from LLC data way
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Critical shared blocks with large-scale read sharing are tracked in a tin director
Interconnection Network
B2
sharing are tracked in a tiny directory
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Entries evicted from tiny directory can be spilled into LLC space at a controlled rate
Interconnection Network
B2
spilled into LLC space at a controlled rate
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Entries evicted from tiny directory can be spilled into LLC space at a controlled rate
Interconnection Network
B2
spilled into LLC space at a controlled rate
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Talk in One Slide Talk in One Slide
C0 C1 C2 C3
Private Cache(s) B1 B1 B2 B3
Interconnection Network
Entries evicted from tiny directory can be spilled into LLC space at a controlled rate
Interconnection Network
B2
spilled into LLC space at a controlled rate
Shared LLC Shared LLC
B2 B2 B3 B1 Sparse Directory Height
Bank Bank
Sparse Directory Slice Sparse Directory Slice B1 B3
Result highlights Result highlights
- 128-core chip-multiprocessor running
scientific computing, general-purpose, and l l h d d kl d commercial multi-threaded workloads
– Our Tiny Directory proposal using sparse directories with (1/32)x to (1/256)x entries performs within 1% of a 2x sparse directory
Ti Di t it f 187KB t 23 75KB
- Tiny Directory capacity ranges from 187KB to 23.75KB
– Our Tiny Directory proposal exercising (1/256)x entries saves 16% energy in the LLC and the entries saves 16% energy in the LLC and the sparse directory compared to the 2x baseline – Our proposal outperforms the state-of-the-art – Our proposal outperforms the state of the art multi-grain directory by large margins
Result highlights Result highlights
- 128-core chip-multiprocessor running
scientific computing, general-purpose, and l l h d d kl d commercial multi-threaded workloads
– Our Tiny Directory proposal using sparse directories with (1/32)x to (1/256)x entries performs within 1% of a 2x sparse directory
Ti Di t it f 187KB t 23 75KB
A significant leap forward in saving on-die SRAM in estment for coherence tracking
- Tiny Directory capacity ranges from 187KB to 23.75KB
– Our Tiny Directory proposal exercising (1/256)x entries saves 16% energy in the LLC and the
SRAM investment for coherence tracking
entries saves 16% energy in the LLC and the sparse directory compared to the 2x baseline – Our proposal outperforms the state-of-the-art – Our proposal outperforms the state of the art multi-grain directory by large margins
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Introduction Introduction
- Sparse directory is a set-associative tagged
Sparse directory is a set associative tagged structure attached to each last-level cache (LLC) bank ( )
– Each sparse directory entry tracks the location(s)
- f an LLC block in the private cache hierarchy
tt hed to e h o e attached to each core – Sparse directory implementation needs to be space-efficient as the number of cores in the space efficient as the number of cores in the chip-multiprocessor increases – The number of sparse directory entries imposes p y p an upper bound on the number of distinct blocks tracked at any point in time
This parameter plays an important role in determining
- This parameter plays an important role in determining
the overall performance and the total space investment for coherence tracking
Sparse directory height Sparse directory height
- Sparse directory height is an important
- Sparse directory height is an important
determinant of performance
Number of sparse directory entries is mentioned – Number of sparse directory entries is mentioned as a fraction of the number of blocks in the last- level private cache (L2 cache in our case) level private cache (L2 cache in our case)
Sparse directory height Sparse directory height
- Sparse directory height is an important
- Sparse directory height is an important
determinant of performance
Number of sparse directory entries is mentioned – Number of sparse directory entries is mentioned as a fraction of the number of blocks in the last- level private cache (L2 cache in our case) level private cache (L2 cache in our case)
Compared to a 2x sparse directory, execution time increases by 3%, 11%, and 28% for (1/4)x, (1/8)x, and (1/16)x directory heights
Sparse directory height Sparse directory height
- Sparse directory height is an important
- Sparse directory height is an important
determinant of performance
Number of sparse directory entries is mentioned – Number of sparse directory entries is mentioned as a fraction of the number of blocks in the last- level private cache (L2 cache in our case) level private cache (L2 cache in our case)
With decreasing directory height premature With decreasing directory height, premature directory evictions cause back-invalidation of live blocks from private cache hierarchy live blocks from private cache hierarchy
Compared to a 2x sparse directory, execution time increases by 3%, 11%, and 28% for (1/4)x, (1/8)x, and (1/16)x directory heights
Private vs. shared blocks Private vs. shared blocks
- Recent proposals have recognized the
presence of a large volume of private blocks in the on-chip cache hierarchy
79% f ll ll d LLC bl k i – 79% of all allocated LLC blocks in our case – Techniques have been proposed to reduce the
- verhead of tracking private blocks
- verhead of tracking private blocks
– Multi-grain directory devotes one directory entry to track a 1 KB private region [MICRO’13] to track a 1 KB private region [MICRO 13]
- Requires support for dual-grain coherence
– Stash directory does not back-invalidate a private block on evicting its directory entry [HPCA’14]
- Requires broadcast-based recovery if such a block
gets shared in future gets shared in future
– OS-identified private pages not tracked [ISCA’11]
- Requires custom OS support
Tracking shared blocks: Limit study Tracking shared blocks: Limit study
- How small the sparse directory can be if
- How small the sparse directory can be if
private blocks are not tracked in the directory
A block is tracked in the directory only when it – A block is tracked in the directory only when it has at least two sharers; tracked until it becomes unowned/non-shared or evicted from directory unowned/non shared or evicted from directory
Tracking shared blocks: Limit study Tracking shared blocks: Limit study
- How small the sparse directory can be if
- How small the sparse directory can be if
private blocks are not tracked in the directory
A block is tracked in the directory only when it – A block is tracked in the directory only when it has at least two sharers; tracked until it becomes unowned/non-shared or evicted from directory unowned/non shared or evicted from directory
Compared to a 2x sparse directory, execution time increases by 1%, 4%, 13%, and 28% for (1/16)x, (1/32)x, (1/64)x, (1/128)x directories
Tracking shared blocks: Limit study Tracking shared blocks: Limit study
- How small the sparse directory can be if
- How small the sparse directory can be if
private blocks are not tracked in the directory
A block is tracked in the directory only when it – A block is tracked in the directory only when it has at least two sharers; tracked until it becomes unowned/non-shared or evicted from directory unowned/non shared or evicted from directory
N t ibl t i t i d f Not possible to maintain good performance below (1/16)x even when all tracking h d f i t bl k i li i t d
- verhead of private blocks is eliminated
Compared to a 2x sparse directory, execution time increases by 1%, 4%, 13%, and 28% for (1/16)x, (1/32)x, (1/64)x, (1/128)x directories
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
- Ti
Di t
- Tiny Directory
−In-LLC coherence tracking −Tiny Directory design −Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Tiny Directory: General plan Tiny Directory: General plan
Th t t ti i di t h i ht
- Three steps to optimize directory height
– Start with a naïve design that doesn’t have a sparse directory sparse directory
- A block is tracked by borrowing bits from the block’s
LLC data way (in-LLC coherence tracking)
– Assumes a traditional non-inclusive/non-exclusive LLC where blocks are filled in LLC on miss and no back-invalidation sent
- n LLC eviction
W k ll f i bl k
- Works well for private blocks
- All read requests to a shared block must be forwarded
to a sharer
– Track critical read-shared blocks in a tiny directory to avoid three-hop critical paths – Make the design robust by spilling a subset of evicted tiny directory entries into LLC
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
- In-LLC coherence tracking
– Tiny Directory design – Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
In In-
- LLC coherence tracking
LLC coherence tracking
- Salient features
- Salient features
– Uses no extra storage for coherence tracking – Borrows bits from the LLC data way of a block Borrows bits from the LLC data way of a block for tracking its location(s) – Extends the traditional baseline MESI protocol p
- Coherence state encoding
– Two state bits per LLC block as in the baseline p
- V=0, D=0: invalid LLC block
- V=1, D=0: valid LLC block, not modified, unowned,
not shared not shared
- V=1, D=1: valid LLC block, modified, unowned, not
shared
- V=0, D=1: valid LLC block, either owned by a core or
shared, bits of data way used for extended encoding
In In-
- LLC coherence tracking
LLC coherence tracking
- Extended state encoding when V=0 D=1
- Extended state encoding when V=0, D=1
– Data bit#0: dirty D t bit#1 di /b – Data bit#1: pending/busy – Data bit#2: owned if 1 and shared if 0 D bi #3 / h di f – Data bit#3: owner/sharer encoding format
- If set to 1, next log C bits encode a sharer/owner (C is
the number of cores) the number of cores)
- If set to 0, next C bits encode a sharer bitvector
– Either 4+C or 4+log C data bits can be corrupted Either 4+C or 4+log C data bits can be corrupted
In In-
- LLC coherence tracking
LLC coherence tracking
Sparse Directory Entry
Tag Full-Map Sharer Set V B O/S
C bits
D Data Block V Tag
LLC Entry
C-bits
Baseline
LLC Entry
Partial Data O/S En Sharers B D
Corrupted Data
V D Tag
Corrupted Data
In-LLC coherence tracking
In In-
- LLC coherence tracking
LLC coherence tracking
- The state transitions are trivially extended
- The state transitions are trivially extended
from baseline MESI T f i t t h t f
- Two performance issues to watch out for
– Extra interconnect traffic due to block t ti bit (4 C 4 l C) b i reconstruction bits (4+C or 4+log C) being carried by the clean block (E and a fraction of S) eviction notices to the LLC from cores eviction notices to the LLC from cores
- M state evictions carry the full block as in the baseline
– Reads to shared blocks suffer from lengthened Reads to shared blocks suffer from lengthened critical path
- Two hops in baseline and three hops in in-LLC tracking
p p g
In In-
- LLC coherence tracking
LLC coherence tracking
Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In In-
- LLC coherence tracking
LLC coherence tracking
R Read Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In In-
- LLC coherence tracking
LLC coherence tracking
R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In In-
- LLC coherence tracking
LLC coherence tracking
R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In In-
- LLC coherence tracking
LLC coherence tracking
R Read V=0, D=1 Tag hit 000 (shared) 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In In-
- LLC coherence tracking
LLC coherence tracking
R S Elect a sharer and forward Read V=0, D=1 Tag hit 000 (shared) 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In In-
- LLC coherence tracking
LLC coherence tracking
R S Respond with data Elect a sharer and forward Busy Read V=0, D=1 Tag hit 000 (shared) Busy clear 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In In-
- LLC coherence tracking
LLC coherence tracking
R S Respond with data Elect a sharer and forward Busy Read V=0, D=1 Tag hit 000 (shared) Busy clear 000 (shared) Home bank LLC T Home bank LLC D t LLC Tag LLC Data
In baseline, home LLC bank would have responded to R directly
In In-
- LLC coherence tracking
LLC coherence tracking
- Interconnect traffic (bytes of header and
- Interconnect traffic (bytes of header and
payload) comparison between in-LLC coherence tracking and sparse 2x directory coherence tracking and sparse 2x directory
In In-
- LLC coherence tracking
LLC coherence tracking
- Interconnect traffic (bytes of header and
- Interconnect traffic (bytes of header and
payload) comparison between in-LLC coherence tracking and sparse 2x directory coherence tracking and sparse 2x directory
Compared to a 2x sparse directory, processor request and eviction traffic Co pa ed o a spa se d ec o y, p ocesso eques a d e c o a c increases by a percentage each; coherence traffic increases by >5%
In In-
- LLC coherence tracking
LLC coherence tracking
- Interconnect traffic (bytes of header and
- Interconnect traffic (bytes of header and
payload) comparison between in-LLC coherence tracking and sparse 2x directory coherence tracking and sparse 2x directory Additional three-hop read requests to shared blocks lead to coherence traffic increase
Compared to a 2x sparse directory, processor request and eviction traffic Co pa ed o a spa se d ec o y, p ocesso eques a d e c o a c increases by a percentage each; coherence traffic increases by >5%
In In-
- LLC coherence tracking
LLC coherence tracking
- Performance comparison with 2x sparse
- Performance comparison with 2x sparse
directory
On average in LLC coherence tracking performs – On average, in-LLC coherence tracking performs 11% worse than a 2x sparse directory Several applications lose at least 10% – Several applications lose at least 10% performance: swaptions, barnes, ocean_cp, 316.applu, 324.apsi, SPECWeb 316.applu, 324.apsi, SPECWeb – Primary reason for this loss in performance is the lengthened critical path of reads to shared blocks g p
In In-
- LLC coherence tracking
LLC coherence tracking
- Fraction of LLC accesses that experience
- Fraction of LLC accesses that experience
lengthened critical path
In In-
- LLC coherence tracking
LLC coherence tracking
- Fraction of LLC accesses that experience
- Fraction of LLC accesses that experience
lengthened critical path
On average 30% LLC accesses suffer from this problem On average, 30% LLC accesses suffer from this problem
In In-
- LLC coherence tracking
LLC coherence tracking
- Fraction of LLC accesses that experience
- Fraction of LLC accesses that experience
lengthened critical path
On average 30% LLC accesses suffer from this problem On average, 30% LLC accesses suffer from this problem For commercial applications, code accesses suffer more than data
In In-
- LLC coherence tracking
LLC coherence tracking
- Fraction of allocated LLC blocks that
- Fraction of allocated LLC blocks that
experience accesses with lengthened critical path path
On average, only 8% LLC blocks experience this problem Can we design a small sparse directory to track these offending blocks?
In In-
- LLC coherence tracking
LLC coherence tracking
- Among the small fraction of offending LLC
Among the small fraction of offending LLC blocks, is there a small subset covering a large fraction of lengthened accesses? large fraction of lengthened accesses?
– Define Shared Three-hop Read Access (STRA) ratio of a block = fraction of LLC read accesses to the block that need forwarding to a sharer because the block is in shared corrupted state
- All offending LLC blocks have non-zero STRA ratio
- Rest of the LLC blocks have zero STRA ratio
Di id ll LLC bl k i i h i (C – Divide all LLC blocks into eight categories (C0 to C7) based on their STRA ratio: 0, (0, 1/2], (1/2, 3/4] (3/4 7/8] (31/32 63/64] (63/64 1] 3/4], (3/4, 7/8], …, (31/32, 63/64], (63/64, 1]
- A block may change its STRA ratio category during its
residence in the LLC
In In-
- LLC coherence tracking
LLC coherence tracking
- Among the small fraction of offending LLC
Among the small fraction of offending LLC blocks, is there a small subset covering a large fraction of lengthened accesses? large fraction of lengthened accesses?
– Key observation: LLC blocks in STRA ratio categories C6 and C7 with STRA ratio in (31/32, g
6 7
( / , 1] have only 12% of the offending blocks (i.e., 12% of 8%), but cover 54% of the accesses with lengthened critical path
- Higher STRA ratio categories have less offending
blocks but cover more lengthened accesses blocks, but cover more lengthened accesses
- Blocks in these higher STRA ratio categories could be
the target of a small sparse directory to avoid the problem of lengthened accesses
– Sets the stage for Tiny Directory design
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking
- Tiny Directory design
– Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Tiny Directory design Tiny Directory design
- Tiny Directory is a traditional sparse directory
- Tiny Directory is a traditional sparse directory
– Augments in-LLC coherence tracking and specializes in tracking a subset of the critical p g read-shared blocks (with high STRA ratio)
- These blocks remain uncorrupted in the LLC and
tracked in the Tiny Directory so that reads to these tracked in the Tiny Directory so that reads to these blocks can be responded by the LLC w/o forwarding
– Very small in size and therefore, must carefully y , y select what to track – A block is considered to be tracked in the Tiny Di t LLC d t th bl k if Directory on an LLC read to the block if
- State of the block is corrupted shared or
- Code block in invalid/unowned/non-shared state
Code block in invalid/unowned/non shared state
– Tracking such a block in Tiny Directory allows future reads to the block to conclude in two hops
Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA
Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tiny Dir.
Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA
R Read Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tiny Dir.
Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA
R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tag miss
- Min. STRA cat. Ci
Tag miss Tiny Dir.
Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA
R Read V=0, D=1 Tag hit Home bank LLC T Home bank LLC D t LLC Tag LLC Data Tag miss
- Min. STRA cat. Ci
Tag miss Tiny Dir.
Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA
R Read V=0, D=1 Tag hit 000 (shared) Home bank LLC T Home bank LLC D t 000 (shared) LLC Tag LLC Data Tag miss
- Min. STRA cat. Ci
STRA cat. Ck Tag miss i < k Tiny Dir.
Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA
R S Elect a sharer and forward Read V=0, D=1 Tag hit 000 (shared) Home bank LLC T Home bank LLC D t 000 (shared) LLC Tag LLC Data Tag miss
- Min. STRA cat. Ci
STRA cat. Ck Tag miss i < k Track in Tiny Directory Tiny Dir.
Tiny Directory policy#1: DSTRA Tiny Directory policy#1: DSTRA
R S Respond with data Elect a sharer and forward Read V=0, D=1 Tag hit 000 (shared) Reconst. Home bank LLC T Home bank LLC D t 000 (shared) bits LLC Tag LLC Data Tag miss
- Min. STRA cat. Ci
STRA cat. Ck Tag miss i < k Track in Tiny Directory Tiny Dir.
Tiny Directory policy#2 Tiny Directory policy#2
- Allocation/Eviction policy#2: DSTRA+gNRU
– Major shortcoming of DSTRA: tracking entries for C7 blocks may stay for too long in the Tiny Directory even if they are not useful any more A t DSTRA ith ti l NRU li – Augment DSTRA with a generational NRU policy
- If an entry does not receive any access for a full
generation it is considered for eviction generation, it is considered for eviction
- The length of a generation is defined to be the
average interval between two consecutive reads to a shared block
- Generation length is determined dynamically
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking – Tiny Directory design
- Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Spilling into LLC space Spilling into LLC space
Ti Di t d t b i d t
- Tiny Directory needs to be sized to
accommodate the critical read-shared working set working set
– Such a requirement is impractical because the size of the critical read-shared working set is size of the critical read shared working set is unknown at design time
- Can vary across applications and across phases of an
li ti application
– To make the proposal robust and practical, we incorporate the provision of spilling tracking incorporate the provision of spilling tracking entries into the LLC – Two possible spill situations: eviction from the Tiny Directory and denial of allocation in the Tiny Directory by the allocation policy
Spilling into LLC space Spilling into LLC space
LLC Tiny Directory EB: Coherence Information B: Block T: Tag
T EB T B
EB: Coherence Information, B: Block, T: Tag
Eviction of EB from Spill
B
Tiny Directory Allocation of E in Spill in LLC ? Allocation of EB in Tiny Directory denied
Tag Array, Data Array
Spilling into LLC space Spilling into LLC space
LLC Tiny Directory EB: Coherence Information B: Block T: Tag
T EB T B
EB: Coherence Information, B: Block, T: Tag
Eviction of EB from Spill
LLC
T EB
S A
Spill
V=0,D=1
B
Tiny Directory Allocation of E in Spill in LLC ?
T B
Set A
p EB in LLC Yes Allocation of EB in Tiny Directory denied
Set B Tag Array, Data Array
Spilling into LLC space Spilling into LLC space
LLC Tiny Directory EB: Coherence Information B: Block T: Tag
T EB T B
EB: Coherence Information, B: Block, T: Tag
Eviction of EB from Spill
LLC
T EB
S A
Spill
V=0,D=1
B
Tiny Directory Allocation of E in Spill in LLC ?
T B
Set A
p EB in LLC Yes Allocation of EB in Tiny Directory denied Use In-LLC
Set B
No Coherence Tracking Eviction of EB from LLC
T EB Partial B
LLC Tag Array, Data Array
Spilling into LLC space Spilling into LLC space
C t lli ill t t t i LLC i
- Controlling spill rate to constrain LLC miss
rate increase
l ll h ll bl f – Goal is to allow as much spill as possible from high STRA categories while keeping the LLC miss rate in check rate in check – Each LLC bank dynamically computes the smallest STRA category C such that all smallest STRA category Ci such that all categories Ck with k ≥ i are allowed to spill provided the miss rate of that bank increases by p y no more than δ – For a given δ, how to determine Ci for a bank? g
i
Spilling into LLC space Spilling into LLC space
LLC bank (256 sets) 240 spill sets Miss rate = MRspill 16 no-spill sets Miss rate = MRno-spill Current lower bound category index i category index i End of 8K-access window MR – MR Increase spilling Decrease spilling Yes No MRspill – MRno-spill ≤ δ i←i-1 i←i+1
Spilling into LLC space Spilling into LLC space
At th d f h 8K i d LLC b k l ifi th
L t ti l i f ill Not much gain from spill
At the end of each 8K-access window an LLC bank classifies the running application into categories A, B, C, D
100% Large potential gain from spill, Relatively high tolerance Not much gain from spill Class B δB=1/32 Class A δA=1/4 ate 100% Class D δ =1/32 Class C δ =1/16
B A
Miss Ra 10% δD=1/32 δC=1/16 0% STRA Ratio 0.4 1.0 0.0 Latency sensitive, Medium tolerance Low tolerance
Putting it all together Putting it all together
Ti Di Hit U l h fl Core Tiny Dir. Hit Usual coherence flow Miss V=1 Usual flow E t l t request LLC Single tag match V=0,D=1Corrupted Extra latency Allocate in Tiny Dir./Spill? Dual tag match Spilled entry flow Tiny Dir eviction Dual tag match Spilled entry flow Move to corrupted state? No tag match LLC fill flow Tiny Dir. eviction Move to corrupted state or spill? No tag match LLC fill flow Move to corrupted state? Allocate in Tiny Dir /Spill (for code)? state or spill? Allocate in Tiny Dir./Spill (for code)? Read to corrupted shared: extra one cyc. for state decoding Read to corrupted exclusive: extra two cyc. (data read)+one cyc.
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Simulation infra Simulation infra-
- structure
structure
- CPU cores
CPU cores
– 128 out-of-order issue dynamically scheduled x86 cores clocked at 2 GHz (private L1$ L2$) x86 cores clocked at 2 GHz (private L1$, L2$)
- L3 cache
Shared across all cores 128 banks (set – Shared across all cores, 128 banks (set interleaved), 256 KB 16-way per bank, 64B blocks, LRU, 4 cycles tag+2 cycles data per bank blocks, LRU, 4 cycles tag+2 cycles data per bank
- Sparse directory
Each L3 cache bank has a sparse directory slice – Each L3 cache bank has a sparse directory slice responsible for tracking the blocks of the bank
- Main memory
- Main memory
– Eight single-channel DDR3-2133 controllers
Simulation infra Simulation infra-
- structure
structure
- Sparse directory overhead
- Sparse directory overhead
– Baseline 2x: about 11 MB (8-way set-associative)
About 100 KB per slice
- About 100 KB per slice
– Tiny (1/32)x: 187 KB (8-way set-associative)
- About 1500 bytes per slice
- About 1500 bytes per slice
– Tiny (1/64)x: 94 KB (8-way set-associative)
- About 750 bytes per slice
- About 750 bytes per slice
– Tiny (1/128)x: 47.5 KB (fully associative 16/slice)
- 380 bytes per slice
380 bytes per slice
– Tiny (1/256)x: 23.75 KB (fully associative 8/slice)
- 190 bytes per slice
190 bytes per slice
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space
- Simulation infra-structure
- Simulation results
- Simulation results
- Summary and extensions
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny Lower is better Tiny (1/32)x Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA In-LLC coherence Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill tracking performs 11% worse than baseline 2x Tiny (1/64)x Tiny (1/128)x (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA In-LLC coherence Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill tracking performs 11% worse than baseline 2x Tiny (1/64)x DSTRA and DSTRA+gNRU always perform better than in-LLC coherence tracking Tiny (1/128)x in-LLC coherence tracking (1/128)x Tiny (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Execution cycles normalized to baseline 2x
- Execution cycles normalized to baseline 2x
Tiny DSTRA In-LLC coherence Lower is better Tiny (1/32)x DSTRA+gNRU DSTRA+gNRU+Spill tracking performs 11% worse than baseline 2x Tiny (1/64)x DSTRA and DSTRA+gNRU always perform better than in-LLC coherence tracking Tiny (1/128)x in-LLC coherence tracking DSTRA+gNRU+Spill almost bridges the gap with baseline (1/128)x Tiny bridges the gap with baseline 2x and performs within 1% (1/256)x 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08
Simulation results Simulation results
- Energy comparison
- Energy comparison
– Execution cycles and LLC+Dir. total energy (dynamic+leakage) at 22 nm relative to Tiny (dynamic+leakage) at 22 nm relative to Tiny (1/256)x exercising DSTRA+gNRU+Spill Cycles Energy Cycles Energy Tiny (1/128)x 0.998 0.995 Base 2x 0 988 1 198
1.5 MB space
Base 2x 0.988 1.198 Base 1x 0.995 1.095 Base (1/2)x 1 003 1 044 Base (1/2)x 1.003 1.044 Base (1/4)x 1.025 1.039 B (1/8) 1 100 1 104 Base (1/8)x 1.100 1.104 Base (1/16)x 1.268 1.269
Lowest energy base
Simulation results Simulation results
- Comparison to related proposals
Comparison to related proposals
– Multi-grain directory (MgD) devotes one directory entry to track a 1 KB private region [MICRO’13]
- Requires support for dual-grain coherence
– Stash directory does not back-invalidate a private block on evicting its directory entry [HPCA’14] block on evicting its directory entry [HPCA’14]
- Requires broadcast-based recovery if such a block
gets shared in future g
– Exec. cycles relative to base 2x (lower is better) MgD (1/8)x: 1.001 (baseline (1/8)x: 1.11) MgD (1/16)x: 1.08 (baseline (1/16)x: 1.28) MgD (1/32)x: 1.29 (baseline (1/32)x: 1.71) Stash (1/32)x: 1.41 Tiny (1/32)x to (1/256)x: 1.005 to 1.01
Sketch Sketch
- Talk in one slide
- Result highlights
Result highlights
- Introduction
Ti Di t
- Tiny Directory
– In-LLC coherence tracking – Tiny Directory design – Spilling into LLC space
- Simulation infra-structure
- Simulation results
Simulation results
- Summary and extensions
Summary Summary
- A novel coherence tracking mechanism
A novel coherence tracking mechanism exercising very small sparse directories in the range (1/32)x to (1/256)x range (1/32)x to (1/256)x
– Just few hundred bytes of storage per LLC slice
- Smart allocation policies for the sparse
- Smart allocation policies for the sparse
directory entries backed by controlled spilling
- f entries into the LLC space
- f entries into the LLC space
- Performs within a percentage of a traditional
2x sparse directory 2x sparse directory
- A significant leap forward in saving on-die
f h k SRAM investment for coherence tracking
- Ref: Shukla and Chaudhuri. HPCA 2017.
Next Steps/Questions Next Steps/Questions
Wh di t /LLC t hi h
- Why so many directory/LLC accesses to high
STRA ratio blocks?
h ff h – Private caches are inefficient in retaining these critical blocks: equal treatment for all blocks O t it f d i i b tt i t h – Opportunity for designing better private cache hierarchies in many-core server processors A sharing aware private cache hierarchy design – A sharing-aware private cache hierarchy design discussed in Shukla and Chaudhuri. ICCD 2017.
Can Tiny Directory be applied to more
- Can Tiny Directory be applied to more
exclusive LLCs a la Magny-Cours/Skylake-X? C b l d k
- Can Tiny Directory be applied to inter-socket
coherence tracking in a multi-socket system?
Next Steps/Questions Next Steps/Questions
C Ti Di t b li d t CPU GPU
- Can Tiny Directory be applied to CPU-GPU
heterogeneous chip-multiprocessors?
h kl d b lk – Most heterogeneous workloads exercise bulk synchronous model with little fine-grain sharing and large regions private to CPU or GPU and large regions private to CPU or GPU
- Ideal for Tiny Directory: most tracking entries would
be amalgamated with shared LLC data ways g y
... nothing clears up a case so much t ti it t th as stating it to another person …
- Sherlock Holmes
[Sir Arthur Conan Doyle. Silver Blaze.]