Proximity-Aware Directory-based Coherence for Multi-core Processor - - PowerPoint PPT Presentation
Proximity-Aware Directory-based Coherence for Multi-core Processor - - PowerPoint PPT Presentation
Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures Jeff Brown Rakesh Kumar Dean Tullsen UC San Diego University of Illinois at Urbana-Champaign SPAA19 June 9, 2007 Introduction The chip multiprocessor
Introduction
- The chip multiprocessor (CMP)
era is upon us!
- Caching complicate writes
- Cache Coherence ensures
caching is done safely
- Multi-core designs offer new tradeoffs
Introduction
- The chip multiprocessor (CMP)
era is upon us!
- Caching complicate writes
- Cache Coherence ensures
caching is done safely
- Multi-core designs offer new tradeoffs
P M P M
Introduction
- The chip multiprocessor (CMP)
era is upon us!
- Caching complicate writes
- Cache Coherence ensures
caching is done safely
- Multi-core designs offer new tradeoffs
P M P M P P M M
Background: Directory-based Cache Coherence
- Directory-based; explicit per-block accounting
– Doesn't rely on broadcasts
- Directory operation: client/server
Background: Directory-based Cache Coherence
- Directory-based; explicit per-block accounting
– Doesn't rely on broadcasts
- Directory operation: client/server
– Processors request data, permissions
P
Background: Directory-based Cache Coherence
- Directory-based; explicit per-block accounting
– Doesn't rely on broadcasts
- Directory operation: client/server
– Processors request data, permissions – Directory controllers manage memory access
P Dir
Background: Directory-based Cache Coherence
- Directory-based; explicit per-block accounting
– Doesn't rely on broadcasts
- Directory operation: client/server
– Processors request data, permissions – Directory controllers manage memory access
P M Dir
Background: Directory-based Cache Coherence
- Directory-based; explicit per-block accounting
– Doesn't rely on broadcasts
- Directory operation: client/server
– Processors request data, permissions – Directory controllers manage memory access
P M Dir
Background: Directory-based Cache Coherence
- Directory-based; explicit per-block accounting
– Doesn't rely on broadcasts
- Directory operation: client/server
– Processors request data, permissions – Directory controllers manage memory access
- Updates, conflicts
P M P Dir
Background: Historical MP Cache Coherence
- Distributed directory, memory
P M P M P M P M
Background: Historical MP Cache Coherence
- Distributed directory, memory
P M P M P M P M Cache Miss
Background: Historical MP Cache Coherence
- Distributed directory, memory
P M P M P M P M Cache Miss "Home Node"
Background: Historical MP Cache Coherence
- Distributed directory, memory
P M P M P M P M Cache Miss "Home Node"
Background: Historical MP Cache Coherence
- Distributed directory, memory
P M P M P M P M Cache Miss "Home Node" Data Request
Background: Historical MP Cache Coherence
- Distributed directory, memory
P M P M P M P M Cache Miss "Home Node" Data Request Reply
Motivation: Multi-core Cache Coherence
M M P M P P P M
Motivation: Multi-core Cache Coherence
M M P M P P P M Cache Miss
Motivation: Multi-core Cache Coherence
M M P M P P P M Cache Miss
Motivation: Multi-core Cache Coherence
"Home Node" M M P M P P P M Cache Miss
Motivation: Multi-core Cache Coherence
"Home Node" Data Request M M P M P P P M Cache Miss
Motivation: Multi-core Cache Coherence
"Home Node" Data Request M M P M P P P M Cache Miss
Motivation: Multi-core Cache Coherence
"Home Node" Reply M M P M P P P M Cache Miss
Motivation: Multi-core Cache Coherence
M M P M P P P M Additional Sharer
Motivation: Multi-core Cache Coherence
M M P M P P P M Additional Sharer
- Multi-core designs present radically different
relative latency & bandwidth
Outline
- Introduction & Background
- System Architecture
- Proximity-Aware Coherence
- Results
- Conclusion
Directory-based Cache Coherence
- Directory structures
Directory-based Cache Coherence
- Directory structures
Main Memory
Directory-based Cache Coherence
- Directory structures
Main Memory
Directory-based Cache Coherence
- Directory structures
– Directory Memory
Main Memory Directory Memory
Directory-based Cache Coherence
- Directory structures
– Directory Memory – Directory Entries
Main Memory Directory Memory
Directory-based Cache Coherence
- Directory structures
– Directory Memory – Directory Entries – Directory Controller
Main Memory Directory Memory Controller
A Traditional Multiprocessor
Core L2 $ Dir Mem Interconnect Core L2 $ Dir Mem
…
A Traditional Multiprocessor
Core L2 $ Dir Mem Interconnect Core L2 $ Dir Mem
…
(Chassis, board, etc.)
A Traditional Multiprocessor
Core L2 $ Dir Mem Interconnect Core L2 $ Dir Mem
…
(Chassis, board, etc.)
Our 16-Core Chip Multiprocessor
Core L2 $ Bus
Dir control
Net. switch
Dir $
Mem. channel
Tile Tile 1 Tile 15 ...
Our 16-Core Chip Multiprocessor
Core L2 $ Bus
Dir control
Net. switch
Dir $
Mem. channel
Tile Tile 1 Tile 15 ...
Our 16-Core Chip Multiprocessor
Core L2 $ Bus
Dir control
Net. switch
Dir $
Mem. channel
Tile Tile 1 Tile 15 ...
Our 16-Core Chip Multiprocessor
Core L2 $ Bus
Dir control
Net. switch
Dir $
Mem. channel
Tile Tile 1 Tile 15 ...
Our 16-Core Chip Multiprocessor
Core L2 $ Bus
Dir control
Net. switch
Dir $
Mem. channel
Tile Tile 1 Tile 15 ...
Our 16-Core Chip Multiprocessor
Core L2 $ Bus
Dir control
Net. switch
Dir $
Mem. channel
Tile Tile 1 Tile 15 ...
Outline
- Introduction & Background
- System Architecture
- Proximity-Aware Coherence
- Results
- Conclusion
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
– Stay on-chip when possible
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
– Stay on-chip when possible – Minimize transit of large data-carrying replies
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
– Stay on-chip when possible – Minimize transit of large data-carrying replies
M M P M P P P M
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
– Stay on-chip when possible – Minimize transit of large data-carrying replies
"Home Node" Data Request Cache Miss M M P M P P P M
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
– Stay on-chip when possible – Minimize transit of large data-carrying replies
"Home Node" M M P M P P P M Additional Sharer
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
– Stay on-chip when possible – Minimize transit of large data-carrying replies
"Home Node" M M P M P P P M Additional Sharer Forward Request
Proximity-Aware Coherence
- Idea: home node asks sharer nearest requester
to forward its cached copy
– Stay on-chip when possible – Minimize transit of large data-carrying replies
Reply M M P M P P P M
Proximity-Aware Coherence
- To service read misses for shared data,
traditional protocols use main memory
- Other nodes may hold copies
- On the CMP landscape, inter-node latency is
much less than memory latency
Sharer Selection
- When the home node lacks a cached copy, it
selects a sharer to ask
Sharer Selection
- When the home node lacks a cached copy, it
selects a sharer to ask
Miss
Home
Sharer Selection
- When the home node lacks a cached copy, it
selects a sharer to ask
– rand
Miss
Home
Sharer Selection
- When the home node lacks a cached copy, it
selects a sharer to ask
– rand – near1
Miss
Home
Sharer Selection
- When the home node lacks a cached copy, it
selects a sharer to ask
– rand – near1 – via1
Miss
Home
Sharer Selection
- When the home node lacks a cached copy, it
selects a sharer to ask
– rand – near1 – via1
- Retries didn't prove beneficial
Miss
Home
Outline
- Introduction & Background
- System Architecture
- Proximity-Aware Coherence
- Results
- Conclusion
Methodology
- Detailed, execution-driven processor and
network simulation
- "RSIM" simulator, adapted to our CMP model
- Parallel workloads from several suites
- Hardware, benchmark details in paper
Proximity-Aware: Potential Coverage
appbt fft lu mp3d
- cean
quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1
Fraction of read misses to shared lines
Proximity-Aware: Potential Coverage
appbt fft lu mp3d
- cean
quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1
Fraction of read misses to shared lines
Proximity-Aware: Potential Coverage
appbt fft lu mp3d
- cean
quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1
Fraction of read misses to shared lines Overall x=43%
Proximity-Aware: Potential Coverage
appbt fft lu mp3d
- cean
quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1
Fraction of read misses to shared lines Overall x=43%
Proximity-Aware: Potential Coverage
appbt fft lu mp3d
- cean
quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1
Fraction of read misses to shared lines Overall x=43% dist 1 x=75%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d
- cean
quick sort un- struct mean
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
rand near1 via1
Normalized L2 miss latency
Proximity-Aware: Latency Benefit
appbt fft lu mp3d
- cean
quick sort un- struct mean
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
rand near1 via1
Normalized L2 miss latency
Proximity-Aware: Latency Benefit
appbt fft lu mp3d
- cean
quick sort un- struct mean
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
rand near1 via1
Normalized L2 miss latency
Latency
- 25%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d
- cean
quick sort un- struct mean
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
rand near1 via1
Normalized L2 miss latency
Latency
- 25%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d
- cean
quick sort un- struct mean
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
rand near1 via1
Normalized L2 miss latency
Latency
- 25%
Reply traffic
- 6%
Proximity-Aware: Latency Benefit
appbt fft lu mp3d
- cean
quick sort un- struct mean
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
rand near1 via1
Normalized L2 miss latency
Latency
- 25%
Reply traffic
- 6%
Proximity-Aware: Speedup
appbt fft lu mp3d
- cean
quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80
rand near1 via1
Speedup
Proximity-Aware: Speedup
appbt fft lu mp3d
- cean
quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80
rand near1 via1
Speedup
Proximity-Aware: Speedup
appbt fft lu mp3d
- cean
quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80
rand near1 via1
Speedup
Speedup 16%
Proximity-Aware: Speedup
- L2 latency sensitivity of workloads
appbt fft lu mp3d
- cean
quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80
rand near1 via1
Speedup
Speedup 16%
Conclusion
- The latency/bandwidth aspects of CMPs
motivates multicore-aware coherence redesign
- One such change: Proximity-Aware Coherence
– Ideas: stay on-chip, decrease "bulk" transit – Mean speedup 16%, mean L2 latency down 25%
- More aggressive techniques are under study
Conclusion
- The latency/bandwidth aspects of CMPs
motivates multicore-aware coherence redesign
- One such change: Proximity-Aware Coherence
– Ideas: stay on-chip, decrease "bulk" transit – Mean speedup 16%, mean L2 latency down 25%
- More aggressive techniques are under study
- Questions?