Reducing the Interconnection Network Cost of Chip Multiprocessors - - PowerPoint PPT Presentation
Reducing the Interconnection Network Cost of Chip Multiprocessors - - PowerPoint PPT Presentation
Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente and Jos ngel Gregorio. NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2 Outline Motivation Reactive traffic
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2
Outline
- Motivation
- Reactive traffic
- End-to-end deadlock
- Rotary solution
- In order delivery
- Evaluation
- Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 3
Motivation: NOCs for CMPs
- CMP systems usually assume the presence of
cache coherency mechanisms.
- Cache coherence requirements for the
communication subsystem:
– Handle of reactive traffics (end-to-end deadlock). – In-order message delivery.
- Solutions for these requirements should have a
minimal impact on NoC technological boundaries.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 4
Outline
- Motivation
- Reactive traffic
- End-to-end deadlock
- Rotary solution
- In order delivery
- Evaluation
- Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 5
Reactive Traffic
Messages involved in a memory transaction depend one upon the other
- Minimal 2 messages:
– CPU-A requests a cache line. – CPU-B L2 provides the block.
- Longer Dependencies:
– CPU-A requests a cache line. – The line is not in CPU-B L2, to memory. – Memory provides the block.
CPU
L1 L2
CPU
L1 L2
CPU
L1 L2
CPU
L1 L2
CPU
L1 L2
CPU
L1 L2
CPU
L1 L2
CPU
L1 L2
CPU
L1 L2
MAIN MEM CHIP
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 6
Reactive Traffic
This kind of communication can cause message-dependent deadlocks.
Router A Network Interface
REQ-IN REQ-OUT REP-OUT REP-IN REQ-IN REQ-OUT REP-OUT REP-IN
Crossbar
1 Router A and Router B flood the network with REQUEST messages 2 REQUEST messages are only attended if a REPLY can be generated The hole leaved by an attended REQUEST is occupied by another REQUEST 3 DEADLOCK: No more REQUESTS can be attended and REPLIES cannot reach destination.
Router B
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 7
Reactive Traffic
A widely utilized solution to avoid this problem is buffer replication. REQ and REP travel through different buffering resources (virtual networks).
Router
REQ-IN REQ-OUT REP-OUT REP-IN REQ-IN REQ-OUT REP-OUT REP-IN
Crossbar
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 8
Reactive Traffic
Path replication solves end-to-end deadlock problem, but can seriously affect other relevant design aspects, such as area, complexity, power. 1 Message Type 4 Message Types Alpha 21364 router: 7 message types
Injector Consumer Buffer Crossbar
- Rtg. &
Arb.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 9
Previous work: The Rotary Router
[REF] P. Abad, V. Puente, P. Prieto, J.A. Gregorio, “Rotary Router: An Efficient Architecture for CMP Interconnection Networks”, International Symposium on Computer Architecture (ISCA), 2007.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 10
E N S
Consumer Injector
W
Buffer
Rotary Router Sketch
Input Stage Packet Pre-Routing. Ring Selection. Output Stage Flow control. Packets storage. Buffering Segment Stage Packet movement. Output arbitration.
Free? No!
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 11
Rotary Router Advantages
- Head of Line Blocking Avoidance.
- Improved Buffering utilization.
- Adaptive routing without virtual channels.
- Centralized structures avoidance (Xbar, Arbiter).
- Topology agnostic Deadlock avoidance Mechanism.
20 40 60 80 100 120 LU JAVA FT HTTP
Normalized EDP (%)
Classic
Rotary
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 12
Reactive Traffic
Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication.
M-1 M-2 M-3 M-4 COMMUNICATION PATTERN Empty Lim M-1 Lim M-2 Lim M-3
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 13
Reactive Traffic
Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication.
M-1 M-2 M-3 M-4 COMMUNICATION PATTERN Empty Lim M-1 Lim M-2 Lim M-3 Empty Lim M-1 Lim M-2 Lim M-3 Empty Lim M-1 Lim M-2 Lim M-3 Empty Lim M-1 Lim M-2 Lim M-3 Lim M-1 Lim M-2 Lim M-3
Transit Inject
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 14
Outline
- Motivation
- Reactive traffic
- End-to-end deadlock
- Rotary solution
- In order delivery
- Evaluation
- Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 15
In-Order Delivery
- This requirement is imposed by some
memory coherence protocols (v.gr. Token coherence protocol) or maintenance tasks.
- In these cases, only specific transactions
need to be ordered (v.gr. Persistent request deactivation)
- Ordered messages represent only a small
portion of total network traffic ( ~5% of total traffic).
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 16
In-Order Delivery
Fulfilling this requirement is extremely simple for input buffered routers. It becomes a challenge for the Rotary Router:
- Adaptive routing allows inter-router packet overtaking.
- Internal router rings allow intra-router overtaking.
E N S
Consumer Injector
W
Buffer
1 2
1 2
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 17
In-Order Delivery
Inter-router overtaking is avoided through specific Routing decisions for in-order messages:
- wraparound links will be avoided (Mesh)
- Adaptive routing will not be allowed (DOR).
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 18
In-Order Delivery
Intra-router overtaking needs a special mechanisms to be avoided. IN OUT 1 1 IN 1 OUT 1 IN 2
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 19
Outline
- Motivation
- Reactive traffic
- End-to-end deadlock
- Rotary solution
- In order delivery
- Evaluation
- Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 20
Performance Evaluation
- Compared to three different routers
DOR Adaptive Latch DOR DOR
Adaptive Bubble Router Deterministic Router Low Latency Router
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 21
Performance Evaluation
- Synthetic Traffic Patterns
- Real Workloads
– GEMS + SICOSYS.
Number of cores 16 Main Memory 4GB, 260 cycles, 320 GB/s L1 I/D cache Private, 32KB, 2-way, 64Bytes block, 1-cycle Command size 16 bytes L2 cache SNUCA, 16x16 banks, 4 per router Network Topology 8x8 Torus L2 cache bank 128KB, 16-way, 3-cycles, Pseudo LRU, 64 Bytes block Network Link 128 bits / 1 cycle latency
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 22
Performance Evaluation
- Synthetic Traffic Patterns
– 5 message types. – 32.000 messages of each type delivered. – Low-lat topology: Mesh.
100 200 300 400 500 RAND BIT-REV. MAT-TR PERM ROTARY BADA BDOR LOW-LAT
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 23
Performance Evaluation
- Real Workloads
– Transactional & Scientific applications.
50 100 150 200 IS LU FT OLTP Java HTTP1 HTTP2 ROTARY BADA BDOR LOW-LAT
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 24
Outline
- Motivation
- Reactive traffic
- End-to-end deadlock
- Rotary solution
- In order delivery
- Evaluation
- Conclusions
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 25
Conclusions
- The Rotary Router has been the base to
implement a mechanism able to deal with end- to-end deadlocks.
- This mechanism does not require path
replication.
- We solve in-order delivery with a simple method
which requires few extra hardware.
- Flexible buffer utilization allows our router to
- btain better performance results.
NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 26