[PPT] - Reducing the Interconnection Network Cost of Chip Multiprocessors PowerPoint Presentation

SLIDE 1

Reducing the Interconnection Network Cost of Chip Multiprocessors

Pablo Abad, Valentín Puente and José Ángel Gregorio.

SLIDE 2

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 2

Outline

Motivation
Reactive traffic
End-to-end deadlock
Rotary solution
In order delivery
Evaluation
Conclusions

SLIDE 3

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 3

Motivation: NOCs for CMPs

CMP systems usually assume the presence of

cache coherency mechanisms.

Cache coherence requirements for the

communication subsystem:

– Handle of reactive traffics (end-to-end deadlock). – In-order message delivery.

Solutions for these requirements should have a

minimal impact on NoC technological boundaries.

SLIDE 4

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 4

Outline

Motivation
Reactive traffic
End-to-end deadlock
Rotary solution
In order delivery
Evaluation
Conclusions

SLIDE 5

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 5

Reactive Traffic

Messages involved in a memory transaction depend one upon the other

Minimal 2 messages:

– CPU-A requests a cache line. – CPU-B L2 provides the block.

Longer Dependencies:

– CPU-A requests a cache line. – The line is not in CPU-B L2, to memory. – Memory provides the block.

CPU

L1 L2

CPU

L1 L2

CPU

L1 L2

CPU

L1 L2

CPU

L1 L2

CPU

L1 L2

CPU

L1 L2

CPU

L1 L2

CPU

L1 L2

MAIN MEM CHIP

SLIDE 6

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 6

Reactive Traffic

This kind of communication can cause message-dependent deadlocks.

Router A Network Interface

REQ-IN REQ-OUT REP-OUT REP-IN REQ-IN REQ-OUT REP-OUT REP-IN

Crossbar

1 Router A and Router B flood the network with REQUEST messages 2 REQUEST messages are only attended if a REPLY can be generated The hole leaved by an attended REQUEST is occupied by another REQUEST 3 DEADLOCK: No more REQUESTS can be attended and REPLIES cannot reach destination.

Router B

SLIDE 7

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 7

Reactive Traffic

A widely utilized solution to avoid this problem is buffer replication. REQ and REP travel through different buffering resources (virtual networks).

Router

REQ-IN REQ-OUT REP-OUT REP-IN REQ-IN REQ-OUT REP-OUT REP-IN

Crossbar

SLIDE 8

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 8

Reactive Traffic

Path replication solves end-to-end deadlock problem, but can seriously affect other relevant design aspects, such as area, complexity, power. 1 Message Type 4 Message Types Alpha 21364 router: 7 message types

Injector Consumer Buffer Crossbar

Rtg. &

Arb.

SLIDE 9

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 9

Previous work: The Rotary Router

[REF] P. Abad, V. Puente, P. Prieto, J.A. Gregorio, “Rotary Router: An Efficient Architecture for CMP Interconnection Networks”, International Symposium on Computer Architecture (ISCA), 2007.

SLIDE 10

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 10

E N S

Consumer Injector

W

Buffer

Rotary Router Sketch

Input Stage Packet Pre-Routing. Ring Selection. Output Stage Flow control. Packets storage. Buffering Segment Stage Packet movement. Output arbitration.

Free? No!

SLIDE 11

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 11

Rotary Router Advantages

Head of Line Blocking Avoidance.
Improved Buffering utilization.
Adaptive routing without virtual channels.
Centralized structures avoidance (Xbar, Arbiter).
Topology agnostic Deadlock avoidance Mechanism.

20 40 60 80 100 120 LU JAVA FT HTTP

Normalized EDP (%)

Classic

Rotary

SLIDE 12

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 12

Reactive Traffic

Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication.

M-1 M-2 M-3 M-4 COMMUNICATION PATTERN Empty Lim M-1 Lim M-2 Lim M-3

SLIDE 13

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 13

Reactive Traffic

Continuous movement of packets inside the router rings allows the Rotary Router to implement a solution to end-to-end Deadlock without requiring path replication.

M-1 M-2 M-3 M-4 COMMUNICATION PATTERN Empty Lim M-1 Lim M-2 Lim M-3 Empty Lim M-1 Lim M-2 Lim M-3 Empty Lim M-1 Lim M-2 Lim M-3 Empty Lim M-1 Lim M-2 Lim M-3 Lim M-1 Lim M-2 Lim M-3

Transit Inject

SLIDE 14

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 14

Outline

Motivation
Reactive traffic
End-to-end deadlock
Rotary solution
In order delivery
Evaluation
Conclusions

SLIDE 15

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 15

In-Order Delivery

This requirement is imposed by some

memory coherence protocols (v.gr. Token coherence protocol) or maintenance tasks.

In these cases, only specific transactions

need to be ordered (v.gr. Persistent request deactivation)

Ordered messages represent only a small

portion of total network traffic ( ~5% of total traffic).

SLIDE 16

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 16

In-Order Delivery

Fulfilling this requirement is extremely simple for input buffered routers. It becomes a challenge for the Rotary Router:

Adaptive routing allows inter-router packet overtaking.
Internal router rings allow intra-router overtaking.

E N S

Consumer Injector

W

Buffer

1 2

SLIDE 17

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 17

In-Order Delivery

Inter-router overtaking is avoided through specific Routing decisions for in-order messages:

wraparound links will be avoided (Mesh)
Adaptive routing will not be allowed (DOR).

SLIDE 18

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 18

In-Order Delivery

Intra-router overtaking needs a special mechanisms to be avoided. IN OUT 1 1 IN 1 OUT 1 IN 2

SLIDE 19

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 19

Outline

Motivation
Reactive traffic
End-to-end deadlock
Rotary solution
In order delivery
Evaluation
Conclusions

SLIDE 20

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 20

Performance Evaluation

Compared to three different routers

DOR Adaptive Latch DOR DOR

Adaptive Bubble Router Deterministic Router Low Latency Router

SLIDE 21

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 21

Performance Evaluation

Synthetic Traffic Patterns
Real Workloads

– GEMS + SICOSYS.

Number of cores 16 Main Memory 4GB, 260 cycles, 320 GB/s L1 I/D cache Private, 32KB, 2-way, 64Bytes block, 1-cycle Command size 16 bytes L2 cache SNUCA, 16x16 banks, 4 per router Network Topology 8x8 Torus L2 cache bank 128KB, 16-way, 3-cycles, Pseudo LRU, 64 Bytes block Network Link 128 bits / 1 cycle latency

SLIDE 22

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 22

Performance Evaluation

Synthetic Traffic Patterns

– 5 message types. – 32.000 messages of each type delivered. – Low-lat topology: Mesh.

100 200 300 400 500 RAND BIT-REV. MAT-TR PERM ROTARY BADA BDOR LOW-LAT

SLIDE 23

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 23

Performance Evaluation

Real Workloads

– Transactional & Scientific applications.

50 100 150 200 IS LU FT OLTP Java HTTP1 HTTP2 ROTARY BADA BDOR LOW-LAT

SLIDE 24

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 24

Outline

Motivation
Reactive traffic
End-to-end deadlock
Rotary solution
In order delivery
Evaluation
Conclusions

SLIDE 25

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 25

Conclusions

The Rotary Router has been the base to

implement a mechanism able to deal with end- to-end deadlocks.

This mechanism does not require path

replication.

We solve in-order delivery with a simple method

which requires few extra hardware.

Flexible buffer utilization allows our router to
btain better performance results.

SLIDE 26

NOCS'08 Reducing the Interconnection Network Cost of Chip Multiprocessors 26

Reducing the Interconnection Network Cost of Chip Multiprocessors - - PowerPoint PPT Presentation

Reducing the Interconnection Network Cost of Chip Multiprocessors

Pablo Abad, Valentín Puente and José Ángel Gregorio.

Outline

Motivation: NOCs for CMPs

cache coherency mechanisms.

communication subsystem:

minimal impact on NoC technological boundaries.

Outline

Reactive Traffic

Reactive Traffic

Reactive Traffic

Reactive Traffic

Previous work: The Rotary Router

Rotary Router Sketch

Rotary Router Advantages

Reactive Traffic

Reactive Traffic

Outline

In-Order Delivery

memory coherence protocols (v.gr. Token coherence protocol) or maintenance tasks.

need to be ordered (v.gr. Persistent request deactivation)

portion of total network traffic ( ~5% of total traffic).

In-Order Delivery

In-Order Delivery

In-Order Delivery

Outline

Performance Evaluation

Performance Evaluation

Performance Evaluation

Performance Evaluation

Outline

Conclusions

implement a mechanism able to deal with end- to-end deadlocks.

replication.

which requires few extra hardware.

Questions?