Ahmed Rami Melhem Alex Jones Abousamra University of Pittsburgh - - PowerPoint PPT Presentation

ahmed rami melhem alex jones abousamra university of
SMART_READER_LITE
LIVE PREVIEW

Ahmed Rami Melhem Alex Jones Abousamra University of Pittsburgh - - PowerPoint PPT Presentation

Ahmed Rami Melhem Alex Jones Abousamra University of Pittsburgh Dj Vu Switching for Multiplane NoCs NOCS12 Power efficiency has become a primary concern in the design of CMPs. The NoC of Intels TeraFLOPS processor consumes more than


slide-1
SLIDE 1

Déjà Vu Switching for Multiplane NoCs NOCS’12

University of Pittsburgh Ahmed Abousamra Rami Melhem Alex Jones

slide-2
SLIDE 2

Déjà Vu Switching for Multiplane NoCs NOCS’12

Power efficiency has become a primary concern in the design of CMPs. The NoC of Intel’s TeraFLOPS processor consumes more than 28% of the chip’s power. Network messages can be classified into critical and non-critical It may be possible to send non-critical messages

  • n a slower plane without hurting performance.
slide-3
SLIDE 3

Déjà Vu Switching for Multiplane NoCs NOCS’12

Baseline Single Plane NoC Control Plane Data Plane

  • Carries control and coherence

messages: data requests, invalidates, acknowledgments, …

  • Operates at baseline’s voltage &

frequency

  • Carries data messages
  • Operates at a lower

voltage & frequency to save power

slide-4
SLIDE 4

Déjà Vu Switching for Multiplane NoCs NOCS’12

Motivation & Related work Importance of data messages to performance The Optimization Problem Proposed Solution

Déjà Vu Switching Analysis of the acceptable data plane speed

Evaluation Summary

slide-5
SLIDE 5

Déjà Vu Switching for Multiplane NoCs NOCS’12

1 2 3 4 5 6 7 8

Cache line having 8 data words Data Message

5 1 2 3 4 6 7 8

Header Critical Word

Subsequent miss for another word in the line  Delayed cache hit

Critical Word Non-Critical Words

slide-6
SLIDE 6

Déjà Vu Switching for Multiplane NoCs NOCS’12

If delayed cache hits are overly delayed, performance can suffer.

Percentage of L1 misses that are delayed cache hits

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% barnes blackscholes bodytrack fluidanimate lu contig lu noncontig

  • cean contig

radiosity radix raytrace specjbb water nsquared water spatial Geometric mean % of L1 misses that are delayed hits

slide-7
SLIDE 7

Déjà Vu Switching for Multiplane NoCs NOCS’12

Motivation & Related work Importance of data messages to performance The Optimization Problem Proposed Solution

Déjà Vu Switching Analysis of the acceptable data plane speed

Evaluation Summary

slide-8
SLIDE 8

Déjà Vu Switching for Multiplane NoCs NOCS’12

How slow can we operate the data plane to maximize energy savings while not impacting performance?

slide-9
SLIDE 9

Déjà Vu Switching for Multiplane NoCs NOCS’12

Split NoC physically into 2 planes: control + data On data plane:

Use circuit-switching to speed-up communication. Reduce voltage & frequency.

Send a control message to establish the circuit once a cache hit is detected. Do not block circuit establishment message: Déjà Vu Switching Analyze acceptable slow down of the data plane to minimize energy while maintaining performance.

slide-10
SLIDE 10

Déjà Vu Switching for Multiplane NoCs NOCS’12 Request Packet Reservation Packet Data Packet Control Plane Data Plane

1) Upon a cache miss, a data request is sent

slide-11
SLIDE 11

Déjà Vu Switching for Multiplane NoCs NOCS’12 Request Packet Reservation Packet Data Packet Control Plane Data Plane 2) Upon a data hit in the next cache level, the circuit reservation packet is sent

slide-12
SLIDE 12

Déjà Vu Switching for Multiplane NoCs NOCS’12 Request Packet Reservation Packet Data Packet Control Plane Data Plane 3) Finally, the data packet is sent to the requester on the reserved circuit

slide-13
SLIDE 13

Déjà Vu Switching for Multiplane NoCs NOCS’12 2 3 1 2 3 1 Control Plane Data Plane

Reserved Circuits Reservation Packets Data Packets

East-North West-North West-East

Router

slide-14
SLIDE 14

Déjà Vu Switching for Multiplane NoCs NOCS’12 Reservation Queue of West Input Port Reservation Queue of East Output Port Reserving the West-East Connection: E W Head of the Queues

Port Names: North West South East Local

slide-15
SLIDE 15

Déjà Vu Switching for Multiplane NoCs NOCS’12 S S Reservation Queues

  • f the Input Ports

Reservation Queues

  • f the Output Ports

E N N N N S W North West South East Local E W N Head of the Queues North West South East Local

Port Names: North West South East Local

slide-16
SLIDE 16

Déjà Vu Switching for Multiplane NoCs NOCS’12 Reservation Queues

  • f the Input Ports

Reservation Queues

  • f the Output Ports

N S North West South East Local E W N Head of the Queues North West South East Local N

Port Names: North West South East Local

slide-17
SLIDE 17

Déjà Vu Switching for Multiplane NoCs NOCS’12 Reservation Queues

  • f the Input Ports

Reservation Queues

  • f the Output Ports

N S North West South East Local E W N

Port Names: North West South East Local

Head of the Queues North West South East Local N

slide-18
SLIDE 18

Déjà Vu Switching for Multiplane NoCs NOCS’12

Each input and output port must independently track the reserved circuits it is part of. Any two reservation packets that share part of their paths, must traverse all the shared links in the same order Data packets must be injected onto the data plane in the same order their reservation packets are injected onto the control plane.

slide-19
SLIDE 19

Déjà Vu Switching for Multiplane NoCs NOCS’12

Motivation & Related work Importance of data messages to performance The Optimization Problem Proposed Solution

Déjà Vu Switching Analysis of the acceptable data plane speed

Evaluation Summary

slide-20
SLIDE 20

Déjà Vu Switching for Multiplane NoCs NOCS’12 8 2 3 4 5 6 7 9 12 10 11 1 13

Time

Reservation Packet injected early at cycle 1 Data Packet injected at cycle 3

slide-21
SLIDE 21

Déjà Vu Switching for Multiplane NoCs NOCS’12 8 2 3 4 5 6 7 9 12 10 11 1 13

Time

Reservation Packet injected early at cycle 1 Data Packet injected at cycle 3

slide-22
SLIDE 22

Déjà Vu Switching for Multiplane NoCs NOCS’12

slide-23
SLIDE 23

Déjà Vu Switching for Multiplane NoCs NOCS’12

Motivation & Related work Importance of data messages to performance The Optimization Problem Proposed Solution

Déjà Vu Switching Analysis of the acceptable data plane speed

Evaluation Summary

slide-24
SLIDE 24

Déjà Vu Switching for Multiplane NoCs NOCS’12

We use the functional simulator, Simics, to simulate cache coherent CMPs of 16 and 64 cores. We use Orion2 to get power numbers for the interconnect routers We evaluate with

Synthetic traces: allows varying the network load Execution driven simulation of parallel benchmarks from the SPLASH-2 and PARSEC suits

Interconnect Topology: Mesh

slide-25
SLIDE 25

Déjà Vu Switching for Multiplane NoCs NOCS’12

Baseline NoC:

Single plane, 16 byte links, packet switched with 3 cycles router pipeline, clocked at 4 GHz

Evaluated NoC:

Control plane: 6 byte links, packet switched, 4GHz Data plane: 10 byte links, circuit switched.

Control and Coherence Packets: 1-flit Data Packets:

Baseline NoC: 5 flits Data Plane: 7 flits

slide-26
SLIDE 26

Déjà Vu Switching for Multiplane NoCs NOCS’12

Normalized NoC energy and completion time

  • f synthetic traces on a 64-core CMP

0% 20% 40% 60% 80% 100% 120%

0,01 0,03 0,05 4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz Normalized Energy Consumption Traffic Injection Rate (request / cycle / node)

0% 20% 40% 60% 80% 100% 120%

0,01 0,03 0,05 4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz Traffic Injection Rate (request / cycle / node) Normalized Completion Time

*Each request receives a reply data packet

slide-27
SLIDE 27

Déjà Vu Switching for Multiplane NoCs NOCS’12

0% 20% 40% 60% 80% 100% 120%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig

  • cean contig

radiosity radix raytrace specjbb water nsquared water spatial Geometric mean 4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Execution Time

Normalized execution time on a 16-core CMP

slide-28
SLIDE 28

Déjà Vu Switching for Multiplane NoCs NOCS’12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig

  • cean contig

radiosity radix raytrace specjbb water nsquared water spatial Geometric mean 4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Energy Consumption

Normalized NoC energy on a 16-core CMP

slide-29
SLIDE 29

Déjà Vu Switching for Multiplane NoCs NOCS’12

0% 20% 40% 60% 80% 100% 120% 140% barnes blackscholes bodytrack fluidanimate lu contig lu noncontig

  • cean contig

radiosity radix specjbb water nsquared water spatial Geometric mean 4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Execution Time

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% barnes blackscholes bodytrack fluidanimate lu contig lu noncontig

  • cean contig

radiosity radix specjbb water nsquared water spatial Geometric mean 4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Energy Consumption

Normalized NoC energy and execution time

  • n a 64-core CMP
slide-30
SLIDE 30

Déjà Vu Switching for Multiplane NoCs NOCS’12

90% 95% 100% 105% 110% barnes blackscholes bodytrack fluidanimate lu contig lu noncontig

  • cean contig

radiosity radix raytrace specjbb water nsquared water spatial Geometric mean PS 4/4 DV 4/4 PS 4/2.66 PS+CW 4/2.66 DV 4/2.66

Performance Relative to Single Plane Baseline

Performance relative to CMP with a single plane NoC

  • n a 16-core CMP
slide-31
SLIDE 31

Déjà Vu Switching for Multiplane NoCs NOCS’12

  • L. Cheng et al. (ISCA'06) and A. Flores et al. (IEEE
  • Trans. Computers'10): Heterogeneous NoC

using wires of different latency and power characteristics to improve performance and reduce NoC energy. Proposal requires wide links (75 bytes), but performance degrades with narrow links. Our work differs in:

Tying latency of messages to performance Using Déjà Vu Switching to Compensate for slower data plane

slide-32
SLIDE 32

Déjà Vu Switching for Multiplane NoCs NOCS’12

Problem: Saving power in the NoC by reducing the data plane’s power consumption without impacting performance. Delayed cache hits are important to performance. Operating data plane in circuit-switched mode allows it to operate at reduced frequency. Déjà Vu Switching allows reservation to proceed when resources are not currently available. The constraints governing the speed of the data plane can be estimated analytically.

slide-33
SLIDE 33

Déjà Vu Switching for Multiplane NoCs NOCS’12