[PPT] - Design of Adaptive Communication Design of Adaptive Communication PowerPoint Presentation

SLIDE 1

Design of Adaptive Communication Design of Adaptive Communication Channel Buffers for Low Channel Buffers for Low-

Power Area

Power Area-

Efficient Network

Efficient Network-

on
n-
Chip Architecture

Chip Architecture

ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS’07) Dec 3-4, 2007 Avinash Kodi†, Ashwini Sarathy* and Ahmed Louri*

†Department of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701 *Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85719

E-mail: kodi@ohio.edu, sarathya@ece.arizona.edu, louri@ece.arizona.edu

Sponsored: National Science Foundation (NSF) grant ECCS-0725765 (at the High Performance Computing Architectures and Technologies Lab, University of Arizona, Tucson)

SLIDE 2

Talk Outline Talk Outline

Motivation & Introduction
iDEAL – Inter-router Dual-function Energy and

Area-efficient Links for NoC architectures

– Link and Router Architecture

Performance Evaluation

– Power & Area estimation for the Links & Routers – Simulation results for Throughput, Latency & Overall network power

Conclusions

2

SLIDE 3

3

Motivation Motivation

(0,0) (1,0) NOC Router Processing Elements (Processors, DSPs, Peripheral Controllers, Memory Subsystems) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) (0,3) (1,3) (2,3) (3,3) Channels

r Links
Increasing wire delay with

decreasing feature size

Scalable, modular interconnect –

Network Network-

on
n-
Chip (

Chip (NoC NoC) ) System-on-Chip (SoC) paradigm System System-

on
n-
Chip (

Chip (SoC SoC) paradigm ) paradigm

Processor Cores SRAM/Flash & Memory Controllers USB / Ethernet controllers UART / GPIO

SLIDE 4

4

Motivation Motivation

1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip

Interconnection Networks”, IEEE Micro, vol. 27, no. 5, pp. 96 – 108, September-October 2007.

Recent NSF-sponsored workshop on On- Chip Interconnection Networks1 :

“The most important technology constraint for
n-chip networks is power consumption”.
Power consumption of OCINs implemented with

current techniques – exceeds expected needs by a factor of 10. Recent NSF-sponsored workshop on On- Chip Interconnection Networks1 :

“The most important technology constraint for
n-chip networks is power consumption”.
Power consumption of OCINs implemented with

current techniques – exceeds expected needs by a factor of 10.

+ x

Crossbar Switch

Processing Element (PE)

x

+ y

y

Route Computation (RC) Virtual Channel (VC) Switch Allocator (SA) Input Buffers

Generic NoC Router

Power Break-up in the NoC Router Buffers, 46% Clock Buffer, 16% Arbiter, 3% Crossbar, 35%

SLIDE 5

5

iDEAL iDEAL – – I Inter nter-

router

router D Dual ual-

function

function E Energy and nergy and A Area rea-

efficient

efficient L Links for inks for NoC NoC architectures architectures

iDEAL Methodology (circuit and architectural techniques)

Reduce the number of router buffers
To prevent performance degradation, use adaptive channel buffers to store

data along the links when required

Dynamic buffer allocation within the router buffers

iDEAL Methodology (circuit and architectural techniques)

Reduce the number of router buffers
To prevent performance degradation, use adaptive channel buffers to store

data along the links when required

Dynamic buffer allocation within the router buffers

+ x

Crossbar Switch

Processing Element (PE)

x

+ y

y

Route Computation (RC) Virtual Channel (VC) Switch Allocator (SA) Input Buffers

Generic NoC architecture iDEAL architecture

Crossbar Switch

Processing Element (PE)

x

+ y

y

Route Computation (RC) Virtual Channel (VC) Switch Allocator (SA) Input Buffers

Adaptive channel buffers along the link Reduced router buffer size

SLIDE 6

6

Input Port

f

Router B Output Port of Router A

Conventional Links Conventional Links

SLIDE 7

7

Input Port

f

Router B Output Port of Router A

iDEAL iDEAL – – Channel Buffer Design Channel Buffer Design (1/2)

(1/2)

Control block Control block

Congestion

SLIDE 8

8

Control block

iDEAL iDEAL – – Channel Buffer Channel Buffer Design Design (2/2)

(2/2)

Functions as a conventional repeater when there is no congestion. Control block is turned ‘OFF’. Control block Repeater tri-stated and holds the sampled value, during congestion. Control block is turned ‘ON’.

SLIDE 9

iDEAL iDEAL – – Control Block Control Block

9

Power efficient
Stable at varying frequencies
Power efficient
Stable at varying frequencies

O/P Port Router A I/P Port Router A

CLK1 CLK2 CLK1 CLK2

Congestion signal CLK

SLIDE 10

iDEAL iDEAL : : Dual Dual-

function Link

function Link

10

3 2 1 Congestion Signal Cycle 1 Data-In Cycle 3 Data-In Cycle 2 Congestion Signal Congestion Release Data-Out 3 2 1 Data-In 3 2 1

SLIDE 11

11

Input Port

f

Router B Output Port of Router A Control block Control block

Congestion

Link Link -

Power & Area Estimation

Power & Area Estimation

Psegment(repeater)

(Dynamic, leakage, short-circuit)

Psegment(chl-buffer)

(leakage, control block)

Pcontrol-blk

(inverters, clock, switched-cap.)

CLK1 CLK2

CLK

Congestion

SLIDE 12

iDEAL iDEAL – – Router Buffer Design Router Buffer Design

12 v Flit 1 Flit r VC State Table Flit 1 Flit r DEMUX MUX vc 1 vc v VCID VC CR OVC OP WP RP Status Congestion Control C* Credit Return VC State Table Input Port P

Static buffer allocation
Fixed number of buffers per

VC

HoL blocking

RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion Status = status of the VC (idle, waiting, RC, VA, SA, ST) RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion Status = status of the VC (idle, waiting, RC, VA, SA, ST)

SLIDE 13

iDEAL iDEAL – – Router Buffer Design Router Buffer Design

13

RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion Status = status of the VC (idle, waiting, RC, VA, SA, ST)

Input Port P Flit 1 Flit r Flit (v-1) r + 1 Flit z DEMUX MUX Flit r+1 Flit 2r

Write Pointer Read Pointer Credit Return Output Flit Tracking

Unified VC State Table Buffer Slot Availability Congestion Control

Buffer Slot Free 1 2 z Y N N Input Flit Tracking VC 1 … v CR OVC OP WP F0 F1 F(z+c)/v RP N N … 3 N 5 … 6 … N … N N 5 N … N N … 6 3 Status … … … …

Dynamic buffer allocation
Approximately (z + c)/v buffers

per VC (z = router buffers, c = channel buffers, v = # of VCs)

SLIDE 14

iDEAL iDEAL – – Router Buffer Design Router Buffer Design

14

RP = read pointer, WP = write pointer, OP = output port, OVC = output VC, CR = credits, C* = congestion Status = status of the VC (idle, waiting, RC, VA, SA, ST)

Example illustrating Dynamic buffer allocation in iDEAL

Buffer Slot Free 1 5 N N N 2 3 4 6 7 N Y N N N VC 1 2 3 CR OVC OP WP F0 F1 F4 RP 2 1 3 5 N N N N 7 N N N N 1 3 Status F3 N N N N 4 N 2 N 5 1 N N 2 4 4 4 4 SA VC Idle SA Unified VC State Table Buffer Slot Availability

Congestion Control Write Pointer Read Pointer

Output Flit Tracking Input Flit Tracking N

Incoming flit (VCID = 1)

6 ST Y N

SLIDE 15

15

Router Router -

Power & Area Estimation

Power & Area Estimation

Processing Element (PE)

Route Computation (RC) Virtual Channel (VC) Switch Allocator (SA) Input Buffers

Buffer Power (Pwrite + Pread) Crossbar Power (Switch + Arbiter)

Crossbar Switch

Sense Amp Bitlines Wordlines 6T SRAM cell

Power reduces on decreasing the

buffer size

SLIDE 16

Performance Evaluation Performance Evaluation

Evaluated on a cycle-accurate on-chip network simulator
Simulated 8 x 8 Mesh and 8 x 8 Folded Torus topologies
Synthetic benchmarks such as uniform, and non-uniform workloads

(Butterfly, Complement, Perfect Shuffle, Matrix Transpose, Bit Reversal) were evaluated

Parameters evaluated include throughput, latency and overall network

power

Considered 5 different configurations – (vnV – rnR – cnC)

(nV = No. of VCs per input port, nR = No. of router buffers per VC, nC = number of channel buffers) – Baseline = 440 – 434, 428, 344, 531

16

SLIDE 17

17 vnV – rnR - cnC Buffer Power (mW) Mesh Link + Control Power (mW) % Change Folded Torus Link + Control Power (mW) 2.020 2.032 + 0 v4-r3-c4 1.646

18.51

2.164 + 0.0122 + 7.0 4.195 + 0.0122 + 3.4 v4-r2-c8 1.272

37.02

2.296 + 0.0205 + 13.9 4.327 + 0.0205 + 6.8 v3-r3-c7 1.365

32.41

2.263 + 0.0184 + 12.2 4.294 + 0.0184 + 6.0 v5-r2-c6 1.459

27.76

2.230 + 0.0164 + 10.5 4.261 + 0.0164 + 5.1 2.164 + 0.0122 2.065 + 0.0059 1.646 4.068 + 0

+ 7.0

4.195 + 0.0122 + 1.8 4.096 + 0.0059 1.926 % Change % Change v4-r4-c0

v3-r4-c4
18.51

+ 3.4 v5-r3-c1

4.65

+ 0.8

Power Estimation Power Estimation -

Summary

Summary

nV = number of VCs per input port nR = number of router buffers per VC nC = number of channel buffers

v4-r2-c8 1.272 -37.02 2.296+0.0205 +13.9 4.437+0.0205 +6.8 v4-r2-c8 1.272 -37.02 2.296+0.0205 +13.9 4.437+0.0205 +6.8

SLIDE 18

18

Uniformly distributed traffic

⇒ Nearly 40% power savings for 50% buffer size reduction

(428), using Dynamic buffer allocation

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)

Uniformly distributed traffic

⇒ Nearly 40% power savings for 50% buffer size reduction

(428), using Dynamic buffer allocation

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)

Buffer Power Buffer Power – – 8x8 Mesh and Folded Torus 8x8 Mesh and Folded Torus

Buffer Power (8x8 Mesh) UN - Dynamic

0.2 0.4 0.6 0.8 1 v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

Configuration Power (watts) Buffer Power (8x8 Folded Torus) UN - Dynamic

0.2 0.4 0.6 0.8 1 v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

Configuration Power (watts)

SLIDE 19

19

Uniformly distributed traffic

⇒ Only about 5% drop in throughput for the 428 case (Dynamic

buffer allocation)

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)

Uniformly distributed traffic

⇒ Only about 5% drop in throughput for the 428 case (Dynamic

buffer allocation)

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)

Throughput Throughput – – 8x8 Mesh and Folded Torus 8x8 Mesh and Folded Torus

Throughput (8x8 Mesh) UN - Dynamic

10 20 30 40 50 60 0.2 0.4 0.6 0.8 1

Offered Load (as a fraction of network capacity) Throughput (GBps)

v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

Throughput (8x8 Folded Torus) UN - Dynamic

10 20 30 40 50 60 70 0.2 0.4 0.6 0.8 1

Offered Load (as a fraction of network capacity) Throughput (GBps)

v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

SLIDE 20

20

Total power consumed for a network load of 0.5

⇒ Nearly 20% savings for the 428, using Dynamic buffer

allocation

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)

Total power consumed for a network load of 0.5

⇒ Nearly 20% savings for the 428, using Dynamic buffer

allocation

(428 = 4 VCs per port, 2 router buffers per VC, 8 channel buffers)

Overall Network Power Overall Network Power – – 8x8 Mesh and Folded 8x8 Mesh and Folded Torus Torus

Total Power (8x8 Mesh) UN - Dynamic

0.5 1 1.5 2 2.5 v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

Configuration Power (watts)

Congestion Power Link Power Crossbar Power Buffer Power

Total Power (8x8 Folded Torus) UN - Dynamic

0.5 1 1.5 2 2.5 3 v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

Configuration Power (watts)

Congestion Power Link Power Crossbar Power Buffer Power

SLIDE 21

21

Reduction in power for all configurations, under all traffic patterns, compared

to the baseline (440)

For example, under Complement traffic the 428 configuration achieves 45%

savings under Static allocation and 37.5% savings under Dynamic allocation

Reduction in power for all configurations, under all traffic patterns, compared

to the baseline (440)

For example, under Complement traffic the 428 configuration achieves 45%

savings under Static allocation and 37.5% savings under Dynamic allocation

0.2 0.4 0.6 0.8 1 S - U N S - CO S - TO S - PS S - B R S - M T S - N E S - B U

Power (watts)

v4-r4-c0 v4-r3-c4 v4-r2-c8

D - UN D - CO D - TO D - PS D - BR D - MT D - NE D - BU

P tt

Buffer Power (8x8 Mesh) at an offered load = 0.5 Traffic Pattern

Buffer Power Buffer Power – – 8x8 Mesh 8x8 Mesh – – all Traffic Patterns all Traffic Patterns

SLIDE 22

22

No significant decrease in throughput under any traffic pattern, using Dynamic

allocation

No significant decrease in throughput under any traffic pattern, using Dynamic

allocation

Throughput (8x8 Mesh) at an offered load = 0.5 Traffic Pattern

Throughput Throughput – – 8x8 Mesh 8x8 Mesh – – all Traffic Patterns all Traffic Patterns

10 20 30 40 50 60 70 S - UN S - CO S - TO S - PS S - BR S - M T S - N E S - BU

Throughput (GBps)

v4-r4-c0 v4-r3-c4 vr-r2-c8 D

U

N D

CO

D

TO

D

P

S D

B

R D

M

T D

N

E D

B

U

SLIDE 23

Conclusion Conclusion

iDEAL

iDEAL architecture provides a Low-Power Area-efficient solution for NoCs, by reducing power consumption through circuit-level and architecture-level techniques.

Simulation results show that by reducing the buffer size in half, a

40 40-

52% savings in power

52% savings in power is achieved, with a significant reduction in router area. There is only a marginal 1-5% drop in performance, under dynamic buffer allocation.

Future work will involve (a) Simulation using real-application traces

(b) Exploring architectural improvements such as aggressive speculation in the credit loop

23

SLIDE 24

Backup Slides Backup Slides

24

SLIDE 25

vnV – rnR - cnC Buffer Area (μm2) Total Buffer + Link Area (μm2) % Change 81,407 81,439 v4-r3-c4 63,991 52 64,011

21.40

v4-r2-c8 48,066 80 48,146

40.88

v3-r3-c7 50,373 73 50,446

38.05

v5-r2-c6 53,712 66 53,778

33.96

63,302 73,803 63,250

22.27
9.37

73,797 Link Repeater Area (μm2) v4-r4-c0 32 v3-r4-c4 52 v5-r3-c1 38

Area Estimation Area Estimation – – Summary Summary with values from Synopsys Design Compiler with values from Synopsys Design Compiler

nV = number of VCs per input port, nR = number of router buffers per VC, nC = number of channel buffers

v4-r2-c8 48,066 80 48,146 -40.88 v4-r2-c8 48,066 80 48,146 -40.88

SLIDE 26

26

Latency Latency – – 8x8 Mesh and Folded Torus 8x8 Mesh and Folded Torus

Average Latency (8x8 Mesh) - UN - Dynamic

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5

Offered Load (as a fraction of network capacity) Average Latency (microsec)

v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

Average Latency (8x8 Folded Torus) UN - Dynamic

0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Offered Load (as a fraction of network capacity) Average Latency (microsec)

v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v5-r3-c1

Uniformly distributed traffic

⇒ For all cases (except 531), saturation for a network load of about 0.3 in case of Mesh and about 0.4 in case of Folded torus

Uniformly distributed traffic

⇒ For all cases (except 531), saturation for a network load of about 0.3 in case of Mesh and about 0.4 in case of Folded torus

SLIDE 27

Comparison with FC Comparison with FC-

CB and DAMQ

CB and DAMQ

27

FC-CB shows similar performance as the

dynamically allocated 440 case

434 and 428 achieve nearly 4% increase in

saturation throughput compared to FC-CB

428 achieves nearly 12.5% improvement in

saturation throughput compared to DAMQ

FC-CB shows similar performance as the

dynamically allocated 440 case

434 and 428 achieve nearly 4% increase in

saturation throughput compared to FC-CB

428 achieves nearly 12.5% improvement in

saturation throughput compared to DAMQ

Comparison of Saturation Throughput (8x8 Mesh) - Uniform Traffic

38 40 42 44 46 48 50 v4-r3-c4 v4-r2-c8 FC-CB DAMQ

Configuration Throughput (in GBps)

Comparison of Average Latency (8x8 Mesh) - Uniform Traffic

0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5

Offered Traffic (as a fraction of network capacity) Average Latency (in microsec) v4-r3-c4 v4-r2-c8 FC-CB DAMQ

SLIDE 28

Power calculations using Power calculations using Synopsys Power Compiler Synopsys Power Compiler

28

428 case shows nearly 40% reduction in

buffer power alone

Nearly 30% decrease in overall network

power for the 428 case

428 case shows nearly 40% reduction in

buffer power alone

Nearly 30% decrease in overall network

power for the 428 case

Total Power (8x8 Mesh) - Uniform Traffic

1 2 3 4 5 6 7 8 9 10 v4-r4-c0 v5-r3-c1 v3-r4-c4 v4-r3-c4 v4-r2-c8

Configuration Power (Watts)

Control Link Switch Arbiter Buffer

Buffer Power (8x8 Mesh) - Uniform Traffic

1 2 3 4 5 6 7 8 v4-r4-c0 v5-r3-c1 v3-r4-c4 v4-r3-c4 v4-r2-c8

Configuration Power (Watts)

Leakage Power Dynamic Power

SLIDE 29

Data flow Control Simulated with Synopsys VCS

29

5 10 15 20 25 30 35 40 Data_out2 from stage 2 Data_out3 from stage 3 Data_out4 from stage 4 Time (ns) Congestion at stage 2 Congestion at stage 3 Congestion at stage 4 Data_out1 from stage 1 500 MHz Clock Signal Data_in Congestion input Congestion at stage1

SLIDE 30

Router Router -

Power Estimation

Power Estimation

30

Component Power / Area Calculation Explanation

Cbuf (1/2 x W x L x Cox) + (W x Lov x Cox)

Cbuf = additional capacitance due to three-state repeater along the links W, L = Width & Length of min. sized inverter Cox = oxide capacitance Lov = gate-drain/source overlap length

Ṕdynamic a x [k(Co + Cp + Cbuf) + ℓCw] x VDD

2 x freq

a = activity factor, k = repeater sizing, ℓ = repeater spacing Co = diffusion capacitance Cp = gate capacitance Cw = wire capacitance VDD = supply voltage freq = operating frequency

Ṕleakage 2 x [1/2 x VDD x (Ioff(Wn + Wp)k)]

Ioff = subthreshold leakage current Wn (Wp) = width of the NMOS (PMOS) in the repeater

Ṕshort-ckt a x trise x Wn x k x VDD x Isc x freq

trise = rise time of the short-ckt current Isc

SLIDE 31

Self-checking Double-sampling technique for the Control block
Slightly more power (0.02 uW v/s 0.06 uW) and area, but more reliable

iDEAL iDEAL – – Control Block Control Block

31

Output Port of Router A Input Port of Router B

Clock Double-sampling the congestion input Congestion Delay Buffer

1

XOR Error MUX

1

XOR D Flip-Flop Error MUX Clock D Flip-Flop

SLIDE 32

Aggressive Speculation Aggressive Speculation

32

Aggressive speculation by increasing the

number of credits available to 8

Additional credits are accounted for by the

channel buffers ⇒ Saturation throughput improves by 10% for the 428 case

Aggressive speculation by increasing the

number of credits available to 8

Additional credits are accounted for by the

channel buffers ⇒ Saturation throughput improves by 10% for the 428 case

Saturation Throughput (8x8 Folded Torus) - Uniform Traffic

53 54 55 56 57 58 59 60 61 v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v3-r3-c7

Configuration Throughput (in GBps)

Average Latency (8x8 Folded Torus) - Uniform Traffic

0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Offered Traffic (as a fraction of network capacity) Average Latency (in microsec)

v4-r4-c0 v4-r3-c4 v4-r2-c8 v3-r4-c4 v3-r3-c7

Total Power (8x8 Folded Torus) - Uniform Traffic

1 2 3 4 5 6 7 8 9 10 v4-r4-c0 v3-r4-c4 v4-r3-c4 v3-r3-c7 v4-r2-c8

Configuration Power (Watts)

Control Link Switch Arbiter Buffer

SLIDE 33

vnV – rnR - cnC Buffer Power (mW) Total Power (Buffer + Link) (mW) % Change 19.54 21.99 v4-r3-c4 14.51 2.91 17.42

20.78

v4-r2-c8 11.57 3.57 15.14

31.15

v3-r3-c7 12.56 3.50 16.06

26.96

v5-r2-c6 14.41 3.31 17.72

19.41

18.00 22.10 15.09

18.14

+ 0.50 19.29 Mesh Link + Control Power (mW) v4-r4-c0 2.45 v3-r4-c4 2.91 v5-r3-c1 2.81

Power Estimation Power Estimation – – Summary Summary with values from Synopsys Power Compiler with values from Synopsys Power Compiler

nV = number of VCs per input port, nR = number of router buffers per VC, nC = number of channel buffers

v4-r2-c8 11.57 3.57 15.14 -31.15 v4-r2-c8 11.57 3.57 15.14 -31.15