Adding Slow- Silent Virtual Channels f or Low- Power On- Chip - - PowerPoint PPT Presentation

adding slow silent virtual channels f or low power on
SMART_READER_LITE
LIVE PREVIEW

Adding Slow- Silent Virtual Channels f or Low- Power On- Chip - - PowerPoint PPT Presentation

Adding Slow- Silent Virtual Channels f or Low- Power On- Chip Networks Hiroki Mat sut ani (Keio Univ, J apan) Michihiro Koibuchi (NI I , J apan) Daihan Wang (Keio Univ, J apan) Hideharu Amano (Keio Univ, J apan) I am very


slide-1
SLIDE 1

Adding Slow- Silent Virtual Channels f or Low- Power On- Chip Networks

Hiroki Mat sut ani (Keio Univ, J apan) Michihiro Koibuchi (NI I , J apan) Daihan Wang (Keio Univ, J apan) Hideharu Amano (Keio Univ, J apan)

slide-2
SLIDE 2

I am very sorry…

  • My f light was canceled on April 6.
  • I was wait ing f or rebooking at airport f or seven

hours, but I couldn’t get a t icket . I got a f ever.

  • I arrived at Newcast le on April 7.
  • I couldn’t f ind my baggage; I wore only a shirt .
  • My hot el reservat ion was canceled w/ o asking;

I didn’t have a place t o sleep…

  • I went t o anot her hot el t o book a room in my

shirt sleeves in t he rain. The f ever was gone up.

  • Ms. J erder

kindly did her present at ion on Apr 8.

  • I would like t hank her and ASYNC/ NOCS

program commit t ee.

slide-3
SLIDE 3

Adding Slow- Silent Virtual Channels f or Low- Power On- Chip Networks

Hiroki Mat sut ani (Keio Univ, J apan) Michihiro Koibuchi (NI I , J apan) Daihan Wang (Keio Univ, J apan) Hideharu Amano (Keio Univ, J apan)

Power gat ing Volt age and f requency scaling

slide-4
SLIDE 4

I ntroduction: Area and power

  • Due t o t he f inger process t echnology,

– Area const raint is relaxed – But power densit y becomes more serious

  • Adding ext ra hardware resources (e.g., VCs)

– We can get a perf ormance margin; so – We can reduce volt age and f requency t o reduce power

VC# 0 Rout er (a) VC# 0 Rout er (b) VC# 0 Rout er (c) VC# 1 VC# 1 VC# 1 VC# 2 VC# 2 VC# 2

I ssues t o be t ackled in t his present at ion

  • Adding ext ra hardware increases t he leakage power
  • How much resource is required t o minimize t ot al power
slide-5
SLIDE 5

Outline: Slow- silent virtual channels

  • Net work-on-Chip (NoC)
  • On-Chip Rout er

– Archit ect ure and it s power consumpt ion

  • Slow-silent virt ual channels

– Volt age and f requency scaling – Run-t ime power gat ing of virt ual channels – Adapt ive VC act ivat ion

  • Evaluat ions (1VC, 2VC, 3VC, and 4VC)

– Throughput – Power consumpt ion (wit h PG & volt age f req scaling) – How many VCs are required t o minimize power

slide-6
SLIDE 6

Network- on- Chip (NoC)

An example t ile archit ect ure (ASPLA 90nm CMOS) Processor core Rout er

The next slides show “Rout er archit ect ure” and “I t s power”

  • Processor core

– Largest component – Various low-power t echniques are used

  • On-chip rout er

– Area is not so large – Always preparing (act ive) f or packet inj ect ion

[Ishikawa,IEICE’05] e.g., St andby current 11uA

slide-7
SLIDE 7

On- Chip Router: Architecture

  • 5-input 5-out put rout er (dat a widt h is 64-bit )

5x5 XBAR ARBITER FIFO FIFO FIFO FIFO FIFO X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE Each physical channel has 2 VCs

HW amount is 34 kilo gat es and 64% of area is used f or FI FO

Each VC has a FI FO buf f er (4 x 64 bit s)

slide-8
SLIDE 8

On- Chip Router: Pipeline

  • A header f lit goes t hrough a rout er in 3 cycles

– RC (Rout ing comput at ion) – VSA (Virt ual channel / Swit ch allocat ion) – ST (Swit ch t raversal)

  • E.g., Packet t ransf er f rom rout er A t o C

RC VSA ST ST ST ST RC VSA ST ST ST ST RC VSA ST ST ST ST ELAPSED TIME [CYCLE] 1 2 3 4 5 6 7 8 9 10 11 12 @ROUTER A @ROUTER B @ROUTER C HEAD DATA 1 DATA 2 DATA 3 A packet consist s of a header and 3 dat a f lit s

slide-9
SLIDE 9

On- Chip Router: Power consumption

  • Place-and-rout ed wit h 90nm CMOS
  • Post layout simulat ion at 200MHz

Power consumpt ion of a rout er when n port s are used [mW]

A rout er consumes more power as t he rout er processes more packet s Packet swit ching power is large Volt age f req scaling

slide-10
SLIDE 10

On- Chip Router: Power consumption

Standby power of the on-chip router Leakage (55.0%) Dynamic (45.0%) Channels (49.4%)

Leakage of channel buf is t he largest Runt ime power gat ing Power consumpt ion when no port is used st andby power

Packet swit ching power is large Volt age f req scaling

slide-11
SLIDE 11

Outline: Slow- silent virtual channels

  • Net work-on-Chip (NoC)
  • On-Chip Rout er

– Archit ect ure and it s power consumpt ion

  • Slow-silent virt ual channels

– Volt age and f requency scaling – Run-t ime power gat ing of virt ual channels – Adapt ive VC act ivat ion

  • Evaluat ions (1VC, 2VC, 3VC, and 4VC)

– Throughput – Power consumpt ion (wit h PG & volt age f req scaling) – How many VCs are required t o minimize power

slide-12
SLIDE 12

Slow- Silent Virtual Channels

Lat ency vs. accept ed t raf f ic

CV V V f

th α

) ( − ∝

2

V f C a P

switching

⋅ ⋅ ⋅ =

  • Adding ext ra VCs

– Perf ormance improves – We can reduce volt age and f requency

  • Volt age & f requency

scaling (VFS)

– Set t he reduced volt age and f requency – I n response t o t he perf ormance margin

  • Problem

– Adding ext ra VCs increases leakage power – I t may overwhelm VFS

1-VC 2-VC 3-VC 4-VC Perf ormance margin

We f ocus on run-t ime power gat ing of VCs t o reduce leakage

slide-13
SLIDE 13

Power Gating of virtual channels

5x5 XBAR ARBI TER X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE sleep sleep sleep sleep sleep

  • Run-t ime power gat ing of virt ual channels

– No packet s in a VC Sleep (t urn of f t he power supply) – Packet arrives at t he VC Wakeup (t urn on t he power)

slide-14
SLIDE 14

Power Gating of virtual channels

5x5 XBAR ARBI TER X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE sleep sleep sleep sleep sleep

Link shut down has been st udied f or on- & of f -chip net works, but prior work uses SRAM buf f ers [Chen,ISLPED’03] [Soteriou,TPDS’07] We use small regist ered FI FOs f or light -weight NoC rout ers

  • Run-t ime power gat ing of virt ual channels

– No packet s in a VC Sleep (t urn of f t he power supply) – Packet arrives at t he VC Wakeup (t urn on t he power)

slide-15
SLIDE 15

Power Gating: Various overheads

  • Area overhead

– Power swit ches

  • Perf ormance overhead

– Wakeup delay – Pipeline st all is caused

  • Power overhead

– Driving power swit ches – Short sleeps adversely increases dynamic power

FIFO Sleep Wait ing f or channel wakeup FIFO Active Pipeline st all of a rout er occurs Frequent on/ of f should be avoided Frequent on/ of f should be avoided

slide-16
SLIDE 16

Power Gating: Various overheads

  • Area overhead

– Power swit ches

  • Perf ormance overhead

– Wakeup delay – Pipeline st all is caused

  • Power overhead

– Driving power swit ches – Short sleeps adversely increases dynamic power

sleep Vdd Virt ual Vdd GND Power switch Circuit block

Cont rol t hat gradually act ivat es VCs in response t o workload

FIFO Sleep Wait ing f or channel wakeup Pipeline st all of a rout er occurs Frequent on/ of f should be avoided Frequent on/ of f should be avoided FIFO Active

slide-17
SLIDE 17

Power Gating: VC activation policy

  • Virt ual channel (VC) level power gat ing
  • Virt ual-channel select ion:

– All packet s use VC# 0 when t hey are inj ect ed t o NoC – VC number is increased when t he packet conf lict s

VC#0 Rout er (a) VC#1 VC#2 VC#0 Rout er (b) VC#1 VC#2 VC#0 Rout er (c) VC#1 VC#2

Only VC# 0 is used if workload is low

slide-18
SLIDE 18

Power Gating: VC activation policy

  • Virt ual channel (VC) level power gat ing
  • Virt ual-channel select ion:

– All packet s use VC# 0 when t hey are inj ect ed t o NoC – VC number is increased when t he packet conf lict s

Rout er (a) Rout er (b) Rout er (c) VC#0 VC#1 VC#2 VC#0 VC#1 VC#2 VC#0 VC#1 VC#2

High peak perf ormance of VCs wit h t he least leakage power

All VCs are act ivat ed if workload is high

slide-19
SLIDE 19

Power Gating: Routing design

  • A virt ual-channel layer

– A virt ual net work consist ing of VCs wit h t he same VC#

  • Deadlock-f reedom

– Moving upper t o lower layers – Only bot t om layer must guarant ee deadlock-f reedom

VC# 0 VC# 1 VC# 2 VC# 3 VC# 0 VC# 1 VC# 2 VC# 3 VC# 0 VC# 1 VC# 2 VC# 3

Rout er (a) Rout er (b) Rout er (c) VC Layer # 0 VC Layer # 1 VC Layer # 2 VC Layer # 3 VC# 0 VC# 1 VC# 2 VC# 3 [Duato,TPDS’93] [Koibuchi,ICPP’03]

All VC layers except f or t he bot t om can employ any rout ing, as f ar as t he bot t om guarant ees deadlock-f ree by it self

slide-20
SLIDE 20

Outline: Slow- silent virtual channels

  • Net work-on-Chip (NoC)
  • On-Chip Rout er

– Archit ect ure and it s power consumpt ion

  • Slow-silent virt ual channels

– Volt age and f requency scaling – Run-t ime power gat ing of virt ual channels – Adapt ive VC act ivat ion

  • Evaluat ions (1VC, 2VC, 3VC, and 4VC)

– Throughput – Power consumpt ion (wit h PG & volt age f req scaling) – How many VCs are required t o minimize power

slide-21
SLIDE 21

Evaluations of slow- silent VCs

  • Preliminary

– Leakage modeling of PG – Breakeven point of PG

  • Evaluat ion it ems

– Original t hroughput – Power consumpt ion w/ o PG and VFS – Power consumpt ion w/ PG and VFS

  • Which is t he best ?

– 1VC, 2VC, 3VC, and 4VC

  • Process t echnology

– ASPLA 90nm CMOS – 1.00V (baseline)

  • Simulat ion paramet ers
  • Traf f ic pat t erns

– Unif rom + NPB t races (BT, SP, CG, MG, I S)

Topology 2-D Mesh (8x8) Rout ing DOR (XY rout ing) Buf f er size 4-f lit (WH swit ching) # of VCs 1VC, 2VC, 3VC, 4VC Lat ency 3-cycle per 1-hop

slide-22
SLIDE 22

Preliminary: Leakage power modeling

Supply volt age 1.0 V Swit ching f act or 0.12 Leakage power 52 uW Dynamic power (200MHz) 78 uW Dynamic power (500MHz) 194 uW Power swit ch size rat io 0.1 Power swit ch cap rat io 0.5 Based on t he post layout simulat ion of on-chip rout er (90nm CMOS)

  • Power gat ing model

– Eoverhead: Power consumed f or t urning PS on/ of f – Esaved: Leakage power saving f or an N-cycle sleep

[Hu,ISLPED’04] How many cycles are required t o sleep f or compensat ing Eoverhead ? We calculat e t he breakeven point of PG based on t he f ollowing paramet ers

slide-23
SLIDE 23

Preliminary: Leakage power modeling

  • Power gat ing model

– Eoverhead: Power consumed f or t urning PS on/ of f – Esaved: Leakage power saving f or N-cycle sleep

Breakeven point is 7 cycle (200MHz) Breakeven point is 16 cycles (500MHz) No power gat ing (PG) PG rout er (200MHz) PG rout er (500MHz)

How many cycles are required t o sleep f or compensat ing Eoverhead ? Power consumpt ion is reduced as sleep durat ion becomes long [Hu,ISLPED’04]

slide-24
SLIDE 24

Preliminary: Leakage power modeling

  • Power gat ing model

– Eoverhead: Power consumed f or t urning PS on/ of f – Esaved: Leakage power saving f or N-cycle sleep

Breakeven point is… PG(200MHz): 7 cycles PG(300MHz): 10 cycles PG(400MHz): 13 cycles PG(500MHz): 16 cycles No power gat ing (PG) PG rout er (200MHz) PG rout er (500MHz)

How many cycles are required t o sleep f or compensat ing Eoverhead ? Power consumpt ion is reduced as sleep durat ion becomes long [Hu,ISLPED’04]

PG rout er (300MHz) PG rout er (400MHz)

slide-25
SLIDE 25

Evaluations of slow- silent VCs

  • Preliminary

– Leakage modeling of PG – Breakeven point of PG

  • Evaluat ion it ems

– Original t hroughput – Power consumpt ion w/ o PG and VFS – Power consumpt ion w/ PG and VFS

  • Which is t he best ?

– 1VC, 2VC, 3VC, and 4VC

  • Process t echnology

– ASPLA 90nm CMOS – 1.00V (baseline)

  • Simulat ion paramet ers
  • Traf f ic pat t erns

– Unif rom + NPB t races (BT, SP, CG, MG, I S)

Topology 2-D Mesh (8x8) Rout ing DOR (XY rout ing) Buf f er size 4-f lit (WH swit ching) # of VCs 1VC, 2VC, 3VC, 4VC Lat ency 3-cycle per 1-hop

slide-26
SLIDE 26

Evaluations: Unif orm (64- core) 1/ 4

1-VC 2-VC 3-VC 4-VC

Original t hroughput

slide-27
SLIDE 27

Evaluations: Unif orm (64- core) 2/ 4

1-VC 2-VC 3-VC 4-VC

Power (wit hout PG & VFS) leakage t ot al

slide-28
SLIDE 28

Evaluations: Unif orm (64- core) 3/ 4

1-VC 2-VC 3-VC 4-VC

Power (wit hout PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 301.8 0.77 3VC 238.8 0.70 4VC 224.8 0.68 St at ic volt age and f requency scaling leakage t ot al 1) We re-charact erized low- volt age libraries (0.68-0.77V) by Cadence SignalSt rom 2) We conf irm our design works at t hese reduced volt ages

slide-29
SLIDE 29

Evaluations: Unif orm (64- core) 4/ 4

1-VC 2-VC 3-VC 4-VC

Power (wit hout PG & VFS) Power (wit h PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 301.8 0.77 3VC 238.8 0.70 4VC 224.8 0.68 St at ic volt age and f requency scaling leakage t ot al leakage t ot al

4- VC is t he lowest

The same result s can be seen in all-t o-all t raf f ics (e.g., I S)

slide-30
SLIDE 30

Evaluations: BT traf f ic (64- core) 1/ 4

1-VC 2-VC 3-VC 4-VC

Original t hroughput

Perf ormance improvement s

  • f 3-VC and 4-VC are small
slide-31
SLIDE 31

Evaluations: BT traf f ic (64- core) 2/ 4

Power (wit hout PG & VFS)

1-VC 2-VC 3-VC 4-VC

leakage t ot al

Perf ormance improvement s

  • f 3-VC and 4-VC are small
slide-32
SLIDE 32

Evaluations: BT traf f ic (64- core) 3/ 4

1-VC 2-VC 3-VC 4-VC

Power (wit hout PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 350.1 0.82 3VC 346.2 0.82 4VC 346.1 0.82 St at ic volt age and f requency scaling leakage t ot al 1) We re-charact erized t he low- volt age library (0.82V) by Cadence SignalSt rom 2) We conf irm our design works at t his reduced volt age Almost t he same

slide-33
SLIDE 33

Evaluations: BT traf f ic (64- core) 4/ 4

1-VC 2-VC 3-VC 4-VC

Power (wit hout PG & VFS) Power (wit h PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 350.1 0.82 3VC 346.2 0.82 4VC 346.1 0.82 St at ic volt age and f requency scaling leakage t ot al leakage t ot al

2- VC is t he lowest

The same result can be seen in neighboring t raf f ics (e.g., SP)

slide-34
SLIDE 34

How many VCs are best f or LP?

  • All-t o-all t raf f ic

– Unif orm, I S t raf f ic – 3 or 4VCs are bet t er

  • Neighboring t raf f ic

– BT, SP t raf f ic – 2VCs are enough

Unif orm (wit h SVFS & PG) BT t raf f ic (wit h SVFS & PG)

I t depends on t he t raf f ic pat t ern of applicat ion

leakage t ot al leakage t ot al

1-VC 2-VC 3-VC 4-VC 4- VC is t he lowest 2- VC is t he lowest

slide-35
SLIDE 35
  • Slow-silent virt ual channels

– Adding ext ra VCs Perf ormance margin is available – We can reduce t he f req and volt age – But adding ext ra VCs increases leakage power …

  • Run-t ime power gat ing of VCs

– Adapt ive VC act ivat ion

  • How many VCs are required f or minimizing power?

– I t depends on t he t raf f ic pat t ern of applicat ion – All-t o-all t raf f ic: 3 or 4 VCs are bet t er – Neighboring t raf f ic: 2 VCs are enough

Summary: Slow- silent virtual channels

slide-36
SLIDE 36
  • Very “FAT”

t rees

– Adding more t rees & volt age f requency scaling – Run-t ime power gat ing

  • There are a lot of t ypes of Fat t rees
  • How many t rees are required t o minimize power?

Future work: Slow- silent f at trees

f at t er

slide-37
SLIDE 37

Thank you f or your attention

slide-38
SLIDE 38

Backup sides

slide-39
SLIDE 39

Wakeup delay:

Perf ormance impact

Twakeup=0 Twakeup=1 Twakeup=2 Twakeup=3

  • Wakeup delays in lit erat ures

– ALU: 2 cycle – FPMAC in I nt el’s 80-t ile chip: 6 cycle

  • Perf ormance impact of wakeup delay (naïve mode)

[Tschanz,JSSC’03] [Vangal,ISSCC’07]

slide-40
SLIDE 40

Eg., A packet goes t hrough R3, R4, R5, and R2

Look- Ahead Sleep Control

  • Look-ahead sleep cont rol

– To mit igat e t he wakeup delay and short -t erm sleeps

  • Normal rout ing:

– Rout er i calculat es t he out put port of Rout er i

  • Look-ahead rout ing:

– Rout er i calculat es t he out put port of Rout er i+1

R0 R1 R2 R3 R4 R5 R6 R7 R8 Five-cycle margin unt il packet arrival

R2 det ect s a packet arrival when t he packet arrives at R4

Look- Ahead: RC SA ST ST ST ST RC SA ST ST

Router 4 Router 5 Router 2

RC

Packet will arrive af t er t wo hops

Look-ahead can eliminat e a wakeup delay of less t han 5-cycle

[Matsutani,ASP-DAC’08]

slide-41
SLIDE 41

Look- ahead method:

HW resources

  • Rout ing comput at ion of next rout er

– J ust changing t he rout ing f unct ion – Area overhead is very small

  • Wakeup signals are needed

– Sender assert s “wakeup” signal t o receiver – Wakeup signals becomes long – Negat ive impact of mult i-cycle or repeat er buf f ers

NRC SA ST ST ST NRC SA ST ST ST NRC SA ST ST ST HEAD DATA 1 DATA 2

NRC st age: Next Rout ing Comput at ion

1 2 3 4 5 6 7 8 Wakeup signals t o rout er 1

[Matsutani,ASP-DAC’08]

slide-42
SLIDE 42

VC activation: three grouping methods

  • 4VC x 1 (# of lane is 1)

– St art ing f rom VC# 0, – A packet moves VC# 0 VC# 1 VC# 2 VC# 3

  • 2VC x 2 (# of lanes is 2)

– I f (dst %2)=0: a packet moves VC# 0 VC# 1 – I f (dst %2)=1: a packet moves VC# 2 VC# 3

  • 1VC x 4 (# of lanes is 4)

– I f (dst %4)=0: a packet uses VC# 0 – … – I f (dst %4)=3: a packet uses VC# 3

VC#0 VC#1 VC#2 VC#3 VC#0 VC#1 VC#2 VC#3

dst=0,4 dst=1,5 dst=2,6 dst=3,7

VC#0 VC#1 VC#2 VC#3

dst=0,2,4 dst=1,3,5 All packets

The f irst one (used in t his paper) achieves t he highest perf ormance wit h t he least leakage power

slide-43
SLIDE 43

X+ X- Y+ Y- CORE

Buf f er design:

Registers or SRAMs

  • I t depends on buf f er dept h, not widt h

– Dept h > 32-f lit Buf f ers are designed wit h SRAMs – Ot herwise Buf f ers are designed wit h regist ers

5x5 XBAR ARBITER FIFO FIFO FIFO FIFO FIFO X+ X- Y+ Y- CORE I n our design: Buf f er dept h is 4-f lit FI FO buf f ers are designed wit h regist ers