Adding Slow- Silent Virtual Channels f or Low- Power On- Chip - - PowerPoint PPT Presentation
Adding Slow- Silent Virtual Channels f or Low- Power On- Chip - - PowerPoint PPT Presentation
Adding Slow- Silent Virtual Channels f or Low- Power On- Chip Networks Hiroki Mat sut ani (Keio Univ, J apan) Michihiro Koibuchi (NI I , J apan) Daihan Wang (Keio Univ, J apan) Hideharu Amano (Keio Univ, J apan) I am very
I am very sorry…
- My f light was canceled on April 6.
- I was wait ing f or rebooking at airport f or seven
hours, but I couldn’t get a t icket . I got a f ever.
- I arrived at Newcast le on April 7.
- I couldn’t f ind my baggage; I wore only a shirt .
- My hot el reservat ion was canceled w/ o asking;
I didn’t have a place t o sleep…
- I went t o anot her hot el t o book a room in my
shirt sleeves in t he rain. The f ever was gone up.
- Ms. J erder
kindly did her present at ion on Apr 8.
- I would like t hank her and ASYNC/ NOCS
program commit t ee.
Adding Slow- Silent Virtual Channels f or Low- Power On- Chip Networks
Hiroki Mat sut ani (Keio Univ, J apan) Michihiro Koibuchi (NI I , J apan) Daihan Wang (Keio Univ, J apan) Hideharu Amano (Keio Univ, J apan)
Power gat ing Volt age and f requency scaling
I ntroduction: Area and power
- Due t o t he f inger process t echnology,
– Area const raint is relaxed – But power densit y becomes more serious
- Adding ext ra hardware resources (e.g., VCs)
– We can get a perf ormance margin; so – We can reduce volt age and f requency t o reduce power
VC# 0 Rout er (a) VC# 0 Rout er (b) VC# 0 Rout er (c) VC# 1 VC# 1 VC# 1 VC# 2 VC# 2 VC# 2
I ssues t o be t ackled in t his present at ion
- Adding ext ra hardware increases t he leakage power
- How much resource is required t o minimize t ot al power
Outline: Slow- silent virtual channels
- Net work-on-Chip (NoC)
- On-Chip Rout er
– Archit ect ure and it s power consumpt ion
- Slow-silent virt ual channels
– Volt age and f requency scaling – Run-t ime power gat ing of virt ual channels – Adapt ive VC act ivat ion
- Evaluat ions (1VC, 2VC, 3VC, and 4VC)
– Throughput – Power consumpt ion (wit h PG & volt age f req scaling) – How many VCs are required t o minimize power
Network- on- Chip (NoC)
An example t ile archit ect ure (ASPLA 90nm CMOS) Processor core Rout er
The next slides show “Rout er archit ect ure” and “I t s power”
- Processor core
– Largest component – Various low-power t echniques are used
- On-chip rout er
– Area is not so large – Always preparing (act ive) f or packet inj ect ion
[Ishikawa,IEICE’05] e.g., St andby current 11uA
On- Chip Router: Architecture
- 5-input 5-out put rout er (dat a widt h is 64-bit )
5x5 XBAR ARBITER FIFO FIFO FIFO FIFO FIFO X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE Each physical channel has 2 VCs
HW amount is 34 kilo gat es and 64% of area is used f or FI FO
Each VC has a FI FO buf f er (4 x 64 bit s)
On- Chip Router: Pipeline
- A header f lit goes t hrough a rout er in 3 cycles
– RC (Rout ing comput at ion) – VSA (Virt ual channel / Swit ch allocat ion) – ST (Swit ch t raversal)
- E.g., Packet t ransf er f rom rout er A t o C
RC VSA ST ST ST ST RC VSA ST ST ST ST RC VSA ST ST ST ST ELAPSED TIME [CYCLE] 1 2 3 4 5 6 7 8 9 10 11 12 @ROUTER A @ROUTER B @ROUTER C HEAD DATA 1 DATA 2 DATA 3 A packet consist s of a header and 3 dat a f lit s
On- Chip Router: Power consumption
- Place-and-rout ed wit h 90nm CMOS
- Post layout simulat ion at 200MHz
Power consumpt ion of a rout er when n port s are used [mW]
A rout er consumes more power as t he rout er processes more packet s Packet swit ching power is large Volt age f req scaling
On- Chip Router: Power consumption
Standby power of the on-chip router Leakage (55.0%) Dynamic (45.0%) Channels (49.4%)
Leakage of channel buf is t he largest Runt ime power gat ing Power consumpt ion when no port is used st andby power
Packet swit ching power is large Volt age f req scaling
Outline: Slow- silent virtual channels
- Net work-on-Chip (NoC)
- On-Chip Rout er
– Archit ect ure and it s power consumpt ion
- Slow-silent virt ual channels
– Volt age and f requency scaling – Run-t ime power gat ing of virt ual channels – Adapt ive VC act ivat ion
- Evaluat ions (1VC, 2VC, 3VC, and 4VC)
– Throughput – Power consumpt ion (wit h PG & volt age f req scaling) – How many VCs are required t o minimize power
Slow- Silent Virtual Channels
Lat ency vs. accept ed t raf f ic
CV V V f
th α
) ( − ∝
2
V f C a P
switching
⋅ ⋅ ⋅ =
- Adding ext ra VCs
– Perf ormance improves – We can reduce volt age and f requency
- Volt age & f requency
scaling (VFS)
– Set t he reduced volt age and f requency – I n response t o t he perf ormance margin
- Problem
– Adding ext ra VCs increases leakage power – I t may overwhelm VFS
1-VC 2-VC 3-VC 4-VC Perf ormance margin
We f ocus on run-t ime power gat ing of VCs t o reduce leakage
Power Gating of virtual channels
5x5 XBAR ARBI TER X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE sleep sleep sleep sleep sleep
- Run-t ime power gat ing of virt ual channels
– No packet s in a VC Sleep (t urn of f t he power supply) – Packet arrives at t he VC Wakeup (t urn on t he power)
Power Gating of virtual channels
5x5 XBAR ARBI TER X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE sleep sleep sleep sleep sleep
Link shut down has been st udied f or on- & of f -chip net works, but prior work uses SRAM buf f ers [Chen,ISLPED’03] [Soteriou,TPDS’07] We use small regist ered FI FOs f or light -weight NoC rout ers
- Run-t ime power gat ing of virt ual channels
– No packet s in a VC Sleep (t urn of f t he power supply) – Packet arrives at t he VC Wakeup (t urn on t he power)
Power Gating: Various overheads
- Area overhead
– Power swit ches
- Perf ormance overhead
– Wakeup delay – Pipeline st all is caused
- Power overhead
– Driving power swit ches – Short sleeps adversely increases dynamic power
FIFO Sleep Wait ing f or channel wakeup FIFO Active Pipeline st all of a rout er occurs Frequent on/ of f should be avoided Frequent on/ of f should be avoided
Power Gating: Various overheads
- Area overhead
– Power swit ches
- Perf ormance overhead
– Wakeup delay – Pipeline st all is caused
- Power overhead
– Driving power swit ches – Short sleeps adversely increases dynamic power
sleep Vdd Virt ual Vdd GND Power switch Circuit block
Cont rol t hat gradually act ivat es VCs in response t o workload
FIFO Sleep Wait ing f or channel wakeup Pipeline st all of a rout er occurs Frequent on/ of f should be avoided Frequent on/ of f should be avoided FIFO Active
Power Gating: VC activation policy
- Virt ual channel (VC) level power gat ing
- Virt ual-channel select ion:
– All packet s use VC# 0 when t hey are inj ect ed t o NoC – VC number is increased when t he packet conf lict s
VC#0 Rout er (a) VC#1 VC#2 VC#0 Rout er (b) VC#1 VC#2 VC#0 Rout er (c) VC#1 VC#2
Only VC# 0 is used if workload is low
Power Gating: VC activation policy
- Virt ual channel (VC) level power gat ing
- Virt ual-channel select ion:
– All packet s use VC# 0 when t hey are inj ect ed t o NoC – VC number is increased when t he packet conf lict s
Rout er (a) Rout er (b) Rout er (c) VC#0 VC#1 VC#2 VC#0 VC#1 VC#2 VC#0 VC#1 VC#2
High peak perf ormance of VCs wit h t he least leakage power
All VCs are act ivat ed if workload is high
Power Gating: Routing design
- A virt ual-channel layer
– A virt ual net work consist ing of VCs wit h t he same VC#
- Deadlock-f reedom
– Moving upper t o lower layers – Only bot t om layer must guarant ee deadlock-f reedom
VC# 0 VC# 1 VC# 2 VC# 3 VC# 0 VC# 1 VC# 2 VC# 3 VC# 0 VC# 1 VC# 2 VC# 3
Rout er (a) Rout er (b) Rout er (c) VC Layer # 0 VC Layer # 1 VC Layer # 2 VC Layer # 3 VC# 0 VC# 1 VC# 2 VC# 3 [Duato,TPDS’93] [Koibuchi,ICPP’03]
All VC layers except f or t he bot t om can employ any rout ing, as f ar as t he bot t om guarant ees deadlock-f ree by it self
Outline: Slow- silent virtual channels
- Net work-on-Chip (NoC)
- On-Chip Rout er
– Archit ect ure and it s power consumpt ion
- Slow-silent virt ual channels
– Volt age and f requency scaling – Run-t ime power gat ing of virt ual channels – Adapt ive VC act ivat ion
- Evaluat ions (1VC, 2VC, 3VC, and 4VC)
– Throughput – Power consumpt ion (wit h PG & volt age f req scaling) – How many VCs are required t o minimize power
Evaluations of slow- silent VCs
- Preliminary
– Leakage modeling of PG – Breakeven point of PG
- Evaluat ion it ems
– Original t hroughput – Power consumpt ion w/ o PG and VFS – Power consumpt ion w/ PG and VFS
- Which is t he best ?
– 1VC, 2VC, 3VC, and 4VC
- Process t echnology
– ASPLA 90nm CMOS – 1.00V (baseline)
- Simulat ion paramet ers
- Traf f ic pat t erns
– Unif rom + NPB t races (BT, SP, CG, MG, I S)
Topology 2-D Mesh (8x8) Rout ing DOR (XY rout ing) Buf f er size 4-f lit (WH swit ching) # of VCs 1VC, 2VC, 3VC, 4VC Lat ency 3-cycle per 1-hop
Preliminary: Leakage power modeling
Supply volt age 1.0 V Swit ching f act or 0.12 Leakage power 52 uW Dynamic power (200MHz) 78 uW Dynamic power (500MHz) 194 uW Power swit ch size rat io 0.1 Power swit ch cap rat io 0.5 Based on t he post layout simulat ion of on-chip rout er (90nm CMOS)
- Power gat ing model
– Eoverhead: Power consumed f or t urning PS on/ of f – Esaved: Leakage power saving f or an N-cycle sleep
[Hu,ISLPED’04] How many cycles are required t o sleep f or compensat ing Eoverhead ? We calculat e t he breakeven point of PG based on t he f ollowing paramet ers
Preliminary: Leakage power modeling
- Power gat ing model
– Eoverhead: Power consumed f or t urning PS on/ of f – Esaved: Leakage power saving f or N-cycle sleep
Breakeven point is 7 cycle (200MHz) Breakeven point is 16 cycles (500MHz) No power gat ing (PG) PG rout er (200MHz) PG rout er (500MHz)
How many cycles are required t o sleep f or compensat ing Eoverhead ? Power consumpt ion is reduced as sleep durat ion becomes long [Hu,ISLPED’04]
Preliminary: Leakage power modeling
- Power gat ing model
– Eoverhead: Power consumed f or t urning PS on/ of f – Esaved: Leakage power saving f or N-cycle sleep
Breakeven point is… PG(200MHz): 7 cycles PG(300MHz): 10 cycles PG(400MHz): 13 cycles PG(500MHz): 16 cycles No power gat ing (PG) PG rout er (200MHz) PG rout er (500MHz)
How many cycles are required t o sleep f or compensat ing Eoverhead ? Power consumpt ion is reduced as sleep durat ion becomes long [Hu,ISLPED’04]
PG rout er (300MHz) PG rout er (400MHz)
Evaluations of slow- silent VCs
- Preliminary
– Leakage modeling of PG – Breakeven point of PG
- Evaluat ion it ems
– Original t hroughput – Power consumpt ion w/ o PG and VFS – Power consumpt ion w/ PG and VFS
- Which is t he best ?
– 1VC, 2VC, 3VC, and 4VC
- Process t echnology
– ASPLA 90nm CMOS – 1.00V (baseline)
- Simulat ion paramet ers
- Traf f ic pat t erns
– Unif rom + NPB t races (BT, SP, CG, MG, I S)
Topology 2-D Mesh (8x8) Rout ing DOR (XY rout ing) Buf f er size 4-f lit (WH swit ching) # of VCs 1VC, 2VC, 3VC, 4VC Lat ency 3-cycle per 1-hop
Evaluations: Unif orm (64- core) 1/ 4
1-VC 2-VC 3-VC 4-VC
Original t hroughput
Evaluations: Unif orm (64- core) 2/ 4
1-VC 2-VC 3-VC 4-VC
Power (wit hout PG & VFS) leakage t ot al
Evaluations: Unif orm (64- core) 3/ 4
1-VC 2-VC 3-VC 4-VC
Power (wit hout PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 301.8 0.77 3VC 238.8 0.70 4VC 224.8 0.68 St at ic volt age and f requency scaling leakage t ot al 1) We re-charact erized low- volt age libraries (0.68-0.77V) by Cadence SignalSt rom 2) We conf irm our design works at t hese reduced volt ages
Evaluations: Unif orm (64- core) 4/ 4
1-VC 2-VC 3-VC 4-VC
Power (wit hout PG & VFS) Power (wit h PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 301.8 0.77 3VC 238.8 0.70 4VC 224.8 0.68 St at ic volt age and f requency scaling leakage t ot al leakage t ot al
4- VC is t he lowest
The same result s can be seen in all-t o-all t raf f ics (e.g., I S)
Evaluations: BT traf f ic (64- core) 1/ 4
1-VC 2-VC 3-VC 4-VC
Original t hroughput
Perf ormance improvement s
- f 3-VC and 4-VC are small
Evaluations: BT traf f ic (64- core) 2/ 4
Power (wit hout PG & VFS)
1-VC 2-VC 3-VC 4-VC
leakage t ot al
Perf ormance improvement s
- f 3-VC and 4-VC are small
Evaluations: BT traf f ic (64- core) 3/ 4
1-VC 2-VC 3-VC 4-VC
Power (wit hout PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 350.1 0.82 3VC 346.2 0.82 4VC 346.1 0.82 St at ic volt age and f requency scaling leakage t ot al 1) We re-charact erized t he low- volt age library (0.82V) by Cadence SignalSt rom 2) We conf irm our design works at t his reduced volt age Almost t he same
Evaluations: BT traf f ic (64- core) 4/ 4
1-VC 2-VC 3-VC 4-VC
Power (wit hout PG & VFS) Power (wit h PG & VFS) Freq [MHz] Volt age [V] 1VC 500.0 1.00 2VC 350.1 0.82 3VC 346.2 0.82 4VC 346.1 0.82 St at ic volt age and f requency scaling leakage t ot al leakage t ot al
2- VC is t he lowest
The same result can be seen in neighboring t raf f ics (e.g., SP)
How many VCs are best f or LP?
- All-t o-all t raf f ic
– Unif orm, I S t raf f ic – 3 or 4VCs are bet t er
- Neighboring t raf f ic
– BT, SP t raf f ic – 2VCs are enough
Unif orm (wit h SVFS & PG) BT t raf f ic (wit h SVFS & PG)
I t depends on t he t raf f ic pat t ern of applicat ion
leakage t ot al leakage t ot al
1-VC 2-VC 3-VC 4-VC 4- VC is t he lowest 2- VC is t he lowest
- Slow-silent virt ual channels
– Adding ext ra VCs Perf ormance margin is available – We can reduce t he f req and volt age – But adding ext ra VCs increases leakage power …
- Run-t ime power gat ing of VCs
– Adapt ive VC act ivat ion
- How many VCs are required f or minimizing power?
– I t depends on t he t raf f ic pat t ern of applicat ion – All-t o-all t raf f ic: 3 or 4 VCs are bet t er – Neighboring t raf f ic: 2 VCs are enough
Summary: Slow- silent virtual channels
- Very “FAT”
t rees
– Adding more t rees & volt age f requency scaling – Run-t ime power gat ing
- There are a lot of t ypes of Fat t rees
- How many t rees are required t o minimize power?
Future work: Slow- silent f at trees
f at t er
Thank you f or your attention
Backup sides
Wakeup delay:
Perf ormance impact
Twakeup=0 Twakeup=1 Twakeup=2 Twakeup=3
- Wakeup delays in lit erat ures
– ALU: 2 cycle – FPMAC in I nt el’s 80-t ile chip: 6 cycle
- Perf ormance impact of wakeup delay (naïve mode)
[Tschanz,JSSC’03] [Vangal,ISSCC’07]
Eg., A packet goes t hrough R3, R4, R5, and R2
Look- Ahead Sleep Control
- Look-ahead sleep cont rol
– To mit igat e t he wakeup delay and short -t erm sleeps
- Normal rout ing:
– Rout er i calculat es t he out put port of Rout er i
- Look-ahead rout ing:
– Rout er i calculat es t he out put port of Rout er i+1
R0 R1 R2 R3 R4 R5 R6 R7 R8 Five-cycle margin unt il packet arrival
R2 det ect s a packet arrival when t he packet arrives at R4
Look- Ahead: RC SA ST ST ST ST RC SA ST ST
Router 4 Router 5 Router 2
RC
Packet will arrive af t er t wo hops
Look-ahead can eliminat e a wakeup delay of less t han 5-cycle
[Matsutani,ASP-DAC’08]
Look- ahead method:
HW resources
- Rout ing comput at ion of next rout er
– J ust changing t he rout ing f unct ion – Area overhead is very small
- Wakeup signals are needed
– Sender assert s “wakeup” signal t o receiver – Wakeup signals becomes long – Negat ive impact of mult i-cycle or repeat er buf f ers
NRC SA ST ST ST NRC SA ST ST ST NRC SA ST ST ST HEAD DATA 1 DATA 2
NRC st age: Next Rout ing Comput at ion
1 2 3 4 5 6 7 8 Wakeup signals t o rout er 1
[Matsutani,ASP-DAC’08]
VC activation: three grouping methods
- 4VC x 1 (# of lane is 1)
– St art ing f rom VC# 0, – A packet moves VC# 0 VC# 1 VC# 2 VC# 3
- 2VC x 2 (# of lanes is 2)
– I f (dst %2)=0: a packet moves VC# 0 VC# 1 – I f (dst %2)=1: a packet moves VC# 2 VC# 3
- 1VC x 4 (# of lanes is 4)
– I f (dst %4)=0: a packet uses VC# 0 – … – I f (dst %4)=3: a packet uses VC# 3
VC#0 VC#1 VC#2 VC#3 VC#0 VC#1 VC#2 VC#3
dst=0,4 dst=1,5 dst=2,6 dst=3,7
VC#0 VC#1 VC#2 VC#3
dst=0,2,4 dst=1,3,5 All packets
The f irst one (used in t his paper) achieves t he highest perf ormance wit h t he least leakage power
X+ X- Y+ Y- CORE
Buf f er design:
Registers or SRAMs
- I t depends on buf f er dept h, not widt h
– Dept h > 32-f lit Buf f ers are designed wit h SRAMs – Ot herwise Buf f ers are designed wit h regist ers
5x5 XBAR ARBITER FIFO FIFO FIFO FIFO FIFO X+ X- Y+ Y- CORE I n our design: Buf f er dept h is 4-f lit FI FO buf f ers are designed wit h regist ers