A Variable-pipeline On-chip Router Optimized to Traffic Pattern - - PowerPoint PPT Presentation
A Variable-pipeline On-chip Router Optimized to Traffic Pattern - - PowerPoint PPT Presentation
A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University) Hiroki Matsutani (University of Tokyo) Michihiro Koibuchi (National Institute of Informatics) Hideharu Amano(Keio University) Japan Outline
Outline
- NoC is the heart of many-core processor
- Router pipeline affects performance and power
– Various existing pipeline structures
- Trade-off between latency, throughput, and power
- We propose a variable-pipeline router(1,2,3cycle)
– 1cycle mode has lowest latency – 2cycle mode is better at throughput and power – 3cycle mode is used to avoid hotspot
Our target region
4 8 16 32 64 128 256 2002 2004 2006 2008 2010?
MIT RAW STI Cell BE Sun T1 Sun T2 TILERA TILE64 Intel Core, IBM Power7 AMD Opteron Intel 80-core ClearSpeed CSX600 ClearSpeed CSX700 picoChip PC102 picoChip PC205 UT TRIPS (OPN)
Number of PEs (caches are not included) 2
Hundreds of simple PEs Chip multi- processor (CMP)
Target
- 8-CPU CMP example
– 8 CPUs (each has a private L1 cache) – Shared L2 cache (divided into 64 banks)
Our target: NoC for future CMPs
UltraSPARC L1 cache (I & D) L2 cache bank (16kB) (256kB, 4-way) [Beckmann, MICRO’04]
Table of Contents
- Trade-off Problem of Router Structures
- Solution: Variable-pipeline Router
- Evaluation
- Related Work
- Conclusions
Conventional On-chip Router
- Module
– Input channel – Crossbar switch – Output channel
- Pipeline stage
– Routing computation(RC) – VC allocation(VA) – Switch allocation(SA) – Switch traversal(ST)
VA SA Head flit Body flit 1 Body flit 2 Tail flit RC SA ST SA ST SA ST ST 1 2 3 4 5 6 cycle 7 Arbiter
Input channel X-bar Output channel North East West South Core North East West South Core Output buffer
1cycle Router Pipeline
- Trade-off between router pipeline structures
- 1cycle pipeline structure
Good: 1-cycle transfer → Lowest communication latency Weak: Sequential execution of NRC/VSA and ST stages → Lowest frequency and throughput
Router ※ SA ST VA NRC Link Link ・・・ Link Link ・・・
2cycle Router Pipeline
- 2cycle Pipeline
Good: NRC and VSA(VA+SA) are executed in parallel → Highest frequency Modest: 2-cycle transfer → Shorter communication latency
Router ※ SA ST VA NRC Link Link ・・・ Link Link ・・・
Link Link ・・・ RC SA1 ST VA1 Link Link ・・・ Router ※
3cycle Router Pipeline
- 3cycle pipeline
Good: Adaptive routing(Duato’s protocol) →Avoid the hotspots Modest: Medium frequency Weak: 3-cycle transfer → large communication latency in cycles
SA2 VA2
Trade-off of Pipeline Structures
- Pros(red) and cons(blue) of each pipeline structure
- An optimal pipeline depends on traffic requirement
Pipeline depth Operating freq. Throughput Latency 1cycle (deterministic routing) Low Low Lowest 2cycle (deterministic routing) High High Lower 3cycle (adaptive routing) Mid Mid High
→ Our solution: switch the pipeline structures dynamically
Table of Contents
- Trade-off Problem of Router Structures
- Solution: Variable-pipeline Router
- Evaluation
- Related Work
- Conclusions
Variable-pipeline(VP) Router
- Using DVFS
– High throughput by increasing freq. and voltage
- Increasing the number of pipeline stages
– Low latency by decreasing freq. and voltage
- Decreasing the number of pipeline stages
- Local processor changes the router pipeline structure for
each application
mode Routing Purpose 3cycle Duato’s protocol (adaptive) Avoiding hotspot 2cycle DOR (deterministic) High throughput, and Low power 1cycle DOR (deterministic) Low latency
Design of Variable-pipeline Router
- Each mode uses the different path
Input channel Output channel RC NRC VA/ SA
Design of Variable-pipeline Router
- 1cycle mode
Input channel Output channel RC NRC VA/ SA
Design of Variable-pipeline Router
- 2cycle mode
Input channel Output channel RC NRC VA/ SA
Design of Variable-pipeline Router
- 3cycle mode
Input channel Output channel RC NRC VA / SA
Design of Variable-pipeline Router
- 5-port(NEWS + Core) for 2-D Mesh/Torus
- Flit width: 66bit(64bit data + 2bit flit type)
– Packet size is variable
Design of Variable-pipeline Router
- Reconfiguration takes only a single cycle
– when no packets arrive
Most of modules are
shared by different modes
Shared by 1-, 2-, and 3-cycle
mode
Input buffers, NRC(RC) and
VA/SA modules
Input buffer is dominating the
router area
Shared by 2-, and 3-cycle
mode
Output latch in the output port
10 20 30 40 50 60 70 3cycle router Area(kGates)
- ther
crossbar input channel
Fault Tolerance for RC module
- When router A and B are running on 1 or 2-cycle mode
with look-ahead (NRC), – if the NRC of router B fails, router C executes both NC and NRC
RC NRC RC NRC RC NRC Packet is including the NRC information Packet has NO NRC information Router A Router B Router C
Router C executes both RC and NRC
Table of Contents
- Trade-off Problem of Router Structures
- Solution: Variable-pipeline Router
- Evaluation
- Related Work
- Conclusions
- RTL simulation
–NC-Verilog8.1
- Design Synthesis
–Design Compiler 2007.12-SP3 –Nangate 45nm library typical(1.2V, 25℃)
- Network Simulation
–GEMS/Simics
- Full system simulator
–Flit-level network simulator
- Application
–9 benchmarks from SPLASH-2
Evaluation Items
Sun Solaris 9, Sun Studio 12 Routers are connected by 2-dimentional mesh
On-chip router UltraSPARC L1Cache(I & D) L2Cache bank (16kB) (256kB, 4-way)
Target 8-core Processor
1. Hardware synthesis results – Area(kGates), frequency(MHz), power(bit/J) 2. Full system CMP simulation results
GEMS/Simics simulator; SPLASH-2 benchmark
– application execution time – Collect the packet trace 3. The average hop count for each traces – get the zero-load latency data 4. Network simulation results using packet traces – Maximum throughput and power consumption
Area Overhead
- Area of Variable-pipeline router
– Increased by 13.3% – Input buffer is dominant in routers
10 20 30 40 50 60 70 80 1cycle 2cycle 3cycle VP router Area[kGates]
- ther
- utput channel
crossbar input channel 13.3%
Operating Frequency
- Frequency of each pipeline stage
- Supply voltage: 0.6V to 1.2V
– As supply voltage increases, frequency is improved
- VP router has 12% frequency overhead
100 200 300 400 500 600 700 0.6 0.7 0.8 0.9 1 1.1 1.2 Frequency[MHz] Supply Voltage[V] 1cycle router 2cycle router 3cycle router VP router(1cycle mode) 12%
Application Execution Time
- Execute SPLASH-2 benchmark for 1, 2, 3cycle router
– Lower execution time is better – 2cycle is best
0.2 0.4 0.6 0.8 1 Normalized execution time 1cycle 2cycle 3cycle
- Flight time without packet conflicts
– Strongly affect to performance – Lower latency is better
- 1-cycle mode is best
10 20 30 40 50 Zero-load latency(nsec) 1cycle 2cycle 3cycle
- 2-cycle router achieves the highest throughput
- Overhead of the adaptive 3-cycle router is a
bottleneck
0.2 0.4 0.6 0.8 1 Normalized maximum throughput 1cycle 2cycle 3cycle
- 2cycle mode is best
Power Consumption
SPLASH-2 radiosity benchmark
2 4 6 8 10 12 14 16 18 0.3 0.5 0.7 0.9 1.1 1.3 Power consumption[mW] Throughput[M flit/sec] 1cycle 2cycle 3cycle
Table of Contents
- Trade-off Problem of Router Structures
- Solution: Variable-pipeline Router
- Evaluation
- Related Work
- Conclusions
Related Work
- 1. Pipeline integration of processors (Shimada、2007)
– Multiple pipeline stages are integrated into a stage when freq decreases
- Using DVFS
- Power efficiency improves
- 2. Router micro-architecture optimizing pipelines
– Speculative router
– VA,SA in parallel (Peh, HPCA00)
– Prediction Router (Matsutani, HPCA09) – Look-ahead(LA) router (Galles, HOTI’96)
- NRC and VSA can be executed in parallel
→ We integrated different pipeline stages on an on-chip router
- On-chip router is the heart of NoCs
–Various existing pipeline structures
- Trade-off between latency, throughput, and power
- We designed a variable-pipeline router
–Switching 1-, 2-, and adaptive 3-cycle pipelines
A variable-pipeline router micro-architecture