A Variable-pipeline On-chip Router Optimized to Traffic Pattern - PowerPoint PPT Presentation

A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University) Hiroki Matsutani (University of Tokyo) Michihiro Koibuchi (National Institute of Informatics) Hideharu Amano(Keio University) Japan

Outline • NoC is the heart of many-core processor • Router pipeline affects performance and power – Various existing pipeline structures • Trade-off between latency, throughput, and power • We propose a variable-pipeline router(1,2,3cycle) – 1cycle mode has lowest latency – 2cycle mode is better at throughput and power – 3cycle mode is used to avoid hotspot

Our target region Number of PEs (caches are not included) picoChip PC102 picoChip PC205 256 ClearSpeed CSX700 Hundreds of simple PEs 128 Intel 80-core ClearSpeed CSX600 64 TILERA TILE64 Target 32 MIT RAW UT TRIPS (OPN) 16 STI Cell BE Chip multi- processor (CMP) 8 Sun T1 Sun T2 4 Intel Core, IBM Power7 AMD Opteron 2 2002 2004 2006 2008 2010?

Our target: NoC for future CMPs • 8-CPU CMP example – 8 CPUs (each has a private L1 cache) – Shared L2 cache (divided into 64 banks) [Beckmann, MICRO’04] UltraSPARC L1 cache (I & D) (16kB) L2 cache bank (256kB, 4-way)

Table of Contents • Trade-off Problem of Router Structures • Solution: Variable-pipeline Router • Evaluation • Related Work • Conclusions

Conventional On-chip Router • Module • Pipeline stage – Input channel – Routing computation(RC) – Crossbar switch – VC allocation(VA) – Output channel – Switch allocation(SA) – Switch traversal(ST) Output buffer Arbiter North Head flit RC VA SA ST North SA ST East Body flit 1 East West West Body flit 2 SA ST South South Tail flit SA ST Core Core 1 2 3 4 5 6 cycle 7 Input channel X-bar Output channel

1cycle Router Pipeline • Trade-off between router pipeline structures • 1cycle pipeline structure Good: 1-cycle transfer → Lowest communication latency Weak: Sequential execution of NRC/VSA and ST stages → Lowest frequency and throughput NRC Link Link VA ST ・・・・・・ Link Link SA ※ Router

2cycle Router Pipeline • 2cycle Pipeline Good: NRC and VSA(VA+SA) are executed in parallel → Highest frequency Modest: 2-cycle transfer → Shorter communication latency NRC Link Link VA ST ・・・・・・ Link Link SA ※ Router

3cycle Router Pipeline • 3cycle pipeline Good: Adaptive routing(Duato’s protocol) → Avoid the hotspots Modest: Medium frequency Weak: 3-cycle transfer → large communication latency in cycles VA1 Link Link SA1 ・・・・・・ RC ST VA2 Link Link SA2 ※ Router

Trade-off of Pipeline Structures • Pros(red) and cons(blue) of each pipeline structure Pipeline depth Operating freq. Throughput Latency 1cycle Low Low Lowest (deterministic routing) 2cycle High High Lower (deterministic routing) 3cycle Mid Mid High (adaptive routing) • An optimal pipeline depends on traffic requirement → Our solution: switch the pipeline structures dynamically

Variable-pipeline(VP) Router • Using DVFS – High throughput by increasing freq. and voltage • Increasing the number of pipeline stages – Low latency by decreasing freq. and voltage • Decreasing the number of pipeline stages • Local processor changes the router pipeline structure for each application mode Routing Purpose Duato’s protocol 3cycle Avoiding hotspot (adaptive) 2cycle DOR (deterministic) High throughput, and Low power 1cycle DOR (deterministic) Low latency

Design of Variable-pipeline Router • Each mode uses the different path NRC VA/ SA RC Input channel Output channel

Design of Variable-pipeline Router • 1cycle mode VA/ NRC SA RC Input channel Output channel

Design of Variable-pipeline Router • 2cycle mode VA/ NRC SA RC Input channel Output channel

Design of Variable-pipeline Router • 3cycle mode NRC VA RC / SA Input channel Output channel

Design of Variable-pipeline Router • 5-port(NEWS + Core) for 2-D Mesh/Torus • Flit width: 66bit(64bit data + 2bit flit type) – Packet size is variable

Design of Variable-pipeline Router • Reconfiguration takes only a single cycle – when no packets arrive

70  Most of modules are shared by different modes 60  Shared by 1-, 2-, and 3-cycle 50 Area(kGates) mode other 40  Input buffers, NRC(RC) and crossbar VA/SA modules 30  Input buffer is dominating the input router area 20 channel  Shared by 2-, and 3-cycle 10 mode 0  Output latch in the output port 3cycle router

Fault Tolerance for RC module • When router A and B are running on 1 or 2-cycle mode with look-ahead (NRC), – if the NRC of router B fails, router C executes both NC and NRC Router A Router B Router C RC RC RC NRC NRC NRC Packet is including Packet has NO Router C executes the NRC information NRC information both RC and NRC

Evaluation Items • RTL simulation –NC-Verilog8.1 • Design Synthesis –Design Compiler 2007.12-SP3 –Nangate 45nm library typical(1.2V, 25 ℃ ) • Network Simulation –GEMS/Simics • Full system simulator –Flit-level network simulator • Application –9 benchmarks from SPLASH-2

Target 8-core Processor  Sun Solaris 9, Sun Studio 12  Routers are connected by 2-dimentional mesh UltraSPARC L1Cache(I & D) (16kB) L2Cache bank (256kB, 4-way) On-chip router

1. Hardware synthesis results – Area(kGates), frequency(MHz), power(bit/J) 2. Full system CMP simulation results GEMS/Simics simulator; SPLASH-2 benchmark – application execution time – Collect the packet trace 3. The average hop count for each traces – get the zero-load latency data 4. Network simulation results using packet traces – Maximum throughput and power consumption

Area Overhead • Area of Variable-pipeline router – Increased by 13.3% – Input buffer is dominant in routers 80 13.3% 70 60 Area[kGates] 50 other 40 output channel 30 crossbar 20 input channel 10 0 1cycle 2cycle 3cycle VP router

Operating Frequency • Frequency of each pipeline stage • Supply voltage: 0.6V to 1.2V – As supply voltage increases, frequency is improved • VP router has 12% frequency overhead 700 Frequency[MHz] 600 1cycle router 500 400 2cycle router 12% 300 200 3cycle router 100 0 VP router(1cycle 0.6 0.7 0.8 0.9 1 1.1 1.2 mode) Supply Voltage[V]

Application Execution Time • Execute SPLASH-2 benchmark for 1, 2, 3cycle router – Lower execution time is better – 2cycle is best 1 Normalized execution time 0.8 1cycle 0.6 2cycle 0.4 3cycle 0.2 0

• Flight time without packet conflicts – Strongly affect to performance – Lower latency is better • 1-cycle mode is best 50 Zero-load latency(nsec) 40 1cycle 30 2cycle 20 3cycle 10 0

• 2-cycle router achieves the highest throughput • Overhead of the adaptive 3-cycle router is a bottleneck 1 Normalized maximum 0.8 throughput 1cycle 0.6 2cycle 0.4 3cycle 0.2 0

Power Consumption • 2cycle mode is best 18 Power consumption[mW] 16 14 12 10 1cycle 8 2cycle 6 3cycle 4 2 0 0.3 0.5 0.7 0.9 1.1 1.3 Throughput[M flit/sec] SPLASH-2 radiosity benchmark

Related Work 1. Pipeline integration of processors (Shimada 、 2007) – Multiple pipeline stages are integrated into a stage when freq decreases • Using DVFS • Power efficiency improves 2. Router micro-architecture optimizing pipelines – Speculative router – VA,SA in parallel (Peh, HPCA00) – Prediction Router (Matsutani, HPCA09) – Look-ahead(LA) router (Galles, HOTI’96) • NRC and VSA can be executed in parallel → We integrated different pipeline stages on an on -chip router

• On-chip router is the heart of NoCs –Various existing pipeline structures • Trade-off between latency, throughput, and power • We designed a variable-pipeline router –Switching 1-, 2-, and adaptive 3-cycle pipelines A variable-pipeline router micro-architecture

A Variable-pipeline On-chip Router Optimized to Traffic Pattern - PowerPoint PPT Presentation

A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University) Hiroki Matsutani (University of Tokyo) Michihiro Koibuchi (National Institute of Informatics) Hideharu Amano(Keio University) Japan Outline

Tor: The Onion Router 2 / 13 Tor: The Onion Router www.cbc.ca 2 / 13 Tor: The Onion Router

Worksheet 9 Worksheet 9 Linux as a router, packet filtering, traffic Linux as a router, packet

An Enhanced Global Router An Enhanced Global Router An Enhanced Global Router An Enhanced Global

OSPF Router Types OSPF Router Types There are four types of OSPF routers. Router types are

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Online Security Michael Hutchinson My First line of Defense Making Things Secure Router

CNC Router What is it good for? About the CNC Full name is CNC Router but is shortened to CNC

Off- -Line Router ETR Line Router ETR- -400 400 Off 2007. 05. 15 Rev. 1 2007. 05. 15

Current W ireless Router C Current W ireless Router C Traditional Routing requires 4 time

Constraining Queuing Delay in a Constraining Queuing Delay in a Router based on Superposition of

BSD Router Project Don't buy a router: download it ! FOSDEM 15 Olivier Cochard-Labb

Source 1 10 Mbps Ethernet Router Dest 1.5 Mbps T1 Link 100 Mbps FDDI Source 2 Source 1

Regaining sovereignty over your router Lucas Lasota | Legal Team lucas.lasota@fsfe.org What is

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Traffic Shaping, Traffic Policing Peter Puschner, Institut fr Technische Informatik Traffic

Frequency Dependence of Scintillation Arcs Dan Stinebring Oberlin College 2019 November 4

Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International

LaBGen-P: A Pixel-Level Stationary Background Generation Method Based on LaBGen B. Laugraud, S.

FIELD ESTIMATION IN DENSE IMAGE ARRAYS F. Battisti, M. Brizzi, M. Carli, A. Neri Universit

Assignment: Named Entity Recognition Empirical Methods in Natural Language Processing Philipp

Charged Lepton Flavour Violation: mu2e, mu3e and Comet Gavin Hesketh, UCL Thanks to Mark

Course overview J. Gomes Ferreira http://ecowin.org/ Universidade Nova de Lisboa Coastal and

Machine learning and the expert in the loop Mich` ele Sebag TAO ECAI 2014, Frontiers of AI 1 /