A Variable-pipeline On-chip Router Optimized to Traffic Pattern - - PowerPoint PPT Presentation

a variable pipeline on chip router optimized to traffic
SMART_READER_LITE
LIVE PREVIEW

A Variable-pipeline On-chip Router Optimized to Traffic Pattern - - PowerPoint PPT Presentation

A Variable-pipeline On-chip Router Optimized to Traffic Pattern Yuto Hirata (Keio University) Hiroki Matsutani (University of Tokyo) Michihiro Koibuchi (National Institute of Informatics) Hideharu Amano(Keio University) Japan Outline


slide-1
SLIDE 1

A Variable-pipeline On-chip Router Optimized to Traffic Pattern

Yuto Hirata (Keio University) Hiroki Matsutani (University of Tokyo) Michihiro Koibuchi (National Institute of Informatics) Hideharu Amano(Keio University) Japan

slide-2
SLIDE 2

Outline

  • NoC is the heart of many-core processor
  • Router pipeline affects performance and power

– Various existing pipeline structures

  • Trade-off between latency, throughput, and power
  • We propose a variable-pipeline router(1,2,3cycle)

– 1cycle mode has lowest latency – 2cycle mode is better at throughput and power – 3cycle mode is used to avoid hotspot

slide-3
SLIDE 3

Our target region

4 8 16 32 64 128 256 2002 2004 2006 2008 2010?

MIT RAW STI Cell BE Sun T1 Sun T2 TILERA TILE64 Intel Core, IBM Power7 AMD Opteron Intel 80-core ClearSpeed CSX600 ClearSpeed CSX700 picoChip PC102 picoChip PC205 UT TRIPS (OPN)

Number of PEs (caches are not included) 2

Hundreds of simple PEs Chip multi- processor (CMP)

Target

slide-4
SLIDE 4
  • 8-CPU CMP example

– 8 CPUs (each has a private L1 cache) – Shared L2 cache (divided into 64 banks)

Our target: NoC for future CMPs

UltraSPARC L1 cache (I & D) L2 cache bank (16kB) (256kB, 4-way) [Beckmann, MICRO’04]

slide-5
SLIDE 5

Table of Contents

  • Trade-off Problem of Router Structures
  • Solution: Variable-pipeline Router
  • Evaluation
  • Related Work
  • Conclusions
slide-6
SLIDE 6

Conventional On-chip Router

  • Module

– Input channel – Crossbar switch – Output channel

  • Pipeline stage

– Routing computation(RC) – VC allocation(VA) – Switch allocation(SA) – Switch traversal(ST)

VA SA Head flit Body flit 1 Body flit 2 Tail flit RC SA ST SA ST SA ST ST 1 2 3 4 5 6 cycle 7 Arbiter

Input channel X-bar Output channel North East West South Core North East West South Core Output buffer

slide-7
SLIDE 7

1cycle Router Pipeline

  • Trade-off between router pipeline structures
  • 1cycle pipeline structure

Good: 1-cycle transfer → Lowest communication latency Weak: Sequential execution of NRC/VSA and ST stages → Lowest frequency and throughput

Router ※ SA ST VA NRC Link Link ・・・ Link Link ・・・

slide-8
SLIDE 8

2cycle Router Pipeline

  • 2cycle Pipeline

Good: NRC and VSA(VA+SA) are executed in parallel → Highest frequency Modest: 2-cycle transfer → Shorter communication latency

Router ※ SA ST VA NRC Link Link ・・・ Link Link ・・・

slide-9
SLIDE 9

Link Link ・・・ RC SA1 ST VA1 Link Link ・・・ Router ※

3cycle Router Pipeline

  • 3cycle pipeline

Good: Adaptive routing(Duato’s protocol) →Avoid the hotspots Modest: Medium frequency Weak: 3-cycle transfer → large communication latency in cycles

SA2 VA2

slide-10
SLIDE 10

Trade-off of Pipeline Structures

  • Pros(red) and cons(blue) of each pipeline structure
  • An optimal pipeline depends on traffic requirement

Pipeline depth Operating freq. Throughput Latency 1cycle (deterministic routing) Low Low Lowest 2cycle (deterministic routing) High High Lower 3cycle (adaptive routing) Mid Mid High

→ Our solution: switch the pipeline structures dynamically

slide-11
SLIDE 11

Table of Contents

  • Trade-off Problem of Router Structures
  • Solution: Variable-pipeline Router
  • Evaluation
  • Related Work
  • Conclusions
slide-12
SLIDE 12

Variable-pipeline(VP) Router

  • Using DVFS

– High throughput by increasing freq. and voltage

  • Increasing the number of pipeline stages

– Low latency by decreasing freq. and voltage

  • Decreasing the number of pipeline stages
  • Local processor changes the router pipeline structure for

each application

mode Routing Purpose 3cycle Duato’s protocol (adaptive) Avoiding hotspot 2cycle DOR (deterministic) High throughput, and Low power 1cycle DOR (deterministic) Low latency

slide-13
SLIDE 13

Design of Variable-pipeline Router

  • Each mode uses the different path

Input channel Output channel RC NRC VA/ SA

slide-14
SLIDE 14

Design of Variable-pipeline Router

  • 1cycle mode

Input channel Output channel RC NRC VA/ SA

slide-15
SLIDE 15

Design of Variable-pipeline Router

  • 2cycle mode

Input channel Output channel RC NRC VA/ SA

slide-16
SLIDE 16

Design of Variable-pipeline Router

  • 3cycle mode

Input channel Output channel RC NRC VA / SA

slide-17
SLIDE 17

Design of Variable-pipeline Router

  • 5-port(NEWS + Core) for 2-D Mesh/Torus
  • Flit width: 66bit(64bit data + 2bit flit type)

– Packet size is variable

slide-18
SLIDE 18

Design of Variable-pipeline Router

  • Reconfiguration takes only a single cycle

– when no packets arrive

slide-19
SLIDE 19

 Most of modules are

shared by different modes

 Shared by 1-, 2-, and 3-cycle

mode

 Input buffers, NRC(RC) and

VA/SA modules

 Input buffer is dominating the

router area

 Shared by 2-, and 3-cycle

mode

 Output latch in the output port

10 20 30 40 50 60 70 3cycle router Area(kGates)

  • ther

crossbar input channel

slide-20
SLIDE 20

Fault Tolerance for RC module

  • When router A and B are running on 1 or 2-cycle mode

with look-ahead (NRC), – if the NRC of router B fails, router C executes both NC and NRC

RC NRC RC NRC RC NRC Packet is including the NRC information Packet has NO NRC information Router A Router B Router C

Router C executes both RC and NRC

slide-21
SLIDE 21

Table of Contents

  • Trade-off Problem of Router Structures
  • Solution: Variable-pipeline Router
  • Evaluation
  • Related Work
  • Conclusions
slide-22
SLIDE 22
  • RTL simulation

–NC-Verilog8.1

  • Design Synthesis

–Design Compiler 2007.12-SP3 –Nangate 45nm library typical(1.2V, 25℃)

  • Network Simulation

–GEMS/Simics

  • Full system simulator

–Flit-level network simulator

  • Application

–9 benchmarks from SPLASH-2

Evaluation Items

slide-23
SLIDE 23

 Sun Solaris 9, Sun Studio 12  Routers are connected by 2-dimentional mesh

On-chip router UltraSPARC L1Cache(I & D) L2Cache bank (16kB) (256kB, 4-way)

Target 8-core Processor

slide-24
SLIDE 24

1. Hardware synthesis results – Area(kGates), frequency(MHz), power(bit/J) 2. Full system CMP simulation results

GEMS/Simics simulator; SPLASH-2 benchmark

– application execution time – Collect the packet trace 3. The average hop count for each traces – get the zero-load latency data 4. Network simulation results using packet traces – Maximum throughput and power consumption

slide-25
SLIDE 25

Area Overhead

  • Area of Variable-pipeline router

– Increased by 13.3% – Input buffer is dominant in routers

10 20 30 40 50 60 70 80 1cycle 2cycle 3cycle VP router Area[kGates]

  • ther
  • utput channel

crossbar input channel 13.3%

slide-26
SLIDE 26

Operating Frequency

  • Frequency of each pipeline stage
  • Supply voltage: 0.6V to 1.2V

– As supply voltage increases, frequency is improved

  • VP router has 12% frequency overhead

100 200 300 400 500 600 700 0.6 0.7 0.8 0.9 1 1.1 1.2 Frequency[MHz] Supply Voltage[V] 1cycle router 2cycle router 3cycle router VP router(1cycle mode) 12%

slide-27
SLIDE 27

Application Execution Time

  • Execute SPLASH-2 benchmark for 1, 2, 3cycle router

– Lower execution time is better – 2cycle is best

0.2 0.4 0.6 0.8 1 Normalized execution time 1cycle 2cycle 3cycle

slide-28
SLIDE 28
  • Flight time without packet conflicts

– Strongly affect to performance – Lower latency is better

  • 1-cycle mode is best

10 20 30 40 50 Zero-load latency(nsec) 1cycle 2cycle 3cycle

slide-29
SLIDE 29
  • 2-cycle router achieves the highest throughput
  • Overhead of the adaptive 3-cycle router is a

bottleneck

0.2 0.4 0.6 0.8 1 Normalized maximum throughput 1cycle 2cycle 3cycle

slide-30
SLIDE 30
  • 2cycle mode is best

Power Consumption

SPLASH-2 radiosity benchmark

2 4 6 8 10 12 14 16 18 0.3 0.5 0.7 0.9 1.1 1.3 Power consumption[mW] Throughput[M flit/sec] 1cycle 2cycle 3cycle

slide-31
SLIDE 31

Table of Contents

  • Trade-off Problem of Router Structures
  • Solution: Variable-pipeline Router
  • Evaluation
  • Related Work
  • Conclusions
slide-32
SLIDE 32

Related Work

  • 1. Pipeline integration of processors (Shimada、2007)

– Multiple pipeline stages are integrated into a stage when freq decreases

  • Using DVFS
  • Power efficiency improves
  • 2. Router micro-architecture optimizing pipelines

– Speculative router

– VA,SA in parallel (Peh, HPCA00)

– Prediction Router (Matsutani, HPCA09) – Look-ahead(LA) router (Galles, HOTI’96)

  • NRC and VSA can be executed in parallel

→ We integrated different pipeline stages on an on-chip router

slide-33
SLIDE 33
  • On-chip router is the heart of NoCs

–Various existing pipeline structures

  • Trade-off between latency, throughput, and power
  • We designed a variable-pipeline router

–Switching 1-, 2-, and adaptive 3-cycle pipelines

A variable-pipeline router micro-architecture