Photonic Networks-on-Chip for Maximizing Performance and Improving - - PowerPoint PPT Presentation

photonic networks on chip for
SMART_READER_LITE
LIVE PREVIEW

Photonic Networks-on-Chip for Maximizing Performance and Improving - - PowerPoint PPT Presentation

Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance Randy Morris , Avinash Kodi and Ahmed Louri School of Electrical Engineering and Computer Science, Ohio University


slide-1
SLIDE 1

Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance

Randy MorrisϮ, Avinash KodiϮ and Ahmed Louri ‡ School of Electrical Engineering and Computer Science, Ohio UniversityϮ Department of Electrical and Computer Engineering, University of Arizona ‡ E-mail: kodi@ohio.edu, louri@email.arizona.edu 45th International Symposium on Microarchitecture (MICRO) December 1 – December 5, 2012 Vancouver BC, Canada

slide-2
SLIDE 2

Talk Outline

  • Motivation & Background
  • R-3PO: Architecture & Reconfiguration
  • Performance Analysis
  • Conclusions

2

slide-3
SLIDE 3

Multicores & Network-on-Chips

  • With increasing cores, communication-centric design

paradigm is becoming important (Network-on-Chips)

  • Energy for communication is increasing
  • Delivered throughput is decreasing

3 Tilera-641 80-core Intel TeraFlops2 512-core FERMI (Nvidia)3

1http://www.tilera.com/products/processors/TILE64 2http://techresearch.intel.com/ProjectDetails.aspx?Id=151 3http://www.nvidia.com/object/fermi_architecture.html

slide-4
SLIDE 4

25 50 75 100 125 150 175 200 225 250

Power (watts) Voltage

Tile Power: Intel Tera-Flops (65 nm)1

1.

  • Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,”

IEEE Computer Society, 2007 pp. 51-61

 Need to provide scalable bandwidth without sacrificing performance

Energy Discrepancy & Throughput

  • Energy discrepancy between

computation and global communication with technology scaling

  • Reduced throughput due to aggressive

voltage and clock scaling

4

1.33 Tflops At 230 W

1 Tflops at 97 W

=> Potential solutions: Nanophotonics, 3D Stacking

0.2 0.4 0.6 0.8 1 1.2

45 32 22 14 10 7

Relative Technology (nm) Compute Energy Interconnect Energy

On-die energy

Source: Shekar Borkar, Intel

 Need to reduce global communication energy

slide-5
SLIDE 5

Nanophotonics & Optical 3D Stacking

5

  • Nanophotonics offers several

advantages:

  • Low energy (7.9 fJ/bit )
  • Small Footprint (~2.5 µm)
  • High Bandwidth (~40 Gbps)
  • CMOS compatibility
  • Optical 3D stacking offers

several advantages:

  • Shorter interconnect length
  • Higher bandwidth density
  • Optical vias create power-efficient

inter-layer communication

  • 1. L. Xu, W. Zhang, Q. Li, J. Chan, H. L. R. Lira, M. Lipson, K. Bergman, "40-Gb/s DPSK Data Transmission Through a Silicon Microring Switch," IEEE Photonics Technology Letters 24.
  • 2. Sasikanth Manipatruni, Kyle Preston, Long Chen, and Michal Lipson, "Ultra-low voltage, ultra-small mode volume silicon microring modulator," Opt. Express 18, 18235-18242 (2010)

Layer 1 Layer 2

  • 3. P. Koonath and B. Jalali, “Multilayer 3-d photonics in silicon,” Opt. Express, vol. 15, pp. 12 686–12 691, 2007.
  • 4. A. Biberman, K. Preston, G. Hendry, N. Sherwood-Droz, J. Chan, J. S. Levy, M. Lipson, and K. Bergman, “Photonic network-on-chip architectures using multilayer deposited silicon

materials for high performance chip multiprocessors,” J. Emerg. Technol. Comput. Syst., vol. 7, pp. 1–25, July 2011.

slide-6
SLIDE 6

6

Recent Work on Photonic NoC, among others

  • However, there are several issues not addressed
  • 2D planar connections have waveguide crossings
  • Static network resource allocation
  • Lack of fault tolerance
  • Shared-Bus [Cornell, MICRO’06]
  • Circuit Switch [Columbia, NoCs’07]
  • CORONA [HP/Wisconsin, ISCA’08]
  • Processor-DRAM [MIT, Hot Int’08]
  • Firefly [Northwestern, ISCA’09]
  • Phastlane [Cornell, ISCA’09]
  • Flexishare [Northwestern, HPCA’10]
  • Oblivious Router [Cornell, ASPLOS’10]
  • ATAC [MIT, PACT’10]
  • MPNoC [Arizona, DAC’10]
  • Free-Space Architecture [ISCA’10]
  • Optical Proximity [Sun, ISCA’10]
  • PROPEL [Ohio, NoCs’10]
  • System Level Trimming [UC Davis,

HPCA’11]

  • Atomic Coherence [Wisconsin/HP, HPCA’11]
  • FeatherWeight [Northwestern/KAIST,

MICRO’11]

  • Resilient Microring Design [UCDavis,

MICRO’11]

  • Tolerating Process Variations [Pittsburgh,

ISCA’12]

slide-7
SLIDE 7

Talk Outline

  • Motivation & Background
  • R-3PO: Architecture & Reconfiguration
  • Performance Analysis
  • Conclusions

7

slide-8
SLIDE 8

R-3PO Architecture

  • Decomposed optical crossbar
  • Reduces optical hardware complexity by having smaller crossbars
  • Reduces crossover losses (~ 0.05 dB/crossing)
  • Optical vias
  • Light switched via photonic rings (reduces electrical power)
  • Eases fabrication as optical and electrical dies can be separately grown
  • Reconfiguration of network resources by re-allocating

bandwidth

  • Reduces application execution time by monitoring link and buffer

utilization

  • Provides fault tolerance as faulty channels are bypassed

8

slide-9
SLIDE 9

Core + Cache + MC

Electro-Optic Transceivers

TSVs Electrical Die Optical Die Heat Sink

External Laser

Electrical Contact Optical Layer 0 Optical Layer 1 Optical Layer 2 Optical Layer 3

9

R-3PO Architecture (1/6)

slide-10
SLIDE 10

Core + Cache + MC

R-3PO Architecture (1/6)

Electrical Die Heat Sink

10

Core Core 2 Core 1 Core 3

Shared L2

L1 Cache L1 Cache L1 Cache L1 Cache

slide-11
SLIDE 11

Core + Cache + MC

Electro-Optic Transceivers

TSVs Electrical Die Heat Sink

External Laser

11

R-3PO Architecture (2/6)

Off- Chip Laser

Tx

Buffer Chain TIA Limiting Amplifier Driver for Electronics Core A Core B

Tx Tx Tx

Rx Rx Rx Rx

λ1 λ2 λ3 λ4 λ1 λ2 λ3 λ4 Photo- detector Micro-ring resonator

slide-12
SLIDE 12

Core + Cache + MC

Electro-Optic Transceivers

TSVs Electrical Die Heat Sink

External Laser

Optical Layer 0

12

Group 0 Group 1 Group 2 Group 3

R-3PO Architecture (3/6)

slide-13
SLIDE 13

Core + Cache + MC

Electro-Optic Transceivers

TSVs Electrical Die Heat Sink

External Laser

Optical Layer 0 Optical Layer 1

13

Group 0 Group 1 Group 2 Group 3

R-3PO Architecture (4/6)

slide-14
SLIDE 14

Core + Cache + MC

Electro-Optic Transceivers

TSVs Electrical Die Heat Sink

External Laser

Optical Layer 0 Optical Layer 1 Optical Layer 2

14

Group 2 Group 3 Group 0 Group 1

R-3PO Architecture (5/6)

slide-15
SLIDE 15

Core + Cache + MC

Electro-Optic Transceivers

TSVs Electrical Die Optical Die Heat Sink

External Laser

Electrical Contact Optical Layer 0 Optical Layer 1 Optical Layer 2 Optical Layer 3

15

Group 0 Group 2 Group 1 Group 3

R-3PO Architecture (6/6)

slide-16
SLIDE 16

L2 Shared Cache

To Optical Layer 0 To Optical Layer 3

E/O Tx E/O Tx

MRR Modulators

Token Req + Rel

demux

Route

Computation (RC) Header

Tile 0

IB0 IB3

Token Req + Rel

Token capture release

mux

O/E Rx O/E Rx

From Optical Layer 3 From Optical Layer 0

MRR Filters

0B0 0B3

Token Control Token Control

Token Re-generation

Switch Allocator

(SA)

BW BW S RC RC EO EO OL OL OL OL OL OL OE OE BW BW D SA SA

RC: Route Computation BWS: Buffer Write (Source) EO: Electrical to Optical Driver OL: Optical link latency (1-3 cycles) OE: Optical to Electrical (Dest) BWD: Buffer Write (Dest) SA: Switch Allocation

BW BW S RC RC EO EO OL OL OL OL OL OL OE OE BW BW D SA SA

16

Router Microarchitecture

slide-17
SLIDE 17

17

Group 2 Group 3 Group 0 Group 1

Source

Layer 2

  • Communication demand between

Tile 0 and Tile 15 is high based on application

  • If there are under-utilized links,

then the bandwidth can be re- allocated to improve the performance

Static Communication

slide-18
SLIDE 18

Destination

Group 0 Group 1 Group 2 Group 3

Layer 1

Group 0 Group 1 Group 2 Group 3

Source

Layer 0

18

2x increase in bandwidth is obtained by routing half the data through two other nanophotonic channels

Network Reconfiguration

Switch point Combine point Layer 0 Layer 1

slide-19
SLIDE 19

19

Reconfiguration

  • Reconfiguration in R-3PO takes place between the different

layers as follows:

  • R-3P0-L1: Reconfiguration between Layer0/Layer1 &

Layer2/Layer3

  • R-3P0-LA: Reconfiguration between adjacent layers
  • R-3P0-L2: Reconfiguration between two adjacent layers
  • R-3P0-L3: Reconfiguration between all layers
  • Reconfiguration algorithm monitors network resources
  • Link & Buffer utilization
  • Accomplished with hardware counters & electrical circuitry
slide-20
SLIDE 20

20

Step 1: Wait for Reconfiguration window, RW

t

Step 2: RCi sends a request packet to all local tiles requesting LinkUtil and BufferUtil for previous RW

t-1

Step 3: Each hardware counter sends LinkUtil and BufferUtil statistics from the pervious RW

t-1 to RCi

Step 4: RCi classifies the link statistic for each hardware counter as: If Linkutil = 0.0 Not-Utilized: Use β4 If Linkutil ≤ Lmin Under-Utilized: Use β3 If Linkutil ≥Lmin and Bufferutil < Bcon Normal-Utilized: Use β2 If Bufferutil > Bcon Over-Utilized: Use β1 Step 5: Each RCi sends bandwidth available information to RCj, (i≠j). Step 6: If RCj can use any of the free links then notify RCi of their use, else RCj will forward to next RCj Step 7a: RCi receives response back from RCj and activates corresponding microrings Step 7b: RCj notifies the tiles of additional bandwidth and RCi notifies RCj that the additional bandwidth is now available Step 8: Goto Step 1

Reconfiguration Algorithm

slide-21
SLIDE 21

Fault Tolerance

  • Channel faults cause communication breakdown isolating

healthy cores due to transceiver failure (Eg., ring resonator failure due to thermal drift or process variation)

  • As redundant channels are available in the decomposed crossbar,

fault tolerance can be implemented

  • Augment the reconfiguration algorithm to detect link faults
  • When faults are detected, bandwidth from working links are shared

with faulty links to communicate with the isolated core

  • Fault tolerance techniques allow performance to degrade gracefully

21

slide-22
SLIDE 22

22

Group 0 Group 1 Group 2 Group 3 Group 0 Group 1 Group 2 Group 3

Faulty Link

Layer 0 Layer 1

Switch point Combine point

Fault Tolerance Example

Bandwidth from Group 0’s interconnects in Layer 0 are switch to the interconnects in Layer 1 that are used to communicate with Group 0

slide-23
SLIDE 23

Talk Outline

  • Motivation & Background
  • R-3PO: Architecture & Reconfiguration
  • Performance Analysis
  • Conclusions

23

slide-24
SLIDE 24

– Synthetic, SPLASH-2, PARSEC, & SPEC CPU 2006 application traces on a cycle accurate simulator

  • SPLASH-2: FFT, LU, radix, ocean, & water
  • PARSEC: blackscholes, facesim, fluidanimate, freqmin, &

streamcluster

  • SPEC CPU 2006: bzip & hmmer

– Power Analysis

  • Optical Power (micro-ring resonators & laser power)
  • Electrical Power (receiver & router)

– Compared to the following networks

  • Electrical: Mesh & Flattened-Butterfly
  • Optical: Firefly, Corona, & MPNoC

24

Performance Analysis

slide-25
SLIDE 25

Off-Chip Laser Tx

Buffer Chain TIA Limiting Amplifier Driver for Electronics Core Core

Tx Tx Tx Rx Rx Rx Rx

λ1 λ2 λ3 λ4 λ1 λ2 λ3 λ4

Photo- detector

25

Laser Power Pre-Driver and SERDES

Ring Heating & Ring modulation

Ring Heating

Optical Receiver Circuitry & DESERDES

Device Energy Device Energy

Ring Heating 2.6 fJ/bit Ring modulation 50 fJ/bit Pre-Diver 19 fJ/bit SERDES 1.5 fJ/bit DESERDES 1.5 fJ/bit Receiver Circuitry 66 fJ/bit

  • C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. Holzwarth, M. Popović, H. Li, H. Smith, J. Hoyt, F. Kärtner, R. Ram, V. Stojanović, and K. Asanović.

"Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics."16th Symposium on High-Performance Interconnects (HOTI-16), Aug. 2008.

Energy Evaluation

slide-26
SLIDE 26

Parameter Value L1/L2 coherence MOESI L2 cache size/accos 4MB/16-way L2 access latency (cycles) 4 L1 cache/accoc 64KB/4-way L1 access latency (cycles) 2 Core Frequency (GHz) 5 Threads (core) 2 Issue Policy In-order Memory Size (GB) 4 Memory latency (cycles) 160 R-3PO is compared to the following networks: Mesh, Flattened-Butterfly, Firefly, Corona, & MPNOC

26

System Simulation Parameters

slide-27
SLIDE 27

0.5 1 1.5 2 2.5 Mesh FB Firefly Corona MPNOC R-3PO-L1 R-3PO-LA R-3PO-L2 R-3PO-L3

Energy per Bit (pJ)

Ring modulation Ring heating Laser Back-end circuit Electrical link Router

27

R-3PO reduces energy consumption by 36%

Energy per bit (256 Cores): Uniform

36%

slide-28
SLIDE 28

28

R-3PO shows an increase in performance of about 2.5x

Application Traffic (64 Cores)

0.5 1 1.5 2 2.5 3 3.5

blackscholes facesim fluidanimate freqmin streamcluster bzip hmmer

Speed-Up

Mesh Flattened-Butterfly Firefly Corona MPNOC R-3PO-L1 R-3PO-LA

slide-29
SLIDE 29

29

R-3PO shows an increase in performance of about 4x

Synthetic Traffic (256 Cores)

1 2 3 4 5 6 7 8

Uniform Bit-reversal Butterfly Compliment Matrix-Transpose Perfect Shuffle Neighbor

Speed-Up

Mesh FB FireFly Corona MPNOC R-3PO-L1 R-3PO-LA

slide-30
SLIDE 30

30

Degrades performance when compared to R-3PO as follows: With 10% faults, performance loss is 3% With 25% faults, performance loss is 13% With 50% faults, performance loss is 35%

Fault Tolerance

0.2 0.4 0.6 0.8 1 1.2

blackscholes facesim fluidanimate freqmin streamcluster bzip hmmer

Performance Degradation

R-3PO-L1 R-3PO-L1(10%) R-3PO-L1(25%) R-3PO-L1(50%)

slide-31
SLIDE 31

Talk Outline

  • Motivation & Background
  • R-3PO: Architecture & Reconfiguration
  • Performance Analysis
  • Conclusions

31

slide-32
SLIDE 32
  • R-3PO combines the benefits of nanophotonic and 3D stacking to

reduce energy consumption while eliminating waveguide crossing

  • We evaluate power-performance trade-off by analyzing the design

space of implementing reconfiguration across multiple layers

  • We apply our reconfiguration algorithm to bypass faulty channels

by sharing bandwidth

  • Our results indicate that energy/bit can be decreased by 23-36%

for various real applications while improving application speedup by 2-4X

32

Conclusions

slide-33
SLIDE 33

Thank You Questions?

slide-34
SLIDE 34

34

R-3PO shows an increase in performance of about 2.5x

Application Traffic (64 Cores/16λ)

0.5 1 1.5 2 2.5 3

blackscholes facesim fluidanimate freqmin streamcluster bzip hmmer

Speed-Up

Mesh Flattened-Butterfly Firefly Corona MPNOC R-3PO-L1 R-3PO-LA

slide-35
SLIDE 35

Device Loss(dB) Device Loss(dB)

Coupler (Lc) 1 Filter drop (Lf) 1 Non-Linearity (Ln) 1 Bending (LB) 1 Photo-detector (Lp) 1 Waveguide Crossing (Lwc) 0.05 Modulator Insertion (Li) 1 Receiver (LRS) Sensitivity

  • 26 dBm (R-3PO)

Waveguide (per cm) (LW) 1.3 Splitter (Ls) 3 Laser Efficiently 30% 35

  • C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. Holzwarth, M. Popović, H. Li, H. Smith, J. Hoyt, F. Kärtner, R. Ram, V. Stojanović, and K. Asanović.

"Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics."16th Symposium on High-Performance Interconnects (HOTI-16), Aug. 2008.

Power Analysis

slide-36
SLIDE 36

20 40 60 80 100 120 140

  • 30
  • 25
  • 20
  • 15
  • 60
  • 40
  • 20

20 40 60 80 100 120 Wavelengths

X: 64 Y: -26 Z: 6.1

Receiver Sensitivity (dBm) Laser Power (Watts)

Wavelengths Receiver Sensitivity (dBm) Laser Power (Watts)

(a)

0.5 1 1.5 2 2.5 3 1 2 3 50 100 150 200 250

Waveguide Loss (dB) Ring Filter Loss (dB) Laser Power (Watts)

(b)

Waveguide Loss (dB) Ring Filter (dB) Laser Power (Watts)

Variation in Laser Power

slide-37
SLIDE 37

Layer 0 Layer 1 Layer 2 Layer 3 Layer 0 G0 <->G0 G1 <-> G2 G3 <-> G3 Layer 1 G0 <->G2 G1 <-> G3 Layer 2 G1 <->G1 G0 <-> G3 G2 <-> G2 Layer 3 G0 <->G1 G2 <-> G3

37

Layer 0 Layer 1 Layer 2 Layer 3 Layer 0 G0 <->G0 G1 <-> G2 G3 <-> G3 Layer 1 G0 <->G2 G1 <-> G3 Layer 2 G1 <->G1 G0 <-> G3 G2 <-> G2 Layer 3 G0 <->G1 G2 <-> G3 Layer 0 Layer 1 Layer 2 Layer 3 Layer 0 G0 <->G0 G1 <-> G2 G3 <-> G3 Layer 1 G0 <->G2 G1 <-> G3 Layer 2 G1 <->G1 G0 <-> G3 G2 <-> G2 Layer 3 G0 <->G1 G2 <-> G3 Layer 0 Layer 1 Layer 2 Layer 3 Layer 0 G0 <->G0 G1 <-> G2 G3 <-> G3 Layer 1 G0 <->G2 G1 <-> G3 Layer 2 G1 <->G1 G0 <-> G3 G2 <-> G2 Layer 3 G0 <->G1 G2 <-> G3

R-3PO-L1 R-3PO-LA R-3PO-L2 R-3PO-L3

Reconfiguration Combinations