[PPT] - Scalable Nanophotonic Interconnect for Cache Coherent Multicores PowerPoint Presentation

SLIDE 1

Scalable Nanophotonic Interconnect for Cache Coherent Multicores

Randy W. Morris, Jr. and Avinash K. Kodi

Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701 E-mail: rmorris@cs.ohiou.edu, kodi@ohio.edu Website: http://oucsace.cs.ohiou.edu/~avinashk/

WINDS 2010: Workshop on the Interaction between Nanophotonic Devices and Systems Atlanta, GA December 5, 2010

SLIDE 2

Talk Outline

Section I: Motivation & Background
Section II: Dual Sub-Network for Snoopy Cache

Coherent Nanophotonic Architecture

Section IV: Performance Analysis
Section V: Future Work

!

SLIDE 3

Why Nanophotonics?

"

2. Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61

Clock Distribution 11% Dual FPMACs 36 % Router & Links 28 % 10-port RF 4% IMEM + DMEM 21%

Tile Power: Intel Tera-Flops (65 nm)2

28%

Power consumption of Network-on-Chips (NoCs) 1 using metallic

interconnects is projected to exceed expectation by a factor of 10

1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip Interconnection

Networks”, IEEE Micro, vol. 27, no. 5, pp. 96 – 108, September-October 2007.

Nanophotonic Technology

Low Power
Small Footprint (10 – 15 !m)
High Bandwidth (10 – 20 Gbps)
CMOS Compatibility

SLIDE 4

#

Resonant wavelength (!0) !0 ! m= neff ! 2"R m # an integer neff # effective refractive index R # radius of the ring resonator

Output Port 1

VR

Input Port 0 Output Port 0

n+ p+ n+

=VOFF =VON VR

Input Port 0 Output Port 0

n+ p+ n+

Micro-ring Resonators

1. Lipson, M., Compact Electro-Optic Modulators on a Silicon Chip, IEEE J. Sel. Top. Quant., Vol. 12, No. 6, Nov.-Dec. 2006, p. 1520-6.
2. M. Lipson, Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities, IEEE Journal of Lightwave

Technologies, Vol. 23, No. 12, 12 December 2005 (invited).

SLIDE 5

Cache Coherence

Write propagation (write by any processor should become visible to all other

processors)

Write serialization (all writes from same or different processors are seen in the

same order by all processors)

Snoopy Protocols

P1 P2 P3 P4

$ $ $ $

Memory Broadcast

Easy to Program Not Easily Scalable

Directory Protocols

P1 P4

$ $

M D

Interconnection Network

Point-to-Point D M

Scalable High miss latency

SLIDE 6

Problems with Snoopy Networks

Two major problems with snoopy cache coherent networks (1) Interconnect bandwidth for broadcasting of memory requests

Bus Networks: Limits one request per cycle
Multiple Buses: Increases cache controllers
Point-to-Point Networks: Selective multicasting & Ordering

(2) Cache Access Rate

Cache tag lookup (latency)
Increased power consumption

SLIDE 7

$

Electrical – Split Transactional Bus – Sun Fireplane (SC 2001) – Timestamp Snooping (ASPLOS 2000), Multicast Snooping (ISCA 2001 – Jetty (HPCA 2001), Region Scout (ISCA 2005), Intel QPI – Broadcasting on Ordered Networks (HPCA 2009, MICRO 2009)

Related Work (to name a few)

Optical/Nanophotonic

SYMNET (Trans on Parallel & Dist Systems 2004)
Shared Bus (MICRO 2006), Wavelength Routed Oblivious Network

(ASPLOS 2010)

Spectra (ISPLED 2009), ATAC (PACT 2010)

SLIDE 8

%

Advantages of the proposed architecture

– Dual sub-networks for memory request

Broadcast & Multicast networks

– Broadcast network used by all tiles to fetch the missed block

Network access implemented using tokens
Determines the sharing pattern

– Multicast network to be shared between nodes to send selective requests

Reduces the broadcast requirement
Simultaneous transient requests in progress to different memory

locations

– Reducing the external laser power by unique power guiding techniques

CC-NPA Architecture

SLIDE 9

Tile 0 Tile 4 Tile 8 Tile12 Tile 1 Tile 5 Tile 9 Tile13 Tile 2 Tile 6 Tile10 Tile14 Tile 3 Tile 7 Tile11 Tile 15

Proposed Broadcast Sub-Network Architecture: CC-NPA

Control center

Core 0 Core 2 Core 1 Core 3

Shared L2

L1 Cache L1 Cache L1 Cache L1 Cache

Transmitter

Receiver

SLIDE 10

Power Guiding

As only one core can transmit, route power to a column of cores.

Reduction in optical power (~75%)

To Column 1 To Column 2 To Column 3

The active column is determined by the circulating optical tokes 2 dB optical loss

To Column 0

SLIDE 11

Tile 0 Tile 4 Tile 8 Tile 12 Tile 13 Tile 9 Tile 5 Tile 1 Tile 2 Tile 6 Tile 10 Tile 14 Tile 15 Tile 11 Tile 7 Tile 3

Optical Token System (1/3)

power power inject inject return return

Control Center

Requests a token

inject token

Received Token

power

SLIDE 12

Tile 0 Tile 4 Tile 8 Tile 12 Tile 13 Tile 9 Tile 5 Tile 1 Tile 2 Tile 6 Tile 10 Tile 14 Tile 15 Tile 11 Tile 7 Tile 3

Optical Token System (2/3)

power power inject inject return return

Requests a token

inject token power

&&&&&&&&&&&&&&&&&&& &&&&&&&

Token Re-Injected

SLIDE 13

Tile 0 Tile 4 Tile 8 Tile 12 Tile 13 Tile 9 Tile 5 Tile 1 Tile 2 Tile 6 Tile 10 Tile 14 Tile 15 Tile 11 Tile 7 Tile 3

Optical Token System (3/3)

power power inject inject return return

Requests a token

inject token power

Token Returns

To next column

Fairness can be insured with additional techniques (Fair slot, Two pass)

SLIDE 14

Proposed Multicast Sub-Network

For larger networks, snoopy-based cache coherence reduces performance

Broadcasting data to all shared tiles, consuming more address bandwidth
Consumes more latency and power at the caches
Wavelength routed second

multicast sub-network

Filter and route cache requests

to nodes that hold the cache data

Reduction in required bandwidth

and power dissipation

Potential for simultaneous

multiple requests (could lead to race conditions)

0"2 0"$ 0"6 0"& 1 1"2

(() L+ ,adi0 Ocean

Sin6le!Sharer Mutiple!Sharer

Percentage of request with multiple sharers

SLIDE 15

'(

Initial Performance Analysis

Performance Comparison

– Simics with Gems Memory Module – FFT, LU, Radiosity, Ocean, Radix, & Water

Area & Power Analysis

>arameter @alue >arameter @alue L1AL2!coherence !"#$ Core!(reDuency! 5!&'( L2!cache siGeAaccoc 256 +,-16"/a1 )hreads!HcoreI 2 L1 cacheAaccoc 64+,-4"/a1 Issue!policy $3"or6er Cache!line!siGe 64, Memory!SiGe!HKLI 4 Memory!Controllers 16 Mddress! Landwidth!HoptI )#*!+,-. Mddress!Landwidth! HelecI 320!&,:; Simics Parameters

SLIDE 16

Splash-2 Speed up (16-cores)

0"2 0"$ 0"6 0"& 1 1"2 1"$

(() L+ ,adiocity ,adi0 ,aytrace

Olectrical CC"N>M

CC-NPA increases performance by about 25%

SLIDE 17

Splash-2 Speed up (64-cores)

0"Q 1 1"Q 2 2"Q 3 3"Q $

(() L+ ,adi0 Ocean Olectrical CC"N>M

CC-NPA increases performance of up to 2x

SLIDE 18

DeTice LossHdLI DeTice LossHdLI

Coupler HLcI 1 (ilter drop!HLfI 1 Non"Linearity HLnI 1 Lendin6 HLLI 1 >hoto"detector HLpI 1 VaTe6uide Crossin6!HLwcI 0<05 Modulator Insertion HLiI 1 ,eceiTer!HL,SI SensitiTity "20 6,= VaTe6uide!Hper!cmI!HLVI 1<3 Splitter!HLsI 3 Laser Officiently 30> ,in6!modulation 150!?@-A ,in6!Heatin6 100!?@-A )IMA!Tolta6e!amp" 1<1!:@-A!B 100!?@-Ait

Power Analysis

LB LC Ls Li Lf LWC LP, LRS

5×LS + 7×LW + LC + LN + 3×LI + LF + 8×LB+ 100×LWC

"$3"1!dL!Hper!waTelen6thI

Total Power (opt) = 5.44 W (8 wavelengths) LW

SLIDE 19

Area Analysis

DeTice Mrea! H!m2)

VaTe6uide!HpitchI (/( "0 Micro"rin6!resonator '**! >hoto"detector '**! )IMA!Limitin6!Mmplifier */*!)!( 100!2

Off-Chip Laser On-Chip Modulator

Transmission Medium

Photodetector

TIA Buffer Chain Limiting Amplifier Driver for Electronics Optical Layer Electronics Layer

On-Chip

Pitch (5.5!m) Photo-detector (100 !m2) TIA/Limiting Amp (0.02625 mm2)

Broadcast Sub-Network: 24 mm2 (optical) 51 mm2 (electrical)

Ring Resonator (100 !m2)

SLIDE 20

Conclusion & Future Work

CC-NPA is both a low power & high bandwidth network

for future cache coherent many-core processors

CC-NPA combines the benefits the of snoopy cache

coherent protocols and nanophotonics

CC-NPA provides scalable bandwidth using two sub-

networks (broadcast and multicast)

Future work will involve designing and optimizing the