Scalable Nanophotonic Interconnect for Cache Coherent Multicores - - PowerPoint PPT Presentation

scalable nanophotonic interconnect for cache coherent
SMART_READER_LITE
LIVE PREVIEW

Scalable Nanophotonic Interconnect for Cache Coherent Multicores - - PowerPoint PPT Presentation

Scalable Nanophotonic Interconnect for Cache Coherent Multicores Randy W. Morris, Jr. and Avinash K. Kodi Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701 E-mail: rmorris@cs.ohiou.edu, kodi@ohio.edu


slide-1
SLIDE 1

Scalable Nanophotonic Interconnect for Cache Coherent Multicores

Randy W. Morris, Jr. and Avinash K. Kodi

Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701 E-mail: rmorris@cs.ohiou.edu, kodi@ohio.edu Website: http://oucsace.cs.ohiou.edu/~avinashk/

WINDS 2010: Workshop on the Interaction between Nanophotonic Devices and Systems Atlanta, GA December 5, 2010

slide-2
SLIDE 2

Talk Outline

  • Section I: Motivation & Background
  • Section II: Dual Sub-Network for Snoopy Cache

Coherent Nanophotonic Architecture

  • Section IV: Performance Analysis
  • Section V: Future Work

!

slide-3
SLIDE 3

Why Nanophotonics?

"

  • 2. Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61

Clock Distribution 11% Dual FPMACs 36 % Router & Links 28 % 10-port RF 4% IMEM + DMEM 21%

Tile Power: Intel Tera-Flops (65 nm)2

28%

  • Power consumption of Network-on-Chips (NoCs) 1 using metallic

interconnects is projected to exceed expectation by a factor of 10

  • 1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip Interconnection

Networks”, IEEE Micro, vol. 27, no. 5, pp. 96 – 108, September-October 2007.

Nanophotonic Technology

  • Low Power
  • Small Footprint (10 – 15 !m)
  • High Bandwidth (10 – 20 Gbps)
  • CMOS Compatibility
slide-4
SLIDE 4

#

Resonant wavelength (!0) !0 ! m= neff ! 2"R m # an integer neff # effective refractive index R # radius of the ring resonator

Output Port 1

VR

Input Port 0 Output Port 0

n+ p+ n+

=VOFF =VON VR

Input Port 0 Output Port 0

n+ p+ n+

Micro-ring Resonators

  • 1. Lipson, M., Compact Electro-Optic Modulators on a Silicon Chip, IEEE J. Sel. Top. Quant., Vol. 12, No. 6, Nov.-Dec. 2006, p. 1520-6.
  • 2. M. Lipson, Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities, IEEE Journal of Lightwave

Technologies, Vol. 23, No. 12, 12 December 2005 (invited).

slide-5
SLIDE 5

Cache Coherence

  • Write propagation (write by any processor should become visible to all other

processors)

  • Write serialization (all writes from same or different processors are seen in the

same order by all processors)

Snoopy Protocols

P1 P2 P3 P4

$ $ $ $

Memory Broadcast

Easy to Program Not Easily Scalable

Directory Protocols

P1 P4

$ $

M D

Interconnection Network

Point-to-Point D M

Scalable High miss latency

slide-6
SLIDE 6

Problems with Snoopy Networks

Two major problems with snoopy cache coherent networks (1) Interconnect bandwidth for broadcasting of memory requests

  • Bus Networks: Limits one request per cycle
  • Multiple Buses: Increases cache controllers
  • Point-to-Point Networks: Selective multicasting & Ordering

(2) Cache Access Rate

  • Cache tag lookup (latency)
  • Increased power consumption
slide-7
SLIDE 7

$

Electrical – Split Transactional Bus – Sun Fireplane (SC 2001) – Timestamp Snooping (ASPLOS 2000), Multicast Snooping (ISCA 2001 – Jetty (HPCA 2001), Region Scout (ISCA 2005), Intel QPI – Broadcasting on Ordered Networks (HPCA 2009, MICRO 2009)

Related Work (to name a few)

Optical/Nanophotonic

  • SYMNET (Trans on Parallel & Dist Systems 2004)
  • Shared Bus (MICRO 2006), Wavelength Routed Oblivious Network

(ASPLOS 2010)

  • Spectra (ISPLED 2009), ATAC (PACT 2010)
slide-8
SLIDE 8

%

  • Advantages of the proposed architecture

– Dual sub-networks for memory request

  • Broadcast & Multicast networks

– Broadcast network used by all tiles to fetch the missed block

  • Network access implemented using tokens
  • Determines the sharing pattern

– Multicast network to be shared between nodes to send selective requests

  • Reduces the broadcast requirement
  • Simultaneous transient requests in progress to different memory

locations

– Reducing the external laser power by unique power guiding techniques

CC-NPA Architecture

slide-9
SLIDE 9

Tile 0 Tile 4 Tile 8 Tile12 Tile 1 Tile 5 Tile 9 Tile13 Tile 2 Tile 6 Tile10 Tile14 Tile 3 Tile 7 Tile11 Tile 15

Proposed Broadcast Sub-Network Architecture: CC-NPA

Control center

Core 0 Core 2 Core 1 Core 3

Shared L2

L1 Cache L1 Cache L1 Cache L1 Cache

Transmitter

Receiver

slide-10
SLIDE 10

Power Guiding

As only one core can transmit, route power to a column of cores.

  • Reduction in optical power (~75%)

To Column 1 To Column 2 To Column 3

The active column is determined by the circulating optical tokes 2 dB optical loss

To Column 0

slide-11
SLIDE 11

Tile 0 Tile 4 Tile 8 Tile 12 Tile 13 Tile 9 Tile 5 Tile 1 Tile 2 Tile 6 Tile 10 Tile 14 Tile 15 Tile 11 Tile 7 Tile 3

Optical Token System (1/3)

power power inject inject return return

Control Center

Requests a token

inject token

Received Token

power

slide-12
SLIDE 12

Tile 0 Tile 4 Tile 8 Tile 12 Tile 13 Tile 9 Tile 5 Tile 1 Tile 2 Tile 6 Tile 10 Tile 14 Tile 15 Tile 11 Tile 7 Tile 3

Optical Token System (2/3)

power power inject inject return return

Requests a token

inject token power

&&&&&&&&&&&&&&&&&&& &&&&&&&

Token Re-Injected

slide-13
SLIDE 13

Tile 0 Tile 4 Tile 8 Tile 12 Tile 13 Tile 9 Tile 5 Tile 1 Tile 2 Tile 6 Tile 10 Tile 14 Tile 15 Tile 11 Tile 7 Tile 3

Optical Token System (3/3)

power power inject inject return return

Requests a token

inject token power

Token Returns

To next column

Fairness can be insured with additional techniques (Fair slot, Two pass)

slide-14
SLIDE 14

Proposed Multicast Sub-Network

For larger networks, snoopy-based cache coherence reduces performance

  • Broadcasting data to all shared tiles, consuming more address bandwidth
  • Consumes more latency and power at the caches
  • Wavelength routed second

multicast sub-network

  • Filter and route cache requests

to nodes that hold the cache data

  • Reduction in required bandwidth

and power dissipation

  • Potential for simultaneous

multiple requests (could lead to race conditions)

0"2 0"$ 0"6 0"& 1 1"2

(() L+ ,adi0 Ocean

Sin6le!Sharer Mutiple!Sharer

Percentage of request with multiple sharers

slide-15
SLIDE 15

'(

Initial Performance Analysis

  • Performance Comparison

– Simics with Gems Memory Module – FFT, LU, Radiosity, Ocean, Radix, & Water

  • Area & Power Analysis

>arameter @alue >arameter @alue L1AL2!coherence !"#$ Core!(reDuency! 5!&'( L2!cache siGeAaccoc 256 +,-16"/a1 )hreads!HcoreI 2 L1 cacheAaccoc 64+,-4"/a1 Issue!policy $3"or6er Cache!line!siGe 64, Memory!SiGe!HKLI 4 Memory!Controllers 16 Mddress! Landwidth!HoptI )#*!+,-. Mddress!Landwidth! HelecI 320!&,:; Simics Parameters

slide-16
SLIDE 16

Splash-2 Speed up (16-cores)

0"2 0"$ 0"6 0"& 1 1"2 1"$

(() L+ ,adiocity ,adi0 ,aytrace

Olectrical CC"N>M

  • CC-NPA increases performance by about 25%
slide-17
SLIDE 17

Splash-2 Speed up (64-cores)

0"Q 1 1"Q 2 2"Q 3 3"Q $

(() L+ ,adi0 Ocean Olectrical CC"N>M

  • CC-NPA increases performance of up to 2x
slide-18
SLIDE 18

DeTice LossHdLI DeTice LossHdLI

Coupler HLcI 1 (ilter drop!HLfI 1 Non"Linearity HLnI 1 Lendin6 HLLI 1 >hoto"detector HLpI 1 VaTe6uide Crossin6!HLwcI 0<05 Modulator Insertion HLiI 1 ,eceiTer!HL,SI SensitiTity "20 6,= VaTe6uide!Hper!cmI!HLVI 1<3 Splitter!HLsI 3 Laser Officiently 30> ,in6!modulation 150!?@-A ,in6!Heatin6 100!?@-A )IMA!Tolta6e!amp" 1<1!:@-A!B 100!?@-Ait

Power Analysis

LB LC Ls Li Lf LWC LP, LRS

5×LS + 7×LW + LC + LN + 3×LI + LF + 8×LB+ 100×LWC

"$3"1!dL!Hper!waTelen6thI

Total Power (opt) = 5.44 W (8 wavelengths) LW

slide-19
SLIDE 19

Area Analysis

DeTice Mrea! H!m2)

VaTe6uide!HpitchI (/( "0 Micro"rin6!resonator '**! >hoto"detector '**! )IMA!Limitin6!Mmplifier */*!)!( 100!2

Off-Chip Laser On-Chip Modulator

Transmission Medium

Photodetector

TIA Buffer Chain Limiting Amplifier Driver for Electronics Optical Layer Electronics Layer

On-Chip

Pitch (5.5!m) Photo-detector (100 !m2) TIA/Limiting Amp (0.02625 mm2)

Broadcast Sub-Network: 24 mm2 (optical) 51 mm2 (electrical)

Ring Resonator (100 !m2)

slide-20
SLIDE 20

Conclusion & Future Work

  • CC-NPA is both a low power & high bandwidth network

for future cache coherent many-core processors

  • CC-NPA combines the benefits the of snoopy cache

coherent protocols and nanophotonics

  • CC-NPA provides scalable bandwidth using two sub-

networks (broadcast and multicast)

  • Future work will involve designing and optimizing the

multicast sub-network