 
              Scalable Nanophotonic Interconnect for Cache Coherent Multicores Randy W. Morris, Jr. and Avinash K. Kodi Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701 E-mail: rmorris@cs.ohiou.edu, kodi@ohio.edu Website: http://oucsace.cs.ohiou.edu/~avinashk/ WINDS 2010: Workshop on the Interaction between Nanophotonic Devices and Systems Atlanta, GA December 5, 2010
Talk Outline • Section I: Motivation & Background • Section II: Dual Sub-Network for Snoopy Cache Coherent Nanophotonic Architecture • Section IV: Performance Analysis • Section V: Future Work !
Why Nanophotonics? • Power consumption of Network-on-Chips (NoCs) 1 using metallic interconnects is projected to exceed expectation by a factor of 10 Tile Power: Intel Tera-Flops (65 nm) 2 Nanophotonic Technology - Low Power Clock Distribution 11% Dual FPMACs 36 % - Small Footprint (10 – 15 ! m) 28% Router & Links 28 % 10-port RF 4% - High Bandwidth (10 – 20 Gbps) IMEM + DMEM 21% - CMOS Compatibility 1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip Interconnection Networks”, IEEE Micro, vol. 27, no. 5, pp. 96 – 108, September-October 2007. 2. Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61 "
Micro-ring Resonators Resonant wavelength ( ! 0 ) ! 0 ! m= n eff ! 2 " R m # an integer n eff # effective refractive index R # radius of the ring resonator Output Port 1 V R =V ON n + p + n + =V OFF V R n + p + n + Input Port 0 Output Port 0 Input Port 0 Output Port 0 1. Lipson, M., Compact Electro-Optic Modulators on a Silicon Chip , IEEE J. Sel. Top. Quant., Vol. 12, No. 6, Nov.-Dec. 2006, p. 1520-6. 2. M. Lipson, Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities , IEEE Journal of Lightwave # Technologies, Vol. 23, No. 12, 12 December 2005 (invited).
Cache Coherence - Write propagation (write by any processor should become visible to all other processors) - Write serialization (all writes from same or different processors are seen in the same order by all processors) Snoopy Protocols Directory Protocols P2 P1 P3 P4 P1 P4 $ $ $ $ $ D M $ M D Interconnection Network Broadcast Memory Point-to-Point Scalable Easy to Program High miss latency Not Easily Scalable
Problems with Snoopy Networks Two major problems with snoopy cache coherent networks (1) Interconnect bandwidth for broadcasting of memory requests - Bus Networks: Limits one request per cycle - Multiple Buses: Increases cache controllers - Point-to-Point Networks: Selective multicasting & Ordering (2) Cache Access Rate - Cache tag lookup (latency) - Increased power consumption
Related Work (to name a few) Electrical – Split Transactional Bus – Sun Fireplane (SC 2001) – Timestamp Snooping (ASPLOS 2000), Multicast Snooping (ISCA 2001 – Jetty (HPCA 2001), Region Scout (ISCA 2005), Intel QPI – Broadcasting on Ordered Networks (HPCA 2009, MICRO 2009) Optical/Nanophotonic - SYMNET (Trans on Parallel & Dist Systems 2004) - Shared Bus (MICRO 2006), Wavelength Routed Oblivious Network (ASPLOS 2010) - Spectra (ISPLED 2009), ATAC (PACT 2010) $
CC-NPA Architecture • Advantages of the proposed architecture – Dual sub-networks for memory request • Broadcast & Multicast networks – Broadcast network used by all tiles to fetch the missed block • Network access implemented using tokens • Determines the sharing pattern – Multicast network to be shared between nodes to send selective requests • Reduces the broadcast requirement • Simultaneous transient requests in progress to different memory locations – Reducing the external laser power by unique power guiding techniques %
Proposed Broadcast Sub-Network Architecture: CC-NPA Tile 3 Tile 2 Tile 1 Tile 0 Transmitter L1 Cache L1 Cache Shared L2 Core 1 Core 0 Tile 7 Tile 5 Tile 6 Tile 4 L1 Cache L1 Cache Receiver Core 3 Core 2 Tile11 Tile 9 Tile10 Tile 8 Tile Tile13 Tile14 Tile12 15 Control center
Power Guiding As only one core can transmit, route power to a column of cores. - Reduction in optical power (~75%) The active column is determined by the circulating optical tokes To Column 1 To Column 2 To Column 0 To Column 3 2 dB optical loss
Optical Token System (1/3) Tile 3 Tile 1 Tile 2 Tile 0 Requests a token Tile 7 Tile 5 Tile 6 Tile 4 inject token power Tile 10 Tile 11 Tile 8 Tile 9 Control Center Tile 15 Tile 13 Tile 14 Tile 12 Received return return Token inject inject power power
Optical Token System (2/3) Tile 3 Tile 1 Tile 2 Tile 0 Requests a token Tile 7 Tile 5 Tile 6 Tile 4 inject token power Token Re-Injected Tile 10 Tile 11 Tile 8 Tile 9 &&&&&&&&&&&&&&&&&&& &&&&&&& Tile 15 Tile 13 Tile 14 Tile 12 return return inject inject power power
Optical Token System (3/3) Tile 3 Tile 1 Tile 2 Tile 0 Requests To next column a token Tile 7 Tile 5 Tile 6 Tile 4 inject token power Token Returns Tile 10 Tile 11 Tile 8 Tile 9 Tile 15 Tile 13 Tile 14 Tile 12 Fairness can be insured with additional techniques (Fair slot, Two pass) return return inject inject power power
Proposed Multicast Sub-Network For larger networks, snoopy-based cache coherence reduces performance - Broadcasting data to all shared tiles, consuming more address bandwidth - Consumes more latency and power at the caches Percentage of request with multiple sharers • Wavelength routed second 1"2 multicast sub-network 1 0"& • Filter and route cache requests to nodes that hold the cache data 0"6 0"$ • Reduction in required bandwidth and power dissipation 0"2 0 • Potential for simultaneous (() L+ ,adi0 Ocean multiple requests (could lead to Sin6le ! Sharer Mutiple ! Sharer race conditions)
Initial Performance Analysis • Performance Comparison – Simics with Gems Memory Module – FFT, LU, Radiosity, Ocean, Radix, & Water • Area & Power Analysis Simics Parameters >arameter @alue >arameter @alue L1AL2 ! coherence !"#$ Core ! (reDuency ! 5 ! &'( L2 ! cache siGeAaccoc 256 +,-16 " /a1 )hreads ! HcoreI 2 L1 cacheAaccoc 64+,-4 " /a1 Issue ! policy $3 " or6er Cache ! line ! siGe 64, Memory ! SiGe ! HKLI 4 Memory ! Controllers 16 Mddress ! )#* ! +,-. Landwidth ! HoptI Mddress ! Landwidth ! 320 ! &,:; HelecI '(
Splash-2 Speed up (16-cores) 1"$ 1"2 1 0"& Olectrical 0"6 CC " N>M 0"$ 0"2 0 (() L+ ,adiocity ,adi0 ,aytrace - CC-NPA increases performance by about 25%
Splash-2 Speed up (64-cores) $ 3"Q 3 2"Q Olectrical 2 CC " N>M 1"Q 1 0"Q 0 (() L+ ,adi0 Ocean - CC-NPA increases performance of up to 2x
Power Analysis DeTice LossHdLI DeTice LossHdLI Coupler HL c I 1 (ilter drop ! HL f I 1 Non " Linearity HL n I 1 Lendin6 HL L I 1 >hoto " detector HL p I 1 VaTe6uide Crossin6 ! HL wc I 0<05 Modulator Insertion 1 ,eceiTer ! HL ,S I " 20 6,= HL i I SensitiTity VaTe6uide ! Hper ! cmI ! HL V I 1<3 Splitter ! HL s I 3 Laser Officiently 30> ,in6 ! modulation 150 ! ?@-A ,in6 ! Heatin6 100 ! ?@-A )IMA ! Tolta6e ! amp" 1<1 ! :@-A ! B 100 ! ?@-Ait L WC L s 5×L S + 7×L W + L C + L N + 3×L I + L F + 8×L B + L i 100×L WC " $3"1 ! dL ! Hper ! waTelen6thI Total Power (opt) = 5.44 W (8 wavelengths) L RS L P, L f L B L W L C
Area Analysis Ring Resonator (100 ! m 2 ) On-Chip Optical Layer Off-Chip On-Chip Transmission Photodetector Laser Modulator Medium Photo-detector (100 ! m 2 ) Electronics Layer Pitch (5.5 ! m) Buffer Chain TIA Limiting Driver for Amplifier Electronics TIA/Limiting Amp (0.02625 mm 2 ) Broadcast Sub-Network: 24 mm 2 (optical) Mrea ! H ! m 2 ) DeTice 51 mm 2 (electrical) VaTe6uide ! HpitchI (/( " 0 Micro " rin6 ! resonator '** ! >hoto " detector '** ! */*!)!( 100 ! 2 )IMA ! Limitin6 ! Mmplifier
Conclusion & Future Work • CC-NPA is both a low power & high bandwidth network for future cache coherent many-core processors • CC-NPA combines the benefits the of snoopy cache coherent protocols and nanophotonics • CC-NPA provides scalable bandwidth using two sub- networks (broadcast and multicast) • Future work will involve designing and optimizing the multicast sub-network
Recommend
More recommend