SLIDE 1 Building manycore processor-to-DRAM networks using monolithic silicon photonics
Ajay Joshi†, Christopher Batten†, Vladimir Stojanović†, Krste Asanović‡
†MIT, 77 Massachusetts Ave, Cambridge MA 02139 ‡UC Berkeley, 430 Soda Hall, MC #1776, Berkeley, CA 94720
{joshi, cbatten, vlada}@mit.edu, krste@eecs.berkeley.edu
High Performance Embedded Computing (HPEC) Workshop
23-25 September 2008
SLIDE 2
MIT/UCB
Manycore systems design space
SLIDE 3
MIT/UCB
Manycore system bandwidth requirements
SLIDE 4 MIT/UCB
Manycore systems – bandwidth, pin count and power scaling
4
1 Byte/Flop, 8 Flops/core @ 5GHz Server & HPC Mobile Client
SLIDE 5
MIT/UCB
Interconnect bottlenecks
CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations
SLIDE 6 MIT/UCB
Interconnect bottlenecks
CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations Need to jointly
and off-chip interconnect network
SLIDE 7 MIT/UCB
Outline
Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion
SLIDE 8 MIT/UCB
Unified on-chip/off-chip photonic link
Supports dense wavelength-division multiplexing
that improves bandwidth density
Uses monolithic integration that reduces energy
consumption
Utilizes the standard bulk CMOS flow
SLIDE 9
MIT/UCB
Optical link components
65 nm bulk CMOS chip designed to test various optical devices
SLIDE 10
MIT/UCB
Silicon photonics area and energy advantage
Metric Energy (pJ/b) Bandwidth density (Gb/s/μ) Global on-chip photonic link 0.25 160-320 Global on-chip optimally repeated electrical link 1 5 Off-chip photonic link (50 μ coupler pitch) 0.25 13-26 Off-chip electrical SERDES (100 μ pitch) 5 0.1 On-chip/off-chip seamless photonic link 0.25
SLIDE 11 MIT/UCB
Outline
Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration
Baseline electrical mesh topology Electrical mesh with optical global crossbar topology
Manycore system using silicon photonics Conclusion
SLIDE 12 MIT/UCB
Baseline electrical system architecture
Access point per DM distributed across the chip Two on-chip electrical mesh networks
Request path – core access point DRAM module Response path – DRAM module access point core
Mesh physical view Mesh logical view
C = core, DM = DRAM module
SLIDE 13 MIT/UCB
Interconnect network design methodology
Ideal throughput and zero load latency used as
design metrics
Energy constrained approach is adopted Energy components in a network
Mesh energy (Em) (router-to-router links (RRL), routers) IO energy (Eio) (logic-to-memory links (LML))
Flit width Calculate on-chip RRL energy Calculate on-chip router energy Calculate mesh throughput Calculate total mesh energy Calculate energy budget for LML Total energy budget Calculate LML width Calculate I/O throughput Calculate zero load latency
SLIDE 14 MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh
bottleneck
Zero load latency limited by data serialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
SLIDE 15 MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh
bottleneck
Zero load latency limited by data serialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
SLIDE 16 MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh
bottleneck
Zero load latency limited by data serialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4
SLIDE 17 MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh
bottleneck
Zero load latency limited by data serialization
On-chip serialization Off-chip serialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4
SLIDE 18 MIT/UCB
Outline
Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration
Baseline electrical mesh topology Electrical mesh with optical global crossbar topology
Manycore system using silicon photonics Conclusion
SLIDE 19 MIT/UCB
Optical system architecture
Off-chip electrical links replaced with optical links Electrical to optical conversion at access point Wavelengths in each optical link distributed
across various core-DRAM module pairs
Mesh physical view Mesh logical view
C = core, DM = DRAM module
SLIDE 20 MIT/UCB
Network throughput and zero load latency
Reduced I/O cost improves
system bandwidth
Reduction in latency due to
lower serialization latency
On-chip network is the new
bottleneck
SLIDE 21 MIT/UCB
Network throughput and zero load latency
Reduced I/O cost improves
system bandwidth
Reduction in latency due to
lower serialization latency
On-chip network is the new
bottleneck
SLIDE 22 MIT/UCB
Optical multi-group system architecture
Break the single on-chip electrical mesh into several groups
Each group has its own smaller mesh Each group still has one AP for each DM More APs each AP is narrower (uses less λs)
Use optical network as a very efficient global crossbar Need a crossbar switch at the memory for arbitration
Ci = core in group i, DM = DRAM module, S = global crossbar switch
SLIDE 23 MIT/UCB
Network throughput vs zero load latency
Grouping moves traffic
from energy-inefficient mesh channels to energy-efficient photonic channels
Grouping and silicon
photonics provides 10x- 15x throughput improvement
Grouping reduces ZLL in
photonic range, but increases ZLL in electrical range
A B 10x-15x
SLIDE 24 MIT/UCB
Simulation results
Grouping
2x improvement in bandwidth at comparable latency
Overprovisioning
2x-3x improvement in bandwidth for small group count at
comparable latency
Minimal improvement for large group count
256 cores,16 DM Uniform random traffic 256 cores,16 DM Uniform random traffic
SLIDE 25 MIT/UCB
Simulation results
Replacing off-chip electrical with photonics (Eg1x4 Og1x4)
2x improvement in bandwidth at comparable latency
- Using opto-electrical global crossbar (Eg4x2 Og16x1)
8x-10x improvement in bandwidth at comparable latency
256 cores,16 DM Uniform random traffic 256 cores 16 DM Uniform random traffic
SLIDE 26 MIT/UCB
Outline
Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion
SLIDE 27
MIT/UCB
Simplified 16-core system design
SLIDE 28
MIT/UCB
Simplified 16-core system design
SLIDE 29
MIT/UCB
Simplified 16-core system design
SLIDE 30
MIT/UCB
Simplified 16-core system design
SLIDE 31
MIT/UCB
Simplified 16-core system design
SLIDE 32
MIT/UCB
Full 256-core system design
SLIDE 33 MIT/UCB
Outline
Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion
SLIDE 34 MIT/UCB
Conclusion
On-chip network design and memory bandwidth will
limit manycore system performance
Unified on-chip/off-chip photonic link is proposed to
solve this problem
Grouping with optical global crossbar improves
system throughput
For an energy-constrained approach, photonics
provide 8-10x improvement in throughput at comparable latency
SLIDE 35
MIT/UCB
Backup
SLIDE 36 MIT/UCB
MIT Eos1 65 nm test chip
Texas Instruments
standard 65 nm bulk CMOS process
First ever photonic
chip in sub-100nm CMOS
Automated
photonic device layout
Monolithic
integration with electrical modulator drivers
SLIDE 37
MIT/UCB Ring modulator Paperclips Waveguide crossings M-Z test structures Digital driver 4 ring filter banks Photo detector Two-ring filter One-ring filter Vertical coupler grating
SLIDE 38 MIT/UCB
Optical waveguide
Waveguide made of polysilicon Silicon substrate under waveguide etched away to
provide optical cladding
64 wavelengths per waveguide in opposite directions
SEM image of a poly silicon waveguide Cross-sectional view of a photonic chip
SLIDE 39 MIT/UCB
Modulators and filters
2nd order ring filters used Rings tuned using sizing and heating
Resonant racetrack modulator Double-ring resonant filter
Modulator is tuned using charge injection Sub-100 fJ/bit energy cost for the modulator driver
SLIDE 40 MIT/UCB
Photodetectors
Embedded SiGe used to create photodetectors Monolithic integration enable good optical coupling Sub-100 fJ/bit energy cost required for the receiver