Building manycore processor-to-DRAM networks using monolithic - - PowerPoint PPT Presentation

building manycore processor to dram networks using
SMART_READER_LITE
LIVE PREVIEW

Building manycore processor-to-DRAM networks using monolithic - - PowerPoint PPT Presentation

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi , Christopher Batten , Vladimir Stojanovi , Krste Asanovi MIT, 77 Massachusetts Ave, Cambridge MA 02139 UC Berkeley, 430 Soda


slide-1
SLIDE 1

Building manycore processor-to-DRAM networks using monolithic silicon photonics

Ajay Joshi†, Christopher Batten†, Vladimir Stojanović†, Krste Asanović‡

†MIT, 77 Massachusetts Ave, Cambridge MA 02139 ‡UC Berkeley, 430 Soda Hall, MC #1776, Berkeley, CA 94720

{joshi, cbatten, vlada}@mit.edu, krste@eecs.berkeley.edu

High Performance Embedded Computing (HPEC) Workshop

23-25 September 2008

slide-2
SLIDE 2

MIT/UCB

Manycore systems design space

slide-3
SLIDE 3

MIT/UCB

Manycore system bandwidth requirements

slide-4
SLIDE 4

MIT/UCB

Manycore systems – bandwidth, pin count and power scaling

4

1 Byte/Flop, 8 Flops/core @ 5GHz Server & HPC Mobile Client

slide-5
SLIDE 5

MIT/UCB

Interconnect bottlenecks

CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations

slide-6
SLIDE 6

MIT/UCB

Interconnect bottlenecks

CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations Need to jointly

  • ptimize on-chip

and off-chip interconnect network

slide-7
SLIDE 7

MIT/UCB

Outline

Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion

slide-8
SLIDE 8

MIT/UCB

Unified on-chip/off-chip photonic link

Supports dense wavelength-division multiplexing

that improves bandwidth density

Uses monolithic integration that reduces energy

consumption

Utilizes the standard bulk CMOS flow

slide-9
SLIDE 9

MIT/UCB

Optical link components

65 nm bulk CMOS chip designed to test various optical devices

slide-10
SLIDE 10

MIT/UCB

Silicon photonics area and energy advantage

Metric Energy (pJ/b) Bandwidth density (Gb/s/μ) Global on-chip photonic link 0.25 160-320 Global on-chip optimally repeated electrical link 1 5 Off-chip photonic link (50 μ coupler pitch) 0.25 13-26 Off-chip electrical SERDES (100 μ pitch) 5 0.1 On-chip/off-chip seamless photonic link 0.25

slide-11
SLIDE 11

MIT/UCB

Outline

Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration

Baseline electrical mesh topology Electrical mesh with optical global crossbar topology

Manycore system using silicon photonics Conclusion

slide-12
SLIDE 12

MIT/UCB

Baseline electrical system architecture

Access point per DM distributed across the chip Two on-chip electrical mesh networks

Request path – core access point DRAM module Response path – DRAM module access point core

Mesh physical view Mesh logical view

C = core, DM = DRAM module

slide-13
SLIDE 13

MIT/UCB

Interconnect network design methodology

Ideal throughput and zero load latency used as

design metrics

Energy constrained approach is adopted Energy components in a network

Mesh energy (Em) (router-to-router links (RRL), routers) IO energy (Eio) (logic-to-memory links (LML))

Flit width Calculate on-chip RRL energy Calculate on-chip router energy Calculate mesh throughput Calculate total mesh energy Calculate energy budget for LML Total energy budget Calculate LML width Calculate I/O throughput Calculate zero load latency

slide-14
SLIDE 14

MIT/UCB

Network throughput and zero load latency

System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh

bottleneck

Zero load latency limited by data serialization

(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)

slide-15
SLIDE 15

MIT/UCB

Network throughput and zero load latency

System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh

bottleneck

Zero load latency limited by data serialization

(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)

slide-16
SLIDE 16

MIT/UCB

Network throughput and zero load latency

System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh

bottleneck

Zero load latency limited by data serialization

(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4

slide-17
SLIDE 17

MIT/UCB

Network throughput and zero load latency

System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh

bottleneck

Zero load latency limited by data serialization

On-chip serialization Off-chip serialization

(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4

slide-18
SLIDE 18

MIT/UCB

Outline

Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration

Baseline electrical mesh topology Electrical mesh with optical global crossbar topology

Manycore system using silicon photonics Conclusion

slide-19
SLIDE 19

MIT/UCB

Optical system architecture

Off-chip electrical links replaced with optical links Electrical to optical conversion at access point Wavelengths in each optical link distributed

across various core-DRAM module pairs

Mesh physical view Mesh logical view

C = core, DM = DRAM module

slide-20
SLIDE 20

MIT/UCB

Network throughput and zero load latency

Reduced I/O cost improves

system bandwidth

Reduction in latency due to

lower serialization latency

On-chip network is the new

bottleneck

slide-21
SLIDE 21

MIT/UCB

Network throughput and zero load latency

Reduced I/O cost improves

system bandwidth

Reduction in latency due to

lower serialization latency

On-chip network is the new

bottleneck

slide-22
SLIDE 22

MIT/UCB

Optical multi-group system architecture

Break the single on-chip electrical mesh into several groups

Each group has its own smaller mesh Each group still has one AP for each DM More APs each AP is narrower (uses less λs)

Use optical network as a very efficient global crossbar Need a crossbar switch at the memory for arbitration

Ci = core in group i, DM = DRAM module, S = global crossbar switch

slide-23
SLIDE 23

MIT/UCB

Network throughput vs zero load latency

Grouping moves traffic

from energy-inefficient mesh channels to energy-efficient photonic channels

Grouping and silicon

photonics provides 10x- 15x throughput improvement

Grouping reduces ZLL in

photonic range, but increases ZLL in electrical range

A B 10x-15x

slide-24
SLIDE 24

MIT/UCB

Simulation results

Grouping

2x improvement in bandwidth at comparable latency

Overprovisioning

2x-3x improvement in bandwidth for small group count at

comparable latency

Minimal improvement for large group count

256 cores,16 DM Uniform random traffic 256 cores,16 DM Uniform random traffic

slide-25
SLIDE 25

MIT/UCB

Simulation results

Replacing off-chip electrical with photonics (Eg1x4 Og1x4)

2x improvement in bandwidth at comparable latency

  • Using opto-electrical global crossbar (Eg4x2 Og16x1)

8x-10x improvement in bandwidth at comparable latency

256 cores,16 DM Uniform random traffic 256 cores 16 DM Uniform random traffic

slide-26
SLIDE 26

MIT/UCB

Outline

Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion

slide-27
SLIDE 27

MIT/UCB

Simplified 16-core system design

slide-28
SLIDE 28

MIT/UCB

Simplified 16-core system design

slide-29
SLIDE 29

MIT/UCB

Simplified 16-core system design

slide-30
SLIDE 30

MIT/UCB

Simplified 16-core system design

slide-31
SLIDE 31

MIT/UCB

Simplified 16-core system design

slide-32
SLIDE 32

MIT/UCB

Full 256-core system design

slide-33
SLIDE 33

MIT/UCB

Outline

Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion

slide-34
SLIDE 34

MIT/UCB

Conclusion

On-chip network design and memory bandwidth will

limit manycore system performance

Unified on-chip/off-chip photonic link is proposed to

solve this problem

Grouping with optical global crossbar improves

system throughput

For an energy-constrained approach, photonics

provide 8-10x improvement in throughput at comparable latency

slide-35
SLIDE 35

MIT/UCB

Backup

slide-36
SLIDE 36

MIT/UCB

MIT Eos1 65 nm test chip

Texas Instruments

standard 65 nm bulk CMOS process

First ever photonic

chip in sub-100nm CMOS

Automated

photonic device layout

Monolithic

integration with electrical modulator drivers

slide-37
SLIDE 37

MIT/UCB Ring modulator Paperclips Waveguide crossings M-Z test structures Digital driver 4 ring filter banks Photo detector Two-ring filter One-ring filter Vertical coupler grating

slide-38
SLIDE 38

MIT/UCB

Optical waveguide

Waveguide made of polysilicon Silicon substrate under waveguide etched away to

provide optical cladding

64 wavelengths per waveguide in opposite directions

SEM image of a poly silicon waveguide Cross-sectional view of a photonic chip

slide-39
SLIDE 39

MIT/UCB

Modulators and filters

2nd order ring filters used Rings tuned using sizing and heating

Resonant racetrack modulator Double-ring resonant filter

Modulator is tuned using charge injection Sub-100 fJ/bit energy cost for the modulator driver

slide-40
SLIDE 40

MIT/UCB

Photodetectors

Embedded SiGe used to create photodetectors Monolithic integration enable good optical coupling Sub-100 fJ/bit energy cost required for the receiver